linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
@ 2020-11-18  8:27 Bharata B Rao
  2020-11-18 11:25 ` Vlastimil Babka
  2021-01-20 17:36 ` Vincent Guittot
  0 siblings, 2 replies; 37+ messages in thread
From: Bharata B Rao @ 2020-11-18  8:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, cl, rientjes, iamjoonsoo.kim, akpm, guro, vbabka,
	shakeelb, hannes, aneesh.kumar, Bharata B Rao

The page order of the slab that gets chosen for a given slab
cache depends on the number of objects that can be fit in the
slab while meeting other requirements. We start with a value
of minimum objects based on nr_cpu_ids that is driven by
possible number of CPUs and hence could be higher than the
actual number of CPUs present in the system. This leads to
calculate_order() chosing a page order that is on the higher
side leading to increased slab memory consumption on systems
that have bigger page sizes.

Hence rely on the number of online CPUs when determining the
mininum objects, thereby increasing the chances of chosing
a lower conservative page order for the slab.

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
---
This is a generic change and I am unsure how it would affect
other archs, but as a start, here are some numbers from
PowerPC pseries KVM guest with and without this patch:

This table shows how this change has affected some of the slab
caches.
===================================================================
		Current				Patched
Cache	<objperslab> <pagesperslab>	<objperslab> <pagesperslab>
===================================================================
TCPv6		53    2			26    1
net_namespace	53    4			26    2
dtl		32    2			16    1
names_cache	32    2			16    1
task_struct	53    8			13    2
thread_stack	32    8			8     2
pgtable-2^11	16    8			8     4
pgtable-2^8	32    2			16    1
kmalloc-32k	16    8			8     4
kmalloc-16k	32    8			8     2
kmalloc-8k	32    4			8     1
kmalloc-4k	32    2			16    1
===================================================================

Slab memory (kB) consumption comparision
==================================================================
			Current		Patched
==================================================================
After-boot		205760		156096
During-hackbench	629145		506752 (Avg of 5 runs)
After-hackbench		474176		331840 (after drop_caches)
==================================================================

Hackbench Time (Avg of 5 runs)
(hackbench -s 1024 -l 200 -g 200 -f 25 -P)
==========================================
Current		Patched
==========================================
10.990		11.010
==========================================

Measuring the effect due to CPU hotplug
----------------------------------------
Since the patch doesn't consider all the possible CPUs for page
order calcluation, let's see how affects the case when CPUs are
hotplugged. Here I compare a system that is booted with 64CPUs
with a system that is booted with 16CPUs but hotplugged with
48CPUs after boot. These numbers are with the patch applied.

Slab memory (kB) consumption comparision
===================================================================
			64bootCPUs	16bootCPUs+48HotPluggedCPUs
===================================================================
After-boot		390272		159744
After-hotplug		-		251328
During-hackbench	1001267		941926 (Avg of 5 runs)
After-hackbench		913600		827200 (after drop_caches)
===================================================================

Hackbench Time (Avg of 5 runs)
(hackbench -s 1024 -l 200 -g 200 -f 25 -P)
===========================================
64bootCPUs	16bootCPUs+48HotPluggedCPUs
===========================================
12.554		12.589
===========================================
 mm/slub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 34dcc09e2ec9..8342c0a167b2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3433,7 +3433,7 @@ static inline int calculate_order(unsigned int size)
 	 */
 	min_objects = slub_min_objects;
 	if (!min_objects)
-		min_objects = 4 * (fls(nr_cpu_ids) + 1);
+		min_objects = 4 * (fls(num_online_cpus()) + 1);
 	max_objects = order_objects(slub_max_order, size);
 	min_objects = min(min_objects, max_objects);
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2020-11-18  8:27 [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order Bharata B Rao
@ 2020-11-18 11:25 ` Vlastimil Babka
  2020-11-18 19:34   ` Roman Gushchin
  2021-01-20 17:36 ` Vincent Guittot
  1 sibling, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2020-11-18 11:25 UTC (permalink / raw)
  To: Bharata B Rao, linux-kernel
  Cc: linux-mm, cl, rientjes, iamjoonsoo.kim, akpm, guro, shakeelb,
	hannes, aneesh.kumar

On 11/18/20 9:27 AM, Bharata B Rao wrote:
> The page order of the slab that gets chosen for a given slab
> cache depends on the number of objects that can be fit in the
> slab while meeting other requirements. We start with a value
> of minimum objects based on nr_cpu_ids that is driven by
> possible number of CPUs and hence could be higher than the
> actual number of CPUs present in the system. This leads to
> calculate_order() chosing a page order that is on the higher
> side leading to increased slab memory consumption on systems
> that have bigger page sizes.
> 
> Hence rely on the number of online CPUs when determining the
> mininum objects, thereby increasing the chances of chosing
> a lower conservative page order for the slab.
> 
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Ideally, we would react to hotplug events and update existing caches 
accordingly. But for that, recalculation of order for existing caches 
would have to be made safe, while not affecting hot paths. We have 
removed the sysfs interface with 32a6f409b693 ("mm, slub: remove runtime 
allocation order changes") as it didn't seem easy and worth the trouble.

In case somebody wants to start with a large order right from the boot 
because they know they will hotplug lots of cpus later, they can use 
slub_min_objects= boot param to override this heuristic. So in case this 
change regresses somebody's performance, there's a way around it and 
thus the risk is low IMHO.

> ---
> This is a generic change and I am unsure how it would affect
> other archs, but as a start, here are some numbers from
> PowerPC pseries KVM guest with and without this patch:
> 
> This table shows how this change has affected some of the slab
> caches.
> ===================================================================
> 		Current				Patched
> Cache	<objperslab> <pagesperslab>	<objperslab> <pagesperslab>
> ===================================================================
> TCPv6		53    2			26    1
> net_namespace	53    4			26    2
> dtl		32    2			16    1
> names_cache	32    2			16    1
> task_struct	53    8			13    2
> thread_stack	32    8			8     2
> pgtable-2^11	16    8			8     4
> pgtable-2^8	32    2			16    1
> kmalloc-32k	16    8			8     4
> kmalloc-16k	32    8			8     2
> kmalloc-8k	32    4			8     1
> kmalloc-4k	32    2			16    1
> ===================================================================
> 
> Slab memory (kB) consumption comparision
> ==================================================================
> 			Current		Patched
> ==================================================================
> After-boot		205760		156096
> During-hackbench	629145		506752 (Avg of 5 runs)
> After-hackbench		474176		331840 (after drop_caches)
> ==================================================================
> 
> Hackbench Time (Avg of 5 runs)
> (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> ==========================================
> Current		Patched
> ==========================================
> 10.990		11.010
> ==========================================
> 
> Measuring the effect due to CPU hotplug
> ----------------------------------------
> Since the patch doesn't consider all the possible CPUs for page
> order calcluation, let's see how affects the case when CPUs are
> hotplugged. Here I compare a system that is booted with 64CPUs
> with a system that is booted with 16CPUs but hotplugged with
> 48CPUs after boot. These numbers are with the patch applied.
> 
> Slab memory (kB) consumption comparision
> ===================================================================
> 			64bootCPUs	16bootCPUs+48HotPluggedCPUs
> ===================================================================
> After-boot		390272		159744
> After-hotplug		-		251328
> During-hackbench	1001267		941926 (Avg of 5 runs)
> After-hackbench		913600		827200 (after drop_caches)
> ===================================================================
> 
> Hackbench Time (Avg of 5 runs)
> (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> ===========================================
> 64bootCPUs	16bootCPUs+48HotPluggedCPUs
> ===========================================
> 12.554		12.589
> ===========================================
>   mm/slub.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 34dcc09e2ec9..8342c0a167b2 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3433,7 +3433,7 @@ static inline int calculate_order(unsigned int size)
>   	 */
>   	min_objects = slub_min_objects;
>   	if (!min_objects)
> -		min_objects = 4 * (fls(nr_cpu_ids) + 1);
> +		min_objects = 4 * (fls(num_online_cpus()) + 1);
>   	max_objects = order_objects(slub_max_order, size);
>   	min_objects = min(min_objects, max_objects);
>   
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2020-11-18 11:25 ` Vlastimil Babka
@ 2020-11-18 19:34   ` Roman Gushchin
  2020-11-18 19:53     ` David Rientjes
  0 siblings, 1 reply; 37+ messages in thread
From: Roman Gushchin @ 2020-11-18 19:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Bharata B Rao, linux-kernel, linux-mm, cl, rientjes,
	iamjoonsoo.kim, akpm, shakeelb, hannes, aneesh.kumar

On Wed, Nov 18, 2020 at 12:25:38PM +0100, Vlastimil Babka wrote:
> On 11/18/20 9:27 AM, Bharata B Rao wrote:
> > The page order of the slab that gets chosen for a given slab
> > cache depends on the number of objects that can be fit in the
> > slab while meeting other requirements. We start with a value
> > of minimum objects based on nr_cpu_ids that is driven by
> > possible number of CPUs and hence could be higher than the
> > actual number of CPUs present in the system. This leads to
> > calculate_order() chosing a page order that is on the higher
> > side leading to increased slab memory consumption on systems
> > that have bigger page sizes.
> > 
> > Hence rely on the number of online CPUs when determining the
> > mininum objects, thereby increasing the chances of chosing
> > a lower conservative page order for the slab.
> > 
> > Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Ideally, we would react to hotplug events and update existing caches
> accordingly. But for that, recalculation of order for existing caches would
> have to be made safe, while not affecting hot paths. We have removed the
> sysfs interface with 32a6f409b693 ("mm, slub: remove runtime allocation
> order changes") as it didn't seem easy and worth the trouble.
> 
> In case somebody wants to start with a large order right from the boot
> because they know they will hotplug lots of cpus later, they can use
> slub_min_objects= boot param to override this heuristic. So in case this
> change regresses somebody's performance, there's a way around it and thus
> the risk is low IMHO.

I agree. For the absolute majority of users there will be no difference.
And there is a good workaround for the rest.

Acked-by: Roman Gushchin <guro@fb.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2020-11-18 19:34   ` Roman Gushchin
@ 2020-11-18 19:53     ` David Rientjes
  0 siblings, 0 replies; 37+ messages in thread
From: David Rientjes @ 2020-11-18 19:53 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Vlastimil Babka, Bharata B Rao, linux-kernel, linux-mm, cl,
	iamjoonsoo.kim, akpm, shakeelb, hannes, aneesh.kumar

On Wed, 18 Nov 2020, Roman Gushchin wrote:

> On Wed, Nov 18, 2020 at 12:25:38PM +0100, Vlastimil Babka wrote:
> > On 11/18/20 9:27 AM, Bharata B Rao wrote:
> > > The page order of the slab that gets chosen for a given slab
> > > cache depends on the number of objects that can be fit in the
> > > slab while meeting other requirements. We start with a value
> > > of minimum objects based on nr_cpu_ids that is driven by
> > > possible number of CPUs and hence could be higher than the
> > > actual number of CPUs present in the system. This leads to
> > > calculate_order() chosing a page order that is on the higher
> > > side leading to increased slab memory consumption on systems
> > > that have bigger page sizes.
> > > 
> > > Hence rely on the number of online CPUs when determining the
> > > mininum objects, thereby increasing the chances of chosing
> > > a lower conservative page order for the slab.
> > > 
> > > Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> > 
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > 
> > Ideally, we would react to hotplug events and update existing caches
> > accordingly. But for that, recalculation of order for existing caches would
> > have to be made safe, while not affecting hot paths. We have removed the
> > sysfs interface with 32a6f409b693 ("mm, slub: remove runtime allocation
> > order changes") as it didn't seem easy and worth the trouble.
> > 
> > In case somebody wants to start with a large order right from the boot
> > because they know they will hotplug lots of cpus later, they can use
> > slub_min_objects= boot param to override this heuristic. So in case this
> > change regresses somebody's performance, there's a way around it and thus
> > the risk is low IMHO.
> 
> I agree. For the absolute majority of users there will be no difference.
> And there is a good workaround for the rest.
> 
> Acked-by: Roman Gushchin <guro@fb.com>
> 

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2020-11-18  8:27 [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order Bharata B Rao
  2020-11-18 11:25 ` Vlastimil Babka
@ 2021-01-20 17:36 ` Vincent Guittot
  2021-01-21  5:30   ` Bharata B Rao
  1 sibling, 1 reply; 37+ messages in thread
From: Vincent Guittot @ 2021-01-20 17:36 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, Christoph Lameter, David Rientjes,
	Joonsoo Kim, Andrew Morton, guro, vbabka, shakeelb,
	Johannes Weiner, aneesh.kumar

Hi,

On Wed, 18 Nov 2020 at 09:28, Bharata B Rao <bharata@linux.ibm.com> wrote:
>
> The page order of the slab that gets chosen for a given slab
> cache depends on the number of objects that can be fit in the
> slab while meeting other requirements. We start with a value
> of minimum objects based on nr_cpu_ids that is driven by
> possible number of CPUs and hence could be higher than the
> actual number of CPUs present in the system. This leads to
> calculate_order() chosing a page order that is on the higher
> side leading to increased slab memory consumption on systems
> that have bigger page sizes.
>
> Hence rely on the number of online CPUs when determining the
> mininum objects, thereby increasing the chances of chosing
> a lower conservative page order for the slab.
>
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> ---
> This is a generic change and I am unsure how it would affect
> other archs, but as a start, here are some numbers from
> PowerPC pseries KVM guest with and without this patch:
>
> This table shows how this change has affected some of the slab
> caches.
> ===================================================================
>                 Current                         Patched
> Cache   <objperslab> <pagesperslab>     <objperslab> <pagesperslab>
> ===================================================================
> TCPv6           53    2                 26    1
> net_namespace   53    4                 26    2
> dtl             32    2                 16    1
> names_cache     32    2                 16    1
> task_struct     53    8                 13    2
> thread_stack    32    8                 8     2
> pgtable-2^11    16    8                 8     4
> pgtable-2^8     32    2                 16    1
> kmalloc-32k     16    8                 8     4
> kmalloc-16k     32    8                 8     2
> kmalloc-8k      32    4                 8     1
> kmalloc-4k      32    2                 16    1
> ===================================================================
>
> Slab memory (kB) consumption comparision
> ==================================================================
>                         Current         Patched
> ==================================================================
> After-boot              205760          156096
> During-hackbench        629145          506752 (Avg of 5 runs)
> After-hackbench         474176          331840 (after drop_caches)
> ==================================================================
>
> Hackbench Time (Avg of 5 runs)
> (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> ==========================================
> Current         Patched
> ==========================================
> 10.990          11.010
> ==========================================
>
> Measuring the effect due to CPU hotplug
> ----------------------------------------
> Since the patch doesn't consider all the possible CPUs for page
> order calcluation, let's see how affects the case when CPUs are
> hotplugged. Here I compare a system that is booted with 64CPUs
> with a system that is booted with 16CPUs but hotplugged with
> 48CPUs after boot. These numbers are with the patch applied.
>
> Slab memory (kB) consumption comparision
> ===================================================================
>                         64bootCPUs      16bootCPUs+48HotPluggedCPUs
> ===================================================================
> After-boot              390272          159744
> After-hotplug           -               251328
> During-hackbench        1001267         941926 (Avg of 5 runs)
> After-hackbench         913600          827200 (after drop_caches)
> ===================================================================
>
> Hackbench Time (Avg of 5 runs)
> (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> ===========================================
> 64bootCPUs      16bootCPUs+48HotPluggedCPUs
> ===========================================
> 12.554          12.589
> ===========================================
>  mm/slub.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>

I'm facing significant performances regression on a large arm64 server
system (224 CPUs). Regressions is also present on small arm64 system
(8 CPUs) but in a far smaller order of magnitude

On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16
v5.11-rc4 : 9.135sec (+/- 0.45%)
v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%)
v5.10: 3.136sec (+/- 0.40%)

This is a 191% regression compared to v5.10.

The problem is that calculate_order() is called a number of times
before secondaries CPUs are booted and it returns 1 instead of 224.
This makes the use of num_online_cpus() irrelevant for those cases

After adding in my command line "slub_min_objects=36" which equals to
4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
, the regression diseapears:

9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)



> diff --git a/mm/slub.c b/mm/slub.c
> index 34dcc09e2ec9..8342c0a167b2 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3433,7 +3433,7 @@ static inline int calculate_order(unsigned int size)
>          */
>         min_objects = slub_min_objects;
>         if (!min_objects)
> -               min_objects = 4 * (fls(nr_cpu_ids) + 1);
> +               min_objects = 4 * (fls(num_online_cpus()) + 1);
>         max_objects = order_objects(slub_max_order, size);
>         min_objects = min(min_objects, max_objects);
>
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-20 17:36 ` Vincent Guittot
@ 2021-01-21  5:30   ` Bharata B Rao
  2021-01-21  9:09     ` Vincent Guittot
  2021-01-21 10:01     ` Christoph Lameter
  0 siblings, 2 replies; 37+ messages in thread
From: Bharata B Rao @ 2021-01-21  5:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, linux-mm, Christoph Lameter, David Rientjes,
	Joonsoo Kim, Andrew Morton, guro, vbabka, shakeelb,
	Johannes Weiner, aneesh.kumar

On Wed, Jan 20, 2021 at 06:36:31PM +0100, Vincent Guittot wrote:
> Hi,
> 
> On Wed, 18 Nov 2020 at 09:28, Bharata B Rao <bharata@linux.ibm.com> wrote:
> >
> > The page order of the slab that gets chosen for a given slab
> > cache depends on the number of objects that can be fit in the
> > slab while meeting other requirements. We start with a value
> > of minimum objects based on nr_cpu_ids that is driven by
> > possible number of CPUs and hence could be higher than the
> > actual number of CPUs present in the system. This leads to
> > calculate_order() chosing a page order that is on the higher
> > side leading to increased slab memory consumption on systems
> > that have bigger page sizes.
> >
> > Hence rely on the number of online CPUs when determining the
> > mininum objects, thereby increasing the chances of chosing
> > a lower conservative page order for the slab.
> >
> > Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> > ---
> > This is a generic change and I am unsure how it would affect
> > other archs, but as a start, here are some numbers from
> > PowerPC pseries KVM guest with and without this patch:
> >
> > This table shows how this change has affected some of the slab
> > caches.
> > ===================================================================
> >                 Current                         Patched
> > Cache   <objperslab> <pagesperslab>     <objperslab> <pagesperslab>
> > ===================================================================
> > TCPv6           53    2                 26    1
> > net_namespace   53    4                 26    2
> > dtl             32    2                 16    1
> > names_cache     32    2                 16    1
> > task_struct     53    8                 13    2
> > thread_stack    32    8                 8     2
> > pgtable-2^11    16    8                 8     4
> > pgtable-2^8     32    2                 16    1
> > kmalloc-32k     16    8                 8     4
> > kmalloc-16k     32    8                 8     2
> > kmalloc-8k      32    4                 8     1
> > kmalloc-4k      32    2                 16    1
> > ===================================================================
> >
> > Slab memory (kB) consumption comparision
> > ==================================================================
> >                         Current         Patched
> > ==================================================================
> > After-boot              205760          156096
> > During-hackbench        629145          506752 (Avg of 5 runs)
> > After-hackbench         474176          331840 (after drop_caches)
> > ==================================================================
> >
> > Hackbench Time (Avg of 5 runs)
> > (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> > ==========================================
> > Current         Patched
> > ==========================================
> > 10.990          11.010
> > ==========================================
> >
> > Measuring the effect due to CPU hotplug
> > ----------------------------------------
> > Since the patch doesn't consider all the possible CPUs for page
> > order calcluation, let's see how affects the case when CPUs are
> > hotplugged. Here I compare a system that is booted with 64CPUs
> > with a system that is booted with 16CPUs but hotplugged with
> > 48CPUs after boot. These numbers are with the patch applied.
> >
> > Slab memory (kB) consumption comparision
> > ===================================================================
> >                         64bootCPUs      16bootCPUs+48HotPluggedCPUs
> > ===================================================================
> > After-boot              390272          159744
> > After-hotplug           -               251328
> > During-hackbench        1001267         941926 (Avg of 5 runs)
> > After-hackbench         913600          827200 (after drop_caches)
> > ===================================================================
> >
> > Hackbench Time (Avg of 5 runs)
> > (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> > ===========================================
> > 64bootCPUs      16bootCPUs+48HotPluggedCPUs
> > ===========================================
> > 12.554          12.589
> > ===========================================
> >  mm/slub.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> 
> I'm facing significant performances regression on a large arm64 server
> system (224 CPUs). Regressions is also present on small arm64 system
> (8 CPUs) but in a far smaller order of magnitude
> 
> On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16
> v5.11-rc4 : 9.135sec (+/- 0.45%)
> v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%)
> v5.10: 3.136sec (+/- 0.40%)
> 
> This is a 191% regression compared to v5.10.
> 
> The problem is that calculate_order() is called a number of times
> before secondaries CPUs are booted and it returns 1 instead of 224.
> This makes the use of num_online_cpus() irrelevant for those cases
> 
> After adding in my command line "slub_min_objects=36" which equals to
> 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> , the regression diseapears:
> 
> 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)

Should we have switched to num_present_cpus() rather than
num_online_cpus()? If so, the below patch should address the
above problem.

From 252b332ccbee7152da1e18f1fff5b83f8e01b8df Mon Sep 17 00:00:00 2001
From: Bharata B Rao <bharata@linux.ibm.com>
Date: Thu, 21 Jan 2021 10:35:08 +0530
Subject: [PATCH] mm/slub: let number of present CPUs determine the slub
 page order

Commit 045ab8c9487b ("mm/slub: let number of online CPUs determine
the slub page order") changed the slub page order to depend on
num_online_cpus() from nr_cpu_ids. However we find that certain
caches (kmalloc) are initialized even before the secondary CPUs
are onlined resulting in lower slub page order and subsequent
regression.

Switch to num_present_cpus() instead.

Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Fixes: 045ab8c9487b ("mm/slub: let number of online CPUs determine the slub page order")
---
 mm/slub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index d9e4e10683cc..2f3e412c849d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3433,7 +3433,7 @@ static inline int calculate_order(unsigned int size)
 	 */
 	min_objects = slub_min_objects;
 	if (!min_objects)
-		min_objects = 4 * (fls(num_online_cpus()) + 1);
+		min_objects = 4 * (fls(num_present_cpus()) + 1);
 	max_objects = order_objects(slub_max_order, size);
 	min_objects = min(min_objects, max_objects);
 
-- 
2.26.2




^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-21  5:30   ` Bharata B Rao
@ 2021-01-21  9:09     ` Vincent Guittot
  2021-01-21 10:01     ` Christoph Lameter
  1 sibling, 0 replies; 37+ messages in thread
From: Vincent Guittot @ 2021-01-21  9:09 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, Christoph Lameter, David Rientjes,
	Joonsoo Kim, Andrew Morton, guro, vbabka, Shakeel Butt,
	Johannes Weiner, aneesh.kumar

On Thu, 21 Jan 2021 at 06:31, Bharata B Rao <bharata@linux.ibm.com> wrote:
>
> On Wed, Jan 20, 2021 at 06:36:31PM +0100, Vincent Guittot wrote:
> > Hi,
> >
> > On Wed, 18 Nov 2020 at 09:28, Bharata B Rao <bharata@linux.ibm.com> wrote:
> > >
> > > The page order of the slab that gets chosen for a given slab
> > > cache depends on the number of objects that can be fit in the
> > > slab while meeting other requirements. We start with a value
> > > of minimum objects based on nr_cpu_ids that is driven by
> > > possible number of CPUs and hence could be higher than the
> > > actual number of CPUs present in the system. This leads to
> > > calculate_order() chosing a page order that is on the higher
> > > side leading to increased slab memory consumption on systems
> > > that have bigger page sizes.
> > >
> > > Hence rely on the number of online CPUs when determining the
> > > mininum objects, thereby increasing the chances of chosing
> > > a lower conservative page order for the slab.
> > >
> > > Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> > > ---
> > > This is a generic change and I am unsure how it would affect
> > > other archs, but as a start, here are some numbers from
> > > PowerPC pseries KVM guest with and without this patch:
> > >
> > > This table shows how this change has affected some of the slab
> > > caches.
> > > ===================================================================
> > >                 Current                         Patched
> > > Cache   <objperslab> <pagesperslab>     <objperslab> <pagesperslab>
> > > ===================================================================
> > > TCPv6           53    2                 26    1
> > > net_namespace   53    4                 26    2
> > > dtl             32    2                 16    1
> > > names_cache     32    2                 16    1
> > > task_struct     53    8                 13    2
> > > thread_stack    32    8                 8     2
> > > pgtable-2^11    16    8                 8     4
> > > pgtable-2^8     32    2                 16    1
> > > kmalloc-32k     16    8                 8     4
> > > kmalloc-16k     32    8                 8     2
> > > kmalloc-8k      32    4                 8     1
> > > kmalloc-4k      32    2                 16    1
> > > ===================================================================
> > >
> > > Slab memory (kB) consumption comparision
> > > ==================================================================
> > >                         Current         Patched
> > > ==================================================================
> > > After-boot              205760          156096
> > > During-hackbench        629145          506752 (Avg of 5 runs)
> > > After-hackbench         474176          331840 (after drop_caches)
> > > ==================================================================
> > >
> > > Hackbench Time (Avg of 5 runs)
> > > (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> > > ==========================================
> > > Current         Patched
> > > ==========================================
> > > 10.990          11.010
> > > ==========================================
> > >
> > > Measuring the effect due to CPU hotplug
> > > ----------------------------------------
> > > Since the patch doesn't consider all the possible CPUs for page
> > > order calcluation, let's see how affects the case when CPUs are
> > > hotplugged. Here I compare a system that is booted with 64CPUs
> > > with a system that is booted with 16CPUs but hotplugged with
> > > 48CPUs after boot. These numbers are with the patch applied.
> > >
> > > Slab memory (kB) consumption comparision
> > > ===================================================================
> > >                         64bootCPUs      16bootCPUs+48HotPluggedCPUs
> > > ===================================================================
> > > After-boot              390272          159744
> > > After-hotplug           -               251328
> > > During-hackbench        1001267         941926 (Avg of 5 runs)
> > > After-hackbench         913600          827200 (after drop_caches)
> > > ===================================================================
> > >
> > > Hackbench Time (Avg of 5 runs)
> > > (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> > > ===========================================
> > > 64bootCPUs      16bootCPUs+48HotPluggedCPUs
> > > ===========================================
> > > 12.554          12.589
> > > ===========================================
> > >  mm/slub.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> >
> > I'm facing significant performances regression on a large arm64 server
> > system (224 CPUs). Regressions is also present on small arm64 system
> > (8 CPUs) but in a far smaller order of magnitude
> >
> > On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16
> > v5.11-rc4 : 9.135sec (+/- 0.45%)
> > v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%)
> > v5.10: 3.136sec (+/- 0.40%)
> >
> > This is a 191% regression compared to v5.10.
> >
> > The problem is that calculate_order() is called a number of times
> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > This makes the use of num_online_cpus() irrelevant for those cases
> >
> > After adding in my command line "slub_min_objects=36" which equals to
> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > , the regression diseapears:
> >
> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>
> Should we have switched to num_present_cpus() rather than
> num_online_cpus()? If so, the below patch should address the
> above problem.

The problem is the same with num_present_cpus() which is initialized
at the same time as num_online_cpus.
Only num_possible_cpus() returns a correct value just like nr_cpu_ids.

Both  num_possible_cpus and nr_cpu_ids return the number of CPUs of
the platforms and not the NR_CPUS
num_possible_cpus() = nr_cpu_ids = 224 from the beginning whereas
NR_CPUS=256 on my large system
num_possible_cpus() = nr_cpu_ids = 8 from the beginning whereas
NR_CPUS=256 on my small system


>
> From 252b332ccbee7152da1e18f1fff5b83f8e01b8df Mon Sep 17 00:00:00 2001
> From: Bharata B Rao <bharata@linux.ibm.com>
> Date: Thu, 21 Jan 2021 10:35:08 +0530
> Subject: [PATCH] mm/slub: let number of present CPUs determine the slub
>  page order
>
> Commit 045ab8c9487b ("mm/slub: let number of online CPUs determine
> the slub page order") changed the slub page order to depend on
> num_online_cpus() from nr_cpu_ids. However we find that certain
> caches (kmalloc) are initialized even before the secondary CPUs
> are onlined resulting in lower slub page order and subsequent
> regression.
>
> Switch to num_present_cpus() instead.
>
> Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> Fixes: 045ab8c9487b ("mm/slub: let number of online CPUs determine the slub page order")
> ---
>  mm/slub.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index d9e4e10683cc..2f3e412c849d 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3433,7 +3433,7 @@ static inline int calculate_order(unsigned int size)
>          */
>         min_objects = slub_min_objects;
>         if (!min_objects)
> -               min_objects = 4 * (fls(num_online_cpus()) + 1);
> +               min_objects = 4 * (fls(num_present_cpus()) + 1);
>         max_objects = order_objects(slub_max_order, size);
>         min_objects = min(min_objects, max_objects);
>
> --
> 2.26.2
>
>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-21  5:30   ` Bharata B Rao
  2021-01-21  9:09     ` Vincent Guittot
@ 2021-01-21 10:01     ` Christoph Lameter
  2021-01-21 10:48       ` Vincent Guittot
  2021-01-21 18:19       ` Vlastimil Babka
  1 sibling, 2 replies; 37+ messages in thread
From: Christoph Lameter @ 2021-01-21 10:01 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Vincent Guittot, linux-kernel, linux-mm, David Rientjes,
	Joonsoo Kim, Andrew Morton, guro, vbabka, shakeelb,
	Johannes Weiner, aneesh.kumar

On Thu, 21 Jan 2021, Bharata B Rao wrote:

> > The problem is that calculate_order() is called a number of times
> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > This makes the use of num_online_cpus() irrelevant for those cases
> >
> > After adding in my command line "slub_min_objects=36" which equals to
> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > , the regression diseapears:
> >
> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>
> Should we have switched to num_present_cpus() rather than
> num_online_cpus()? If so, the below patch should address the
> above problem.

There is certainly an initcall after secondaries are booted where we could
redo the calculate_order?

Or the num_online_cpus needs to be up to date earlier. Why does this issue
not occur on x86? Does x86 have an up to date num_online_cpus earlier?



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-21 10:01     ` Christoph Lameter
@ 2021-01-21 10:48       ` Vincent Guittot
  2021-01-21 18:19       ` Vlastimil Babka
  1 sibling, 0 replies; 37+ messages in thread
From: Vincent Guittot @ 2021-01-21 10:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Bharata B Rao, linux-kernel, linux-mm, David Rientjes,
	Joonsoo Kim, Andrew Morton, guro, vbabka, Shakeel Butt,
	Johannes Weiner, aneesh.kumar

On Thu, 21 Jan 2021 at 11:01, Christoph Lameter <cl@linux.com> wrote:
>
> On Thu, 21 Jan 2021, Bharata B Rao wrote:
>
> > > The problem is that calculate_order() is called a number of times
> > > before secondaries CPUs are booted and it returns 1 instead of 224.
> > > This makes the use of num_online_cpus() irrelevant for those cases
> > >
> > > After adding in my command line "slub_min_objects=36" which equals to
> > > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > > , the regression diseapears:
> > >
> > > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >
> > Should we have switched to num_present_cpus() rather than
> > num_online_cpus()? If so, the below patch should address the
> > above problem.
>
> There is certainly an initcall after secondaries are booted where we could
> redo the calculate_order?
>
> Or the num_online_cpus needs to be up to date earlier. Why does this issue
> not occur on x86? Does x86 have an up to date num_online_cpus earlier?

I have added a printk in calculate_order() :
        pr_info(" SLUB calculate_order cmd  %d min %d online %d
present %d possible %d cpus %d", slub_min_objects, min_objects,
num_online_cpus(), num_present_cpus(), num_possible_cpus(),
nr_cpu_ids);

And checked with
qemu-system-x86_64 -kernel bzImage -nographic -smp 4 -append "console=ttyS0"

[    0.064927]  SLUB calculate_order cmd  0 min 8 online 1 present 4
possible 4 cpus 4
[    0.064970] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1

The num_online_cpus has the same behavior as on my platform and
num_online_cpus =  1 when kmem_cache_init() is called

Only the num_present_cpus = 4 from the beginning but that's probably
just because it runs in a VM

Also it's interesting to notice that num_possible_cpus and nr_cpu_ids
are set to the correct value

>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-21 10:01     ` Christoph Lameter
  2021-01-21 10:48       ` Vincent Guittot
@ 2021-01-21 18:19       ` Vlastimil Babka
  2021-01-22  8:03         ` Vincent Guittot
                           ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Vlastimil Babka @ 2021-01-21 18:19 UTC (permalink / raw)
  To: Christoph Lameter, Bharata B Rao
  Cc: Vincent Guittot, linux-kernel, linux-mm, David Rientjes,
	Joonsoo Kim, Andrew Morton, guro, shakeelb, Johannes Weiner,
	aneesh.kumar, Jann Horn, Michal Hocko

On 1/21/21 11:01 AM, Christoph Lameter wrote:
> On Thu, 21 Jan 2021, Bharata B Rao wrote:
> 
>> > The problem is that calculate_order() is called a number of times
>> > before secondaries CPUs are booted and it returns 1 instead of 224.
>> > This makes the use of num_online_cpus() irrelevant for those cases
>> >
>> > After adding in my command line "slub_min_objects=36" which equals to
>> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
>> > , the regression diseapears:
>> >
>> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)

I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
supposed to be a scheduler benchmark? What exactly is going on?

>> Should we have switched to num_present_cpus() rather than
>> num_online_cpus()? If so, the below patch should address the
>> above problem.
> 
> There is certainly an initcall after secondaries are booted where we could
> redo the calculate_order?

We could do it even in hotplug handler. But in practice that means making sure
it's safe, i.e. all users of oo_order/oo_objects must handle the value changing.

Consider e.g. init_cache_random_seq() which uses oo_objects(s->oo) to allocate
s->random_seq when cache s is created. Then shuffle_freelist() will use the
current value of oo_objects(s->oo) to index s->random_seq, for a newly allocated
slab - what if the page order has increased meanwhile due to secondary booting
or hotplug? Array overflow. That's why I just made the former sysfs handler for
changing order read-only.

Things would be easier if we could trust *on all arches* either

- num_present_cpus() to count what the hardware really physically has during
boot, even if not yet onlined, at the time we init slab. This would still not
handle later hotplug (probably mostly in a VM scenario, not that somebody would
bring bunch of actual new cpu boards to a running bare metal system?).

- num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
where it's not really possible to plug more CPU's. In a VM scenario we could
still have an opposite problem, where theoretically "anything is possible" but
the virtual cpus are never added later.

We could also start questioning the very assumption that number of cpus should
affect slab page size in the first place. Should it? After all, each CPU will
have one or more slab pages privately cached, as we discuss in the other
thread... So why make the slab pages also larger?

> Or the num_online_cpus needs to be up to date earlier. Why does this issue
> not occur on x86? Does x86 have an up to date num_online_cpus earlier?
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-21 18:19       ` Vlastimil Babka
@ 2021-01-22  8:03         ` Vincent Guittot
  2021-01-22 12:03           ` Vlastimil Babka
  2021-01-22 13:05         ` [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order Jann Horn
  2021-01-26  8:52         ` Michal Hocko
  2 siblings, 1 reply; 37+ messages in thread
From: Vincent Guittot @ 2021-01-22  8:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Lameter, Bharata B Rao, linux-kernel, linux-mm,
	David Rientjes, Joonsoo Kim, Andrew Morton, guro, Shakeel Butt,
	Johannes Weiner, aneesh.kumar, Jann Horn, Michal Hocko

On Thu, 21 Jan 2021 at 19:19, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> >
> >> > The problem is that calculate_order() is called a number of times
> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> >> > This makes the use of num_online_cpus() irrelevant for those cases
> >> >
> >> > After adding in my command line "slub_min_objects=36" which equals to
> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> >> > , the regression diseapears:
> >> >
> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>
> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> supposed to be a scheduler benchmark? What exactly is going on?
>

From hackbench description:
Hackbench is both a benchmark and a stress test for the Linux kernel
scheduler. It's  main
       job  is  to  create a specified number of pairs of schedulable
entities (either threads or
       traditional processes) which communicate via either sockets or
pipes and time how long  it
       takes for each pair to send data back and forth.

> >> Should we have switched to num_present_cpus() rather than
> >> num_online_cpus()? If so, the below patch should address the
> >> above problem.
> >
> > There is certainly an initcall after secondaries are booted where we could
> > redo the calculate_order?
>
> We could do it even in hotplug handler. But in practice that means making sure
> it's safe, i.e. all users of oo_order/oo_objects must handle the value changing.
>
> Consider e.g. init_cache_random_seq() which uses oo_objects(s->oo) to allocate
> s->random_seq when cache s is created. Then shuffle_freelist() will use the
> current value of oo_objects(s->oo) to index s->random_seq, for a newly allocated
> slab - what if the page order has increased meanwhile due to secondary booting
> or hotplug? Array overflow. That's why I just made the former sysfs handler for
> changing order read-only.
>
> Things would be easier if we could trust *on all arches* either
>
> - num_present_cpus() to count what the hardware really physically has during
> boot, even if not yet onlined, at the time we init slab. This would still not
> handle later hotplug (probably mostly in a VM scenario, not that somebody would
> bring bunch of actual new cpu boards to a running bare metal system?).
>
> - num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
> where it's not really possible to plug more CPU's. In a VM scenario we could
> still have an opposite problem, where theoretically "anything is possible" but
> the virtual cpus are never added later.

On all the system that I have tested num_possible_cpus()/nr_cpu_ids
were correctly initialized

large arm64 acpi system
small arm64 DT based system
VM on x86 system

>
> We could also start questioning the very assumption that number of cpus should
> affect slab page size in the first place. Should it? After all, each CPU will
> have one or more slab pages privately cached, as we discuss in the other
> thread... So why make the slab pages also larger?
>
> > Or the num_online_cpus needs to be up to date earlier. Why does this issue
> > not occur on x86? Does x86 have an up to date num_online_cpus earlier?
> >
> >
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-22  8:03         ` Vincent Guittot
@ 2021-01-22 12:03           ` Vlastimil Babka
  2021-01-22 13:16             ` Vincent Guittot
  2021-01-23  5:16             ` Bharata B Rao
  0 siblings, 2 replies; 37+ messages in thread
From: Vlastimil Babka @ 2021-01-22 12:03 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Christoph Lameter, Bharata B Rao, linux-kernel, linux-mm,
	David Rientjes, Joonsoo Kim, Andrew Morton, guro, Shakeel Butt,
	Johannes Weiner, aneesh.kumar, Jann Horn, Michal Hocko

On 1/22/21 9:03 AM, Vincent Guittot wrote:
> On Thu, 21 Jan 2021 at 19:19, Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 1/21/21 11:01 AM, Christoph Lameter wrote:
>> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
>> >
>> >> > The problem is that calculate_order() is called a number of times
>> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
>> >> > This makes the use of num_online_cpus() irrelevant for those cases
>> >> >
>> >> > After adding in my command line "slub_min_objects=36" which equals to
>> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
>> >> > , the regression diseapears:
>> >> >
>> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>>
>> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
>> supposed to be a scheduler benchmark? What exactly is going on?
>>
> 
> From hackbench description:
> Hackbench is both a benchmark and a stress test for the Linux kernel
> scheduler. It's  main
>        job  is  to  create a specified number of pairs of schedulable
> entities (either threads or
>        traditional processes) which communicate via either sockets or
> pipes and time how long  it
>        takes for each pair to send data back and forth.

Yep, so I wonder which slab entities this is stressing that much.

>> Things would be easier if we could trust *on all arches* either
>>
>> - num_present_cpus() to count what the hardware really physically has during
>> boot, even if not yet onlined, at the time we init slab. This would still not
>> handle later hotplug (probably mostly in a VM scenario, not that somebody would
>> bring bunch of actual new cpu boards to a running bare metal system?).
>>
>> - num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
>> where it's not really possible to plug more CPU's. In a VM scenario we could
>> still have an opposite problem, where theoretically "anything is possible" but
>> the virtual cpus are never added later.
> 
> On all the system that I have tested num_possible_cpus()/nr_cpu_ids
> were correctly initialized
> 
> large arm64 acpi system
> small arm64 DT based system
> VM on x86 system

So it's just powerpc that has this issue with too large nr_cpu_ids? Is it caused
by bios or the hypervisor? How does num_present_cpus() look there?

What about heuristic:
- num_online_cpus() > 1 - we trust that and use it
- otherwise nr_cpu_ids
Would that work? Too arbitrary?


>> We could also start questioning the very assumption that number of cpus should
>> affect slab page size in the first place. Should it? After all, each CPU will
>> have one or more slab pages privately cached, as we discuss in the other
>> thread... So why make the slab pages also larger?
>>
>> > Or the num_online_cpus needs to be up to date earlier. Why does this issue
>> > not occur on x86? Does x86 have an up to date num_online_cpus earlier?
>> >
>> >
>>
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-21 18:19       ` Vlastimil Babka
  2021-01-22  8:03         ` Vincent Guittot
@ 2021-01-22 13:05         ` Jann Horn
  2021-01-22 13:09           ` Jann Horn
                             ` (2 more replies)
  2021-01-26  8:52         ` Michal Hocko
  2 siblings, 3 replies; 37+ messages in thread
From: Jann Horn @ 2021-01-22 13:05 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Lameter, Bharata B Rao, Vincent Guittot, linux-kernel,
	Linux-MM, David Rientjes, Joonsoo Kim, Andrew Morton,
	Roman Gushchin, Shakeel Butt, Johannes Weiner, aneesh.kumar,
	Michal Hocko

On Thu, Jan 21, 2021 at 7:19 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> >
> >> > The problem is that calculate_order() is called a number of times
> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> >> > This makes the use of num_online_cpus() irrelevant for those cases
> >> >
> >> > After adding in my command line "slub_min_objects=36" which equals to
> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> >> > , the regression diseapears:
> >> >
> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>
> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> supposed to be a scheduler benchmark? What exactly is going on?

Uuuh, I think powerpc doesn't have cmpxchg_double?

"vgrep cmpxchg_double arch/" just spits out arm64, s390 and x86? And
<https://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family>
says under "POWERPC": "no DW LL/SC"

So powerpc is probably hitting the page-bitlock-based implementation
all the time for stuff like __slub_free()? Do you have detailed
profiling results from "perf top" or something like that?

(I actually have some WIP patches and a design document for getting
rid of cmpxchg_double in struct page that I hacked together in the
last couple days; I'm currently in the process of sending them over to
some other folks in the company who hopefully have cycles to
review/polish/benchmark them so that they can be upstreamed, assuming
that those folks think they're important enough. I don't have the
cycles for it...)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-22 13:05         ` [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order Jann Horn
@ 2021-01-22 13:09           ` Jann Horn
  2021-01-22 15:27           ` Vlastimil Babka
  2021-01-25  4:28           ` Bharata B Rao
  2 siblings, 0 replies; 37+ messages in thread
From: Jann Horn @ 2021-01-22 13:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Lameter, Bharata B Rao, Vincent Guittot, linux-kernel,
	Linux-MM, David Rientjes, Joonsoo Kim, Andrew Morton,
	Roman Gushchin, Shakeel Butt, Johannes Weiner, aneesh.kumar,
	Michal Hocko

On Fri, Jan 22, 2021 at 2:05 PM Jann Horn <jannh@google.com> wrote:
> On Thu, Jan 21, 2021 at 7:19 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> > >
> > >> > The problem is that calculate_order() is called a number of times
> > >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > >> > This makes the use of num_online_cpus() irrelevant for those cases
> > >> >
> > >> > After adding in my command line "slub_min_objects=36" which equals to
> > >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > >> > , the regression diseapears:
> > >> >
> > >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >
> > I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> > supposed to be a scheduler benchmark? What exactly is going on?
>
> Uuuh, I think powerpc doesn't have cmpxchg_double?
>
> "vgrep cmpxchg_double arch/" just spits out arm64, s390 and x86? And
> <https://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family>
> says under "POWERPC": "no DW LL/SC"
>
> So powerpc is probably hitting the page-bitlock-based implementation
> all the time for stuff like __slub_free()? Do you have detailed
> profiling results from "perf top" or something like that?
>
> (I actually have some WIP patches and a design document for getting
> rid of cmpxchg_double in struct page that I hacked together in the
> last couple days; I'm currently in the process of sending them over to
> some other folks in the company who hopefully have cycles to
> review/polish/benchmark them so that they can be upstreamed, assuming
> that those folks think they're important enough. I don't have the
> cycles for it...)

(The stuff I have in mind will only work on 64-bit though. We are
talking about PPC64 here, right?)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-22 12:03           ` Vlastimil Babka
@ 2021-01-22 13:16             ` Vincent Guittot
  2021-01-23  5:16             ` Bharata B Rao
  1 sibling, 0 replies; 37+ messages in thread
From: Vincent Guittot @ 2021-01-22 13:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Lameter, Bharata B Rao, linux-kernel, linux-mm,
	David Rientjes, Joonsoo Kim, Andrew Morton, guro, Shakeel Butt,
	Johannes Weiner, aneesh.kumar, Jann Horn, Michal Hocko

On Fri, 22 Jan 2021 at 13:03, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/22/21 9:03 AM, Vincent Guittot wrote:
> > On Thu, 21 Jan 2021 at 19:19, Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 1/21/21 11:01 AM, Christoph Lameter wrote:
> >> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> >> >
> >> >> > The problem is that calculate_order() is called a number of times
> >> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> >> >> > This makes the use of num_online_cpus() irrelevant for those cases
> >> >> >
> >> >> > After adding in my command line "slub_min_objects=36" which equals to
> >> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> >> >> > , the regression diseapears:
> >> >> >
> >> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >>
> >> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> >> supposed to be a scheduler benchmark? What exactly is going on?
> >>
> >
> > From hackbench description:
> > Hackbench is both a benchmark and a stress test for the Linux kernel
> > scheduler. It's  main
> >        job  is  to  create a specified number of pairs of schedulable
> > entities (either threads or
> >        traditional processes) which communicate via either sockets or
> > pipes and time how long  it
> >        takes for each pair to send data back and forth.
>
> Yep, so I wonder which slab entities this is stressing that much.
>
> >> Things would be easier if we could trust *on all arches* either
> >>
> >> - num_present_cpus() to count what the hardware really physically has during
> >> boot, even if not yet onlined, at the time we init slab. This would still not
> >> handle later hotplug (probably mostly in a VM scenario, not that somebody would
> >> bring bunch of actual new cpu boards to a running bare metal system?).
> >>
> >> - num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
> >> where it's not really possible to plug more CPU's. In a VM scenario we could
> >> still have an opposite problem, where theoretically "anything is possible" but
> >> the virtual cpus are never added later.
> >
> > On all the system that I have tested num_possible_cpus()/nr_cpu_ids
> > were correctly initialized
> >
> > large arm64 acpi system
> > small arm64 DT based system
> > VM on x86 system
>
> So it's just powerpc that has this issue with too large nr_cpu_ids? Is it caused
> by bios or the hypervisor? How does num_present_cpus() look there?

num_present_cpus() starts to 1 until secondary cpus boot in the arm64 case

>
> What about heuristic:
> - num_online_cpus() > 1 - we trust that and use it
> - otherwise nr_cpu_ids
> Would that work? Too arbitrary?
>
>
> >> We could also start questioning the very assumption that number of cpus should
> >> affect slab page size in the first place. Should it? After all, each CPU will
> >> have one or more slab pages privately cached, as we discuss in the other
> >> thread... So why make the slab pages also larger?
> >>
> >> > Or the num_online_cpus needs to be up to date earlier. Why does this issue
> >> > not occur on x86? Does x86 have an up to date num_online_cpus earlier?
> >> >
> >> >
> >>
> >
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-22 13:05         ` [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order Jann Horn
  2021-01-22 13:09           ` Jann Horn
@ 2021-01-22 15:27           ` Vlastimil Babka
  2021-01-25  4:28           ` Bharata B Rao
  2 siblings, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2021-01-22 15:27 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christoph Lameter, Bharata B Rao, Vincent Guittot, linux-kernel,
	Linux-MM, David Rientjes, Joonsoo Kim, Andrew Morton,
	Roman Gushchin, Shakeel Butt, Johannes Weiner, aneesh.kumar,
	Michal Hocko

On 1/22/21 2:05 PM, Jann Horn wrote:
> On Thu, Jan 21, 2021 at 7:19 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>> On 1/21/21 11:01 AM, Christoph Lameter wrote:
>> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
>> >
>> >> > The problem is that calculate_order() is called a number of times
>> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
>> >> > This makes the use of num_online_cpus() irrelevant for those cases
>> >> >
>> >> > After adding in my command line "slub_min_objects=36" which equals to
>> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
>> >> > , the regression diseapears:
>> >> >
>> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>>
>> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
>> supposed to be a scheduler benchmark? What exactly is going on?
> 
> Uuuh, I think powerpc doesn't have cmpxchg_double?

The benchmark was done by Vincent on arm64, AFAICS. PowerPC (ppc64) was what
Bharata had used to demonstrate the order calculation change in his patch.

There seems to be some implementation dependency on CONFIG_ARM64_LSE_ATOMICS but
AFAICS that doesn't determine if cmpxchg_double is provided.

> "vgrep cmpxchg_double arch/" just spits out arm64, s390 and x86? And
> <https://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family>
> says under "POWERPC": "no DW LL/SC"

Interesting find in any case.

> So powerpc is probably hitting the page-bitlock-based implementation
> all the time for stuff like __slub_free()? Do you have detailed
> profiling results from "perf top" or something like that?
> 
> (I actually have some WIP patches and a design document for getting
> rid of cmpxchg_double in struct page that I hacked together in the
> last couple days; I'm currently in the process of sending them over to
> some other folks in the company who hopefully have cycles to
> review/polish/benchmark them so that they can be upstreamed, assuming
> that those folks think they're important enough. I don't have the
> cycles for it...)

I'm curious, so I hope this works out :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-22 12:03           ` Vlastimil Babka
  2021-01-22 13:16             ` Vincent Guittot
@ 2021-01-23  5:16             ` Bharata B Rao
  2021-01-23 12:32               ` Vincent Guittot
  1 sibling, 1 reply; 37+ messages in thread
From: Bharata B Rao @ 2021-01-23  5:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Vincent Guittot, Christoph Lameter, linux-kernel, linux-mm,
	David Rientjes, Joonsoo Kim, Andrew Morton, guro, Shakeel Butt,
	Johannes Weiner, aneesh.kumar, Jann Horn, Michal Hocko

On Fri, Jan 22, 2021 at 01:03:57PM +0100, Vlastimil Babka wrote:
> On 1/22/21 9:03 AM, Vincent Guittot wrote:
> > On Thu, 21 Jan 2021 at 19:19, Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 1/21/21 11:01 AM, Christoph Lameter wrote:
> >> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> >> >
> >> >> > The problem is that calculate_order() is called a number of times
> >> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> >> >> > This makes the use of num_online_cpus() irrelevant for those cases
> >> >> >
> >> >> > After adding in my command line "slub_min_objects=36" which equals to
> >> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> >> >> > , the regression diseapears:
> >> >> >
> >> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >>
> >> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> >> supposed to be a scheduler benchmark? What exactly is going on?
> >>
> > 
> > From hackbench description:
> > Hackbench is both a benchmark and a stress test for the Linux kernel
> > scheduler. It's  main
> >        job  is  to  create a specified number of pairs of schedulable
> > entities (either threads or
> >        traditional processes) which communicate via either sockets or
> > pipes and time how long  it
> >        takes for each pair to send data back and forth.
> 
> Yep, so I wonder which slab entities this is stressing that much.
> 
> >> Things would be easier if we could trust *on all arches* either
> >>
> >> - num_present_cpus() to count what the hardware really physically has during
> >> boot, even if not yet onlined, at the time we init slab. This would still not
> >> handle later hotplug (probably mostly in a VM scenario, not that somebody would
> >> bring bunch of actual new cpu boards to a running bare metal system?).
> >>
> >> - num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
> >> where it's not really possible to plug more CPU's. In a VM scenario we could
> >> still have an opposite problem, where theoretically "anything is possible" but
> >> the virtual cpus are never added later.
> > 
> > On all the system that I have tested num_possible_cpus()/nr_cpu_ids
> > were correctly initialized
> > 
> > large arm64 acpi system
> > small arm64 DT based system
> > VM on x86 system
> 
> So it's just powerpc that has this issue with too large nr_cpu_ids? Is it caused
> by bios or the hypervisor? How does num_present_cpus() look there?

PowerPC PowerNV Host: (160 cpus)
num_online_cpus 1 num_present_cpus 160 num_possible_cpus 160 nr_cpu_ids 160 

PowerPC pseries KVM guest: (-smp 16,maxcpus=160)
num_online_cpus 1 num_present_cpus 16 num_possible_cpus 160 nr_cpu_ids 160 

That's what I see on powerpc, hence I thought num_present_cpus() could
be the correct one to use in slub page order calculation.

> 
> What about heuristic:
> - num_online_cpus() > 1 - we trust that and use it
> - otherwise nr_cpu_ids
> Would that work? Too arbitrary?

Looking at the following snippet from include/linux/cpumask.h, it
appears that num_present_cpus() should be reasonable compromise
between online and possible/nr_cpus_ids to use here.

/*
 * The following particular system cpumasks and operations manage
 * possible, present, active and online cpus.
 *
 *     cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
 *     cpu_present_mask - has bit 'cpu' set iff cpu is populated
 *     cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler
 *     cpu_active_mask  - has bit 'cpu' set iff cpu available to migration
 *
 *  If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
 *
 *  The cpu_possible_mask is fixed at boot time, as the set of CPU id's
 *  that it is possible might ever be plugged in at anytime during the
 *  life of that system boot.  The cpu_present_mask is dynamic(*),
 *  representing which CPUs are currently plugged in.  And
 *  cpu_online_mask is the dynamic subset of cpu_present_mask,
 *  indicating those CPUs available for scheduling.
 *
 *  If HOTPLUG is enabled, then cpu_possible_mask is forced to have
 *  all NR_CPUS bits set, otherwise it is just the set of CPUs that
 *  ACPI reports present at boot.
 *
 *  If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
 *  depending on what ACPI reports as currently plugged in, otherwise
 *  cpu_present_mask is just a copy of cpu_possible_mask.
 *
 *  (*) Well, cpu_present_mask is dynamic in the hotplug case.  If not
 *      hotplug, it's a copy of cpu_possible_mask, hence fixed at boot.
 */

So for host systems, present is (usually) equal to possible and for
guest systems present should indicate the CPUs found to be present
at boottime. The intention of my original patch was to use this
metric in slub page order calculation rather than nr_cpus_ids
or num_cpus_possible() which could be high on guest systems that
typically support CPU hotplug.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-23  5:16             ` Bharata B Rao
@ 2021-01-23 12:32               ` Vincent Guittot
  2021-01-25 11:20                 ` Vlastimil Babka
  0 siblings, 1 reply; 37+ messages in thread
From: Vincent Guittot @ 2021-01-23 12:32 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Vlastimil Babka, Christoph Lameter, linux-kernel, linux-mm,
	David Rientjes, Joonsoo Kim, Andrew Morton, guro, Shakeel Butt,
	Johannes Weiner, aneesh.kumar, Jann Horn, Michal Hocko,
	Catalin Marinas, Will Deacon

+Adding arch arm64 Maintainers

On Sat, 23 Jan 2021 at 06:16, Bharata B Rao <bharata@linux.ibm.com> wrote:
>
> On Fri, Jan 22, 2021 at 01:03:57PM +0100, Vlastimil Babka wrote:
> > On 1/22/21 9:03 AM, Vincent Guittot wrote:
> > > On Thu, 21 Jan 2021 at 19:19, Vlastimil Babka <vbabka@suse.cz> wrote:
> > >>
> > >> On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > >> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> > >> >
> > >> >> > The problem is that calculate_order() is called a number of times
> > >> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > >> >> > This makes the use of num_online_cpus() irrelevant for those cases
> > >> >> >
> > >> >> > After adding in my command line "slub_min_objects=36" which equals to
> > >> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > >> >> > , the regression diseapears:
> > >> >> >
> > >> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> > >>
> > >> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> > >> supposed to be a scheduler benchmark? What exactly is going on?
> > >>
> > >
> > > From hackbench description:
> > > Hackbench is both a benchmark and a stress test for the Linux kernel
> > > scheduler. It's  main
> > >        job  is  to  create a specified number of pairs of schedulable
> > > entities (either threads or
> > >        traditional processes) which communicate via either sockets or
> > > pipes and time how long  it
> > >        takes for each pair to send data back and forth.
> >
> > Yep, so I wonder which slab entities this is stressing that much.
> >
> > >> Things would be easier if we could trust *on all arches* either
> > >>
> > >> - num_present_cpus() to count what the hardware really physically has during
> > >> boot, even if not yet onlined, at the time we init slab. This would still not
> > >> handle later hotplug (probably mostly in a VM scenario, not that somebody would
> > >> bring bunch of actual new cpu boards to a running bare metal system?).
> > >>
> > >> - num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
> > >> where it's not really possible to plug more CPU's. In a VM scenario we could
> > >> still have an opposite problem, where theoretically "anything is possible" but
> > >> the virtual cpus are never added later.
> > >
> > > On all the system that I have tested num_possible_cpus()/nr_cpu_ids
> > > were correctly initialized
> > >
> > > large arm64 acpi system
> > > small arm64 DT based system
> > > VM on x86 system
> >
> > So it's just powerpc that has this issue with too large nr_cpu_ids? Is it caused
> > by bios or the hypervisor? How does num_present_cpus() look there?
>
> PowerPC PowerNV Host: (160 cpus)
> num_online_cpus 1 num_present_cpus 160 num_possible_cpus 160 nr_cpu_ids 160
>
> PowerPC pseries KVM guest: (-smp 16,maxcpus=160)
> num_online_cpus 1 num_present_cpus 16 num_possible_cpus 160 nr_cpu_ids 160
>
> That's what I see on powerpc, hence I thought num_present_cpus() could
> be the correct one to use in slub page order calculation.

num_present_cpus() is set to 1 on arm64 until secondaries cpus boot

arm64 224cpus acpi host:
num_online_cpus 1 num_present_cpus 1 num_possible_cpus 224 nr_cpu_ids 224
arm64 8cpus DT host:
num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8
arm64 8cpus qemu-system-aarch64 (-smp 8,maxcpus=256)
num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8

Then present and online increase to num_possible_cpus once all cpus are booted

>
> >
> > What about heuristic:
> > - num_online_cpus() > 1 - we trust that and use it
> > - otherwise nr_cpu_ids
> > Would that work? Too arbitrary?
>
> Looking at the following snippet from include/linux/cpumask.h, it
> appears that num_present_cpus() should be reasonable compromise
> between online and possible/nr_cpus_ids to use here.
>
> /*
>  * The following particular system cpumasks and operations manage
>  * possible, present, active and online cpus.
>  *
>  *     cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
>  *     cpu_present_mask - has bit 'cpu' set iff cpu is populated
>  *     cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler
>  *     cpu_active_mask  - has bit 'cpu' set iff cpu available to migration
>  *
>  *  If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
>  *
>  *  The cpu_possible_mask is fixed at boot time, as the set of CPU id's
>  *  that it is possible might ever be plugged in at anytime during the
>  *  life of that system boot.  The cpu_present_mask is dynamic(*),
>  *  representing which CPUs are currently plugged in.  And
>  *  cpu_online_mask is the dynamic subset of cpu_present_mask,
>  *  indicating those CPUs available for scheduling.
>  *
>  *  If HOTPLUG is enabled, then cpu_possible_mask is forced to have
>  *  all NR_CPUS bits set, otherwise it is just the set of CPUs that
>  *  ACPI reports present at boot.
>  *
>  *  If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
>  *  depending on what ACPI reports as currently plugged in, otherwise
>  *  cpu_present_mask is just a copy of cpu_possible_mask.
>  *
>  *  (*) Well, cpu_present_mask is dynamic in the hotplug case.  If not
>  *      hotplug, it's a copy of cpu_possible_mask, hence fixed at boot.
>  */
>
> So for host systems, present is (usually) equal to possible and for

But "cpu_present_mask varies dynamically,  depending on what ACPI
reports as currently plugged in"

So it should varies when secondaries cpus are booted

> guest systems present should indicate the CPUs found to be present
> at boottime. The intention of my original patch was to use this
> metric in slub page order calculation rather than nr_cpus_ids
> or num_cpus_possible() which could be high on guest systems that
> typically support CPU hotplug.
>
> Regards,
> Bharata.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-22 13:05         ` [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order Jann Horn
  2021-01-22 13:09           ` Jann Horn
  2021-01-22 15:27           ` Vlastimil Babka
@ 2021-01-25  4:28           ` Bharata B Rao
  2 siblings, 0 replies; 37+ messages in thread
From: Bharata B Rao @ 2021-01-25  4:28 UTC (permalink / raw)
  To: Jann Horn
  Cc: Vlastimil Babka, Christoph Lameter, Vincent Guittot,
	linux-kernel, Linux-MM, David Rientjes, Joonsoo Kim,
	Andrew Morton, Roman Gushchin, Shakeel Butt, Johannes Weiner,
	aneesh.kumar, Michal Hocko

On Fri, Jan 22, 2021 at 02:05:47PM +0100, Jann Horn wrote:
> On Thu, Jan 21, 2021 at 7:19 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> > >
> > >> > The problem is that calculate_order() is called a number of times
> > >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > >> > This makes the use of num_online_cpus() irrelevant for those cases
> > >> >
> > >> > After adding in my command line "slub_min_objects=36" which equals to
> > >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > >> > , the regression diseapears:
> > >> >
> > >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >
> > I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> > supposed to be a scheduler benchmark? What exactly is going on?
> 
> Uuuh, I think powerpc doesn't have cmpxchg_double?
> 
> "vgrep cmpxchg_double arch/" just spits out arm64, s390 and x86? And
> <https://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family>
> says under "POWERPC": "no DW LL/SC"
> 
> So powerpc is probably hitting the page-bitlock-based implementation
> all the time for stuff like __slub_free()? Do you have detailed
> profiling results from "perf top" or something like that?

I can check that, but the current patch was aimed at reducing
the page order of the slub caches so that they don't end up
consuming more memory on 64K systems.

> 
> (I actually have some WIP patches and a design document for getting
> rid of cmpxchg_double in struct page that I hacked together in the
> last couple days; I'm currently in the process of sending them over to
> some other folks in the company who hopefully have cycles to
> review/polish/benchmark them so that they can be upstreamed, assuming
> that those folks think they're important enough. I don't have the
> cycles for it...)

Sounds interesting, will keep a watch to see its effect on powerpc.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-23 12:32               ` Vincent Guittot
@ 2021-01-25 11:20                 ` Vlastimil Babka
  2021-01-26 23:03                   ` Will Deacon
  0 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2021-01-25 11:20 UTC (permalink / raw)
  To: Vincent Guittot, Bharata B Rao
  Cc: Christoph Lameter, linux-kernel, linux-mm, David Rientjes,
	Joonsoo Kim, Andrew Morton, guro, Shakeel Butt, Johannes Weiner,
	aneesh.kumar, Jann Horn, Michal Hocko, Catalin Marinas,
	Will Deacon

On 1/23/21 1:32 PM, Vincent Guittot wrote:
>> PowerPC PowerNV Host: (160 cpus)
>> num_online_cpus 1 num_present_cpus 160 num_possible_cpus 160 nr_cpu_ids 160
>>
>> PowerPC pseries KVM guest: (-smp 16,maxcpus=160)
>> num_online_cpus 1 num_present_cpus 16 num_possible_cpus 160 nr_cpu_ids 160
>>
>> That's what I see on powerpc, hence I thought num_present_cpus() could
>> be the correct one to use in slub page order calculation.
> 
> num_present_cpus() is set to 1 on arm64 until secondaries cpus boot
> 
> arm64 224cpus acpi host:
> num_online_cpus 1 num_present_cpus 1 num_possible_cpus 224 nr_cpu_ids 224
> arm64 8cpus DT host:
> num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8
> arm64 8cpus qemu-system-aarch64 (-smp 8,maxcpus=256)
> num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8

I would have expected num_present_cpus to be 224, 8, 8, respectively.

> Then present and online increase to num_possible_cpus once all cpus are booted
> 
>>
>> >
>> > What about heuristic:
>> > - num_online_cpus() > 1 - we trust that and use it
>> > - otherwise nr_cpu_ids
>> > Would that work? Too arbitrary?
>>
>> Looking at the following snippet from include/linux/cpumask.h, it
>> appears that num_present_cpus() should be reasonable compromise
>> between online and possible/nr_cpus_ids to use here.
>>
>> /*
>>  * The following particular system cpumasks and operations manage
>>  * possible, present, active and online cpus.
>>  *
>>  *     cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
>>  *     cpu_present_mask - has bit 'cpu' set iff cpu is populated
>>  *     cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler
>>  *     cpu_active_mask  - has bit 'cpu' set iff cpu available to migration
>>  *
>>  *  If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
>>  *
>>  *  The cpu_possible_mask is fixed at boot time, as the set of CPU id's
>>  *  that it is possible might ever be plugged in at anytime during the
>>  *  life of that system boot.  The cpu_present_mask is dynamic(*),
>>  *  representing which CPUs are currently plugged in.  And
>>  *  cpu_online_mask is the dynamic subset of cpu_present_mask,
>>  *  indicating those CPUs available for scheduling.
>>  *
>>  *  If HOTPLUG is enabled, then cpu_possible_mask is forced to have
>>  *  all NR_CPUS bits set, otherwise it is just the set of CPUs that
>>  *  ACPI reports present at boot.
>>  *
>>  *  If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
>>  *  depending on what ACPI reports as currently plugged in, otherwise
>>  *  cpu_present_mask is just a copy of cpu_possible_mask.
>>  *
>>  *  (*) Well, cpu_present_mask is dynamic in the hotplug case.  If not
>>  *      hotplug, it's a copy of cpu_possible_mask, hence fixed at boot.
>>  */
>>
>> So for host systems, present is (usually) equal to possible and for
> 
> But "cpu_present_mask varies dynamically,  depending on what ACPI
> reports as currently plugged in"
> 
> So it should varies when secondaries cpus are booted

Hm, but booting the secondaries is just a software (kernel) action? They are
already physically there, so it seems to me as if the cpu_present_mask is not
populated correctly on arm64, and it's just a mirror of cpu_online_mask?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-21 18:19       ` Vlastimil Babka
  2021-01-22  8:03         ` Vincent Guittot
  2021-01-22 13:05         ` [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order Jann Horn
@ 2021-01-26  8:52         ` Michal Hocko
  2021-01-26 13:38           ` Vincent Guittot
  2 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2021-01-26  8:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Lameter, Bharata B Rao, Vincent Guittot, linux-kernel,
	linux-mm, David Rientjes, Joonsoo Kim, Andrew Morton, guro,
	shakeelb, Johannes Weiner, aneesh.kumar, Jann Horn

On Thu 21-01-21 19:19:21, Vlastimil Babka wrote:
[...]
> We could also start questioning the very assumption that number of cpus should
> affect slab page size in the first place. Should it? After all, each CPU will
> have one or more slab pages privately cached, as we discuss in the other
> thread... So why make the slab pages also larger?

I do agree. What is the acutal justification for this scaling?
        /*
         * Attempt to find best configuration for a slab. This
         * works by first attempting to generate a layout with
         * the best configuration and backing off gradually.
         *
         * First we increase the acceptable waste in a slab. Then
         * we reduce the minimum objects required in a slab.
         */

doesn't speak about CPUs.  9b2cd506e5f2 ("slub: Calculate min_objects
based on number of processors.") does talk about hackbench "This has
been shown to address the performance issues in hackbench on 16p etc."
but it doesn't give any more details to tell actually _why_ that works.

This thread shows that this is still somehow related to performance but
the real reason is not clear. I believe we should be focusing on the
actual reasons for the performance impact than playing with some fancy
math and tuning for a benchmark on a particular machine which doesn't
work for others due to subtle initialization timing issues.

Fundamentally why should higher number of CPUs imply the size of slab in
the first place?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-26  8:52         ` Michal Hocko
@ 2021-01-26 13:38           ` Vincent Guittot
  2021-01-26 13:59             ` Michal Hocko
  0 siblings, 1 reply; 37+ messages in thread
From: Vincent Guittot @ 2021-01-26 13:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Christoph Lameter, Bharata B Rao, linux-kernel,
	linux-mm, David Rientjes, Joonsoo Kim, Andrew Morton, guro,
	Shakeel Butt, Johannes Weiner, aneesh.kumar, Jann Horn

On Tue, 26 Jan 2021 at 09:52, Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 21-01-21 19:19:21, Vlastimil Babka wrote:
> [...]
> > We could also start questioning the very assumption that number of cpus should
> > affect slab page size in the first place. Should it? After all, each CPU will
> > have one or more slab pages privately cached, as we discuss in the other
> > thread... So why make the slab pages also larger?
>
> I do agree. What is the acutal justification for this scaling?
>         /*
>          * Attempt to find best configuration for a slab. This
>          * works by first attempting to generate a layout with
>          * the best configuration and backing off gradually.
>          *
>          * First we increase the acceptable waste in a slab. Then
>          * we reduce the minimum objects required in a slab.
>          */
>
> doesn't speak about CPUs.  9b2cd506e5f2 ("slub: Calculate min_objects
> based on number of processors.") does talk about hackbench "This has
> been shown to address the performance issues in hackbench on 16p etc."
> but it doesn't give any more details to tell actually _why_ that works.
>
> This thread shows that this is still somehow related to performance but
> the real reason is not clear. I believe we should be focusing on the
> actual reasons for the performance impact than playing with some fancy
> math and tuning for a benchmark on a particular machine which doesn't
> work for others due to subtle initialization timing issues.
>
> Fundamentally why should higher number of CPUs imply the size of slab in
> the first place?

A 1st answer is that the activity and the number of threads involved
scales with the number of CPUs. Regarding the hackbench benchmark as
an example, the number of group/threads raise to a higher level on the
server than on the small system which doesn't seem unreasonable.

On 8 CPUs, I run hackbench with up to 16 groups which means 16*40
threads. But I raise up to 256 groups, which means 256*40 threads, on
the 224 CPUs system. In fact, hackbench -g 1 (with 1 group) doesn't
regress on the 224 CPUs  system.  The next test with 4 groups starts
to regress by -7%. But the next one: hackbench -g 16 regresses by 187%
(duration is almost 3 times longer). It seems reasonable to assume
that the number of running threads and resources scale with the number
of CPUs because we want to run more stuff.


> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-26 13:38           ` Vincent Guittot
@ 2021-01-26 13:59             ` Michal Hocko
  2021-01-27 13:38               ` Vlastimil Babka
  2021-01-28 13:45               ` Mel Gorman
  0 siblings, 2 replies; 37+ messages in thread
From: Michal Hocko @ 2021-01-26 13:59 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Vlastimil Babka, Christoph Lameter, Bharata B Rao, linux-kernel,
	linux-mm, David Rientjes, Joonsoo Kim, Andrew Morton, guro,
	Shakeel Butt, Johannes Weiner, aneesh.kumar, Jann Horn

On Tue 26-01-21 14:38:14, Vincent Guittot wrote:
> On Tue, 26 Jan 2021 at 09:52, Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 21-01-21 19:19:21, Vlastimil Babka wrote:
> > [...]
> > > We could also start questioning the very assumption that number of cpus should
> > > affect slab page size in the first place. Should it? After all, each CPU will
> > > have one or more slab pages privately cached, as we discuss in the other
> > > thread... So why make the slab pages also larger?
> >
> > I do agree. What is the acutal justification for this scaling?
> >         /*
> >          * Attempt to find best configuration for a slab. This
> >          * works by first attempting to generate a layout with
> >          * the best configuration and backing off gradually.
> >          *
> >          * First we increase the acceptable waste in a slab. Then
> >          * we reduce the minimum objects required in a slab.
> >          */
> >
> > doesn't speak about CPUs.  9b2cd506e5f2 ("slub: Calculate min_objects
> > based on number of processors.") does talk about hackbench "This has
> > been shown to address the performance issues in hackbench on 16p etc."
> > but it doesn't give any more details to tell actually _why_ that works.
> >
> > This thread shows that this is still somehow related to performance but
> > the real reason is not clear. I believe we should be focusing on the
> > actual reasons for the performance impact than playing with some fancy
> > math and tuning for a benchmark on a particular machine which doesn't
> > work for others due to subtle initialization timing issues.
> >
> > Fundamentally why should higher number of CPUs imply the size of slab in
> > the first place?
> 
> A 1st answer is that the activity and the number of threads involved
> scales with the number of CPUs. Regarding the hackbench benchmark as
> an example, the number of group/threads raise to a higher level on the
> server than on the small system which doesn't seem unreasonable.
> 
> On 8 CPUs, I run hackbench with up to 16 groups which means 16*40
> threads. But I raise up to 256 groups, which means 256*40 threads, on
> the 224 CPUs system. In fact, hackbench -g 1 (with 1 group) doesn't
> regress on the 224 CPUs  system.  The next test with 4 groups starts
> to regress by -7%. But the next one: hackbench -g 16 regresses by 187%
> (duration is almost 3 times longer). It seems reasonable to assume
> that the number of running threads and resources scale with the number
> of CPUs because we want to run more stuff.

OK, I do understand that more jobs scale with the number of CPUs but I
would also expect that higher order pages are generally more expensive
to get so this is not really a clear cut especially under some more
demand on the memory where allocations are smooth. So the question
really is whether this is not just optimizing for artificial conditions.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-25 11:20                 ` Vlastimil Babka
@ 2021-01-26 23:03                   ` Will Deacon
  2021-01-27  9:10                     ` Christoph Lameter
  0 siblings, 1 reply; 37+ messages in thread
From: Will Deacon @ 2021-01-26 23:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Vincent Guittot, Bharata B Rao, Christoph Lameter, linux-kernel,
	linux-mm, David Rientjes, Joonsoo Kim, Andrew Morton, guro,
	Shakeel Butt, Johannes Weiner, aneesh.kumar, Jann Horn,
	Michal Hocko, Catalin Marinas

On Mon, Jan 25, 2021 at 12:20:14PM +0100, Vlastimil Babka wrote:
> On 1/23/21 1:32 PM, Vincent Guittot wrote:
> >> PowerPC PowerNV Host: (160 cpus)
> >> num_online_cpus 1 num_present_cpus 160 num_possible_cpus 160 nr_cpu_ids 160
> >>
> >> PowerPC pseries KVM guest: (-smp 16,maxcpus=160)
> >> num_online_cpus 1 num_present_cpus 16 num_possible_cpus 160 nr_cpu_ids 160
> >>
> >> That's what I see on powerpc, hence I thought num_present_cpus() could
> >> be the correct one to use in slub page order calculation.
> > 
> > num_present_cpus() is set to 1 on arm64 until secondaries cpus boot
> > 
> > arm64 224cpus acpi host:
> > num_online_cpus 1 num_present_cpus 1 num_possible_cpus 224 nr_cpu_ids 224
> > arm64 8cpus DT host:
> > num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8
> > arm64 8cpus qemu-system-aarch64 (-smp 8,maxcpus=256)
> > num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8
> 
> I would have expected num_present_cpus to be 224, 8, 8, respectively.
> 
> > Then present and online increase to num_possible_cpus once all cpus are booted
> > 
> >>
> >> >
> >> > What about heuristic:
> >> > - num_online_cpus() > 1 - we trust that and use it
> >> > - otherwise nr_cpu_ids
> >> > Would that work? Too arbitrary?
> >>
> >> Looking at the following snippet from include/linux/cpumask.h, it
> >> appears that num_present_cpus() should be reasonable compromise
> >> between online and possible/nr_cpus_ids to use here.
> >>
> >> /*
> >>  * The following particular system cpumasks and operations manage
> >>  * possible, present, active and online cpus.
> >>  *
> >>  *     cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
> >>  *     cpu_present_mask - has bit 'cpu' set iff cpu is populated
> >>  *     cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler
> >>  *     cpu_active_mask  - has bit 'cpu' set iff cpu available to migration
> >>  *
> >>  *  If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
> >>  *
> >>  *  The cpu_possible_mask is fixed at boot time, as the set of CPU id's
> >>  *  that it is possible might ever be plugged in at anytime during the
> >>  *  life of that system boot.  The cpu_present_mask is dynamic(*),
> >>  *  representing which CPUs are currently plugged in.  And
> >>  *  cpu_online_mask is the dynamic subset of cpu_present_mask,
> >>  *  indicating those CPUs available for scheduling.
> >>  *
> >>  *  If HOTPLUG is enabled, then cpu_possible_mask is forced to have
> >>  *  all NR_CPUS bits set, otherwise it is just the set of CPUs that
> >>  *  ACPI reports present at boot.
> >>  *
> >>  *  If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
> >>  *  depending on what ACPI reports as currently plugged in, otherwise
> >>  *  cpu_present_mask is just a copy of cpu_possible_mask.
> >>  *
> >>  *  (*) Well, cpu_present_mask is dynamic in the hotplug case.  If not
> >>  *      hotplug, it's a copy of cpu_possible_mask, hence fixed at boot.
> >>  */
> >>
> >> So for host systems, present is (usually) equal to possible and for
> > 
> > But "cpu_present_mask varies dynamically,  depending on what ACPI
> > reports as currently plugged in"
> > 
> > So it should varies when secondaries cpus are booted
> 
> Hm, but booting the secondaries is just a software (kernel) action? They are
> already physically there, so it seems to me as if the cpu_present_mask is not
> populated correctly on arm64, and it's just a mirror of cpu_online_mask?

I think the present_mask retains CPUs if they are hotplugged off, whereas
the online mask does not. We can't really do any better on arm64, as there's
no way of telling that a CPU is present until we've seen it.

Will

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-26 23:03                   ` Will Deacon
@ 2021-01-27  9:10                     ` Christoph Lameter
  2021-01-27 11:04                       ` Vlastimil Babka
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Lameter @ 2021-01-27  9:10 UTC (permalink / raw)
  To: Will Deacon
  Cc: Vlastimil Babka, Vincent Guittot, Bharata B Rao, linux-kernel,
	linux-mm, David Rientjes, Joonsoo Kim, Andrew Morton, guro,
	Shakeel Butt, Johannes Weiner, aneesh.kumar, Jann Horn,
	Michal Hocko, Catalin Marinas

On Tue, 26 Jan 2021, Will Deacon wrote:

> > Hm, but booting the secondaries is just a software (kernel) action? They are
> > already physically there, so it seems to me as if the cpu_present_mask is not
> > populated correctly on arm64, and it's just a mirror of cpu_online_mask?
>
> I think the present_mask retains CPUs if they are hotplugged off, whereas
> the online mask does not. We can't really do any better on arm64, as there's
> no way of telling that a CPU is present until we've seen it.

The order of each page in a kmem cache --and therefore also the number
of objects in a slab page-- can be different because that information is
stored in the page struct.

Therefore it is possible to retune the order while the cache is in operaton.

This means you can run an initcall after all cpus have been brought up to
set the order and number of objects in a slab page differently.

The older slab pages will continue to exist with the old orders until they
are freed.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-27  9:10                     ` Christoph Lameter
@ 2021-01-27 11:04                       ` Vlastimil Babka
  2021-02-03 11:10                         ` Bharata B Rao
  0 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2021-01-27 11:04 UTC (permalink / raw)
  To: Christoph Lameter, Will Deacon
  Cc: Vincent Guittot, Bharata B Rao, linux-kernel, linux-mm,
	David Rientjes, Joonsoo Kim, Andrew Morton, guro, Shakeel Butt,
	Johannes Weiner, aneesh.kumar, Jann Horn, Michal Hocko,
	Catalin Marinas

On 1/27/21 10:10 AM, Christoph Lameter wrote:
> On Tue, 26 Jan 2021, Will Deacon wrote:
> 
>> > Hm, but booting the secondaries is just a software (kernel) action? They are
>> > already physically there, so it seems to me as if the cpu_present_mask is not
>> > populated correctly on arm64, and it's just a mirror of cpu_online_mask?
>>
>> I think the present_mask retains CPUs if they are hotplugged off, whereas
>> the online mask does not. We can't really do any better on arm64, as there's
>> no way of telling that a CPU is present until we've seen it.
> 
> The order of each page in a kmem cache --and therefore also the number
> of objects in a slab page-- can be different because that information is
> stored in the page struct.
> 
> Therefore it is possible to retune the order while the cache is in operaton.

Yes, but it's tricky to do the retuning safely, e.g. if freelist randomization
is enabled, see [1].

But as a quick fix for the regression, the heuristic idea could work reasonably
on all architectures?
- if num_present_cpus() is > 1, trust that it doesn't have the issue such as
arm64, and use it
- otherwise use nr_cpu_ids

Long-term we can attempt to do the retuning safe, or decide that number of cpus
shouldn't determine the order...

[1] https://lore.kernel.org/linux-mm/d7fb9425-9a62-c7b8-604d-5828d7e6b1da@suse.cz/

> This means you can run an initcall after all cpus have been brought up to
> set the order and number of objects in a slab page differently.
> 
> The older slab pages will continue to exist with the old orders until they
> are freed.
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-26 13:59             ` Michal Hocko
@ 2021-01-27 13:38               ` Vlastimil Babka
  2021-01-28 13:45               ` Mel Gorman
  1 sibling, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2021-01-27 13:38 UTC (permalink / raw)
  To: Michal Hocko, Vincent Guittot
  Cc: Christoph Lameter, Bharata B Rao, linux-kernel, linux-mm,
	David Rientjes, Joonsoo Kim, Andrew Morton, guro, Shakeel Butt,
	Johannes Weiner, aneesh.kumar, Jann Horn

On 1/26/21 2:59 PM, Michal Hocko wrote:
>> 
>> On 8 CPUs, I run hackbench with up to 16 groups which means 16*40
>> threads. But I raise up to 256 groups, which means 256*40 threads, on
>> the 224 CPUs system. In fact, hackbench -g 1 (with 1 group) doesn't
>> regress on the 224 CPUs  system.  The next test with 4 groups starts
>> to regress by -7%. But the next one: hackbench -g 16 regresses by 187%
>> (duration is almost 3 times longer). It seems reasonable to assume
>> that the number of running threads and resources scale with the number
>> of CPUs because we want to run more stuff.
> 
> OK, I do understand that more jobs scale with the number of CPUs but I
> would also expect that higher order pages are generally more expensive
> to get so this is not really a clear cut especially under some more
> demand on the memory where allocations are smooth. So the question
> really is whether this is not just optimizing for artificial conditions.

FWIW, I enabled CONFIG_SLUB_STATS and run "hackbench -l 16000 -g 16" in a
(small) VM, and checked tools/vm/slabinfo -DA as per the config option's help,
and it seems to be these 2 caches that are stressed:

Name                   Objects      Alloc       Free   %Fast Fallb O CmpX   UL
kmalloc-512                812   25655535   25654908  71   1     0 0 20082    0
skbuff_head_cache          304   25602632   25602632  84   1     0 0 11241    0

I guess larger pages mean more batched per-cpu allocations without going to the
shared structures or even page allocator. But 3 times duration is still surprising
to me. I'll dig more.
 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-26 13:59             ` Michal Hocko
  2021-01-27 13:38               ` Vlastimil Babka
@ 2021-01-28 13:45               ` Mel Gorman
  2021-01-28 13:57                 ` Michal Hocko
  1 sibling, 1 reply; 37+ messages in thread
From: Mel Gorman @ 2021-01-28 13:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vincent Guittot, Vlastimil Babka, Christoph Lameter,
	Bharata B Rao, linux-kernel, linux-mm, David Rientjes,
	Joonsoo Kim, Andrew Morton, guro, Shakeel Butt, Johannes Weiner,
	aneesh.kumar, Jann Horn

On Tue, Jan 26, 2021 at 02:59:18PM +0100, Michal Hocko wrote:
> > > This thread shows that this is still somehow related to performance but
> > > the real reason is not clear. I believe we should be focusing on the
> > > actual reasons for the performance impact than playing with some fancy
> > > math and tuning for a benchmark on a particular machine which doesn't
> > > work for others due to subtle initialization timing issues.
> > >
> > > Fundamentally why should higher number of CPUs imply the size of slab in
> > > the first place?
> > 
> > A 1st answer is that the activity and the number of threads involved
> > scales with the number of CPUs. Regarding the hackbench benchmark as
> > an example, the number of group/threads raise to a higher level on the
> > server than on the small system which doesn't seem unreasonable.
> > 
> > On 8 CPUs, I run hackbench with up to 16 groups which means 16*40
> > threads. But I raise up to 256 groups, which means 256*40 threads, on
> > the 224 CPUs system. In fact, hackbench -g 1 (with 1 group) doesn't
> > regress on the 224 CPUs  system.  The next test with 4 groups starts
> > to regress by -7%. But the next one: hackbench -g 16 regresses by 187%
> > (duration is almost 3 times longer). It seems reasonable to assume
> > that the number of running threads and resources scale with the number
> > of CPUs because we want to run more stuff.
> 
> OK, I do understand that more jobs scale with the number of CPUs but I
> would also expect that higher order pages are generally more expensive
> to get so this is not really a clear cut especially under some more
> demand on the memory where allocations are smooth. So the question
> really is whether this is not just optimizing for artificial conditions.

The flip side is that smaller orders increase zone lock contention and
contention can csale with the number of CPUs so it's partially related.
hackbench-sockets is an extreme case (pipetest is not affected) but it's
the messenger in this case.

On a x86-64 2-socket 40 core (80 threads) machine then comparing a revert
of the patch with vanilla 5.11-rc5 is

hackbench-process-sockets
                            5.11-rc5               5.11-rc5
                     revert-lockstat       vanilla-lockstat
Amean     1        1.1560 (   0.00%)      1.0633 *   8.02%*
Amean     4        2.0797 (   0.00%)      2.5470 * -22.47%*
Amean     7        3.2693 (   0.00%)      4.3433 * -32.85%*
Amean     12       5.2043 (   0.00%)      6.5600 * -26.05%*
Amean     21      10.5817 (   0.00%)     11.3320 *  -7.09%*
Amean     30      13.3923 (   0.00%)     15.5817 * -16.35%*
Amean     48      20.3893 (   0.00%)     23.6733 * -16.11%*
Amean     79      31.4210 (   0.00%)     38.2787 * -21.83%*
Amean     110     43.6177 (   0.00%)     53.8847 * -23.54%*
Amean     141     56.3840 (   0.00%)     68.4257 * -21.36%*
Amean     172     70.0577 (   0.00%)     85.0077 * -21.34%*
Amean     203     81.9717 (   0.00%)    100.7137 * -22.86%*
Amean     234     95.1900 (   0.00%)    116.0280 * -21.89%*
Amean     265    108.9097 (   0.00%)    130.4307 * -19.76%*
Amean     296    119.7470 (   0.00%)    142.3637 * -18.89%*

i.e. the patch incurs a 7% to 32% performance penalty. This bisected
cleanly yesterday when I was looking for the regression and then found
the thread.

Numerous caches change size. For example, kmalloc-512 goes from order-0
(vanilla) to order-2 with the revert.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

VANILLA
                             &zone->lock:       1202731        1203433           0.07         120.55     1555485.48           1.29        8920825       12537091           0.06          84.10     9855085.12           0.79
                             -----------
                             &zone->lock          61903          [<00000000b47dc96a>] free_one_page+0x3f/0x530
                             &zone->lock           7655          [<00000000099f6e05>] get_page_from_freelist+0x475/0x1370
                             &zone->lock          36529          [<0000000075b9b918>] free_pcppages_bulk+0x1ac/0x7d0
                             &zone->lock        1097346          [<00000000b8e4950a>] get_page_from_freelist+0xaf0/0x1370
                             -----------
                             &zone->lock          44716          [<00000000099f6e05>] get_page_from_freelist+0x475/0x1370
                             &zone->lock          69813          [<0000000075b9b918>] free_pcppages_bulk+0x1ac/0x7d0
                             &zone->lock          31596          [<00000000b47dc96a>] free_one_page+0x3f/0x530
                             &zone->lock        1057308          [<00000000b8e4950a>] get_page_from_freelist+0xaf0/0x1370

REVERT
                             &zone->lock:        735827         739037           0.06          66.12      699661.56           0.95        4095299        7757942           0.05          54.35     5670083.68           0.73
                             -----------
                             &zone->lock         101927          [<00000000a60d5f86>] free_one_page+0x3f/0x530
                             &zone->lock         626426          [<00000000122cecf3>] get_page_from_freelist+0xaf0/0x1370
                             &zone->lock           9207          [<0000000068b9c9a1>] free_pcppages_bulk+0x1ac/0x7d0
                             &zone->lock           1477          [<00000000f856e720>] get_page_from_freelist+0x475/0x1370
                             -----------
                             &zone->lock           6249          [<00000000f856e720>] get_page_from_freelist+0x475/0x1370
                             &zone->lock          92224          [<00000000a60d5f86>] free_one_page+0x3f/0x530
                             &zone->lock          19690          [<0000000068b9c9a1>] free_pcppages_bulk+0x1ac/0x7d0
                             &zone->lock         620874          [<00000000122cecf3>] get_page_from_freelist+0xaf0/0x1370

Each individual wait time is small but the maximum waittime-max is roughly
double (120us vanilla vs 66us reverting the patch). Total wait time is
roughly doubled also due to the patch. Acquisitions are almost doubled.

So mostly this is down to the number of times SLUB calls into the page
allocator which only caches order-0 pages on a per-cpu basis. I do have
a prototype for a high-order per-cpu allocator but it is very rough --
high watermarks stop making sense, code is rough, memory needed for the
pcpu structures quadruples etc.


-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-28 13:45               ` Mel Gorman
@ 2021-01-28 13:57                 ` Michal Hocko
  2021-01-28 14:42                   ` Mel Gorman
  0 siblings, 1 reply; 37+ messages in thread
From: Michal Hocko @ 2021-01-28 13:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Vlastimil Babka, Christoph Lameter,
	Bharata B Rao, linux-kernel, linux-mm, David Rientjes,
	Joonsoo Kim, Andrew Morton, guro, Shakeel Butt, Johannes Weiner,
	aneesh.kumar, Jann Horn

On Thu 28-01-21 13:45:12, Mel Gorman wrote:
[...]
> So mostly this is down to the number of times SLUB calls into the page
> allocator which only caches order-0 pages on a per-cpu basis. I do have
> a prototype for a high-order per-cpu allocator but it is very rough --
> high watermarks stop making sense, code is rough, memory needed for the
> pcpu structures quadruples etc.

Thanks this is really useful. But it really begs a question whether this
is a general case or more an exception. And as such maybe we want to
define high throughput caches which would gain a higher order pages to
keep pace with allocation and reduce the churn or deploy some other
techniques to reduce the direct page allocator involvement.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-28 13:57                 ` Michal Hocko
@ 2021-01-28 14:42                   ` Mel Gorman
  0 siblings, 0 replies; 37+ messages in thread
From: Mel Gorman @ 2021-01-28 14:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vincent Guittot, Vlastimil Babka, Christoph Lameter,
	Bharata B Rao, linux-kernel, linux-mm, David Rientjes,
	Joonsoo Kim, Andrew Morton, guro, Shakeel Butt, Johannes Weiner,
	aneesh.kumar, Jann Horn

On Thu, Jan 28, 2021 at 02:57:10PM +0100, Michal Hocko wrote:
> On Thu 28-01-21 13:45:12, Mel Gorman wrote:
> [...]
> > So mostly this is down to the number of times SLUB calls into the page
> > allocator which only caches order-0 pages on a per-cpu basis. I do have
> > a prototype for a high-order per-cpu allocator but it is very rough --
> > high watermarks stop making sense, code is rough, memory needed for the
> > pcpu structures quadruples etc.
> 
> Thanks this is really useful. But it really begs a question whether this
> is a general case or more an exception. And as such maybe we want to
> define high throughput caches which would gain a higher order pages to
> keep pace with allocation and reduce the churn or deploy some other
> techniques to reduce the direct page allocator involvement.

I don't think we want to define "high throughput caches" because it'll
be workload dependant and a game of whack-a-mole. If the "high throughput
cache" is a kmalloc cache for some set of workloads and one of the inode
caches or dcaches for another one, there will be no setting that is
universally good.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-01-27 11:04                       ` Vlastimil Babka
@ 2021-02-03 11:10                         ` Bharata B Rao
  2021-02-04  7:32                           ` Vincent Guittot
  2021-02-04  9:33                           ` Vlastimil Babka
  0 siblings, 2 replies; 37+ messages in thread
From: Bharata B Rao @ 2021-02-03 11:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Lameter, Will Deacon, Vincent Guittot, linux-kernel,
	linux-mm, David Rientjes, Joonsoo Kim, Andrew Morton, guro,
	Shakeel Butt, Johannes Weiner, aneesh.kumar, Jann Horn,
	Michal Hocko, Catalin Marinas

On Wed, Jan 27, 2021 at 12:04:01PM +0100, Vlastimil Babka wrote:
> On 1/27/21 10:10 AM, Christoph Lameter wrote:
> > On Tue, 26 Jan 2021, Will Deacon wrote:
> > 
> >> > Hm, but booting the secondaries is just a software (kernel) action? They are
> >> > already physically there, so it seems to me as if the cpu_present_mask is not
> >> > populated correctly on arm64, and it's just a mirror of cpu_online_mask?
> >>
> >> I think the present_mask retains CPUs if they are hotplugged off, whereas
> >> the online mask does not. We can't really do any better on arm64, as there's
> >> no way of telling that a CPU is present until we've seen it.
> > 
> > The order of each page in a kmem cache --and therefore also the number
> > of objects in a slab page-- can be different because that information is
> > stored in the page struct.
> > 
> > Therefore it is possible to retune the order while the cache is in operaton.
> 
> Yes, but it's tricky to do the retuning safely, e.g. if freelist randomization
> is enabled, see [1].
> 
> But as a quick fix for the regression, the heuristic idea could work reasonably
> on all architectures?
> - if num_present_cpus() is > 1, trust that it doesn't have the issue such as
> arm64, and use it
> - otherwise use nr_cpu_ids
> 
> Long-term we can attempt to do the retuning safe, or decide that number of cpus
> shouldn't determine the order...
> 
> [1] https://lore.kernel.org/linux-mm/d7fb9425-9a62-c7b8-604d-5828d7e6b1da@suse.cz/

So what is preferrable here now? Above or other quick fix or reverting
the original commit?

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-02-03 11:10                         ` Bharata B Rao
@ 2021-02-04  7:32                           ` Vincent Guittot
  2021-02-04  9:07                             ` Christoph Lameter
  2021-02-04  9:33                           ` Vlastimil Babka
  1 sibling, 1 reply; 37+ messages in thread
From: Vincent Guittot @ 2021-02-04  7:32 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Vlastimil Babka, Christoph Lameter, Will Deacon, linux-kernel,
	linux-mm, David Rientjes, Joonsoo Kim, Andrew Morton, guro,
	Shakeel Butt, Johannes Weiner, aneesh.kumar, Jann Horn,
	Michal Hocko, Catalin Marinas

On Wed, 3 Feb 2021 at 12:10, Bharata B Rao <bharata@linux.ibm.com> wrote:
>
> On Wed, Jan 27, 2021 at 12:04:01PM +0100, Vlastimil Babka wrote:
> > On 1/27/21 10:10 AM, Christoph Lameter wrote:
> > > On Tue, 26 Jan 2021, Will Deacon wrote:
> > >
> > >> > Hm, but booting the secondaries is just a software (kernel) action? They are
> > >> > already physically there, so it seems to me as if the cpu_present_mask is not
> > >> > populated correctly on arm64, and it's just a mirror of cpu_online_mask?
> > >>
> > >> I think the present_mask retains CPUs if they are hotplugged off, whereas
> > >> the online mask does not. We can't really do any better on arm64, as there's
> > >> no way of telling that a CPU is present until we've seen it.
> > >
> > > The order of each page in a kmem cache --and therefore also the number
> > > of objects in a slab page-- can be different because that information is
> > > stored in the page struct.
> > >
> > > Therefore it is possible to retune the order while the cache is in operaton.
> >
> > Yes, but it's tricky to do the retuning safely, e.g. if freelist randomization
> > is enabled, see [1].
> >
> > But as a quick fix for the regression, the heuristic idea could work reasonably
> > on all architectures?
> > - if num_present_cpus() is > 1, trust that it doesn't have the issue such as
> > arm64, and use it
> > - otherwise use nr_cpu_ids
> >
> > Long-term we can attempt to do the retuning safe, or decide that number of cpus
> > shouldn't determine the order...
> >
> > [1] https://lore.kernel.org/linux-mm/d7fb9425-9a62-c7b8-604d-5828d7e6b1da@suse.cz/
>
> So what is preferrable here now? Above or other quick fix or reverting
> the original commit?

I'm fine with whatever the solution as long as we can use keep using
nr_cpu_ids when other values like num_present_cpus, don't reflect
correctly the system

Regards,
Vincent

>
> Regards,
> Bharata.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-02-04  7:32                           ` Vincent Guittot
@ 2021-02-04  9:07                             ` Christoph Lameter
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Lameter @ 2021-02-04  9:07 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Bharata B Rao, Vlastimil Babka, Will Deacon, linux-kernel,
	linux-mm, David Rientjes, Joonsoo Kim, Andrew Morton, guro,
	Shakeel Butt, Johannes Weiner, aneesh.kumar, Jann Horn,
	Michal Hocko, Catalin Marinas

On Thu, 4 Feb 2021, Vincent Guittot wrote:

> > So what is preferrable here now? Above or other quick fix or reverting
> > the original commit?
>
> I'm fine with whatever the solution as long as we can use keep using
> nr_cpu_ids when other values like num_present_cpus, don't reflect
> correctly the system

AFAICT they are correctly reflecting the current state of the system.

The problem here is the bringup of the system and the tuning therefor.

One additional thing that may help: The slab caches can work in a degraded
mode where no fastpath allocations can occur. That mode is used primarily
for debugging but maybe that mode can also help during bootstrap to avoid
having to deal with the per cpu data and so on.

In degraded mode SLUB will take a lock for each operation on an object.

In this mode the following is true

  kmem_cache_cpu->page == NULL
  kmem_cache_cpu->freelist == NULL

   kmem_cache_debug(s) == true

So if you define a new debug mode and include it in SLAB_DEBUG_FLAGS then
you can force SLUB to fallback to operations where a lock is taken and
where slab allocation can be stopped. This may be ok for bring up.

The debug flags are also tied to some wizardry that can patch the code at
runtime to optimize for debubgging or fast operations. You would tie into
that one as well. Start in debug mode by default and switch to fast
operations after all processors are up.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
  2021-02-03 11:10                         ` Bharata B Rao
  2021-02-04  7:32                           ` Vincent Guittot
@ 2021-02-04  9:33                           ` Vlastimil Babka
  2021-02-08 13:41                             ` [PATCH] mm, slub: better heuristic for number of cpus when calculating slab order Vlastimil Babka
  1 sibling, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2021-02-04  9:33 UTC (permalink / raw)
  To: bharata
  Cc: Christoph Lameter, Will Deacon, Vincent Guittot, linux-kernel,
	linux-mm, David Rientjes, Joonsoo Kim, Andrew Morton, guro,
	Shakeel Butt, Johannes Weiner, aneesh.kumar, Jann Horn,
	Michal Hocko, Catalin Marinas

On 2/3/21 12:10 PM, Bharata B Rao wrote:
> On Wed, Jan 27, 2021 at 12:04:01PM +0100, Vlastimil Babka wrote:
>> Yes, but it's tricky to do the retuning safely, e.g. if freelist randomization
>> is enabled, see [1].
>> 
>> But as a quick fix for the regression, the heuristic idea could work reasonably
>> on all architectures?
>> - if num_present_cpus() is > 1, trust that it doesn't have the issue such as
>> arm64, and use it
>> - otherwise use nr_cpu_ids
>> 
>> Long-term we can attempt to do the retuning safe, or decide that number of cpus
>> shouldn't determine the order...
>> 
>> [1] https://lore.kernel.org/linux-mm/d7fb9425-9a62-c7b8-604d-5828d7e6b1da@suse.cz/
> 
> So what is preferrable here now? Above or other quick fix or reverting
> the original commit?

I would try the above first. In case it doesn't work, revert. As the immediate
fix for the regression, that people can safely backport.
Anything more complex will take more time and would be more risky to backport.

> Regards,
> Bharata.
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] mm, slub: better heuristic for number of cpus when calculating slab order
  2021-02-04  9:33                           ` Vlastimil Babka
@ 2021-02-08 13:41                             ` Vlastimil Babka
  2021-02-08 14:54                               ` Vincent Guittot
  2021-02-10 14:07                               ` Mel Gorman
  0 siblings, 2 replies; 37+ messages in thread
From: Vlastimil Babka @ 2021-02-08 13:41 UTC (permalink / raw)
  To: vbabka
  Cc: Catalin.Marinas, akpm, aneesh.kumar, bharata, cl, guro, hannes,
	iamjoonsoo.kim, jannh, linux-kernel, linux-mm, mhocko, rientjes,
	shakeelb, vincent.guittot, will, Mel Gorman, stable

When creating a new kmem cache, SLUB determines how large the slab pages will
based on number of inputs, including the number of CPUs in the system. Larger
slab pages mean that more objects can be allocated/free from per-cpu slabs
before accessing shared structures, but also potentially more memory can be
wasted due to low slab usage and fragmentation.
The rough idea of using number of CPUs is that larger systems will be more
likely to benefit from reduced contention, and also should have enough memory
to spare.

Number of CPUs used to be determined as nr_cpu_ids, which is number of possible
cpus, but on some systems many will never be onlined, thus commit 045ab8c9487b
("mm/slub: let number of online CPUs determine the slub page order") changed it
to nr_online_cpus(). However, for kmem caches created early before CPUs are
onlined, this may lead to permamently low slab page sizes.

Vincent reports a regression [1] of hackbench on arm64 systems:

> I'm facing significant performances regression on a large arm64 server
> system (224 CPUs). Regressions is also present on small arm64 system
> (8 CPUs) but in a far smaller order of magnitude

> On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16
> v5.11-rc4 : 9.135sec (+/- 0.45%)
> v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%)
> v5.10: 3.136sec (+/- 0.40%)

Mel reports a regression [2] of hackbench on x86_64, with lockstat suggesting
page allocator contention:

> i.e. the patch incurs a 7% to 32% performance penalty. This bisected
> cleanly yesterday when I was looking for the regression and then found
> the thread.

> Numerous caches change size. For example, kmalloc-512 goes from order-0
> (vanilla) to order-2 with the revert.

> So mostly this is down to the number of times SLUB calls into the page
> allocator which only caches order-0 pages on a per-cpu basis.

Clearly num_online_cpus() doesn't work too early in bootup. We could change
the order dynamically in a memory hotplug callback, but runtime order changing
for existing kmem caches has been already shown as dangerous, and removed in
32a6f409b693 ("mm, slub: remove runtime allocation order changes"). It could be
resurrected in a safe manner with some effort, but to fix the regression we
need something simpler.

We could use num_present_cpus() that should be the number of physically present
CPUs even before they are onlined. That would for for PowerPC [3], which
triggered the original commit,  but that still doesn't work on arm64 [4] as
explained in [5].

So this patch tries to determine the best available value without specific arch
knowledge.
- num_present_cpus() if the number is larger than 1, as that means the arch is
likely setting it properly
- nr_cpu_ids otherwise

This should fix the reported regressions while also keeping the effect of
045ab8c9487b for PowerPC systems. It's possible there are configurations where
num_present_cpus() is 1 during boot while nr_cpu_ids is at the same time
bloated, so these (if they exist) would keep the large orders based on
nr_cpu_ids as was before 045ab8c9487b.

[1] https://lore.kernel.org/linux-mm/CAKfTPtA_JgMf_+zdFbcb_V9rM7JBWNPjAz9irgwFj7Rou=xzZg@mail.gmail.com/
[2] https://lore.kernel.org/linux-mm/20210128134512.GF3592@techsingularity.net/
[3] https://lore.kernel.org/linux-mm/20210123051607.GC2587010@in.ibm.com/
[4] https://lore.kernel.org/linux-mm/CAKfTPtAjyVmS5VYvU6DBxg4-JEo5bdmWbngf-03YsY18cmWv_g@mail.gmail.com/
[5] https://lore.kernel.org/linux-mm/20210126230305.GD30941@willie-the-truck/

Fixes: 045ab8c9487b ("mm/slub: let number of online CPUs determine the slub page order")
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Reported-by: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---

OK, this is a 5.11 regression, so we should try to it by 5.12. I've also
Cc'd stable for that reason although it's not a crash fix.
We can still try later to replace this with a safe order update in hotplug
callbacks, but that's infeasible for 5.12.

 mm/slub.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 176b1cb0d006..8fc9190e6cb3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3454,6 +3454,7 @@ static inline int calculate_order(unsigned int size)
 	unsigned int order;
 	unsigned int min_objects;
 	unsigned int max_objects;
+	unsigned int nr_cpus;
 
 	/*
 	 * Attempt to find best configuration for a slab. This
@@ -3464,8 +3465,21 @@ static inline int calculate_order(unsigned int size)
 	 * we reduce the minimum objects required in a slab.
 	 */
 	min_objects = slub_min_objects;
-	if (!min_objects)
-		min_objects = 4 * (fls(num_online_cpus()) + 1);
+	if (!min_objects) {
+		/*
+		 * Some architectures will only update present cpus when
+		 * onlining them, so don't trust the number if it's just 1. But
+		 * we also don't want to use nr_cpu_ids always, as on some other
+		 * architectures, there can be many possible cpus, but never
+		 * onlined. Here we compromise between trying to avoid too high
+		 * order on systems that appear larger than they are, and too
+		 * low order on systems that appear smaller than they are.
+		 */
+		nr_cpus = num_present_cpus();
+		if (nr_cpus <= 1)
+			nr_cpus = nr_cpu_ids;
+		min_objects = 4 * (fls(nr_cpus) + 1);
+	}
 	max_objects = order_objects(slub_max_order, size);
 	min_objects = min(min_objects, max_objects);
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm, slub: better heuristic for number of cpus when calculating slab order
  2021-02-08 13:41                             ` [PATCH] mm, slub: better heuristic for number of cpus when calculating slab order Vlastimil Babka
@ 2021-02-08 14:54                               ` Vincent Guittot
  2021-02-10 14:07                               ` Mel Gorman
  1 sibling, 0 replies; 37+ messages in thread
From: Vincent Guittot @ 2021-02-08 14:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Catalin Marinas, Andrew Morton, aneesh.kumar, Bharata B Rao,
	Christoph Lameter, guro, Johannes Weiner, Joonsoo Kim, Jann Horn,
	linux-kernel, linux-mm, Michal Hocko, David Rientjes,
	Shakeel Butt, Will Deacon, Mel Gorman, # v4 . 16+

On Mon, 8 Feb 2021 at 14:41, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> When creating a new kmem cache, SLUB determines how large the slab pages will
> based on number of inputs, including the number of CPUs in the system. Larger
> slab pages mean that more objects can be allocated/free from per-cpu slabs
> before accessing shared structures, but also potentially more memory can be
> wasted due to low slab usage and fragmentation.
> The rough idea of using number of CPUs is that larger systems will be more
> likely to benefit from reduced contention, and also should have enough memory
> to spare.
>
> Number of CPUs used to be determined as nr_cpu_ids, which is number of possible
> cpus, but on some systems many will never be onlined, thus commit 045ab8c9487b
> ("mm/slub: let number of online CPUs determine the slub page order") changed it
> to nr_online_cpus(). However, for kmem caches created early before CPUs are
> onlined, this may lead to permamently low slab page sizes.
>
> Vincent reports a regression [1] of hackbench on arm64 systems:
>
> > I'm facing significant performances regression on a large arm64 server
> > system (224 CPUs). Regressions is also present on small arm64 system
> > (8 CPUs) but in a far smaller order of magnitude
>
> > On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16
> > v5.11-rc4 : 9.135sec (+/- 0.45%)
> > v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%)
> > v5.10: 3.136sec (+/- 0.40%)
>
> Mel reports a regression [2] of hackbench on x86_64, with lockstat suggesting
> page allocator contention:
>
> > i.e. the patch incurs a 7% to 32% performance penalty. This bisected
> > cleanly yesterday when I was looking for the regression and then found
> > the thread.
>
> > Numerous caches change size. For example, kmalloc-512 goes from order-0
> > (vanilla) to order-2 with the revert.
>
> > So mostly this is down to the number of times SLUB calls into the page
> > allocator which only caches order-0 pages on a per-cpu basis.
>
> Clearly num_online_cpus() doesn't work too early in bootup. We could change
> the order dynamically in a memory hotplug callback, but runtime order changing
> for existing kmem caches has been already shown as dangerous, and removed in
> 32a6f409b693 ("mm, slub: remove runtime allocation order changes"). It could be
> resurrected in a safe manner with some effort, but to fix the regression we
> need something simpler.
>
> We could use num_present_cpus() that should be the number of physically present
> CPUs even before they are onlined. That would for for PowerPC [3], which

minor typo : "That would for for PowerPC" should be "That would work
for PowerPC" ?

> triggered the original commit,  but that still doesn't work on arm64 [4] as
> explained in [5].
>
> So this patch tries to determine the best available value without specific arch
> knowledge.
> - num_present_cpus() if the number is larger than 1, as that means the arch is
> likely setting it properly
> - nr_cpu_ids otherwise
>
> This should fix the reported regressions while also keeping the effect of
> 045ab8c9487b for PowerPC systems. It's possible there are configurations where
> num_present_cpus() is 1 during boot while nr_cpu_ids is at the same time
> bloated, so these (if they exist) would keep the large orders based on
> nr_cpu_ids as was before 045ab8c9487b.
>
> [1] https://lore.kernel.org/linux-mm/CAKfTPtA_JgMf_+zdFbcb_V9rM7JBWNPjAz9irgwFj7Rou=xzZg@mail.gmail.com/
> [2] https://lore.kernel.org/linux-mm/20210128134512.GF3592@techsingularity.net/
> [3] https://lore.kernel.org/linux-mm/20210123051607.GC2587010@in.ibm.com/
> [4] https://lore.kernel.org/linux-mm/CAKfTPtAjyVmS5VYvU6DBxg4-JEo5bdmWbngf-03YsY18cmWv_g@mail.gmail.com/
> [5] https://lore.kernel.org/linux-mm/20210126230305.GD30941@willie-the-truck/
>
> Fixes: 045ab8c9487b ("mm/slub: let number of online CPUs determine the slub page order")
> Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
> Reported-by: Mel Gorman <mgorman@techsingularity.net>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Tested on both large and small arm64 systems. There is no regression
with this patch applied

Tested-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>
> OK, this is a 5.11 regression, so we should try to it by 5.12. I've also
> Cc'd stable for that reason although it's not a crash fix.
> We can still try later to replace this with a safe order update in hotplug
> callbacks, but that's infeasible for 5.12.
>
>  mm/slub.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 176b1cb0d006..8fc9190e6cb3 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3454,6 +3454,7 @@ static inline int calculate_order(unsigned int size)
>         unsigned int order;
>         unsigned int min_objects;
>         unsigned int max_objects;
> +       unsigned int nr_cpus;
>
>         /*
>          * Attempt to find best configuration for a slab. This
> @@ -3464,8 +3465,21 @@ static inline int calculate_order(unsigned int size)
>          * we reduce the minimum objects required in a slab.
>          */
>         min_objects = slub_min_objects;
> -       if (!min_objects)
> -               min_objects = 4 * (fls(num_online_cpus()) + 1);
> +       if (!min_objects) {
> +               /*
> +                * Some architectures will only update present cpus when
> +                * onlining them, so don't trust the number if it's just 1. But
> +                * we also don't want to use nr_cpu_ids always, as on some other
> +                * architectures, there can be many possible cpus, but never
> +                * onlined. Here we compromise between trying to avoid too high
> +                * order on systems that appear larger than they are, and too
> +                * low order on systems that appear smaller than they are.
> +                */
> +               nr_cpus = num_present_cpus();
> +               if (nr_cpus <= 1)
> +                       nr_cpus = nr_cpu_ids;
> +               min_objects = 4 * (fls(nr_cpus) + 1);
> +       }
>         max_objects = order_objects(slub_max_order, size);
>         min_objects = min(min_objects, max_objects);
>
> --
> 2.30.0
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm, slub: better heuristic for number of cpus when calculating slab order
  2021-02-08 13:41                             ` [PATCH] mm, slub: better heuristic for number of cpus when calculating slab order Vlastimil Babka
  2021-02-08 14:54                               ` Vincent Guittot
@ 2021-02-10 14:07                               ` Mel Gorman
  1 sibling, 0 replies; 37+ messages in thread
From: Mel Gorman @ 2021-02-10 14:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Catalin.Marinas, akpm, aneesh.kumar, bharata, cl, guro, hannes,
	iamjoonsoo.kim, jannh, linux-kernel, linux-mm, mhocko, rientjes,
	shakeelb, vincent.guittot, will, stable

On Mon, Feb 08, 2021 at 02:41:08PM +0100, Vlastimil Babka wrote:
> When creating a new kmem cache, SLUB determines how large the slab pages will
> based on number of inputs, including the number of CPUs in the system. Larger
> slab pages mean that more objects can be allocated/free from per-cpu slabs
> before accessing shared structures, but also potentially more memory can be
> wasted due to low slab usage and fragmentation.
> The rough idea of using number of CPUs is that larger systems will be more
> likely to benefit from reduced contention, and also should have enough memory
> to spare.
> 
> <SNIP>
>
> So this patch tries to determine the best available value without specific arch
> knowledge.
> - num_present_cpus() if the number is larger than 1, as that means the arch is
> likely setting it properly
> - nr_cpu_ids otherwise
> 
> This should fix the reported regressions while also keeping the effect of
> 045ab8c9487b for PowerPC systems. It's possible there are configurations where
> num_present_cpus() is 1 during boot while nr_cpu_ids is at the same time
> bloated, so these (if they exist) would keep the large orders based on
> nr_cpu_ids as was before 045ab8c9487b.
> 

Tested-by: Mel Gorman <mgorman@techsingularity.net>

Only x86-64 tested, three machines, all showing similar results as would
be expected. One example;

hackbench-process-sockets
                          5.11.0-rc7             5.11.0-rc7             5.11.0-rc7
                             vanilla            revert-v1r1        vbabka-fix-v1r1
Amean     1        0.3873 (   0.00%)      0.4060 (  -4.82%)      0.3747 (   3.27%)
Amean     4        1.3767 (   0.00%)      0.7700 *  44.07%*      0.7790 *  43.41%*
Amean     7        2.4710 (   0.00%)      1.2753 *  48.39%*      1.2680 *  48.68%*
Amean     12       3.7103 (   0.00%)      1.9570 *  47.26%*      1.9470 *  47.52%*
Amean     21       5.9790 (   0.00%)      2.9760 *  50.23%*      2.9830 *  50.11%*
Amean     30       8.0467 (   0.00%)      4.0590 *  49.56%*      4.0410 *  49.78%*
Amean     48      12.8180 (   0.00%)      6.5167 *  49.16%*      6.4070 *  50.02%*
Amean     79      20.5150 (   0.00%)     10.3580 *  49.51%*     10.3740 *  49.43%*
Amean     110     25.5320 (   0.00%)     14.0453 *  44.99%*     14.0577 *  44.94%*
Amean     141     32.4170 (   0.00%)     17.3267 *  46.55%*     17.4977 *  46.02%*
Amean     172     40.0883 (   0.00%)     21.0360 *  47.53%*     21.1480 *  47.25%*
Amean     203     47.2923 (   0.00%)     25.2367 *  46.64%*     25.4923 *  46.10%*
Amean     234     55.2623 (   0.00%)     29.0720 *  47.39%*     29.3273 *  46.93%*
Amean     265     61.4513 (   0.00%)     33.0260 *  46.26%*     33.0617 *  46.20%*
Amean     296     73.2960 (   0.00%)     36.6920 *  49.94%*     37.2520 *  49.18%*

Comparing just a revert and the patch

                          5.11.0-rc7             5.11.0-rc7
                         revert-v1r1        vbabka-fix-v1r1
Amean     1        0.4060 (   0.00%)      0.3747 (   7.72%)
Amean     4        0.7700 (   0.00%)      0.7790 (  -1.17%)
Amean     7        1.2753 (   0.00%)      1.2680 (   0.58%)
Amean     12       1.9570 (   0.00%)      1.9470 (   0.51%)
Amean     21       2.9760 (   0.00%)      2.9830 (  -0.24%)
Amean     30       4.0590 (   0.00%)      4.0410 (   0.44%)
Amean     48       6.5167 (   0.00%)      6.4070 (   1.68%)
Amean     79      10.3580 (   0.00%)     10.3740 (  -0.15%)
Amean     110     14.0453 (   0.00%)     14.0577 (  -0.09%)
Amean     141     17.3267 (   0.00%)     17.4977 *  -0.99%*
Amean     172     21.0360 (   0.00%)     21.1480 (  -0.53%)
Amean     203     25.2367 (   0.00%)     25.4923 (  -1.01%)
Amean     234     29.0720 (   0.00%)     29.3273 (  -0.88%)
Amean     265     33.0260 (   0.00%)     33.0617 (  -0.11%)
Amean     296     36.6920 (   0.00%)     37.2520 (  -1.53%)

That's a negligible difference and all but one group (141) was within the
noise. Even for 141, it's very marginal and with the degree of overload
at that group count, it can be ignored.

Thanks!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2021-02-10 14:08 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-18  8:27 [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order Bharata B Rao
2020-11-18 11:25 ` Vlastimil Babka
2020-11-18 19:34   ` Roman Gushchin
2020-11-18 19:53     ` David Rientjes
2021-01-20 17:36 ` Vincent Guittot
2021-01-21  5:30   ` Bharata B Rao
2021-01-21  9:09     ` Vincent Guittot
2021-01-21 10:01     ` Christoph Lameter
2021-01-21 10:48       ` Vincent Guittot
2021-01-21 18:19       ` Vlastimil Babka
2021-01-22  8:03         ` Vincent Guittot
2021-01-22 12:03           ` Vlastimil Babka
2021-01-22 13:16             ` Vincent Guittot
2021-01-23  5:16             ` Bharata B Rao
2021-01-23 12:32               ` Vincent Guittot
2021-01-25 11:20                 ` Vlastimil Babka
2021-01-26 23:03                   ` Will Deacon
2021-01-27  9:10                     ` Christoph Lameter
2021-01-27 11:04                       ` Vlastimil Babka
2021-02-03 11:10                         ` Bharata B Rao
2021-02-04  7:32                           ` Vincent Guittot
2021-02-04  9:07                             ` Christoph Lameter
2021-02-04  9:33                           ` Vlastimil Babka
2021-02-08 13:41                             ` [PATCH] mm, slub: better heuristic for number of cpus when calculating slab order Vlastimil Babka
2021-02-08 14:54                               ` Vincent Guittot
2021-02-10 14:07                               ` Mel Gorman
2021-01-22 13:05         ` [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order Jann Horn
2021-01-22 13:09           ` Jann Horn
2021-01-22 15:27           ` Vlastimil Babka
2021-01-25  4:28           ` Bharata B Rao
2021-01-26  8:52         ` Michal Hocko
2021-01-26 13:38           ` Vincent Guittot
2021-01-26 13:59             ` Michal Hocko
2021-01-27 13:38               ` Vlastimil Babka
2021-01-28 13:45               ` Mel Gorman
2021-01-28 13:57                 ` Michal Hocko
2021-01-28 14:42                   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).