linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v4] mm/slub: Optimize slub memory usage
@ 2023-07-20 10:23 Jay Patel
  2023-08-10 17:54 ` Hyeonggon Yoo
  0 siblings, 1 reply; 11+ messages in thread
From: Jay Patel @ 2023-07-20 10:23 UTC (permalink / raw)
  To: linux-mm
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, vbabka,
	aneesh.kumar, tsahu, piyushs, jaypatel

In the current implementation of the slub memory allocator, the slab
order selection process follows these criteria:

1) Determine the minimum order required to serve the minimum number of
objects (min_objects). This calculation is based on the formula (order
= min_objects * object_size / PAGE_SIZE).
2) If the minimum order is greater than the maximum allowed order
(slub_max_order), set slub_max_order as the order for this slab.
3) If the minimum order is less than the slub_max_order, iterate
through a loop from minimum order to slub_max_order and check if the
condition (rem <= slab_size / fract_leftover) holds true. Here,
slab_size is calculated as (PAGE_SIZE << order), rem is (slab_size %
object_size), and fract_leftover can have values of 16, 8, or 4. If
the condition is true, select that order for the slab.


However, in point 3, when calculating the fraction left over, it can
result in a large range of values (like 1 Kb to 256 bytes on 4K page
size & 4 Kb to 16 Kb on 64K page size with order 0 and goes on
increasing with higher order) when compared to the remainder (rem). This
can lead to the selection of an order that results in more memory
wastage. To mitigate such wastage, we have modified point 3 as follows:
To adjust the value of fract_leftover based on the page size, while
retaining the current value as the default for a 4K page size.

Test results are as follows:

1) On 160 CPUs with 64K Page size

+-----------------+----------------+----------------+
|          Total wastage in slub memory             |
+-----------------+----------------+----------------+
|                 | After Boot     |After Hackbench |
| Normal          | 932 Kb         | 1812 Kb        |
| With Patch      | 729 Kb         | 1636 Kb        |
| Wastage reduce  | ~22%           | ~10%           |
+-----------------+----------------+----------------+

+-----------------+----------------+----------------+
|            Total slub memory                      |
+-----------------+----------------+----------------+
|                 | After Boot     | After Hackbench|
| Normal          | 1855296        | 2944576        |
| With Patch      | 1544576        | 2692032        |
| Memory reduce   | ~17%           | ~9%            |
+-----------------+----------------+----------------+

hackbench-process-sockets
+-------+-----+----------+----------+-----------+
| Amean | 1   | 1.2727   | 1.2450   | ( 2.22%)  |
| Amean | 4   | 1.6063   | 1.5810   | ( 1.60%)  |
| Amean | 7   | 2.4190   | 2.3983   | ( 0.86%)  |
| Amean | 12  | 3.9730   | 3.9347   | ( 0.97%)  |
| Amean | 21  | 6.9823   | 6.8957   | ( 1.26%)  |
| Amean | 30  | 10.1867  | 10.0600  | ( 1.26%)  |
| Amean | 48  | 16.7490  | 16.4853  | ( 1.60%)  |
| Amean | 79  | 28.1870  | 27.8673  | ( 1.15%)  |
| Amean | 110 | 39.8363  | 39.3793  | ( 1.16%) |
| Amean | 141 | 51.5277  | 51.4907  | ( 0.07%)  |
| Amean | 172 | 62.9700  | 62.7300  | ( 0.38%)  |
| Amean | 203 | 74.5037  | 74.0630  | ( 0.59%)  |
| Amean | 234 | 85.6560  | 85.3587  | ( 0.35%)  |
| Amean | 265 | 96.9883  | 96.3770  | ( 0.63%)  |
| Amean | 296 | 108.6893 | 108.0870 | ( 0.56%)  |
+-------+-----+----------+----------+-----------+

2) On 16 CPUs with 64K Page size

+----------------+----------------+----------------+
|          Total wastage in slub memory            |
+----------------+----------------+----------------+
|                | After Boot     | After Hackbench|
| Normal         | 273 Kb         | 544 Kb         |
| With Patch     | 260 Kb         | 500 Kb         |
| Wastage reduce | ~5%            | ~9%            |
+----------------+----------------+----------------+

+-----------------+----------------+----------------+
|            Total slub memory                      |
+-----------------+----------------+----------------+
|                 | After Boot     | After Hackbench|
| Normal          | 275840          | 412480        |
| With Patch      | 272768          | 406208        |
| Memory reduce   | ~1%             | ~2%           |
+-----------------+----------------+----------------+

hackbench-process-sockets
+-------+----+---------+---------+-----------+
| Amean | 1  | 0.9513  | 0.9250  | ( 2.77%)  |
| Amean | 4  | 2.9630  | 2.9570  | ( 0.20%)  |
| Amean | 7  | 5.1780  | 5.1763  | ( 0.03%)  |
| Amean | 12 | 8.8833  | 8.8817  | ( 0.02%)  |
| Amean | 21 | 15.7577 | 15.6883 | ( 0.44%)  |
| Amean | 30 | 22.2063 | 22.2843 | ( -0.35%) |
| Amean | 48 | 36.0587 | 36.1390 | ( -0.22%) |
| Amean | 64 | 49.7803 | 49.3457 | ( 0.87%)  |
+-------+----+---------+---------+-----------+

Signed-off-by: Jay Patel <jaypatel@linux.ibm.com>
---
Changes from V3
1) Resolved error and optimise logic for all arch

Changes from V2
1) removed all page order selection logic for slab cache base on
wastage.
2) Increasing fraction size base on page size (keeping current value
as default to 4K page)

Changes from V1
1) If min_objects * object_size > PAGE_ALLOC_COSTLY_ORDER, then it
will return with PAGE_ALLOC_COSTLY_ORDER.
2) Similarly, if min_objects * object_size < PAGE_SIZE, then it will
return with slub_min_order.
3) Additionally, I changed slub_max_order to 2. There is no specific
reason for using the value 2, but it provided the best results in
terms of performance without any noticeable impact.

 mm/slub.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index c87628cd8a9a..8f6f38083b94 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -287,6 +287,7 @@ static inline bool kmem_cache_has_cpu_partial(struct kmem_cache *s)
 #define OO_SHIFT	16
 #define OO_MASK		((1 << OO_SHIFT) - 1)
 #define MAX_OBJS_PER_PAGE	32767 /* since slab.objects is u15 */
+#define SLUB_PAGE_FRAC_SHIFT 12
 
 /* Internal SLUB flags */
 /* Poison object */
@@ -4117,6 +4118,7 @@ static inline int calculate_order(unsigned int size)
 	unsigned int min_objects;
 	unsigned int max_objects;
 	unsigned int nr_cpus;
+	unsigned int page_size_frac;
 
 	/*
 	 * Attempt to find best configuration for a slab. This
@@ -4145,10 +4147,13 @@ static inline int calculate_order(unsigned int size)
 	max_objects = order_objects(slub_max_order, size);
 	min_objects = min(min_objects, max_objects);
 
-	while (min_objects > 1) {
+	page_size_frac = ((PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT) == 1) ? 0
+		: PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT;
+
+	while (min_objects >= 1) {
 		unsigned int fraction;
 
-		fraction = 16;
+		fraction = 16 + page_size_frac;
 		while (fraction >= 4) {
 			order = calc_slab_order(size, min_objects,
 					slub_max_order, fraction);
@@ -4159,14 +4164,6 @@ static inline int calculate_order(unsigned int size)
 		min_objects--;
 	}
 
-	/*
-	 * We were unable to place multiple objects in a slab. Now
-	 * lets see if we can place a single object there.
-	 */
-	order = calc_slab_order(size, 1, slub_max_order, 1);
-	if (order <= slub_max_order)
-		return order;
-
 	/*
 	 * Doh this slab cannot be placed using slub_max_order.
 	 */
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v4] mm/slub: Optimize slub memory usage
  2023-07-20 10:23 [RFC PATCH v4] mm/slub: Optimize slub memory usage Jay Patel
@ 2023-08-10 17:54 ` Hyeonggon Yoo
  2023-08-11  6:52   ` Jay Patel
  2023-08-11 15:43   ` Vlastimil Babka
  0 siblings, 2 replies; 11+ messages in thread
From: Hyeonggon Yoo @ 2023-08-10 17:54 UTC (permalink / raw)
  To: Jay Patel
  Cc: linux-mm, cl, penberg, rientjes, iamjoonsoo.kim, akpm, vbabka,
	aneesh.kumar, tsahu, piyushs

On Thu, Jul 20, 2023 at 7:24 PM Jay Patel <jaypatel@linux.ibm.com> wrote:
>
> In the current implementation of the slub memory allocator, the slab
> order selection process follows these criteria:
>
> 1) Determine the minimum order required to serve the minimum number of
> objects (min_objects). This calculation is based on the formula (order
> = min_objects * object_size / PAGE_SIZE).
> 2) If the minimum order is greater than the maximum allowed order
> (slub_max_order), set slub_max_order as the order for this slab.
> 3) If the minimum order is less than the slub_max_order, iterate
> through a loop from minimum order to slub_max_order and check if the
> condition (rem <= slab_size / fract_leftover) holds true. Here,
> slab_size is calculated as (PAGE_SIZE << order), rem is (slab_size %
> object_size), and fract_leftover can have values of 16, 8, or 4. If
> the condition is true, select that order for the slab.
>
>
> However, in point 3, when calculating the fraction left over, it can
> result in a large range of values (like 1 Kb to 256 bytes on 4K page
> size & 4 Kb to 16 Kb on 64K page size with order 0 and goes on
> increasing with higher order) when compared to the remainder (rem). This
> can lead to the selection of an order that results in more memory
> wastage. To mitigate such wastage, we have modified point 3 as follows:
> To adjust the value of fract_leftover based on the page size, while
> retaining the current value as the default for a 4K page size.
>
> Test results are as follows:
>
> 1) On 160 CPUs with 64K Page size
>
> +-----------------+----------------+----------------+
> |          Total wastage in slub memory             |
> +-----------------+----------------+----------------+
> |                 | After Boot     |After Hackbench |
> | Normal          | 932 Kb         | 1812 Kb        |
> | With Patch      | 729 Kb         | 1636 Kb        |
> | Wastage reduce  | ~22%           | ~10%           |
> +-----------------+----------------+----------------+
>
> +-----------------+----------------+----------------+
> |            Total slub memory                      |
> +-----------------+----------------+----------------+
> |                 | After Boot     | After Hackbench|
> | Normal          | 1855296        | 2944576        |
> | With Patch      | 1544576        | 2692032        |
> | Memory reduce   | ~17%           | ~9%            |
> +-----------------+----------------+----------------+
>
> hackbench-process-sockets
> +-------+-----+----------+----------+-----------+
> | Amean | 1   | 1.2727   | 1.2450   | ( 2.22%)  |
> | Amean | 4   | 1.6063   | 1.5810   | ( 1.60%)  |
> | Amean | 7   | 2.4190   | 2.3983   | ( 0.86%)  |
> | Amean | 12  | 3.9730   | 3.9347   | ( 0.97%)  |
> | Amean | 21  | 6.9823   | 6.8957   | ( 1.26%)  |
> | Amean | 30  | 10.1867  | 10.0600  | ( 1.26%)  |
> | Amean | 48  | 16.7490  | 16.4853  | ( 1.60%)  |
> | Amean | 79  | 28.1870  | 27.8673  | ( 1.15%)  |
> | Amean | 110 | 39.8363  | 39.3793  | ( 1.16%) |
> | Amean | 141 | 51.5277  | 51.4907  | ( 0.07%)  |
> | Amean | 172 | 62.9700  | 62.7300  | ( 0.38%)  |
> | Amean | 203 | 74.5037  | 74.0630  | ( 0.59%)  |
> | Amean | 234 | 85.6560  | 85.3587  | ( 0.35%)  |
> | Amean | 265 | 96.9883  | 96.3770  | ( 0.63%)  |
> | Amean | 296 | 108.6893 | 108.0870 | ( 0.56%)  |
> +-------+-----+----------+----------+-----------+
>
> 2) On 16 CPUs with 64K Page size
>
> +----------------+----------------+----------------+
> |          Total wastage in slub memory            |
> +----------------+----------------+----------------+
> |                | After Boot     | After Hackbench|
> | Normal         | 273 Kb         | 544 Kb         |
> | With Patch     | 260 Kb         | 500 Kb         |
> | Wastage reduce | ~5%            | ~9%            |
> +----------------+----------------+----------------+
>
> +-----------------+----------------+----------------+
> |            Total slub memory                      |
> +-----------------+----------------+----------------+
> |                 | After Boot     | After Hackbench|
> | Normal          | 275840          | 412480        |
> | With Patch      | 272768          | 406208        |
> | Memory reduce   | ~1%             | ~2%           |
> +-----------------+----------------+----------------+
>
> hackbench-process-sockets
> +-------+----+---------+---------+-----------+
> | Amean | 1  | 0.9513  | 0.9250  | ( 2.77%)  |
> | Amean | 4  | 2.9630  | 2.9570  | ( 0.20%)  |
> | Amean | 7  | 5.1780  | 5.1763  | ( 0.03%)  |
> | Amean | 12 | 8.8833  | 8.8817  | ( 0.02%)  |
> | Amean | 21 | 15.7577 | 15.6883 | ( 0.44%)  |
> | Amean | 30 | 22.2063 | 22.2843 | ( -0.35%) |
> | Amean | 48 | 36.0587 | 36.1390 | ( -0.22%) |
> | Amean | 64 | 49.7803 | 49.3457 | ( 0.87%)  |
> +-------+----+---------+---------+-----------+
>
> Signed-off-by: Jay Patel <jaypatel@linux.ibm.com>
> ---
> Changes from V3
> 1) Resolved error and optimise logic for all arch
>
> Changes from V2
> 1) removed all page order selection logic for slab cache base on
> wastage.
> 2) Increasing fraction size base on page size (keeping current value
> as default to 4K page)
>
> Changes from V1
> 1) If min_objects * object_size > PAGE_ALLOC_COSTLY_ORDER, then it
> will return with PAGE_ALLOC_COSTLY_ORDER.
> 2) Similarly, if min_objects * object_size < PAGE_SIZE, then it will
> return with slub_min_order.
> 3) Additionally, I changed slub_max_order to 2. There is no specific
> reason for using the value 2, but it provided the best results in
> terms of performance without any noticeable impact.
>
>  mm/slub.c | 17 +++++++----------
>  1 file changed, 7 insertions(+), 10 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index c87628cd8a9a..8f6f38083b94 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -287,6 +287,7 @@ static inline bool kmem_cache_has_cpu_partial(struct kmem_cache *s)
>  #define OO_SHIFT       16
>  #define OO_MASK                ((1 << OO_SHIFT) - 1)
>  #define MAX_OBJS_PER_PAGE      32767 /* since slab.objects is u15 */
> +#define SLUB_PAGE_FRAC_SHIFT 12
>
>  /* Internal SLUB flags */
>  /* Poison object */
> @@ -4117,6 +4118,7 @@ static inline int calculate_order(unsigned int size)
>         unsigned int min_objects;
>         unsigned int max_objects;
>         unsigned int nr_cpus;
> +       unsigned int page_size_frac;
>
>         /*
>          * Attempt to find best configuration for a slab. This
> @@ -4145,10 +4147,13 @@ static inline int calculate_order(unsigned int size)
>         max_objects = order_objects(slub_max_order, size);
>         min_objects = min(min_objects, max_objects);
>
> -       while (min_objects > 1) {
> +       page_size_frac = ((PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT) == 1) ? 0
> +               : PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT;
> +
> +       while (min_objects >= 1) {
>                 unsigned int fraction;
>
> -               fraction = 16;
> +               fraction = 16 + page_size_frac;
>                 while (fraction >= 4) {

Sorry I'm a bit late for the review.

IIRC hexagon/powerpc can have ridiculously large page sizes (1M or 256KB)
(but I don't know if such config is actually used, tbh) so I think
there should be
an upper bound.

>                         order = calc_slab_order(size, min_objects,
>                                         slub_max_order, fraction);
> @@ -4159,14 +4164,6 @@ static inline int calculate_order(unsigned int size)
>                 min_objects--;
>         }
> -       /*
> -        * We were unable to place multiple objects in a slab. Now
> -        * lets see if we can place a single object there.
> -        */
> -       order = calc_slab_order(size, 1, slub_max_order, 1);
> -       if (order <= slub_max_order)
> -               return order;

I'm not sure if it's okay to remove this?
It was fine in v2 because the least wasteful order was chosen
regardless of fraction but that's not true anymore.

Otherwise, everything looks fine to me. I'm too dumb to anticipate
the outcome of increasing the slab order :P but this patch does not
sound crazy to me.

Thanks!
--
Hyeonggon


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v4] mm/slub: Optimize slub memory usage
  2023-08-10 17:54 ` Hyeonggon Yoo
@ 2023-08-11  6:52   ` Jay Patel
  2023-08-18  5:11     ` Hyeonggon Yoo
  2023-08-11 15:43   ` Vlastimil Babka
  1 sibling, 1 reply; 11+ messages in thread
From: Jay Patel @ 2023-08-11  6:52 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: linux-mm, cl, penberg, rientjes, iamjoonsoo.kim, akpm, vbabka,
	aneesh.kumar, tsahu, piyushs

On Fri, 2023-08-11 at 02:54 +0900, Hyeonggon Yoo wrote:
> On Thu, Jul 20, 2023 at 7:24 PM Jay Patel <jaypatel@linux.ibm.com>
> wrote:
> > In the current implementation of the slub memory allocator, the
> > slab
> > order selection process follows these criteria:
> > 
> > 1) Determine the minimum order required to serve the minimum number
> > of
> > objects (min_objects). This calculation is based on the formula
> > (order
> > = min_objects * object_size / PAGE_SIZE).
> > 2) If the minimum order is greater than the maximum allowed order
> > (slub_max_order), set slub_max_order as the order for this slab.
> > 3) If the minimum order is less than the slub_max_order, iterate
> > through a loop from minimum order to slub_max_order and check if
> > the
> > condition (rem <= slab_size / fract_leftover) holds true. Here,
> > slab_size is calculated as (PAGE_SIZE << order), rem is (slab_size
> > %
> > object_size), and fract_leftover can have values of 16, 8, or 4. If
> > the condition is true, select that order for the slab.
> > 
> > 
> > However, in point 3, when calculating the fraction left over, it
> > can
> > result in a large range of values (like 1 Kb to 256 bytes on 4K
> > page
> > size & 4 Kb to 16 Kb on 64K page size with order 0 and goes on
> > increasing with higher order) when compared to the remainder (rem).
> > This
> > can lead to the selection of an order that results in more memory
> > wastage. To mitigate such wastage, we have modified point 3 as
> > follows:
> > To adjust the value of fract_leftover based on the page size, while
> > retaining the current value as the default for a 4K page size.
> > 
> > Test results are as follows:
> > 
> > 1) On 160 CPUs with 64K Page size
> > 
> > +-----------------+----------------+----------------+
> > >          Total wastage in slub memory             |
> > +-----------------+----------------+----------------+
> > >                 | After Boot     |After Hackbench |
> > > Normal          | 932 Kb         | 1812 Kb        |
> > > With Patch      | 729 Kb         | 1636 Kb        |
> > > Wastage reduce  | ~22%           | ~10%           |
> > +-----------------+----------------+----------------+
> > 
> > +-----------------+----------------+----------------+
> > >            Total slub memory                      |
> > +-----------------+----------------+----------------+
> > >                 | After Boot     | After Hackbench|
> > > Normal          | 1855296        | 2944576        |
> > > With Patch      | 1544576        | 2692032        |
> > > Memory reduce   | ~17%           | ~9%            |
> > +-----------------+----------------+----------------+
> > 
> > hackbench-process-sockets
> > +-------+-----+----------+----------+-----------+
> > > Amean | 1   | 1.2727   | 1.2450   | ( 2.22%)  |
> > > Amean | 4   | 1.6063   | 1.5810   | ( 1.60%)  |
> > > Amean | 7   | 2.4190   | 2.3983   | ( 0.86%)  |
> > > Amean | 12  | 3.9730   | 3.9347   | ( 0.97%)  |
> > > Amean | 21  | 6.9823   | 6.8957   | ( 1.26%)  |
> > > Amean | 30  | 10.1867  | 10.0600  | ( 1.26%)  |
> > > Amean | 48  | 16.7490  | 16.4853  | ( 1.60%)  |
> > > Amean | 79  | 28.1870  | 27.8673  | ( 1.15%)  |
> > > Amean | 110 | 39.8363  | 39.3793  | ( 1.16%) |
> > > Amean | 141 | 51.5277  | 51.4907  | ( 0.07%)  |
> > > Amean | 172 | 62.9700  | 62.7300  | ( 0.38%)  |
> > > Amean | 203 | 74.5037  | 74.0630  | ( 0.59%)  |
> > > Amean | 234 | 85.6560  | 85.3587  | ( 0.35%)  |
> > > Amean | 265 | 96.9883  | 96.3770  | ( 0.63%)  |
> > > Amean | 296 | 108.6893 | 108.0870 | ( 0.56%)  |
> > +-------+-----+----------+----------+-----------+
> > 
> > 2) On 16 CPUs with 64K Page size
> > 
> > +----------------+----------------+----------------+
> > >          Total wastage in slub memory            |
> > +----------------+----------------+----------------+
> > >                | After Boot     | After Hackbench|
> > > Normal         | 273 Kb         | 544 Kb         |
> > > With Patch     | 260 Kb         | 500 Kb         |
> > > Wastage reduce | ~5%            | ~9%            |
> > +----------------+----------------+----------------+
> > 
> > +-----------------+----------------+----------------+
> > >            Total slub memory                      |
> > +-----------------+----------------+----------------+
> > >                 | After Boot     | After Hackbench|
> > > Normal          | 275840          | 412480        |
> > > With Patch      | 272768          | 406208        |
> > > Memory reduce   | ~1%             | ~2%           |
> > +-----------------+----------------+----------------+
> > 
> > hackbench-process-sockets
> > +-------+----+---------+---------+-----------+
> > > Amean | 1  | 0.9513  | 0.9250  | ( 2.77%)  |
> > > Amean | 4  | 2.9630  | 2.9570  | ( 0.20%)  |
> > > Amean | 7  | 5.1780  | 5.1763  | ( 0.03%)  |
> > > Amean | 12 | 8.8833  | 8.8817  | ( 0.02%)  |
> > > Amean | 21 | 15.7577 | 15.6883 | ( 0.44%)  |
> > > Amean | 30 | 22.2063 | 22.2843 | ( -0.35%) |
> > > Amean | 48 | 36.0587 | 36.1390 | ( -0.22%) |
> > > Amean | 64 | 49.7803 | 49.3457 | ( 0.87%)  |
> > +-------+----+---------+---------+-----------+
> > 
> > Signed-off-by: Jay Patel <jaypatel@linux.ibm.com>
> > ---
> > Changes from V3
> > 1) Resolved error and optimise logic for all arch
> > 
> > Changes from V2
> > 1) removed all page order selection logic for slab cache base on
> > wastage.
> > 2) Increasing fraction size base on page size (keeping current
> > value
> > as default to 4K page)
> > 
> > Changes from V1
> > 1) If min_objects * object_size > PAGE_ALLOC_COSTLY_ORDER, then it
> > will return with PAGE_ALLOC_COSTLY_ORDER.
> > 2) Similarly, if min_objects * object_size < PAGE_SIZE, then it
> > will
> > return with slub_min_order.
> > 3) Additionally, I changed slub_max_order to 2. There is no
> > specific
> > reason for using the value 2, but it provided the best results in
> > terms of performance without any noticeable impact.
> > 
> >  mm/slub.c | 17 +++++++----------
> >  1 file changed, 7 insertions(+), 10 deletions(-)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index c87628cd8a9a..8f6f38083b94 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -287,6 +287,7 @@ static inline bool
> > kmem_cache_has_cpu_partial(struct kmem_cache *s)
> >  #define OO_SHIFT       16
> >  #define OO_MASK                ((1 << OO_SHIFT) - 1)
> >  #define MAX_OBJS_PER_PAGE      32767 /* since slab.objects is u15
> > */
> > +#define SLUB_PAGE_FRAC_SHIFT 12
> > 
> >  /* Internal SLUB flags */
> >  /* Poison object */
> > @@ -4117,6 +4118,7 @@ static inline int calculate_order(unsigned
> > int size)
> >         unsigned int min_objects;
> >         unsigned int max_objects;
> >         unsigned int nr_cpus;
> > +       unsigned int page_size_frac;
> > 
> >         /*
> >          * Attempt to find best configuration for a slab. This
> > @@ -4145,10 +4147,13 @@ static inline int calculate_order(unsigned
> > int size)
> >         max_objects = order_objects(slub_max_order, size);
> >         min_objects = min(min_objects, max_objects);
> > 
> > -       while (min_objects > 1) {
> > +       page_size_frac = ((PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT) == 1)
> > ? 0
> > +               : PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT;
> > +
> > +       while (min_objects >= 1) {
> >                 unsigned int fraction;
> > 
> > -               fraction = 16;
> > +               fraction = 16 + page_size_frac;
> >                 while (fraction >= 4) {
> 
> Sorry I'm a bit late for the review.
> 
> IIRC hexagon/powerpc can have ridiculously large page sizes (1M or
> 256KB)
> (but I don't know if such config is actually used, tbh) so I think
> there should be
> an upper bound.

Hi,
I think that might not be required as arch with larger page size
will required larger fraction value as per this exit condition (rem <=
slab_size / fract_leftover) during calc_slab_order. 
> 
> >                         order = calc_slab_order(size, min_objects,
> >                                         slub_max_order, fraction);
> > @@ -4159,14 +4164,6 @@ static inline int calculate_order(unsigned
> > int size)
> >                 min_objects--;
> >         }
> > -       /*
> > -        * We were unable to place multiple objects in a slab. Now
> > -        * lets see if we can place a single object there.
> > -        */
> > -       order = calc_slab_order(size, 1, slub_max_order, 1);
> > -       if (order <= slub_max_order)
> > -               return order;
> 
> I'm not sure if it's okay to remove this?
> It was fine in v2 because the least wasteful order was chosen
> regardless of fraction but that's not true anymore.
> 
Ok, So my though are like if single object in slab with slab_size =
PAGE_SIZE << slub_max_order and it wastage more then 1\4th of slab_size
then it's better to skip this part and use MAX_ORDER instead of
slub_max_order.
Could you kindly share your perspective on this part?

Tha
nks 
Jay Patel 
> Otherwise, everything looks fine to me. I'm too dumb to anticipate
> the outcome of increasing the slab order :P but this patch does not
> sound crazy to me.
> 
> Thanks!
> --
> Hyeonggon



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v4] mm/slub: Optimize slub memory usage
  2023-08-10 17:54 ` Hyeonggon Yoo
  2023-08-11  6:52   ` Jay Patel
@ 2023-08-11 15:43   ` Vlastimil Babka
  2023-08-24 10:52     ` Jay Patel
  1 sibling, 1 reply; 11+ messages in thread
From: Vlastimil Babka @ 2023-08-11 15:43 UTC (permalink / raw)
  To: Hyeonggon Yoo, Jay Patel
  Cc: linux-mm, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	aneesh.kumar, tsahu, piyushs

On 8/10/23 19:54, Hyeonggon Yoo wrote:
>>                         order = calc_slab_order(size, min_objects,
>>                                         slub_max_order, fraction);
>> @@ -4159,14 +4164,6 @@ static inline int calculate_order(unsigned int size)
>>                 min_objects--;
>>         }
>> -       /*
>> -        * We were unable to place multiple objects in a slab. Now
>> -        * lets see if we can place a single object there.
>> -        */
>> -       order = calc_slab_order(size, 1, slub_max_order, 1);
>> -       if (order <= slub_max_order)
>> -               return order;
> 
> I'm not sure if it's okay to remove this?
> It was fine in v2 because the least wasteful order was chosen
> regardless of fraction but that's not true anymore.
> 
> Otherwise, everything looks fine to me. I'm too dumb to anticipate
> the outcome of increasing the slab order :P but this patch does not
> sound crazy to me.

I wanted to have a better idea how the orders change so I hacked up a patch
to print them for all sizes up to 1MB (unnecessarily large I guess) and also
for various page sizes and nr_cpus (that's however rather invasive and prone
to me missing some helper being used that still relies on real PAGE_SHIFT),
then I applied v4 (needed some conflict fixups with my hack) on top:

https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slab-orders

As expected, things didn't change with 4k PAGE_SIZE. With 64k PAGE_SIZE, I
thought the patch in v4 form would result in lower orders, but seems not always?

I.e. I can see before the patch:

 Calculated slab orders for page_shift 16 nr_cpus 1:
          8       0
       4376       1

(so until 4368 bytes it keeps order at 0)

And after:
          8       0
       2264       1
       2272       0
       2344       1
       2352       0
       2432       1

Not sure this kind of "oscillation" is helpful with a small machine (1CPU),
and 64kB pages so the unused part of page is quite small.

With 16 cpus, AFAICS the orders are also larger for some sizes.
Hm but you reported reduction of total slab memory which suggests lower
orders were selected somewhere, so maybe I did some mistake.

Anyway my point here is that this evaluation approach might be useful, even
if it's a non-upstreamable hack, and some postprocessing of the output is
needed for easier comparison of before/after, so feel free to try that out.

BTW I'll be away for 2 weeks from now, so further feedback will have to come
from others in that time...

> Thanks!
> --
> Hyeonggon



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v4] mm/slub: Optimize slub memory usage
  2023-08-11  6:52   ` Jay Patel
@ 2023-08-18  5:11     ` Hyeonggon Yoo
  2023-08-18  6:41       ` Jay Patel
  0 siblings, 1 reply; 11+ messages in thread
From: Hyeonggon Yoo @ 2023-08-18  5:11 UTC (permalink / raw)
  To: jaypatel
  Cc: linux-mm, cl, penberg, rientjes, iamjoonsoo.kim, akpm, vbabka,
	aneesh.kumar, tsahu, piyushs

On Fri, Aug 11, 2023 at 3:52 PM Jay Patel <jaypatel@linux.ibm.com> wrote:
>
> On Fri, 2023-08-11 at 02:54 +0900, Hyeonggon Yoo wrote:
> > On Thu, Jul 20, 2023 at 7:24 PM Jay Patel <jaypatel@linux.ibm.com>
> > wrote:
> > > In the current implementation of the slub memory allocator, the
> > > slab
> > > order selection process follows these criteria:
> > >
> > > 1) Determine the minimum order required to serve the minimum number
> > > of
> > > objects (min_objects). This calculation is based on the formula
> > > (order
> > > = min_objects * object_size / PAGE_SIZE).
> > > 2) If the minimum order is greater than the maximum allowed order
> > > (slub_max_order), set slub_max_order as the order for this slab.
> > > 3) If the minimum order is less than the slub_max_order, iterate
> > > through a loop from minimum order to slub_max_order and check if
> > > the
> > > condition (rem <= slab_size / fract_leftover) holds true. Here,
> > > slab_size is calculated as (PAGE_SIZE << order), rem is (slab_size
> > > %
> > > object_size), and fract_leftover can have values of 16, 8, or 4. If
> > > the condition is true, select that order for the slab.
> > >
> > >
> > > However, in point 3, when calculating the fraction left over, it
> > > can
> > > result in a large range of values (like 1 Kb to 256 bytes on 4K
> > > page
> > > size & 4 Kb to 16 Kb on 64K page size with order 0 and goes on
> > > increasing with higher order) when compared to the remainder (rem).
> > > This
> > > can lead to the selection of an order that results in more memory
> > > wastage. To mitigate such wastage, we have modified point 3 as
> > > follows:
> > > To adjust the value of fract_leftover based on the page size, while
> > > retaining the current value as the default for a 4K page size.
> > >
> > > Test results are as follows:
> > >
> > > 1) On 160 CPUs with 64K Page size
> > >
> > > +-----------------+----------------+----------------+
> > > >          Total wastage in slub memory             |
> > > +-----------------+----------------+----------------+
> > > >                 | After Boot     |After Hackbench |
> > > > Normal          | 932 Kb         | 1812 Kb        |
> > > > With Patch      | 729 Kb         | 1636 Kb        |
> > > > Wastage reduce  | ~22%           | ~10%           |
> > > +-----------------+----------------+----------------+
> > >
> > > +-----------------+----------------+----------------+
> > > >            Total slub memory                      |
> > > +-----------------+----------------+----------------+
> > > >                 | After Boot     | After Hackbench|
> > > > Normal          | 1855296        | 2944576        |
> > > > With Patch      | 1544576        | 2692032        |
> > > > Memory reduce   | ~17%           | ~9%            |
> > > +-----------------+----------------+----------------+
> > >
> > > hackbench-process-sockets
> > > +-------+-----+----------+----------+-----------+
> > > > Amean | 1   | 1.2727   | 1.2450   | ( 2.22%)  |
> > > > Amean | 4   | 1.6063   | 1.5810   | ( 1.60%)  |
> > > > Amean | 7   | 2.4190   | 2.3983   | ( 0.86%)  |
> > > > Amean | 12  | 3.9730   | 3.9347   | ( 0.97%)  |
> > > > Amean | 21  | 6.9823   | 6.8957   | ( 1.26%)  |
> > > > Amean | 30  | 10.1867  | 10.0600  | ( 1.26%)  |
> > > > Amean | 48  | 16.7490  | 16.4853  | ( 1.60%)  |
> > > > Amean | 79  | 28.1870  | 27.8673  | ( 1.15%)  |
> > > > Amean | 110 | 39.8363  | 39.3793  | ( 1.16%) |
> > > > Amean | 141 | 51.5277  | 51.4907  | ( 0.07%)  |
> > > > Amean | 172 | 62.9700  | 62.7300  | ( 0.38%)  |
> > > > Amean | 203 | 74.5037  | 74.0630  | ( 0.59%)  |
> > > > Amean | 234 | 85.6560  | 85.3587  | ( 0.35%)  |
> > > > Amean | 265 | 96.9883  | 96.3770  | ( 0.63%)  |
> > > > Amean | 296 | 108.6893 | 108.0870 | ( 0.56%)  |
> > > +-------+-----+----------+----------+-----------+
> > >
> > > 2) On 16 CPUs with 64K Page size
> > >
> > > +----------------+----------------+----------------+
> > > >          Total wastage in slub memory            |
> > > +----------------+----------------+----------------+
> > > >                | After Boot     | After Hackbench|
> > > > Normal         | 273 Kb         | 544 Kb         |
> > > > With Patch     | 260 Kb         | 500 Kb         |
> > > > Wastage reduce | ~5%            | ~9%            |
> > > +----------------+----------------+----------------+
> > >
> > > +-----------------+----------------+----------------+
> > > >            Total slub memory                      |
> > > +-----------------+----------------+----------------+
> > > >                 | After Boot     | After Hackbench|
> > > > Normal          | 275840          | 412480        |
> > > > With Patch      | 272768          | 406208        |
> > > > Memory reduce   | ~1%             | ~2%           |
> > > +-----------------+----------------+----------------+
> > >
> > > hackbench-process-sockets
> > > +-------+----+---------+---------+-----------+
> > > > Amean | 1  | 0.9513  | 0.9250  | ( 2.77%)  |
> > > > Amean | 4  | 2.9630  | 2.9570  | ( 0.20%)  |
> > > > Amean | 7  | 5.1780  | 5.1763  | ( 0.03%)  |
> > > > Amean | 12 | 8.8833  | 8.8817  | ( 0.02%)  |
> > > > Amean | 21 | 15.7577 | 15.6883 | ( 0.44%)  |
> > > > Amean | 30 | 22.2063 | 22.2843 | ( -0.35%) |
> > > > Amean | 48 | 36.0587 | 36.1390 | ( -0.22%) |
> > > > Amean | 64 | 49.7803 | 49.3457 | ( 0.87%)  |
> > > +-------+----+---------+---------+-----------+
> > >
> > > Signed-off-by: Jay Patel <jaypatel@linux.ibm.com>
> > > ---
> > > Changes from V3
> > > 1) Resolved error and optimise logic for all arch
> > >
> > > Changes from V2
> > > 1) removed all page order selection logic for slab cache base on
> > > wastage.
> > > 2) Increasing fraction size base on page size (keeping current
> > > value
> > > as default to 4K page)
> > >
> > > Changes from V1
> > > 1) If min_objects * object_size > PAGE_ALLOC_COSTLY_ORDER, then it
> > > will return with PAGE_ALLOC_COSTLY_ORDER.
> > > 2) Similarly, if min_objects * object_size < PAGE_SIZE, then it
> > > will
> > > return with slub_min_order.
> > > 3) Additionally, I changed slub_max_order to 2. There is no
> > > specific
> > > reason for using the value 2, but it provided the best results in
> > > terms of performance without any noticeable impact.
> > >
> > >  mm/slub.c | 17 +++++++----------
> > >  1 file changed, 7 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index c87628cd8a9a..8f6f38083b94 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -287,6 +287,7 @@ static inline bool
> > > kmem_cache_has_cpu_partial(struct kmem_cache *s)
> > >  #define OO_SHIFT       16
> > >  #define OO_MASK                ((1 << OO_SHIFT) - 1)
> > >  #define MAX_OBJS_PER_PAGE      32767 /* since slab.objects is u15
> > > */
> > > +#define SLUB_PAGE_FRAC_SHIFT 12
> > >
> > >  /* Internal SLUB flags */
> > >  /* Poison object */
> > > @@ -4117,6 +4118,7 @@ static inline int calculate_order(unsigned
> > > int size)
> > >         unsigned int min_objects;
> > >         unsigned int max_objects;
> > >         unsigned int nr_cpus;
> > > +       unsigned int page_size_frac;
> > >
> > >         /*
> > >          * Attempt to find best configuration for a slab. This
> > > @@ -4145,10 +4147,13 @@ static inline int calculate_order(unsigned
> > > int size)
> > >         max_objects = order_objects(slub_max_order, size);
> > >         min_objects = min(min_objects, max_objects);
> > >
> > > -       while (min_objects > 1) {
> > > +       page_size_frac = ((PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT) == 1)
> > > ? 0
> > > +               : PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT;
> > > +
> > > +       while (min_objects >= 1) {
> > >                 unsigned int fraction;
> > >
> > > -               fraction = 16;
> > > +               fraction = 16 + page_size_frac;
> > >                 while (fraction >= 4) {
> >
> > Sorry I'm a bit late for the review.
> >
> > IIRC hexagon/powerpc can have ridiculously large page sizes (1M or
> > 256KB)
> > (but I don't know if such config is actually used, tbh) so I think
> > there should be
> > an upper bound.
>
> Hi,
> I think that might not be required as arch with larger page size
> will required larger fraction value as per this exit condition (rem <=
> slab_size / fract_leftover) during calc_slab_order.

Okay, with 256KB pages the fraction will start from 80, and then 40,
20, 10, 5, ...
and 1/80 of 256KB is about 3KB. So it's to waste less even when the
machine uses large page sizes,
because 1/16 of 256KB  is still large, right?

> > >                         order = calc_slab_order(size, min_objects,
> > >                                         slub_max_order, fraction);
> > > @@ -4159,14 +4164,6 @@ static inline int calculate_order(unsigned
> > > int size)
> > >                 min_objects--;
> > >         }
> > > -       /*
> > > -        * We were unable to place multiple objects in a slab. Now
> > > -        * lets see if we can place a single object there.
> > > -        */
> > > -       order = calc_slab_order(size, 1, slub_max_order, 1);
> > > -       if (order <= slub_max_order)
> > > -               return order;
> >
> > I'm not sure if it's okay to remove this?
> > It was fine in v2 because the least wasteful order was chosen
> > regardless of fraction but that's not true anymore.
> >
> Ok, So my though are like if single object in slab with slab_size =
> PAGE_SIZE << slub_max_order and it wastage more then 1\4th of slab_size
> then it's better to skip this part and use MAX_ORDER instead of
> slub_max_order.
> Could you kindly share your perspective on this part?

I simply missed that part! :)
That looks fine to me.


> Tha
> nks
> Jay Patel
> > Otherwise, everything looks fine to me. I'm too dumb to anticipate
> > the outcome of increasing the slab order :P but this patch does not
> > sound crazy to me.
> >
> > Thanks!
> > --
> > Hyeonggon
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v4] mm/slub: Optimize slub memory usage
  2023-08-18  5:11     ` Hyeonggon Yoo
@ 2023-08-18  6:41       ` Jay Patel
  0 siblings, 0 replies; 11+ messages in thread
From: Jay Patel @ 2023-08-18  6:41 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: linux-mm, cl, penberg, rientjes, iamjoonsoo.kim, akpm, vbabka,
	aneesh.kumar, tsahu, piyushs

On Fri, 2023-08-18 at 14:11 +0900, Hyeonggon Yoo wrote:
> On Fri, Aug 11, 2023 at 3:52 PM Jay Patel <jaypatel@linux.ibm.com>
> wrote:
> > On Fri, 2023-08-11 at 02:54 +0900, Hyeonggon Yoo wrote:
> > > On Thu, Jul 20, 2023 at 7:24 PM Jay Patel <jaypatel@linux.ibm.com
> > > >
> > > wrote:
> > > > In the current implementation of the slub memory allocator, the
> > > > slab
> > > > order selection process follows these criteria:
> > > > 
> > > > 1) Determine the minimum order required to serve the minimum
> > > > number
> > > > of
> > > > objects (min_objects). This calculation is based on the formula
> > > > (order
> > > > = min_objects * object_size / PAGE_SIZE).
> > > > 2) If the minimum order is greater than the maximum allowed
> > > > order
> > > > (slub_max_order), set slub_max_order as the order for this
> > > > slab.
> > > > 3) If the minimum order is less than the slub_max_order,
> > > > iterate
> > > > through a loop from minimum order to slub_max_order and check
> > > > if
> > > > the
> > > > condition (rem <= slab_size / fract_leftover) holds true. Here,
> > > > slab_size is calculated as (PAGE_SIZE << order), rem is
> > > > (slab_size
> > > > %
> > > > object_size), and fract_leftover can have values of 16, 8, or
> > > > 4. If
> > > > the condition is true, select that order for the slab.
> > > > 
> > > > 
> > > > However, in point 3, when calculating the fraction left over,
> > > > it
> > > > can
> > > > result in a large range of values (like 1 Kb to 256 bytes on 4K
> > > > page
> > > > size & 4 Kb to 16 Kb on 64K page size with order 0 and goes on
> > > > increasing with higher order) when compared to the remainder
> > > > (rem).
> > > > This
> > > > can lead to the selection of an order that results in more
> > > > memory
> > > > wastage. To mitigate such wastage, we have modified point 3 as
> > > > follows:
> > > > To adjust the value of fract_leftover based on the page size,
> > > > while
> > > > retaining the current value as the default for a 4K page size.
> > > > 
> > > > Test results are as follows:
> > > > 
> > > > 1) On 160 CPUs with 64K Page size
> > > > 
> > > > +-----------------+----------------+----------------+
> > > > >          Total wastage in slub memory             |
> > > > +-----------------+----------------+----------------+
> > > > >                 | After Boot     |After Hackbench |
> > > > > Normal          | 932 Kb         | 1812 Kb        |
> > > > > With Patch      | 729 Kb         | 1636 Kb        |
> > > > > Wastage reduce  | ~22%           | ~10%           |
> > > > +-----------------+----------------+----------------+
> > > > 
> > > > +-----------------+----------------+----------------+
> > > > >            Total slub memory                      |
> > > > +-----------------+----------------+----------------+
> > > > >                 | After Boot     | After Hackbench|
> > > > > Normal          | 1855296        | 2944576        |
> > > > > With Patch      | 1544576        | 2692032        |
> > > > > Memory reduce   | ~17%           | ~9%            |
> > > > +-----------------+----------------+----------------+
> > > > 
> > > > hackbench-process-sockets
> > > > +-------+-----+----------+----------+-----------+
> > > > > Amean | 1   | 1.2727   | 1.2450   | ( 2.22%)  |
> > > > > Amean | 4   | 1.6063   | 1.5810   | ( 1.60%)  |
> > > > > Amean | 7   | 2.4190   | 2.3983   | ( 0.86%)  |
> > > > > Amean | 12  | 3.9730   | 3.9347   | ( 0.97%)  |
> > > > > Amean | 21  | 6.9823   | 6.8957   | ( 1.26%)  |
> > > > > Amean | 30  | 10.1867  | 10.0600  | ( 1.26%)  |
> > > > > Amean | 48  | 16.7490  | 16.4853  | ( 1.60%)  |
> > > > > Amean | 79  | 28.1870  | 27.8673  | ( 1.15%)  |
> > > > > Amean | 110 | 39.8363  | 39.3793  | ( 1.16%) |
> > > > > Amean | 141 | 51.5277  | 51.4907  | ( 0.07%)  |
> > > > > Amean | 172 | 62.9700  | 62.7300  | ( 0.38%)  |
> > > > > Amean | 203 | 74.5037  | 74.0630  | ( 0.59%)  |
> > > > > Amean | 234 | 85.6560  | 85.3587  | ( 0.35%)  |
> > > > > Amean | 265 | 96.9883  | 96.3770  | ( 0.63%)  |
> > > > > Amean | 296 | 108.6893 | 108.0870 | ( 0.56%)  |
> > > > +-------+-----+----------+----------+-----------+
> > > > 
> > > > 2) On 16 CPUs with 64K Page size
> > > > 
> > > > +----------------+----------------+----------------+
> > > > >          Total wastage in slub memory            |
> > > > +----------------+----------------+----------------+
> > > > >                | After Boot     | After Hackbench|
> > > > > Normal         | 273 Kb         | 544 Kb         |
> > > > > With Patch     | 260 Kb         | 500 Kb         |
> > > > > Wastage reduce | ~5%            | ~9%            |
> > > > +----------------+----------------+----------------+
> > > > 
> > > > +-----------------+----------------+----------------+
> > > > >            Total slub memory                      |
> > > > +-----------------+----------------+----------------+
> > > > >                 | After Boot     | After Hackbench|
> > > > > Normal          | 275840          | 412480        |
> > > > > With Patch      | 272768          | 406208        |
> > > > > Memory reduce   | ~1%             | ~2%           |
> > > > +-----------------+----------------+----------------+
> > > > 
> > > > hackbench-process-sockets
> > > > +-------+----+---------+---------+-----------+
> > > > > Amean | 1  | 0.9513  | 0.9250  | ( 2.77%)  |
> > > > > Amean | 4  | 2.9630  | 2.9570  | ( 0.20%)  |
> > > > > Amean | 7  | 5.1780  | 5.1763  | ( 0.03%)  |
> > > > > Amean | 12 | 8.8833  | 8.8817  | ( 0.02%)  |
> > > > > Amean | 21 | 15.7577 | 15.6883 | ( 0.44%)  |
> > > > > Amean | 30 | 22.2063 | 22.2843 | ( -0.35%) |
> > > > > Amean | 48 | 36.0587 | 36.1390 | ( -0.22%) |
> > > > > Amean | 64 | 49.7803 | 49.3457 | ( 0.87%)  |
> > > > +-------+----+---------+---------+-----------+
> > > > 
> > > > Signed-off-by: Jay Patel <jaypatel@linux.ibm.com>
> > > > ---
> > > > Changes from V3
> > > > 1) Resolved error and optimise logic for all arch
> > > > 
> > > > Changes from V2
> > > > 1) removed all page order selection logic for slab cache base
> > > > on
> > > > wastage.
> > > > 2) Increasing fraction size base on page size (keeping current
> > > > value
> > > > as default to 4K page)
> > > > 
> > > > Changes from V1
> > > > 1) If min_objects * object_size > PAGE_ALLOC_COSTLY_ORDER, then
> > > > it
> > > > will return with PAGE_ALLOC_COSTLY_ORDER.
> > > > 2) Similarly, if min_objects * object_size < PAGE_SIZE, then it
> > > > will
> > > > return with slub_min_order.
> > > > 3) Additionally, I changed slub_max_order to 2. There is no
> > > > specific
> > > > reason for using the value 2, but it provided the best results
> > > > in
> > > > terms of performance without any noticeable impact.
> > > > 
> > > >  mm/slub.c | 17 +++++++----------
> > > >  1 file changed, 7 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/mm/slub.c b/mm/slub.c
> > > > index c87628cd8a9a..8f6f38083b94 100644
> > > > --- a/mm/slub.c
> > > > +++ b/mm/slub.c
> > > > @@ -287,6 +287,7 @@ static inline bool
> > > > kmem_cache_has_cpu_partial(struct kmem_cache *s)
> > > >  #define OO_SHIFT       16
> > > >  #define OO_MASK                ((1 << OO_SHIFT) - 1)
> > > >  #define MAX_OBJS_PER_PAGE      32767 /* since slab.objects is
> > > > u15
> > > > */
> > > > +#define SLUB_PAGE_FRAC_SHIFT 12
> > > > 
> > > >  /* Internal SLUB flags */
> > > >  /* Poison object */
> > > > @@ -4117,6 +4118,7 @@ static inline int
> > > > calculate_order(unsigned
> > > > int size)
> > > >         unsigned int min_objects;
> > > >         unsigned int max_objects;
> > > >         unsigned int nr_cpus;
> > > > +       unsigned int page_size_frac;
> > > > 
> > > >         /*
> > > >          * Attempt to find best configuration for a slab. This
> > > > @@ -4145,10 +4147,13 @@ static inline int
> > > > calculate_order(unsigned
> > > > int size)
> > > >         max_objects = order_objects(slub_max_order, size);
> > > >         min_objects = min(min_objects, max_objects);
> > > > 
> > > > -       while (min_objects > 1) {
> > > > +       page_size_frac = ((PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT)
> > > > == 1)
> > > > ? 0
> > > > +               : PAGE_SIZE >> SLUB_PAGE_FRAC_SHIFT;
> > > > +
> > > > +       while (min_objects >= 1) {
> > > >                 unsigned int fraction;
> > > > 
> > > > -               fraction = 16;
> > > > +               fraction = 16 + page_size_frac;
> > > >                 while (fraction >= 4) {
> > > 
> > > Sorry I'm a bit late for the review.
> > > 
> > > IIRC hexagon/powerpc can have ridiculously large page sizes (1M
> > > or
> > > 256KB)
> > > (but I don't know if such config is actually used, tbh) so I
> > > think
> > > there should be
> > > an upper bound.
> > 
> > Hi,
> > I think that might not be required as arch with larger page size
> > will required larger fraction value as per this exit condition (rem
> > <=
> > slab_size / fract_leftover) during calc_slab_order.
> 
> Okay, with 256KB pages the fraction will start from 80, and then 40,
> 20, 10, 5, ...
> and 1/80 of 256KB is about 3KB. So it's to waste less even when the
> machine uses large page sizes,
> because 1/16 of 256KB  is still large, right?

Yes correct, so with this approach we can save memory wastage and total
memory for slub when using larger page size  :) 
> 
> > > >                         order = calc_slab_order(size,
> > > > min_objects,
> > > >                                         slub_max_order,
> > > > fraction);
> > > > @@ -4159,14 +4164,6 @@ static inline int
> > > > calculate_order(unsigned
> > > > int size)
> > > >                 min_objects--;
> > > >         }
> > > > -       /*
> > > > -        * We were unable to place multiple objects in a slab.
> > > > Now
> > > > -        * lets see if we can place a single object there.
> > > > -        */
> > > > -       order = calc_slab_order(size, 1, slub_max_order, 1);
> > > > -       if (order <= slub_max_order)
> > > > -               return order;
> > > 
> > > I'm not sure if it's okay to remove this?
> > > It was fine in v2 because the least wasteful order was chosen
> > > regardless of fraction but that's not true anymore.
> > > 
> > Ok, So my though are like if single object in slab with slab_size =
> > PAGE_SIZE << slub_max_order and it wastage more then 1\4th of
> > slab_size
> > then it's better to skip this part and use MAX_ORDER instead of
> > slub_max_order.
> > Could you kindly share your perspective on this part?
> 
> I simply missed that part! :)
> That looks fine to me.
> 
> 
> > Tha
> > nks
> > Jay Patel
> > > Otherwise, everything looks fine to me. I'm too dumb to
> > > anticipate
> > > the outcome of increasing the slab order :P but this patch does
> > > not
> > > sound crazy to me.
> > > 
> > > Thanks!
> > > --
> > > Hyeonggon



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v4] mm/slub: Optimize slub memory usage
  2023-08-11 15:43   ` Vlastimil Babka
@ 2023-08-24 10:52     ` Jay Patel
  2023-09-07 13:42       ` Vlastimil Babka
  0 siblings, 1 reply; 11+ messages in thread
From: Jay Patel @ 2023-08-24 10:52 UTC (permalink / raw)
  To: Vlastimil Babka, Hyeonggon Yoo
  Cc: linux-mm, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	aneesh.kumar, tsahu, piyushs

On Fri, 2023-08-11 at 17:43 +0200, Vlastimil Babka wrote:
> On 8/10/23 19:54, Hyeonggon Yoo wrote:
> > >                         order = calc_slab_order(size,
> > > min_objects,
> > >                                         slub_max_order,
> > > fraction);
> > > @@ -4159,14 +4164,6 @@ static inline int calculate_order(unsigned
> > > int size)
> > >                 min_objects--;
> > >         }
> > > -       /*
> > > -        * We were unable to place multiple objects in a slab.
> > > Now
> > > -        * lets see if we can place a single object there.
> > > -        */
> > > -       order = calc_slab_order(size, 1, slub_max_order, 1);
> > > -       if (order <= slub_max_order)
> > > -               return order;
> > 
> > I'm not sure if it's okay to remove this?
> > It was fine in v2 because the least wasteful order was chosen
> > regardless of fraction but that's not true anymore.
> > 
> > Otherwise, everything looks fine to me. I'm too dumb to anticipate
> > the outcome of increasing the slab order :P but this patch does not
> > sound crazy to me.
> 
> I wanted to have a better idea how the orders change so I hacked up a
> patch
> to print them for all sizes up to 1MB (unnecessarily large I guess)
> and also
> for various page sizes and nr_cpus (that's however rather invasive
> and prone
> to me missing some helper being used that still relies on real
> PAGE_SHIFT),
> then I applied v4 (needed some conflict fixups with my hack) on top:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slab-orders
> 
> As expected, things didn't change with 4k PAGE_SIZE. With 64k
> PAGE_SIZE, I
> thought the patch in v4 form would result in lower orders, but seems
> not always?
> 
> I.e. I can see before the patch:
> 
>  Calculated slab orders for page_shift 16 nr_cpus 1:
>           8       0
>        4376       1
> 
> (so until 4368 bytes it keeps order at 0)
> 
> And after:
>           8       0
>        2264       1
>        2272       0
>        2344       1
>        2352       0
>        2432       1
> 
> Not sure this kind of "oscillation" is helpful with a small machine
> (1CPU),
> and 64kB pages so the unused part of page is quite small.
> 
Hi Vlastimil,
 
With patch. it will cause the fraction_size to rise to 32
when utilizing a 64k page size. As a result, the maximum wastage cap
for each slab cache will be 2k (64k divided by 32). Any object size
exceeding this cap will be moved to order 1 or beyond due to which this
oscillation is seen.
 
> With 16 cpus, AFAICS the orders are also larger for some sizes.
> Hm but you reported reduction of total slab memory which suggests
> lower
> orders were selected somewhere, so maybe I did some mistake.A

AFAIK total slab memory is reduce because of two reason (with this
patch for larger page size) 
1) order for some slab cache is reduce (by increasing fraction_size)
2) Have also seen reduction in overall slab cache numbers as because of
increasing page order

> 
> Anyway my point here is that this evaluation approach might be
> useful, even
> if it's a non-upstreamable hack, and some postprocessing of the
> output is
> needed for easier comparison of before/after, so feel free to try
> that out.

Thank you for this details test :) 
> 
> BTW I'll be away for 2 weeks from now, so further feedback will have
> to come
> from others in that time...
> 
Do we have any additional feedback from others on the same matter?

Thank

Jay Patel
> > Thanks!
> > --
> > Hyeonggon



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v4] mm/slub: Optimize slub memory usage
  2023-08-24 10:52     ` Jay Patel
@ 2023-09-07 13:42       ` Vlastimil Babka
  2023-09-14  5:40         ` Jay Patel
  0 siblings, 1 reply; 11+ messages in thread
From: Vlastimil Babka @ 2023-09-07 13:42 UTC (permalink / raw)
  To: jaypatel, Hyeonggon Yoo
  Cc: linux-mm, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	aneesh.kumar, tsahu, piyushs

On 8/24/23 12:52, Jay Patel wrote:
> On Fri, 2023-08-11 at 17:43 +0200, Vlastimil Babka wrote:
>> On 8/10/23 19:54, Hyeonggon Yoo wrote:
>> > >                         order = calc_slab_order(size,
>> > > min_objects,
>> > >                                         slub_max_order,
>> > > fraction);
>> > > @@ -4159,14 +4164,6 @@ static inline int calculate_order(unsigned
>> > > int size)
>> > >                 min_objects--;
>> > >         }
>> > > -       /*
>> > > -        * We were unable to place multiple objects in a slab.
>> > > Now
>> > > -        * lets see if we can place a single object there.
>> > > -        */
>> > > -       order = calc_slab_order(size, 1, slub_max_order, 1);
>> > > -       if (order <= slub_max_order)
>> > > -               return order;
>> > 
>> > I'm not sure if it's okay to remove this?
>> > It was fine in v2 because the least wasteful order was chosen
>> > regardless of fraction but that's not true anymore.
>> > 
>> > Otherwise, everything looks fine to me. I'm too dumb to anticipate
>> > the outcome of increasing the slab order :P but this patch does not
>> > sound crazy to me.
>> 
>> I wanted to have a better idea how the orders change so I hacked up a
>> patch
>> to print them for all sizes up to 1MB (unnecessarily large I guess)
>> and also
>> for various page sizes and nr_cpus (that's however rather invasive
>> and prone
>> to me missing some helper being used that still relies on real
>> PAGE_SHIFT),
>> then I applied v4 (needed some conflict fixups with my hack) on top:
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slab-orders
>> 
>> As expected, things didn't change with 4k PAGE_SIZE. With 64k
>> PAGE_SIZE, I
>> thought the patch in v4 form would result in lower orders, but seems
>> not always?
>> 
>> I.e. I can see before the patch:
>> 
>>  Calculated slab orders for page_shift 16 nr_cpus 1:
>>           8       0
>>        4376       1
>> 
>> (so until 4368 bytes it keeps order at 0)
>> 
>> And after:
>>           8       0
>>        2264       1
>>        2272       0
>>        2344       1
>>        2352       0
>>        2432       1
>> 
>> Not sure this kind of "oscillation" is helpful with a small machine
>> (1CPU),
>> and 64kB pages so the unused part of page is quite small.
>> 
> Hi Vlastimil,
>  
> With patch. it will cause the fraction_size to rise to 32
> when utilizing a 64k page size. As a result, the maximum wastage cap
> for each slab cache will be 2k (64k divided by 32). Any object size
> exceeding this cap will be moved to order 1 or beyond due to which this
> oscillation is seen.

Hi, sorry for the late reply.

>> With 16 cpus, AFAICS the orders are also larger for some sizes.
>> Hm but you reported reduction of total slab memory which suggests
>> lower
>> orders were selected somewhere, so maybe I did some mistake.A
> 
> AFAIK total slab memory is reduce because of two reason (with this
> patch for larger page size) 
> 1) order for some slab cache is reduce (by increasing fraction_size)

How can increased fraction_size ever result in a lower order? I think it can
only result in increased order (or same order). And the simulations with my
hack patch don't seem to counter example that. Note previously I did expect
the order to be lower (or same) and was surprised by my results, but now I
realized I misunderstood the v4 patch.

> 2) Have also seen reduction in overall slab cache numbers as because of
> increasing page order

I think your results might be just due to randomness and could turn out
different with repeating the test, or converge to be the same if you average
multiple runs. You posted them for "160 CPUs with 64K Page size" and if I
add that combination to my hack print, I see the same result before and
after your patch:

Calculated slab orders for page_shift 16 nr_cpus 160:
         8       0
      1824       1
      3648       2
      7288       3
    174768       2
    196608       3
    524296       4

Still, I might have a bug there. Can you confirm there are actual
differences with a /proc/slabinfo before/after your patch? If there are
none, any differences observed have to be due to randomness, not differences
in order.

Going back to the idea behind your patch, I don't think it makes sense to
try increase the fraction only for higher-orders. Yes, with 1/16 fraction,
the waste with 64kB page can be 4kB, while with 1/32 it will be just 2kB,
and with 4kB this is only 256 vs 128bytes. However the object sizes and
counts don't differ with page size, so with 4kB pages we'll have more slabs
to host the same number of objects, and the waste will accumulate
accordingly - i.e. the fraction metric should be independent of page size
wrt resulting total kilobytes of waste.

So maybe the only thing we need to do is to try setting it to 32 initial
value instead of 16 regardless of page size. That should hopefully again
show a good tradeoff for 4kB as one of the earlier versions, while on 64kB
it shouldn't cause much difference (again, none at all with 160 cpus, some
difference with less than 128 cpus, if my simulations were correct).

>> 
>> Anyway my point here is that this evaluation approach might be
>> useful, even
>> if it's a non-upstreamable hack, and some postprocessing of the
>> output is
>> needed for easier comparison of before/after, so feel free to try
>> that out.
> 
> Thank you for this details test :) 
>> 
>> BTW I'll be away for 2 weeks from now, so further feedback will have
>> to come
>> from others in that time...
>> 
> Do we have any additional feedback from others on the same matter?
> 
> Thank
> 
> Jay Patel
>> > Thanks!
>> > --
>> > Hyeonggon
> 
> 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v4] mm/slub: Optimize slub memory usage
  2023-09-07 13:42       ` Vlastimil Babka
@ 2023-09-14  5:40         ` Jay Patel
  2023-09-14  6:38           ` Vlastimil Babka
  0 siblings, 1 reply; 11+ messages in thread
From: Jay Patel @ 2023-09-14  5:40 UTC (permalink / raw)
  To: Vlastimil Babka, Hyeonggon Yoo
  Cc: linux-mm, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	aneesh.kumar, tsahu, piyushs

On Thu, 2023-09-07 at 15:42 +0200, Vlastimil Babka wrote:
> On 8/24/23 12:52, Jay Patel wrote:
> > On Fri, 2023-08-11 at 17:43 +0200, Vlastimil Babka wrote:
> > > On 8/10/23 19:54, Hyeonggon Yoo wrote:
> > > > >                         order = calc_slab_order(size,
> > > > > min_objects,
> > > > >                                         slub_max_order,
> > > > > fraction);
> > > > > @@ -4159,14 +4164,6 @@ static inline int
> > > > > calculate_order(unsigned
> > > > > int size)
> > > > >                 min_objects--;
> > > > >         }
> > > > > -       /*
> > > > > -        * We were unable to place multiple objects in a
> > > > > slab.
> > > > > Now
> > > > > -        * lets see if we can place a single object there.
> > > > > -        */
> > > > > -       order = calc_slab_order(size, 1, slub_max_order, 1);
> > > > > -       if (order <= slub_max_order)
> > > > > -               return order;
> > > > 
> > > > I'm not sure if it's okay to remove this?
> > > > It was fine in v2 because the least wasteful order was chosen
> > > > regardless of fraction but that's not true anymore.
> > > > 
> > > > Otherwise, everything looks fine to me. I'm too dumb to
> > > > anticipate
> > > > the outcome of increasing the slab order :P but this patch does
> > > > not
> > > > sound crazy to me.
> > > 
> > > I wanted to have a better idea how the orders change so I hacked
> > > up a
> > > patch
> > > to print them for all sizes up to 1MB (unnecessarily large I
> > > guess)
> > > and also
> > > for various page sizes and nr_cpus (that's however rather
> > > invasive
> > > and prone
> > > to me missing some helper being used that still relies on real
> > > PAGE_SHIFT),
> > > then I applied v4 (needed some conflict fixups with my hack) on
> > > top:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slab-orders
> > > 
> > > As expected, things didn't change with 4k PAGE_SIZE. With 64k
> > > PAGE_SIZE, I
> > > thought the patch in v4 form would result in lower orders, but
> > > seems
> > > not always?
> > > 
> > > I.e. I can see before the patch:
> > > 
> > >  Calculated slab orders for page_shift 16 nr_cpus 1:
> > >           8       0
> > >        4376       1
> > > 
> > > (so until 4368 bytes it keeps order at 0)
> > > 
> > > And after:
> > >           8       0
> > >        2264       1
> > >        2272       0
> > >        2344       1
> > >        2352       0
> > >        2432       1
> > > 
> > > Not sure this kind of "oscillation" is helpful with a small
> > > machine
> > > (1CPU),
> > > and 64kB pages so the unused part of page is quite small.
> > > 
> > Hi Vlastimil,
> >  
> > With patch. it will cause the fraction_size to rise to 32
> > when utilizing a 64k page size. As a result, the maximum wastage
> > cap
> > for each slab cache will be 2k (64k divided by 32). Any object size
> > exceeding this cap will be moved to order 1 or beyond due to which
> > this
> > oscillation is seen.
> 
> Hi, sorry for the late reply.
> 
> > > With 16 cpus, AFAICS the orders are also larger for some sizes.
> > > Hm but you reported reduction of total slab memory which suggests
> > > lower
> > > orders were selected somewhere, so maybe I did some mistake.A
> > 
> > AFAIK total slab memory is reduce because of two reason (with this
> > patch for larger page size) 
> > 1) order for some slab cache is reduce (by increasing
> > fraction_size)
> 
> How can increased fraction_size ever result in a lower order? I think
> it can
> only result in increased order (or same order). And the simulations
> with my
> hack patch don't seem to counter example that. Note previously I did
> expect
> the order to be lower (or same) and was surprised by my results, but
> now I
> realized I misunderstood the v4 patch.

Hi, Sorry for late reply as i was on vacation :) 

You're absolutely
right. Increasing the fraction size won't reduce the order, and I
apologize for any confusion in my previous response.
> 
> > 2) Have also seen reduction in overall slab cache numbers as
> > because of
> > increasing page order
> 
> I think your results might be just due to randomness and could turn
> out
> different with repeating the test, or converge to be the same if you
> average
> multiple runs. You posted them for "160 CPUs with 64K Page size" and
> if I
> add that combination to my hack print, I see the same result before
> and
> after your patch:
> 
> Calculated slab orders for page_shift 16 nr_cpus 160:
>          8       0
>       1824       1
>       3648       2
>       7288       3
>     174768       2
>     196608       3
>     524296       4
> 
> Still, I might have a bug there. Can you confirm there are actual
> differences with a /proc/slabinfo before/after your patch? If there
> are
> none, any differences observed have to be due to randomness, not
> differences
> in order.

Indeed, to eliminate randomness, I've consistently gathered data from
/proc/slabinfo, and I can confirm a decrease in the total number of
slab caches. 

Values as on 160 cpu system with 64k page size 
Without
patch 24892 slab caches
with patch    23891 slab caches  
> 
> Going back to the idea behind your patch, I don't think it makes
> sense to
> try increase the fraction only for higher-orders. Yes, with 1/16
> fraction,
> the waste with 64kB page can be 4kB, while with 1/32 it will be just
> 2kB,
> and with 4kB this is only 256 vs 128bytes. However the object sizes
> and
> counts don't differ with page size, so with 4kB pages we'll have more
> slabs
> to host the same number of objects, and the waste will accumulate
> accordingly - i.e. the fraction metric should be independent of page
> size
> wrt resulting total kilobytes of waste.
> 
> So maybe the only thing we need to do is to try setting it to 32
> initial
> value instead of 16 regardless of page size. That should hopefully
> again
> show a good tradeoff for 4kB as one of the earlier versions, while on
> 64kB
> it shouldn't cause much difference (again, none at all with 160 cpus,
> some
> difference with less than 128 cpus, if my simulations were correct).
> 
Yes, We can modify the default fraction size to 32 for all page sizes.
I've noticed that on a 160 CPU system with a 64K page size, there's a
noticeable change in the total memory allocated for slabs – it
decreases.

Alright, I'll make the necessary changes to the patch, setting the
fraction size default to 32, and I'll post v5 along with some
performance metrics.
>  
> > > Anyway my point here is that this evaluation approach might be
> > > useful, even
> > > if it's a non-upstreamable hack, and some postprocessing of the
> > > output is
> > > needed for easier comparison of before/after, so feel free to try
> > > that out.
> > 
> > Thank you for this details test :) 
> > > BTW I'll be away for 2 weeks from now, so further feedback will
> > > have
> > > to come
> > > from others in that time...
> > > 
> > Do we have any additional feedback from others on the same matter?
> > 
> > Thank
> > 
> > Jay Patel
> > > > Thanks!
> > > > --
> > > > Hyeonggon



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v4] mm/slub: Optimize slub memory usage
  2023-09-14  5:40         ` Jay Patel
@ 2023-09-14  6:38           ` Vlastimil Babka
  2023-09-14 12:43             ` Jay Patel
  0 siblings, 1 reply; 11+ messages in thread
From: Vlastimil Babka @ 2023-09-14  6:38 UTC (permalink / raw)
  To: jaypatel, Hyeonggon Yoo
  Cc: linux-mm, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	aneesh.kumar, tsahu, piyushs

On 9/14/23 07:40, Jay Patel wrote:
> On Thu, 2023-09-07 at 15:42 +0200, Vlastimil Babka wrote:
>> On 8/24/23 12:52, Jay Patel wrote:
>> How can increased fraction_size ever result in a lower order? I think
>> it can
>> only result in increased order (or same order). And the simulations
>> with my
>> hack patch don't seem to counter example that. Note previously I did
>> expect
>> the order to be lower (or same) and was surprised by my results, but
>> now I
>> realized I misunderstood the v4 patch.
> 
> Hi, Sorry for late reply as i was on vacation :) 
> 
> You're absolutely
> right. Increasing the fraction size won't reduce the order, and I
> apologize for any confusion in my previous response.

No problem, glad that it's cleared :)

>> 
>> > 2) Have also seen reduction in overall slab cache numbers as
>> > because of
>> > increasing page order
>> 
>> I think your results might be just due to randomness and could turn
>> out
>> different with repeating the test, or converge to be the same if you
>> average
>> multiple runs. You posted them for "160 CPUs with 64K Page size" and
>> if I
>> add that combination to my hack print, I see the same result before
>> and
>> after your patch:
>> 
>> Calculated slab orders for page_shift 16 nr_cpus 160:
>>          8       0
>>       1824       1
>>       3648       2
>>       7288       3
>>     174768       2
>>     196608       3
>>     524296       4
>> 
>> Still, I might have a bug there. Can you confirm there are actual
>> differences with a /proc/slabinfo before/after your patch? If there
>> are
>> none, any differences observed have to be due to randomness, not
>> differences
>> in order.
> 
> Indeed, to eliminate randomness, I've consistently gathered data from
> /proc/slabinfo, and I can confirm a decrease in the total number of
> slab caches. 
> 
> Values as on 160 cpu system with 64k page size 
> Without
> patch 24892 slab caches
> with patch    23891 slab caches  

I would like to see why exactly they decreased, given what the patch does it
has to be due to getting a higher order slab pages. So the values of
"<objperslab> <pagesperslab>" columns should increase for some caches -
which ones and what is their <objsize>?

>> 
>> Going back to the idea behind your patch, I don't think it makes
>> sense to
>> try increase the fraction only for higher-orders. Yes, with 1/16
>> fraction,
>> the waste with 64kB page can be 4kB, while with 1/32 it will be just
>> 2kB,
>> and with 4kB this is only 256 vs 128bytes. However the object sizes
>> and
>> counts don't differ with page size, so with 4kB pages we'll have more
>> slabs
>> to host the same number of objects, and the waste will accumulate
>> accordingly - i.e. the fraction metric should be independent of page
>> size
>> wrt resulting total kilobytes of waste.
>> 
>> So maybe the only thing we need to do is to try setting it to 32
>> initial
>> value instead of 16 regardless of page size. That should hopefully
>> again
>> show a good tradeoff for 4kB as one of the earlier versions, while on
>> 64kB
>> it shouldn't cause much difference (again, none at all with 160 cpus,
>> some
>> difference with less than 128 cpus, if my simulations were correct).
>> 
> Yes, We can modify the default fraction size to 32 for all page sizes.
> I've noticed that on a 160 CPU system with a 64K page size, there's a
> noticeable change in the total memory allocated for slabs – it
> decreases.
> 
> Alright, I'll make the necessary changes to the patch, setting the
> fraction size default to 32, and I'll post v5 along with some
> performance metrics.

Could you please also check my cleanup series at

https://lore.kernel.org/all/20230908145302.30320-6-vbabka@suse.cz/

(I did Cc you there). If it makes sense, I'd like to apply the further
optimization on top of those cleanups, not the other way around.

Thanks!

>>  
>> > > Anyway my point here is that this evaluation approach might be
>> > > useful, even
>> > > if it's a non-upstreamable hack, and some postprocessing of the
>> > > output is
>> > > needed for easier comparison of before/after, so feel free to try
>> > > that out.
>> > 
>> > Thank you for this details test :) 
>> > > BTW I'll be away for 2 weeks from now, so further feedback will
>> > > have
>> > > to come
>> > > from others in that time...
>> > > 
>> > Do we have any additional feedback from others on the same matter?
>> > 
>> > Thank
>> > 
>> > Jay Patel
>> > > > Thanks!
>> > > > --
>> > > > Hyeonggon
> 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v4] mm/slub: Optimize slub memory usage
  2023-09-14  6:38           ` Vlastimil Babka
@ 2023-09-14 12:43             ` Jay Patel
  0 siblings, 0 replies; 11+ messages in thread
From: Jay Patel @ 2023-09-14 12:43 UTC (permalink / raw)
  To: Vlastimil Babka, Hyeonggon Yoo
  Cc: linux-mm, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	aneesh.kumar, tsahu, piyushs

On Thu, 2023-09-14 at 08:38 +0200, Vlastimil Babka wrote:
> On 9/14/23 07:40, Jay Patel wrote:
> > On Thu, 2023-09-07 at 15:42 +0200, Vlastimil Babka wrote:
> > > On 8/24/23 12:52, Jay Patel wrote:
> > > How can increased fraction_size ever result in a lower order? I
> > > think
> > > it can
> > > only result in increased order (or same order). And the
> > > simulations
> > > with my
> > > hack patch don't seem to counter example that. Note previously I
> > > did
> > > expect
> > > the order to be lower (or same) and was surprised by my results,
> > > but
> > > now I
> > > realized I misunderstood the v4 patch.
> > 
> > Hi, Sorry for late reply as i was on vacation :) 
> > 
> > You're absolutely
> > right. Increasing the fraction size won't reduce the order, and I
> > apologize for any confusion in my previous response.
> 
> No problem, glad that it's cleared :)
> 
> > > > 2) Have also seen reduction in overall slab cache numbers as
> > > > because of
> > > > increasing page order
> > > 
> > > I think your results might be just due to randomness and could
> > > turn
> > > out
> > > different with repeating the test, or converge to be the same if
> > > you
> > > average
> > > multiple runs. You posted them for "160 CPUs with 64K Page size"
> > > and
> > > if I
> > > add that combination to my hack print, I see the same result
> > > before
> > > and
> > > after your patch:
> > > 
> > > Calculated slab orders for page_shift 16 nr_cpus 160:
> > >          8       0
> > >       1824       1
> > >       3648       2
> > >       7288       3
> > >     174768       2
> > >     196608       3
> > >     524296       4
> > > 
> > > Still, I might have a bug there. Can you confirm there are actual
> > > differences with a /proc/slabinfo before/after your patch? If
> > > there
> > > are
> > > none, any differences observed have to be due to randomness, not
> > > differences
> > > in order.
> > 
> > Indeed, to eliminate randomness, I've consistently gathered data
> > from
> > /proc/slabinfo, and I can confirm a decrease in the total number of
> > slab caches. 
> > 
> > Values as on 160 cpu system with 64k page size 
> > Without
> > patch 24892 slab caches
> > with patch    23891 slab caches  
> 
> I would like to see why exactly they decreased, given what the patch
> does it
> has to be due to getting a higher order slab pages. So the values of
> "<objperslab> <pagesperslab>" columns should increase for some caches
> -
> which ones and what is their <objsize>?

yes correct, increase in page order for a slab cache will result in
increasing values of "<objperslab> <pagesperslab>"

I just check total numbers of slab cache, so let me check this values
in details and will get back with objsize :) 


> 
> > > Going back to the idea behind your patch, I don't think it makes
> > > sense to
> > > try increase the fraction only for higher-orders. Yes, with 1/16
> > > fraction,
> > > the waste with 64kB page can be 4kB, while with 1/32 it will be
> > > just
> > > 2kB,
> > > and with 4kB this is only 256 vs 128bytes. However the object
> > > sizes
> > > and
> > > counts don't differ with page size, so with 4kB pages we'll have
> > > more
> > > slabs
> > > to host the same number of objects, and the waste will accumulate
> > > accordingly - i.e. the fraction metric should be independent of
> > > page
> > > size
> > > wrt resulting total kilobytes of waste.
> > > 
> > > So maybe the only thing we need to do is to try setting it to 32
> > > initial
> > > value instead of 16 regardless of page size. That should
> > > hopefully
> > > again
> > > show a good tradeoff for 4kB as one of the earlier versions,
> > > while on
> > > 64kB
> > > it shouldn't cause much difference (again, none at all with 160
> > > cpus,
> > > some
> > > difference with less than 128 cpus, if my simulations were
> > > correct).
> > > 
> > Yes, We can modify the default fraction size to 32 for all page
> > sizes.
> > I've noticed that on a 160 CPU system with a 64K page size, there's
> > a
> > noticeable change in the total memory allocated for slabs – it
> > decreases.
> > 
> > Alright, I'll make the necessary changes to the patch, setting the
> > fraction size default to 32, and I'll post v5 along with some
> > performance metrics.
> 
> Could you please also check my cleanup series at
> 
> https://lore.kernel.org/all/20230908145302.30320-6-vbabka@suse.cz/
> 
> (I did Cc you there). If it makes sense, I'd like to apply the
> further
> optimization on top of those cleanups, not the other way around.
> 
> Thanks!
> 
I've just gone through that patch series,and yes we can adjust the
fraction size related change within that series :)
> > >  
> > > > > Anyway my point here is that this evaluation approach might
> > > > > be
> > > > > useful, even
> > > > > if it's a non-upstreamable hack, and some postprocessing of
> > > > > the
> > > > > output is
> > > > > needed for easier comparison of before/after, so feel free to
> > > > > try
> > > > > that out.
> > > > 
> > > > Thank you for this details test :) 
> > > > > BTW I'll be away for 2 weeks from now, so further feedback
> > > > > will
> > > > > have
> > > > > to come
> > > > > from others in that time...
> > > > > 
> > > > Do we have any additional feedback from others on the same
> > > > matter?
> > > > 
> > > > Thank
> > > > 
> > > > Jay Patel
> > > > > > Thanks!
> > > > > > --
> > > > > > Hyeonggon



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-09-14 12:43 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-20 10:23 [RFC PATCH v4] mm/slub: Optimize slub memory usage Jay Patel
2023-08-10 17:54 ` Hyeonggon Yoo
2023-08-11  6:52   ` Jay Patel
2023-08-18  5:11     ` Hyeonggon Yoo
2023-08-18  6:41       ` Jay Patel
2023-08-11 15:43   ` Vlastimil Babka
2023-08-24 10:52     ` Jay Patel
2023-09-07 13:42       ` Vlastimil Babka
2023-09-14  5:40         ` Jay Patel
2023-09-14  6:38           ` Vlastimil Babka
2023-09-14 12:43             ` Jay Patel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).