All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/1] slub: limit number of slabs to scan in count_partial()
@ 2024-04-17 18:59 Jianfeng Wang
  2024-04-17 18:59 ` [PATCH v2 1/1] " Jianfeng Wang
  0 siblings, 1 reply; 4+ messages in thread
From: Jianfeng Wang @ 2024-04-17 18:59 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, cl, vbabka, penberg, rientjes, iamjoonsoo.kim, junxiao.bi

This patch fixes a known issue in get_slabinfo() which relies on
count_partial() to get the exact count of free objects in a
kmem_cache_node's partial list. For some slub caches, their per-node
partial list can be extremely long. The current version of
count_partial() traverses the partial list to get the exact count of
objects while holding the kmem_cache_node's spinlock. This process
may take a long time, during which slab allocations are blocked and
IRQs are disabled. In production workloads, even NMI watchdog can be
triggered due to this matter. Moreover, getting the exact count of
objects may not be useful as well: the count may change right after
the spinlock is released and re-captured by others.

The proposed fix is to limit the number of slabs to scan, and output
an approximated object count for a long partial list. The v1 patch
counts N slabs from the list's head and then uses it to approximate
the total object count in the list. As suggested by Vlastimil, an
alternative method, i.e., counting N/2 from both the list's head and
tail, produces a more accurate approximation after the partial list
is sorted by kmem_cache_shrink().

---
Changes since v1 [1]
 - Update the approximation method by counting from the list's head and tail
 - Cap the approximation by the total object count
 - Update the commit message to add benchmark results and explain the choice

[1] https://lore.kernel.org/linux-mm/20240411164023.99368-1-jianfeng.w.wang@oracle.com/

Thanks,
--Jianfeng

Jianfeng Wang (1):
  slub: limit number of slabs to scan in count_partial()

 mm/slub.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

-- 
2.42.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2 1/1] slub: limit number of slabs to scan in count_partial()
  2024-04-17 18:59 [PATCH v2 0/1] slub: limit number of slabs to scan in count_partial() Jianfeng Wang
@ 2024-04-17 18:59 ` Jianfeng Wang
  2024-04-18 10:01   ` Vlastimil Babka
  0 siblings, 1 reply; 4+ messages in thread
From: Jianfeng Wang @ 2024-04-17 18:59 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, cl, vbabka, penberg, rientjes, iamjoonsoo.kim, junxiao.bi

When reading "/proc/slabinfo", the kernel needs to report the number
of free objects for each kmem_cache. The current implementation uses
count_partial() to count the number of free objects by scanning each
kmem_cache_node's list of partial slabs and summing free objects
from every partial slab in the list. This process must hold per
kmem_cache_node spinlock and disable IRQ, and may take a long time.
Consequently, it can block slab allocations on other CPU cores and
cause timeouts for network devices and so on, when the partial list
is long. In production, even NMI watchdog can be triggered due to this
matter: e.g., for "buffer_head", the number of partial slabs was
observed to be ~1M in one kmem_cache_node. This problem was also
confirmed by several others [1-3].

Iterating a partial list to get the exact count of objects can cause
soft lockups for a long list with or without the lock (e.g., if
preemption is disabled), and is not very useful too: the object
count can change right after the lock is released. The approach of
maintaining free-object counters requires atomic operations on the
fast path [3].

So, the fix is to limit the number of slabs to scan in count_partial().
Suppose the limit is N. If the list's length is not greater than N,
output the exact count by traversing the whole list; if its length is
greater than N, then output an approximated count by traversing a
subset of the list. The proposed method is to scan N/2 slabs from the
list's head and the other N/2 slabs from the tail. For a partial list
with ~280K slabs, benchmarks show that this approach performs better
than just counting from the list's head, after slabs get sorted by
kmem_cache_shrink(). Default the limit to 10000, as it produces an
approximation within 1% of the exact count for both scenarios.

Benchmarks: Diff = (exact - approximated) / exact
* Normal case (w/o kmem_cache_shrink()):
| MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)|
| 1000        |  0.43  %              |  1.09  %              |
| 5000        |  0.06  %              |  0.37  %              |
| 10000       |  0.02  %              |  0.16  %              |
| 20000       |  0.009 %              | -0.003 %              |

* Skewed case (w/ kmem_cache_shrink()):
| MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)|
| 1000        |  12.46 %              |  6.75  %              |
| 5000        |  5.38  %              |  1.27  %              |
| 10000       |  4.99  %              |  0.22  %              |
| 20000       |  4.86  %              | -0.06  %              |

[1] https://lore.kernel.org/linux-mm/
alpine.DEB.2.21.2003031602460.1537@www.lameter.com/T/
[2] https://lore.kernel.org/lkml/
alpine.DEB.2.22.394.2008071258020.55871@www.lameter.com/T/
[3] https://lore.kernel.org/lkml/
1e01092b-140d-2bab-aeba-321a74a194ee@linux.com/T/

Signed-off-by: Jianfeng Wang <jianfeng.w.wang@oracle.com>
---
 mm/slub.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1bb2a93cf7b6..7e34f2f0ba85 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3213,6 +3213,8 @@ static inline bool free_debug_processing(struct kmem_cache *s,
 #endif /* CONFIG_SLUB_DEBUG */
 
 #if defined(CONFIG_SLUB_DEBUG) || defined(SLAB_SUPPORTS_SYSFS)
+#define MAX_PARTIAL_TO_SCAN 10000
+
 static unsigned long count_partial(struct kmem_cache_node *n,
 					int (*get_count)(struct slab *))
 {
@@ -3221,8 +3223,30 @@ static unsigned long count_partial(struct kmem_cache_node *n,
 	struct slab *slab;
 
 	spin_lock_irqsave(&n->list_lock, flags);
-	list_for_each_entry(slab, &n->partial, slab_list)
-		x += get_count(slab);
+	if (n->nr_partial <= MAX_PARTIAL_TO_SCAN) {
+		list_for_each_entry(slab, &n->partial, slab_list)
+			x += get_count(slab);
+	} else {
+		/*
+		 * For a long list, approximate the total count of objects in
+		 * it to meet the limit on the number of slabs to scan.
+		 * Scan from both the list's head and tail for better accuracy.
+		 */
+		unsigned long scanned = 0;
+
+		list_for_each_entry(slab, &n->partial, slab_list) {
+			x += get_count(slab);
+			if (++scanned == MAX_PARTIAL_TO_SCAN / 2)
+				break;
+		}
+		list_for_each_entry_reverse(slab, &n->partial, slab_list) {
+			x += get_count(slab);
+			if (++scanned == MAX_PARTIAL_TO_SCAN)
+				break;
+		}
+		x = mult_frac(x, n->nr_partial, scanned);
+		x = min(x, node_nr_objs(n));
+	}
 	spin_unlock_irqrestore(&n->list_lock, flags);
 	return x;
 }
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 1/1] slub: limit number of slabs to scan in count_partial()
  2024-04-17 18:59 ` [PATCH v2 1/1] " Jianfeng Wang
@ 2024-04-18 10:01   ` Vlastimil Babka
  2024-04-18 22:44     ` Jianfeng Wang
  0 siblings, 1 reply; 4+ messages in thread
From: Vlastimil Babka @ 2024-04-18 10:01 UTC (permalink / raw)
  To: Jianfeng Wang, linux-mm, linux-kernel
  Cc: akpm, cl, penberg, rientjes, iamjoonsoo.kim, junxiao.bi

On 4/17/24 20:59, Jianfeng Wang wrote:
> When reading "/proc/slabinfo", the kernel needs to report the number
> of free objects for each kmem_cache. The current implementation uses
> count_partial() to count the number of free objects by scanning each

Hi,

thanks. I wanted to apply this patch but then I realized we use the same
function besides slabinfo for sysfs and slab_out_of_memory(), and it's
not always counting free objects. When somebody is debugging with sysfs,
they may expect the exact counts and pay the price if needed, but we
probably don't want to make slab_out_of_memory() slow and precise, so
that's more like the slabinfo.

So what I propose is to create a new variant of count_partial, called
e.g. count_partial_free_approx() which has no get_count parameter but
hardcodes what count_free() does.
Then use this new function only for slabinfo and slab_out_of_memory(),
leaving the other count_partial() users unchanged.
Another benefit of that is that we remove the overhead of calling
get_count(), which may be nontrivial with the current cpu vulnerability
mitigations so it's good to avoid for slabinfo and oom reports.

Thanks!

> kmem_cache_node's list of partial slabs and summing free objects
> from every partial slab in the list. This process must hold per
> kmem_cache_node spinlock and disable IRQ, and may take a long time.
> Consequently, it can block slab allocations on other CPU cores and
> cause timeouts for network devices and so on, when the partial list
> is long. In production, even NMI watchdog can be triggered due to this
> matter: e.g., for "buffer_head", the number of partial slabs was
> observed to be ~1M in one kmem_cache_node. This problem was also
> confirmed by several others [1-3].
> 
> Iterating a partial list to get the exact count of objects can cause
> soft lockups for a long list with or without the lock (e.g., if
> preemption is disabled), and is not very useful too: the object
> count can change right after the lock is released. The approach of
> maintaining free-object counters requires atomic operations on the
> fast path [3].
> 
> So, the fix is to limit the number of slabs to scan in count_partial().
> Suppose the limit is N. If the list's length is not greater than N,
> output the exact count by traversing the whole list; if its length is
> greater than N, then output an approximated count by traversing a
> subset of the list. The proposed method is to scan N/2 slabs from the
> list's head and the other N/2 slabs from the tail. For a partial list
> with ~280K slabs, benchmarks show that this approach performs better
> than just counting from the list's head, after slabs get sorted by
> kmem_cache_shrink(). Default the limit to 10000, as it produces an
> approximation within 1% of the exact count for both scenarios.
> 
> Benchmarks: Diff = (exact - approximated) / exact
> * Normal case (w/o kmem_cache_shrink()):
> | MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)|
> | 1000        |  0.43  %              |  1.09  %              |
> | 5000        |  0.06  %              |  0.37  %              |
> | 10000       |  0.02  %              |  0.16  %              |
> | 20000       |  0.009 %              | -0.003 %              |
> 
> * Skewed case (w/ kmem_cache_shrink()):
> | MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)|
> | 1000        |  12.46 %              |  6.75  %              |
> | 5000        |  5.38  %              |  1.27  %              |
> | 10000       |  4.99  %              |  0.22  %              |
> | 20000       |  4.86  %              | -0.06  %              |
> 
> [1] https://lore.kernel.org/linux-mm/
> alpine.DEB.2.21.2003031602460.1537@www.lameter.com/T/
> [2] https://lore.kernel.org/lkml/
> alpine.DEB.2.22.394.2008071258020.55871@www.lameter.com/T/
> [3] https://lore.kernel.org/lkml/
> 1e01092b-140d-2bab-aeba-321a74a194ee@linux.com/T/
> 
> Signed-off-by: Jianfeng Wang <jianfeng.w.wang@oracle.com>
> ---
>  mm/slub.c | 28 ++++++++++++++++++++++++++--
>  1 file changed, 26 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 1bb2a93cf7b6..7e34f2f0ba85 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3213,6 +3213,8 @@ static inline bool free_debug_processing(struct kmem_cache *s,
>  #endif /* CONFIG_SLUB_DEBUG */
>  
>  #if defined(CONFIG_SLUB_DEBUG) || defined(SLAB_SUPPORTS_SYSFS)
> +#define MAX_PARTIAL_TO_SCAN 10000
> +
>  static unsigned long count_partial(struct kmem_cache_node *n,
>  					int (*get_count)(struct slab *))
>  {
> @@ -3221,8 +3223,30 @@ static unsigned long count_partial(struct kmem_cache_node *n,
>  	struct slab *slab;
>  
>  	spin_lock_irqsave(&n->list_lock, flags);
> -	list_for_each_entry(slab, &n->partial, slab_list)
> -		x += get_count(slab);
> +	if (n->nr_partial <= MAX_PARTIAL_TO_SCAN) {
> +		list_for_each_entry(slab, &n->partial, slab_list)
> +			x += get_count(slab);
> +	} else {
> +		/*
> +		 * For a long list, approximate the total count of objects in
> +		 * it to meet the limit on the number of slabs to scan.
> +		 * Scan from both the list's head and tail for better accuracy.
> +		 */
> +		unsigned long scanned = 0;
> +
> +		list_for_each_entry(slab, &n->partial, slab_list) {
> +			x += get_count(slab);
> +			if (++scanned == MAX_PARTIAL_TO_SCAN / 2)
> +				break;
> +		}
> +		list_for_each_entry_reverse(slab, &n->partial, slab_list) {
> +			x += get_count(slab);
> +			if (++scanned == MAX_PARTIAL_TO_SCAN)
> +				break;
> +		}
> +		x = mult_frac(x, n->nr_partial, scanned);
> +		x = min(x, node_nr_objs(n));
> +	}
>  	spin_unlock_irqrestore(&n->list_lock, flags);
>  	return x;
>  }

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 1/1] slub: limit number of slabs to scan in count_partial()
  2024-04-18 10:01   ` Vlastimil Babka
@ 2024-04-18 22:44     ` Jianfeng Wang
  0 siblings, 0 replies; 4+ messages in thread
From: Jianfeng Wang @ 2024-04-18 22:44 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm, linux-kernel
  Cc: akpm, cl, penberg, rientjes, iamjoonsoo.kim, junxiao.bi


On 4/18/24 3:01 AM, Vlastimil Babka wrote:
> On 4/17/24 20:59, Jianfeng Wang wrote:
>> When reading "/proc/slabinfo", the kernel needs to report the number
>> of free objects for each kmem_cache. The current implementation uses
>> count_partial() to count the number of free objects by scanning each
> 
> Hi,
> 
> thanks. I wanted to apply this patch but then I realized we use the same
> function besides slabinfo for sysfs and slab_out_of_memory(), and it's
> not always counting free objects. When somebody is debugging with sysfs,
> they may expect the exact counts and pay the price if needed, but we
> probably don't want to make slab_out_of_memory() slow and precise, so
> that's more like the slabinfo.
> 
> So what I propose is to create a new variant of count_partial, called
> e.g. count_partial_free_approx() which has no get_count parameter but
> hardcodes what count_free() does.
> Then use this new function only for slabinfo and slab_out_of_memory(),
> leaving the other count_partial() users unchanged.
> Another benefit of that is that we remove the overhead of calling
> get_count(), which may be nontrivial with the current cpu vulnerability
> mitigations so it's good to avoid for slabinfo and oom reports.
> 
> Thanks!
> 

Thank you both for review.

I assume that: in most cases, people would stop at /proc/slabinfo;
and they must have a good reason to dig into the exact object count
in sysfs. At that point, it is better that the kernel can offer it.
So, it sounds good to me to offer the two capabilities, i.e., fast
approximation and exact count.

I will send a v3.

>> kmem_cache_node's list of partial slabs and summing free objects
>> from every partial slab in the list. This process must hold per
>> kmem_cache_node spinlock and disable IRQ, and may take a long time.
>> Consequently, it can block slab allocations on other CPU cores and
>> cause timeouts for network devices and so on, when the partial list
>> is long. In production, even NMI watchdog can be triggered due to this
>> matter: e.g., for "buffer_head", the number of partial slabs was
>> observed to be ~1M in one kmem_cache_node. This problem was also
>> confirmed by several others [1-3].
>>
>> Iterating a partial list to get the exact count of objects can cause
>> soft lockups for a long list with or without the lock (e.g., if
>> preemption is disabled), and is not very useful too: the object
>> count can change right after the lock is released. The approach of
>> maintaining free-object counters requires atomic operations on the
>> fast path [3].
>>
>> So, the fix is to limit the number of slabs to scan in count_partial().
>> Suppose the limit is N. If the list's length is not greater than N,
>> output the exact count by traversing the whole list; if its length is
>> greater than N, then output an approximated count by traversing a
>> subset of the list. The proposed method is to scan N/2 slabs from the
>> list's head and the other N/2 slabs from the tail. For a partial list
>> with ~280K slabs, benchmarks show that this approach performs better
>> than just counting from the list's head, after slabs get sorted by
>> kmem_cache_shrink(). Default the limit to 10000, as it produces an
>> approximation within 1% of the exact count for both scenarios.
>>
>> Benchmarks: Diff = (exact - approximated) / exact
>> * Normal case (w/o kmem_cache_shrink()):
>> | MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)|
>> | 1000        |  0.43  %              |  1.09  %              |
>> | 5000        |  0.06  %              |  0.37  %              |
>> | 10000       |  0.02  %              |  0.16  %              |
>> | 20000       |  0.009 %              | -0.003 %              |
>>
>> * Skewed case (w/ kmem_cache_shrink()):
>> | MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)|
>> | 1000        |  12.46 %              |  6.75  %              |
>> | 5000        |  5.38  %              |  1.27  %              |
>> | 10000       |  4.99  %              |  0.22  %              |
>> | 20000       |  4.86  %              | -0.06  %              |
>>
>> [1] https://urldefense.com/v3/__https://lore.kernel.org/linux-mm/__;!!ACWV5N9M2RV99hQ!LuGLO7jdg-btEuCdiGR-urqWDlQa9J1c_HdkmkBogW_86XHSFoohzTP29qfBZScVYn3BKt0s5m5CUniSpHWw$ 
>> alpine.DEB.2.21.2003031602460.1537@www.lameter.com/T/
>> [2] https://urldefense.com/v3/__https://lore.kernel.org/lkml/__;!!ACWV5N9M2RV99hQ!LuGLO7jdg-btEuCdiGR-urqWDlQa9J1c_HdkmkBogW_86XHSFoohzTP29qfBZScVYn3BKt0s5m5CUo6BnK68$ 
>> alpine.DEB.2.22.394.2008071258020.55871@www.lameter.com/T/
>> [3] https://urldefense.com/v3/__https://lore.kernel.org/lkml/__;!!ACWV5N9M2RV99hQ!LuGLO7jdg-btEuCdiGR-urqWDlQa9J1c_HdkmkBogW_86XHSFoohzTP29qfBZScVYn3BKt0s5m5CUo6BnK68$ 
>> 1e01092b-140d-2bab-aeba-321a74a194ee@linux.com/T/
>>
>> Signed-off-by: Jianfeng Wang <jianfeng.w.wang@oracle.com>
>> ---
>>  mm/slub.c | 28 ++++++++++++++++++++++++++--
>>  1 file changed, 26 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 1bb2a93cf7b6..7e34f2f0ba85 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -3213,6 +3213,8 @@ static inline bool free_debug_processing(struct kmem_cache *s,
>>  #endif /* CONFIG_SLUB_DEBUG */
>>  
>>  #if defined(CONFIG_SLUB_DEBUG) || defined(SLAB_SUPPORTS_SYSFS)
>> +#define MAX_PARTIAL_TO_SCAN 10000
>> +
>>  static unsigned long count_partial(struct kmem_cache_node *n,
>>  					int (*get_count)(struct slab *))
>>  {
>> @@ -3221,8 +3223,30 @@ static unsigned long count_partial(struct kmem_cache_node *n,
>>  	struct slab *slab;
>>  
>>  	spin_lock_irqsave(&n->list_lock, flags);
>> -	list_for_each_entry(slab, &n->partial, slab_list)
>> -		x += get_count(slab);
>> +	if (n->nr_partial <= MAX_PARTIAL_TO_SCAN) {
>> +		list_for_each_entry(slab, &n->partial, slab_list)
>> +			x += get_count(slab);
>> +	} else {
>> +		/*
>> +		 * For a long list, approximate the total count of objects in
>> +		 * it to meet the limit on the number of slabs to scan.
>> +		 * Scan from both the list's head and tail for better accuracy.
>> +		 */
>> +		unsigned long scanned = 0;
>> +
>> +		list_for_each_entry(slab, &n->partial, slab_list) {
>> +			x += get_count(slab);
>> +			if (++scanned == MAX_PARTIAL_TO_SCAN / 2)
>> +				break;
>> +		}
>> +		list_for_each_entry_reverse(slab, &n->partial, slab_list) {
>> +			x += get_count(slab);
>> +			if (++scanned == MAX_PARTIAL_TO_SCAN)
>> +				break;
>> +		}
>> +		x = mult_frac(x, n->nr_partial, scanned);
>> +		x = min(x, node_nr_objs(n));
>> +	}
>>  	spin_unlock_irqrestore(&n->list_lock, flags);
>>  	return x;
>>  }

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-04-18 22:45 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-17 18:59 [PATCH v2 0/1] slub: limit number of slabs to scan in count_partial() Jianfeng Wang
2024-04-17 18:59 ` [PATCH v2 1/1] " Jianfeng Wang
2024-04-18 10:01   ` Vlastimil Babka
2024-04-18 22:44     ` Jianfeng Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.