linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] slub: limit count of partial slabs scanned to gather statistics
@ 2020-05-04 16:07 Konstantin Khlebnikov
  2020-05-04 19:56 ` Andrew Morton
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-04 16:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm, Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim

To get exact count of free and used objects slub have to scan list of
partial slabs. This may take at long time. Scanning holds spinlock and
blocks allocations which move partial slabs to per-cpu lists and back.

Example found in the wild:

# cat /sys/kernel/slab/dentry/partial
14478538 N0=7329569 N1=7148969
# time cat /sys/kernel/slab/dentry/objects
286225471 N0=136967768 N1=149257703

real	0m1.722s
user	0m0.001s
sys	0m1.721s

The same problem in slab was addressed in commit f728b0a5d72a ("mm, slab:
faster active and free stats") by adding more kmem cache statistics.
For slub same approach requires atomic op on fast path when object frees.

Let's simply limit count of scanned slabs and print warning.
Limit set in /sys/module/slub/parameters/max_partial_to_count.
Default is 10000 which should be enough for most sane cases.

Return linear approximation if list of partials is longer than limit.
Nobody should notice difference.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 mm/slub.c |   15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 9bf44955c4f1..86a366f7acb6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2407,16 +2407,29 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
 #endif /* CONFIG_SLUB_DEBUG */
 
 #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
+
+static unsigned long max_partial_to_count __read_mostly = 10000;
+module_param(max_partial_to_count, ulong, 0644);
+
 static unsigned long count_partial(struct kmem_cache_node *n,
 					int (*get_count)(struct page *))
 {
+	unsigned long counted = 0;
 	unsigned long flags;
 	unsigned long x = 0;
 	struct page *page;
 
 	spin_lock_irqsave(&n->list_lock, flags);
-	list_for_each_entry(page, &n->partial, slab_list)
+	list_for_each_entry(page, &n->partial, slab_list) {
 		x += get_count(page);
+
+		if (++counted > max_partial_to_count) {
+			pr_warn_once("SLUB: too much partial slabs to count all objects, increase max_partial_to_count.\n");
+			/* Approximate total count of objects */
+			x = mult_frac(x, n->nr_partial, counted);
+			break;
+		}
+	}
 	spin_unlock_irqrestore(&n->list_lock, flags);
 	return x;
 }


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] slub: limit count of partial slabs scanned to gather statistics
  2020-05-04 16:07 [PATCH] slub: limit count of partial slabs scanned to gather statistics Konstantin Khlebnikov
@ 2020-05-04 19:56 ` Andrew Morton
  2020-05-05  5:46   ` Konstantin Khlebnikov
  2020-05-08  3:18   ` Christopher Lameter
  2020-05-04 21:19 ` David Rientjes
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 14+ messages in thread
From: Andrew Morton @ 2020-05-04 19:56 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-kernel, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim

On Mon, 04 May 2020 19:07:39 +0300 Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote:

> To get exact count of free and used objects slub have to scan list of
> partial slabs. This may take at long time. Scanning holds spinlock and
> blocks allocations which move partial slabs to per-cpu lists and back.
> 
> Example found in the wild:
> 
> # cat /sys/kernel/slab/dentry/partial
> 14478538 N0=7329569 N1=7148969
> # time cat /sys/kernel/slab/dentry/objects
> 286225471 N0=136967768 N1=149257703
> 
> real	0m1.722s
> user	0m0.001s
> sys	0m1.721s

I assume this could trigger the softlockup detector or even NMI
watchdog in some situations?

> The same problem in slab was addressed in commit f728b0a5d72a ("mm, slab:
> faster active and free stats") by adding more kmem cache statistics.
> For slub same approach requires atomic op on fast path when object frees.
> 
> Let's simply limit count of scanned slabs and print warning.
> Limit set in /sys/module/slub/parameters/max_partial_to_count.
> Default is 10000 which should be enough for most sane cases.
> 
> Return linear approximation if list of partials is longer than limit.
> Nobody should notice difference.

That's a pretty sad "solution" :(

But I guess it's better than nothing at all, unless there are
alternative ideas?

> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2407,16 +2407,29 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
>  #endif /* CONFIG_SLUB_DEBUG */
>  
>  #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
> +
> +static unsigned long max_partial_to_count __read_mostly = 10000;
> +module_param(max_partial_to_count, ulong, 0644);
> +
>  static unsigned long count_partial(struct kmem_cache_node *n,
>  					int (*get_count)(struct page *))
>  {
> +	unsigned long counted = 0;
>  	unsigned long flags;
>  	unsigned long x = 0;
>  	struct page *page;
>  
>  	spin_lock_irqsave(&n->list_lock, flags);
> -	list_for_each_entry(page, &n->partial, slab_list)
> +	list_for_each_entry(page, &n->partial, slab_list) {
>  		x += get_count(page);
> +
> +		if (++counted > max_partial_to_count) {
> +			pr_warn_once("SLUB: too much partial slabs to count all objects, increase max_partial_to_count.\n");
> +			/* Approximate total count of objects */
> +			x = mult_frac(x, n->nr_partial, counted);
> +			break;
> +		}
> +	}
>  	spin_unlock_irqrestore(&n->list_lock, flags);
>  	return x;
>  }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slub: limit count of partial slabs scanned to gather statistics
  2020-05-04 16:07 [PATCH] slub: limit count of partial slabs scanned to gather statistics Konstantin Khlebnikov
  2020-05-04 19:56 ` Andrew Morton
@ 2020-05-04 21:19 ` David Rientjes
  2020-05-05  6:20   ` Konstantin Khlebnikov
  2020-05-06 11:56 ` Vlastimil Babka
  2020-05-06 19:06 ` Qian Cai
  3 siblings, 1 reply; 14+ messages in thread
From: David Rientjes @ 2020-05-04 21:19 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-kernel, linux-mm, Andrew Morton, Christoph Lameter,
	Pekka Enberg, Joonsoo Kim

On Mon, 4 May 2020, Konstantin Khlebnikov wrote:

> To get exact count of free and used objects slub have to scan list of
> partial slabs. This may take at long time. Scanning holds spinlock and
> blocks allocations which move partial slabs to per-cpu lists and back.
> 
> Example found in the wild:
> 
> # cat /sys/kernel/slab/dentry/partial
> 14478538 N0=7329569 N1=7148969
> # time cat /sys/kernel/slab/dentry/objects
> 286225471 N0=136967768 N1=149257703
> 
> real	0m1.722s
> user	0m0.001s
> sys	0m1.721s
> 
> The same problem in slab was addressed in commit f728b0a5d72a ("mm, slab:
> faster active and free stats") by adding more kmem cache statistics.
> For slub same approach requires atomic op on fast path when object frees.
> 
> Let's simply limit count of scanned slabs and print warning.
> Limit set in /sys/module/slub/parameters/max_partial_to_count.
> Default is 10000 which should be enough for most sane cases.
> 
> Return linear approximation if list of partials is longer than limit.
> Nobody should notice difference.
> 

Hi Konstantin,

Do you only exhibit this on slub for SO_ALL|SO_OBJECTS?  I notice the 
timing in the changelog is only looking at "objects" and not "partial".

If so, it seems this is also a problem for get_slabinfo() since it also 
uses the count_free() callback for count_partial().

Concern would be that the kernel has now drastically changed a statistic 
that it exports to userspace.  There was some discussion about this back 
in 2016[*] and one idea was that slabinfo would truncate its scanning and 
append a '+' to the end of the value to indicate it exceeds the max, i.e. 
10000+.  I think that '+' actually caused the problem itself for userspace 
processes.

I think the patch is too far reaching, however, since it impacts all 
count_partial() counting and not only for the case cited in the changelog.  
Are there examples for things other than the count_free() callback?

 [*] https://lore.kernel.org/patchwork/patch/708427/

> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>  mm/slub.c |   15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 9bf44955c4f1..86a366f7acb6 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2407,16 +2407,29 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
>  #endif /* CONFIG_SLUB_DEBUG */
>  
>  #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
> +
> +static unsigned long max_partial_to_count __read_mostly = 10000;
> +module_param(max_partial_to_count, ulong, 0644);
> +
>  static unsigned long count_partial(struct kmem_cache_node *n,
>  					int (*get_count)(struct page *))
>  {
> +	unsigned long counted = 0;
>  	unsigned long flags;
>  	unsigned long x = 0;
>  	struct page *page;
>  
>  	spin_lock_irqsave(&n->list_lock, flags);
> -	list_for_each_entry(page, &n->partial, slab_list)
> +	list_for_each_entry(page, &n->partial, slab_list) {
>  		x += get_count(page);
> +
> +		if (++counted > max_partial_to_count) {
> +			pr_warn_once("SLUB: too much partial slabs to count all objects, increase max_partial_to_count.\n");
> +			/* Approximate total count of objects */
> +			x = mult_frac(x, n->nr_partial, counted);
> +			break;
> +		}
> +	}
>  	spin_unlock_irqrestore(&n->list_lock, flags);
>  	return x;
>  }
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slub: limit count of partial slabs scanned to gather statistics
  2020-05-04 19:56 ` Andrew Morton
@ 2020-05-05  5:46   ` Konstantin Khlebnikov
  2020-05-08  3:18   ` Christopher Lameter
  1 sibling, 0 replies; 14+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-05  5:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim

On 04/05/2020 22.56, Andrew Morton wrote:
> On Mon, 04 May 2020 19:07:39 +0300 Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote:
> 
>> To get exact count of free and used objects slub have to scan list of
>> partial slabs. This may take at long time. Scanning holds spinlock and
>> blocks allocations which move partial slabs to per-cpu lists and back.
>>
>> Example found in the wild:
>>
>> # cat /sys/kernel/slab/dentry/partial
>> 14478538 N0=7329569 N1=7148969
>> # time cat /sys/kernel/slab/dentry/objects
>> 286225471 N0=136967768 N1=149257703
>>
>> real	0m1.722s
>> user	0m0.001s
>> sys	0m1.721s
> 
> I assume this could trigger the softlockup detector or even NMI
> watchdog in some situations?

Yes, irqs are disabled here. But loop itself is pretty fast.
It requires terabytes of ram to reach common thresholds for watchdogs.

> 
>> The same problem in slab was addressed in commit f728b0a5d72a ("mm, slab:
>> faster active and free stats") by adding more kmem cache statistics.
>> For slub same approach requires atomic op on fast path when object frees.
>>
>> Let's simply limit count of scanned slabs and print warning.
>> Limit set in /sys/module/slub/parameters/max_partial_to_count.
>> Default is 10000 which should be enough for most sane cases.
>>
>> Return linear approximation if list of partials is longer than limit.
>> Nobody should notice difference.
> 
> That's a pretty sad "solution" :(
> 
> But I guess it's better than nothing at all, unless there are
> alternative ideas?

Running this loop till the end adds more problems than gives information.
Adding new  percpu or atomic counters to fast paths seems redundant even for debugging.

Actually there is no much sense in accurate statistics for count of objects,
when there are millions of them.

Memory consumption here is defined by count and size of slabs.

> 
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2407,16 +2407,29 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
>>   #endif /* CONFIG_SLUB_DEBUG */
>>   
>>   #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
>> +
>> +static unsigned long max_partial_to_count __read_mostly = 10000;
>> +module_param(max_partial_to_count, ulong, 0644);
>> +
>>   static unsigned long count_partial(struct kmem_cache_node *n,
>>   					int (*get_count)(struct page *))
>>   {
>> +	unsigned long counted = 0;
>>   	unsigned long flags;
>>   	unsigned long x = 0;
>>   	struct page *page;
>>   
>>   	spin_lock_irqsave(&n->list_lock, flags);
>> -	list_for_each_entry(page, &n->partial, slab_list)
>> +	list_for_each_entry(page, &n->partial, slab_list) {
>>   		x += get_count(page);
>> +
>> +		if (++counted > max_partial_to_count) {
>> +			pr_warn_once("SLUB: too much partial slabs to count all objects, increase max_partial_to_count.\n");
>> +			/* Approximate total count of objects */
>> +			x = mult_frac(x, n->nr_partial, counted);
>> +			break;
>> +		}
>> +	}
>>   	spin_unlock_irqrestore(&n->list_lock, flags);
>>   	return x;
>>   }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slub: limit count of partial slabs scanned to gather statistics
  2020-05-04 21:19 ` David Rientjes
@ 2020-05-05  6:20   ` Konstantin Khlebnikov
  0 siblings, 0 replies; 14+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-05  6:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-kernel, linux-mm, Andrew Morton, Christoph Lameter,
	Pekka Enberg, Joonsoo Kim

On 05/05/2020 00.19, David Rientjes wrote:
> On Mon, 4 May 2020, Konstantin Khlebnikov wrote:
> 
>> To get exact count of free and used objects slub have to scan list of
>> partial slabs. This may take at long time. Scanning holds spinlock and
>> blocks allocations which move partial slabs to per-cpu lists and back.
>>
>> Example found in the wild:
>>
>> # cat /sys/kernel/slab/dentry/partial
>> 14478538 N0=7329569 N1=7148969
>> # time cat /sys/kernel/slab/dentry/objects
>> 286225471 N0=136967768 N1=149257703
>>
>> real	0m1.722s
>> user	0m0.001s
>> sys	0m1.721s
>>
>> The same problem in slab was addressed in commit f728b0a5d72a ("mm, slab:
>> faster active and free stats") by adding more kmem cache statistics.
>> For slub same approach requires atomic op on fast path when object frees.
>>
>> Let's simply limit count of scanned slabs and print warning.
>> Limit set in /sys/module/slub/parameters/max_partial_to_count.
>> Default is 10000 which should be enough for most sane cases.
>>
>> Return linear approximation if list of partials is longer than limit.
>> Nobody should notice difference.
>>
> 
> Hi Konstantin,
> 
> Do you only exhibit this on slub for SO_ALL|SO_OBJECTS?  I notice the
> timing in the changelog is only looking at "objects" and not "partial".

"partial" is a count of partial slabs which simply sums per-numa counters.
Affected only "objects" and "objects_partial" which walk the list.

> 
> If so, it seems this is also a problem for get_slabinfo() since it also
> uses the count_free() callback for count_partial().

Yep, /proc/slabinfo also affected.

Actually it's more affected than sysfs - it walks all cgroups while sysfs shows only root.

> 
> Concern would be that the kernel has now drastically changed a statistic
> that it exports to userspace.  There was some discussion about this back
> in 2016[*] and one idea was that slabinfo would truncate its scanning and
> append a '+' to the end of the value to indicate it exceeds the max, i.e.
> 10000+.  I think that '+' actually caused the problem itself for userspace
> processes.

Yep, "+" will break everything for sure.
I thought about returning "-1" or INT_MAX without counting,
but approximation gives almost correct result without breaking anything.

Each partial slab has at least one used and free object thus approximated
result will be somewhere between nr_partial_slabs and nr_partial_objects.

> 
> I think the patch is too far reaching, however, since it impacts all
> count_partial() counting and not only for the case cited in the changelog.
> Are there examples for things other than the count_free() callback?

Nope, this is just a statistics for used/free objects.
Total count of objects and slabs are counted precisely.

> 
>   [*] https://lore.kernel.org/patchwork/patch/708427/
> 
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>> ---
>>   mm/slub.c |   15 ++++++++++++++-
>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 9bf44955c4f1..86a366f7acb6 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2407,16 +2407,29 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
>>   #endif /* CONFIG_SLUB_DEBUG */
>>   
>>   #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
>> +
>> +static unsigned long max_partial_to_count __read_mostly = 10000;
>> +module_param(max_partial_to_count, ulong, 0644);
>> +
>>   static unsigned long count_partial(struct kmem_cache_node *n,
>>   					int (*get_count)(struct page *))
>>   {
>> +	unsigned long counted = 0;
>>   	unsigned long flags;
>>   	unsigned long x = 0;
>>   	struct page *page;
>>   
>>   	spin_lock_irqsave(&n->list_lock, flags);
>> -	list_for_each_entry(page, &n->partial, slab_list)
>> +	list_for_each_entry(page, &n->partial, slab_list) {
>>   		x += get_count(page);
>> +
>> +		if (++counted > max_partial_to_count) {
>> +			pr_warn_once("SLUB: too much partial slabs to count all objects, increase max_partial_to_count.\n");
>> +			/* Approximate total count of objects */
>> +			x = mult_frac(x, n->nr_partial, counted);
>> +			break;
>> +		}
>> +	}
>>   	spin_unlock_irqrestore(&n->list_lock, flags);
>>   	return x;
>>   }
>>
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slub: limit count of partial slabs scanned to gather statistics
  2020-05-04 16:07 [PATCH] slub: limit count of partial slabs scanned to gather statistics Konstantin Khlebnikov
  2020-05-04 19:56 ` Andrew Morton
  2020-05-04 21:19 ` David Rientjes
@ 2020-05-06 11:56 ` Vlastimil Babka
  2020-05-07  5:25   ` Konstantin Khlebnikov
  2020-05-06 19:06 ` Qian Cai
  3 siblings, 1 reply; 14+ messages in thread
From: Vlastimil Babka @ 2020-05-06 11:56 UTC (permalink / raw)
  To: Konstantin Khlebnikov, linux-kernel, linux-mm, Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Roman Gushchin, Wen Yang

On 5/4/20 6:07 PM, Konstantin Khlebnikov wrote:
> To get exact count of free and used objects slub have to scan list of
> partial slabs. This may take at long time. Scanning holds spinlock and
> blocks allocations which move partial slabs to per-cpu lists and back.
> 
> Example found in the wild:
> 
> # cat /sys/kernel/slab/dentry/partial
> 14478538 N0=7329569 N1=7148969
> # time cat /sys/kernel/slab/dentry/objects
> 286225471 N0=136967768 N1=149257703
> 
> real	0m1.722s
> user	0m0.001s
> sys	0m1.721s
> 
> The same problem in slab was addressed in commit f728b0a5d72a ("mm, slab:
> faster active and free stats") by adding more kmem cache statistics.
> For slub same approach requires atomic op on fast path when object frees.

In general yeah, but are you sure about this one? AFAICS this is about pages in
the n->partial list, where manipulations happen under n->list_lock and shouldn't
be fast path. It should be feasible to add a counter under the same lock, so it
wouldn't even need to be atomic?

> Let's simply limit count of scanned slabs and print warning.
> Limit set in /sys/module/slub/parameters/max_partial_to_count.
> Default is 10000 which should be enough for most sane cases.
> 
> Return linear approximation if list of partials is longer than limit.
> Nobody should notice difference.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

BTW there was a different patch in that area proposed recently [1] for slabinfo.
Christopher argued that we can do that for slabinfo but leave /sys stats
precise. Guess not then?

[1]
https://lore.kernel.org/linux-mm/20200222092428.99488-1-wenyang@linux.alibaba.com/

> ---
>  mm/slub.c |   15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 9bf44955c4f1..86a366f7acb6 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2407,16 +2407,29 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
>  #endif /* CONFIG_SLUB_DEBUG */
>  
>  #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
> +
> +static unsigned long max_partial_to_count __read_mostly = 10000;
> +module_param(max_partial_to_count, ulong, 0644);
> +
>  static unsigned long count_partial(struct kmem_cache_node *n,
>  					int (*get_count)(struct page *))
>  {
> +	unsigned long counted = 0;
>  	unsigned long flags;
>  	unsigned long x = 0;
>  	struct page *page;
>  
>  	spin_lock_irqsave(&n->list_lock, flags);
> -	list_for_each_entry(page, &n->partial, slab_list)
> +	list_for_each_entry(page, &n->partial, slab_list) {
>  		x += get_count(page);
> +
> +		if (++counted > max_partial_to_count) {
> +			pr_warn_once("SLUB: too much partial slabs to count all objects, increase max_partial_to_count.\n");
> +			/* Approximate total count of objects */
> +			x = mult_frac(x, n->nr_partial, counted);
> +			break;
> +		}
> +	}
>  	spin_unlock_irqrestore(&n->list_lock, flags);
>  	return x;
>  }
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slub: limit count of partial slabs scanned to gather statistics
  2020-05-04 16:07 [PATCH] slub: limit count of partial slabs scanned to gather statistics Konstantin Khlebnikov
                   ` (2 preceding siblings ...)
  2020-05-06 11:56 ` Vlastimil Babka
@ 2020-05-06 19:06 ` Qian Cai
  2020-05-07  3:01   ` Qian Cai
  3 siblings, 1 reply; 14+ messages in thread
From: Qian Cai @ 2020-05-06 19:06 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: LKML, Linux-MM, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim



> On May 4, 2020, at 12:07 PM, Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote:
> 
> To get exact count of free and used objects slub have to scan list of
> partial slabs. This may take at long time. Scanning holds spinlock and
> blocks allocations which move partial slabs to per-cpu lists and back.
> 
> Example found in the wild:
> 
> # cat /sys/kernel/slab/dentry/partial
> 14478538 N0=7329569 N1=7148969
> # time cat /sys/kernel/slab/dentry/objects
> 286225471 N0=136967768 N1=149257703
> 
> real	0m1.722s
> user	0m0.001s
> sys	0m1.721s
> 
> The same problem in slab was addressed in commit f728b0a5d72a ("mm, slab:
> faster active and free stats") by adding more kmem cache statistics.
> For slub same approach requires atomic op on fast path when object frees.
> 
> Let's simply limit count of scanned slabs and print warning.
> Limit set in /sys/module/slub/parameters/max_partial_to_count.
> Default is 10000 which should be enough for most sane cases.
> 
> Return linear approximation if list of partials is longer than limit.
> Nobody should notice difference.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

This patch will trigger the warning under memory pressure, and then makes lockdep unhappy. Also,  it is almost impossible tell how many max_partial_to_count is sufficient from user perspective.

[ 6371.600511] SLUB: too much partial slabs to count all objects, increase max_partial_to_count.
[ 6371.601399] irq event stamp: 8132599

[ 6371.611415] ======================================================
[ 6371.611417] WARNING: possible circular locking dependency detected
[ 6371.611419] 5.7.0-rc4-mm1+ #1 Not tainted
[ 6371.611421] ------------------------------------------------------
[ 6371.611423] oom02/43515 is trying to acquire lock:
[ 6371.611425] ffffffff893b8980 (console_owner){-.-.}-{0:0}, at: console_unlock+0x240/0x750

[ 6371.611433] but task is already holding lock:
[ 6371.611434] ffff8886456fcb98 (&n->list_lock){-.-.}-{2:2}, at: count_partial+0x29/0xe0

[ 6371.611441] which lock already depends on the new lock.


[ 6371.611445] the existing dependency chain (in reverse order) is:

[ 6371.611446] -> #3 (&n->list_lock){-.-.}-{2:2}:
[ 6371.611452]        _raw_spin_lock+0x2f/0x40
[ 6371.611453]        deactivate_slab+0x37a/0x690
[ 6371.611455]        ___slab_alloc+0x65d/0x810
[ 6371.611456]        __slab_alloc+0x43/0x70
[ 6371.611457]        __kmalloc+0x2b2/0x430
[ 6371.611459]        __tty_buffer_request_room+0x100/0x250
[ 6371.611460]        tty_insert_flip_string_fixed_flag+0x67/0x130
[ 6371.611462]        pty_write+0xa2/0xf0
[ 6371.611463]        n_tty_write+0x36b/0x7c0
[ 6371.611464]        tty_write+0x275/0x500
[ 6371.611466]        __vfs_write+0x50/0xa0
[ 6371.611467]        vfs_write+0x10b/0x290
[ 6371.611468]        redirected_tty_write+0x6a/0xc0
[ 6371.611470]        do_iter_write+0x253/0x2b0
[ 6371.611471]        vfs_writev+0x152/0x1f0
[ 6371.611472]        do_writev+0xda/0x180
[ 6371.611474]        __x64_sys_writev+0x45/0x50
[ 6371.611475]        do_syscall_64+0xcc/0xaf0
[ 6371.611477]        entry_SYSCALL_64_after_hwframe+0x49/0xb3

[ 6371.611478] -> #2 (&port->lock#2){-.-.}-{2:2}:
[ 6371.611484]        _raw_spin_lock_irqsave+0x3a/0x50
[ 6371.611486]        tty_port_tty_get+0x22/0xa0
[ 6371.611487]        tty_port_default_wakeup+0xf/0x30
[ 6371.611489]        tty_port_tty_wakeup+0x39/0x40
[ 6371.611490]        uart_write_wakeup+0x2a/0x40
[ 6371.611492]        serial8250_tx_chars+0x22e/0x410
[ 6371.611493]        serial8250_handle_irq.part.21+0x17c/0x180
[ 6371.611495]        serial8250_default_handle_irq+0x5c/0x90
[ 6371.611496]        serial8250_interrupt+0xa6/0x130
[ 6371.611498]        __handle_irq_event_percpu+0x81/0x550
[ 6371.611499]        handle_irq_event_percpu+0x70/0x100
[ 6371.611501]        handle_irq_event+0x5a/0x8b
[ 6371.611502]        handle_edge_irq+0x10c/0x370
[ 6371.611503]        do_IRQ+0x9e/0x1d0
[ 6371.611505]        ret_from_intr+0x0/0x37
[ 6371.611506]        cpuidle_enter_state+0x148/0x910
[ 6371.611507]        cpuidle_enter+0x41/0x70
[ 6371.611509]        do_idle+0x3cf/0x440
[ 6371.611510]        cpu_startup_entry+0x1d/0x1f
[ 6371.611511]        start_secondary+0x29a/0x340
[ 6371.611513]        secondary_startup_64+0xb6/0xc0

[ 6371.611516] -> #1 (&port->lock){-.-.}-{2:2}:
[ 6371.611522]        _raw_spin_lock_irqsave+0x3a/0x50
[ 6371.611525]        serial8250_console_write+0x113/0x560
[ 6371.611527]        univ8250_console_write+0x4b/0x60
[ 6371.611529]        console_unlock+0x4e3/0x750
[ 6371.611530]        vprintk_emit+0x10d/0x340
[ 6371.611532]        vprintk_default+0x1f/0x30
[ 6371.611533]        vprintk_func+0x44/0xd4
[ 6371.611535]        printk+0x9f/0xc5
[ 6371.611537]        register_console+0x262/0x3e0
[ 6371.611538]        univ8250_console_init+0x23/0x2d
[ 6371.611540]        console_init+0x268/0x395
[ 6371.611542]        start_kernel+0x6c3/0x8b9
[ 6371.611544]        x86_64_start_reservations+0x24/0x26
[ 6371.611546]        x86_64_start_kernel+0xf4/0xfb
[ 6371.611548]        secondary_startup_64+0xb6/0xc0

[ 6371.611551] -> #0 (console_owner){-.-.}-{0:0}:
[ 6371.611558]        __lock_acquire+0x21f8/0x3260
[ 6371.611560]        lock_acquire+0x1a2/0x680
[ 6371.611562]        console_unlock+0x2a2/0x750
[ 6371.611564]        vprintk_emit+0x10d/0x340
[ 6371.611566]        vprintk_default+0x1f/0x30
[ 6371.611568]        vprintk_func+0x44/0xd4
[ 6371.611569]        printk+0x9f/0xc5
[ 6371.611571]        count_partial.cold.50+0x4d/0x52
[ 6371.611573]        get_slabinfo+0x5c/0xb0
[ 6371.611575]        dump_unreclaimable_slab.cold.35+0x97/0xe2
[ 6371.611577]        dump_header+0x45a/0x510
[ 6371.611579]        oom_kill_process+0xd0/0x280
[ 6371.611581]        out_of_memory+0x478/0xa50
[ 6371.611583]        __alloc_pages_slowpath.constprop.61+0x1680/0x1850
[ 6371.611585]        __alloc_pages_nodemask+0x57c/0x6f0
[ 6371.611587]        alloc_pages_vma+0x81/0x310
[ 6371.611589]        do_anonymous_page+0x1bb/0x7a0
[ 6371.611591]        __handle_mm_fault+0xbb0/0xbe0
[ 6371.611593]        handle_mm_fault+0xdc/0x2e0
[ 6371.611595]        do_page_fault+0x2cb/0x9d7
[ 6371.611597]        page_fault+0x34/0x40

[ 6371.611600] other info that might help us debug this:

[ 6371.611603] Chain exists of:
[ 6371.611604]   console_owner --> &port->lock#2 --> &n->list_lock

[ 6371.611615]  Possible unsafe locking scenario:

[ 6371.611618]        CPU0                    CPU1
[ 6371.611619]        ----                    ----
[ 6371.611621]   lock(&n->list_lock);
[ 6371.611625]                                lock(&port->lock#2);
[ 6371.611630]                                lock(&n->list_lock);
[ 6371.611634]   lock(console_owner);

[ 6371.611639]  *** DEADLOCK ***

[ 6371.611641] 5 locks held by oom02/43515:
[ 6371.611642]  #0: ffff888ef72b4158 (&mm->mmap_sem#2){++++}-{3:3}, at: do_page_fault+0x1d6/0x9d7
[ 6371.611649]  #1: ffffffff894dd268 (oom_lock){+.+.}-{3:3}, at: __alloc_pages_slowpath.constprop.61+0x90a/0x1850
[ 6371.611656]  #2: ffffffff89520aa8 (slab_mutex){+.+.}-{3:3}, at: dump_unreclaimable_slab+0x2b/0x40
[ 6371.611661]  #3: ffff8886456fcb98 (&n->list_lock){-.-.}-{2:2}, at: count_partial+0x29/0xe0
[ 6371.611668]  #4: ffffffff893b8e60 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x100/0x340

[ 6371.611675] stack backtrace:
[ 6371.611676] CPU: 1 PID: 43515 Comm: oom02 Not tainted 5.7.0-rc4-mm1+ #1
[ 6371.611679] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
[ 6371.611680] Call Trace:
[ 6371.611681]  dump_stack+0xa7/0xea
[ 6371.611682]  print_circular_bug.cold.54+0x147/0x14c
[ 6371.611684]  check_noncircular+0x295/0x2d0
[ 6371.611685]  ? print_circular_bug+0x1d0/0x1d0
[ 6371.611686]  ? __kasan_check_read+0x11/0x20
[ 6371.611688]  ? mark_lock+0x160/0xfe0
[ 6371.611689]  __lock_acquire+0x21f8/0x3260
[ 6371.611690]  ? register_lock_class+0xb90/0xb90
[ 6371.611691]  ? snprintf+0xc0/0xc0
[ 6371.611693]  ? __kasan_check_read+0x11/0x20
[ 6371.611694]  ? check_chain_key+0x1df/0x2e0
[ 6371.611695]  lock_acquire+0x1a2/0x680
[ 6371.611697]  ? console_unlock+0x240/0x750
[ 6371.611698]  ? lock_downgrade+0x3e0/0x3e0
[ 6371.611699]  ? check_flags.part.28+0x220/0x220
[ 6371.611701]  ? rwlock_bug.part.1+0x60/0x60
[ 6371.611702]  ? __kasan_check_read+0x11/0x20
[ 6371.611703]  console_unlock+0x2a2/0x750
[ 6371.611705]  ? console_unlock+0x240/0x750
[ 6371.611706]  vprintk_emit+0x10d/0x340
[ 6371.611707]  ? kernel_poison_pages.cold.3+0x86/0x86
[ 6371.611709]  vprintk_default+0x1f/0x30
[ 6371.611710]  vprintk_func+0x44/0xd4
[ 6371.611711]  ? do_raw_spin_lock+0x11e/0x1e0
[ 6371.611712]  printk+0x9f/0xc5
[ 6371.611714]  ? log_store.cold.31+0x11/0x11
[ 6371.611715]  ? count_partial+0x29/0xe0
[ 6371.611717]  ? do_raw_spin_lock+0x11e/0x1e0
[ 6371.611718]  count_partial.cold.50+0x4d/0x52
[ 6371.611719]  get_slabinfo+0x5c/0xb0
[ 6371.611721]  dump_unreclaimable_slab.cold.35+0x97/0xe2
[ 6371.611722]  ? show_mem+0x10b/0x11c
[ 6371.611723]  dump_header+0x45a/0x510
[ 6371.611724]  oom_kill_process+0xd0/0x280
[ 6371.611726]  out_of_memory+0x478/0xa50
[ 6371.611727]  ? oom_killer_disable+0x230/0x230
[ 6371.611728]  ? mutex_trylock+0x17a/0x190
[ 6371.611730]  __alloc_pages_slowpath.constprop.61+0x1680/0x1850
[ 6371.611731]  ? warn_alloc+0x120/0x120
[ 6371.611733]  ? check_flags.part.28+0x220/0x220
[ 6371.611734]  ? ___might_sleep+0x178/0x210
[ 6371.611735]  ? __kasan_check_read+0x11/0x20
[ 6371.611737]  __alloc_pages_nodemask+0x57c/0x6f0
[ 6371.611738]  ? __alloc_pages_slowpath.constprop.61+0x1850/0x1850
[ 6371.611740]  alloc_pages_vma+0x81/0x310
[ 6371.611741]  do_anonymous_page+0x1bb/0x7a0
[ 6371.611742]  ? __pte_alloc+0x170/0x170
[ 6371.611743]  ? match_held_lock+0x35/0x270
[ 6371.611745]  __handle_mm_fault+0xbb0/0xbe0
[ 6371.611746]  ? copy_page_range+0x420/0x420
[ 6371.611747]  ? sync_mm_rss+0x7f/0x190
[ 6371.611749]  handle_mm_fault+0xdc/0x2e0
[ 6371.611750]  do_page_fault+0x2cb/0x9d7
[ 6371.611751]  page_fault+0x34/0x40


> ---
> mm/slub.c |   15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 9bf44955c4f1..86a366f7acb6 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2407,16 +2407,29 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
> #endif /* CONFIG_SLUB_DEBUG */
> 
> #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
> +
> +static unsigned long max_partial_to_count __read_mostly = 10000;
> +module_param(max_partial_to_count, ulong, 0644);
> +
> static unsigned long count_partial(struct kmem_cache_node *n,
> 					int (*get_count)(struct page *))
> {
> +	unsigned long counted = 0;
> 	unsigned long flags;
> 	unsigned long x = 0;
> 	struct page *page;
> 
> 	spin_lock_irqsave(&n->list_lock, flags);
> -	list_for_each_entry(page, &n->partial, slab_list)
> +	list_for_each_entry(page, &n->partial, slab_list) {
> 		x += get_count(page);
> +
> +		if (++counted > max_partial_to_count) {
> +			pr_warn_once("SLUB: too much partial slabs to count all objects, increase max_partial_to_count.\n");
> +			/* Approximate total count of objects */
> +			x = mult_frac(x, n->nr_partial, counted);
> +			break;
> +		}
> +	}
> 	spin_unlock_irqrestore(&n->list_lock, flags);
> 	return x;
> }
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slub: limit count of partial slabs scanned to gather statistics
  2020-05-06 19:06 ` Qian Cai
@ 2020-05-07  3:01   ` Qian Cai
  2020-05-07  3:20     ` Stephen Rothwell
  2020-05-07  5:15     ` Konstantin Khlebnikov
  0 siblings, 2 replies; 14+ messages in thread
From: Qian Cai @ 2020-05-07  3:01 UTC (permalink / raw)
  To: Andrew Morton, Stephen Rothwell
  Cc: LKML, Linux-MM, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Konstantin Khlebnikov



> On May 6, 2020, at 3:06 PM, Qian Cai <cai@lca.pw> wrote:
> 
> 
> 
>> On May 4, 2020, at 12:07 PM, Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote:
>> 
>> To get exact count of free and used objects slub have to scan list of
>> partial slabs. This may take at long time. Scanning holds spinlock and
>> blocks allocations which move partial slabs to per-cpu lists and back.
>> 
>> Example found in the wild:
>> 
>> # cat /sys/kernel/slab/dentry/partial
>> 14478538 N0=7329569 N1=7148969
>> # time cat /sys/kernel/slab/dentry/objects
>> 286225471 N0=136967768 N1=149257703
>> 
>> real	0m1.722s
>> user	0m0.001s
>> sys	0m1.721s
>> 
>> The same problem in slab was addressed in commit f728b0a5d72a ("mm, slab:
>> faster active and free stats") by adding more kmem cache statistics.
>> For slub same approach requires atomic op on fast path when object frees.
>> 
>> Let's simply limit count of scanned slabs and print warning.
>> Limit set in /sys/module/slub/parameters/max_partial_to_count.
>> Default is 10000 which should be enough for most sane cases.
>> 
>> Return linear approximation if list of partials is longer than limit.
>> Nobody should notice difference.
>> 
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> 
> This patch will trigger the warning under memory pressure, and then makes lockdep unhappy. Also,  it is almost impossible tell how many max_partial_to_count is sufficient from user perspective.

Andrew, Stephen, can you remove this patch from linux-next?

Even read some procfs files would trigger the warning and lockdep on a debug kernel probably due to kmemleak and debugobjects that would require more partial slabs objects. Thus, it would be problematic to break testing bots on linux-next like this.

> 
> [ 6371.600511] SLUB: too much partial slabs to count all objects, increase max_partial_to_count.
> [ 6371.601399] irq event stamp: 8132599
> 
> [ 6371.611415] ======================================================
> [ 6371.611417] WARNING: possible circular locking dependency detected
> [ 6371.611419] 5.7.0-rc4-mm1+ #1 Not tainted
> [ 6371.611421] ------------------------------------------------------
> [ 6371.611423] oom02/43515 is trying to acquire lock:
> [ 6371.611425] ffffffff893b8980 (console_owner){-.-.}-{0:0}, at: console_unlock+0x240/0x750
> 
> [ 6371.611433] but task is already holding lock:
> [ 6371.611434] ffff8886456fcb98 (&n->list_lock){-.-.}-{2:2}, at: count_partial+0x29/0xe0
> 
> [ 6371.611441] which lock already depends on the new lock.
> 
> 
> [ 6371.611445] the existing dependency chain (in reverse order) is:
> 
> [ 6371.611446] -> #3 (&n->list_lock){-.-.}-{2:2}:
> [ 6371.611452]        _raw_spin_lock+0x2f/0x40
> [ 6371.611453]        deactivate_slab+0x37a/0x690
> [ 6371.611455]        ___slab_alloc+0x65d/0x810
> [ 6371.611456]        __slab_alloc+0x43/0x70
> [ 6371.611457]        __kmalloc+0x2b2/0x430
> [ 6371.611459]        __tty_buffer_request_room+0x100/0x250
> [ 6371.611460]        tty_insert_flip_string_fixed_flag+0x67/0x130
> [ 6371.611462]        pty_write+0xa2/0xf0
> [ 6371.611463]        n_tty_write+0x36b/0x7c0
> [ 6371.611464]        tty_write+0x275/0x500
> [ 6371.611466]        __vfs_write+0x50/0xa0
> [ 6371.611467]        vfs_write+0x10b/0x290
> [ 6371.611468]        redirected_tty_write+0x6a/0xc0
> [ 6371.611470]        do_iter_write+0x253/0x2b0
> [ 6371.611471]        vfs_writev+0x152/0x1f0
> [ 6371.611472]        do_writev+0xda/0x180
> [ 6371.611474]        __x64_sys_writev+0x45/0x50
> [ 6371.611475]        do_syscall_64+0xcc/0xaf0
> [ 6371.611477]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> 
> [ 6371.611478] -> #2 (&port->lock#2){-.-.}-{2:2}:
> [ 6371.611484]        _raw_spin_lock_irqsave+0x3a/0x50
> [ 6371.611486]        tty_port_tty_get+0x22/0xa0
> [ 6371.611487]        tty_port_default_wakeup+0xf/0x30
> [ 6371.611489]        tty_port_tty_wakeup+0x39/0x40
> [ 6371.611490]        uart_write_wakeup+0x2a/0x40
> [ 6371.611492]        serial8250_tx_chars+0x22e/0x410
> [ 6371.611493]        serial8250_handle_irq.part.21+0x17c/0x180
> [ 6371.611495]        serial8250_default_handle_irq+0x5c/0x90
> [ 6371.611496]        serial8250_interrupt+0xa6/0x130
> [ 6371.611498]        __handle_irq_event_percpu+0x81/0x550
> [ 6371.611499]        handle_irq_event_percpu+0x70/0x100
> [ 6371.611501]        handle_irq_event+0x5a/0x8b
> [ 6371.611502]        handle_edge_irq+0x10c/0x370
> [ 6371.611503]        do_IRQ+0x9e/0x1d0
> [ 6371.611505]        ret_from_intr+0x0/0x37
> [ 6371.611506]        cpuidle_enter_state+0x148/0x910
> [ 6371.611507]        cpuidle_enter+0x41/0x70
> [ 6371.611509]        do_idle+0x3cf/0x440
> [ 6371.611510]        cpu_startup_entry+0x1d/0x1f
> [ 6371.611511]        start_secondary+0x29a/0x340
> [ 6371.611513]        secondary_startup_64+0xb6/0xc0
> 
> [ 6371.611516] -> #1 (&port->lock){-.-.}-{2:2}:
> [ 6371.611522]        _raw_spin_lock_irqsave+0x3a/0x50
> [ 6371.611525]        serial8250_console_write+0x113/0x560
> [ 6371.611527]        univ8250_console_write+0x4b/0x60
> [ 6371.611529]        console_unlock+0x4e3/0x750
> [ 6371.611530]        vprintk_emit+0x10d/0x340
> [ 6371.611532]        vprintk_default+0x1f/0x30
> [ 6371.611533]        vprintk_func+0x44/0xd4
> [ 6371.611535]        printk+0x9f/0xc5
> [ 6371.611537]        register_console+0x262/0x3e0
> [ 6371.611538]        univ8250_console_init+0x23/0x2d
> [ 6371.611540]        console_init+0x268/0x395
> [ 6371.611542]        start_kernel+0x6c3/0x8b9
> [ 6371.611544]        x86_64_start_reservations+0x24/0x26
> [ 6371.611546]        x86_64_start_kernel+0xf4/0xfb
> [ 6371.611548]        secondary_startup_64+0xb6/0xc0
> 
> [ 6371.611551] -> #0 (console_owner){-.-.}-{0:0}:
> [ 6371.611558]        __lock_acquire+0x21f8/0x3260
> [ 6371.611560]        lock_acquire+0x1a2/0x680
> [ 6371.611562]        console_unlock+0x2a2/0x750
> [ 6371.611564]        vprintk_emit+0x10d/0x340
> [ 6371.611566]        vprintk_default+0x1f/0x30
> [ 6371.611568]        vprintk_func+0x44/0xd4
> [ 6371.611569]        printk+0x9f/0xc5
> [ 6371.611571]        count_partial.cold.50+0x4d/0x52
> [ 6371.611573]        get_slabinfo+0x5c/0xb0
> [ 6371.611575]        dump_unreclaimable_slab.cold.35+0x97/0xe2
> [ 6371.611577]        dump_header+0x45a/0x510
> [ 6371.611579]        oom_kill_process+0xd0/0x280
> [ 6371.611581]        out_of_memory+0x478/0xa50
> [ 6371.611583]        __alloc_pages_slowpath.constprop.61+0x1680/0x1850
> [ 6371.611585]        __alloc_pages_nodemask+0x57c/0x6f0
> [ 6371.611587]        alloc_pages_vma+0x81/0x310
> [ 6371.611589]        do_anonymous_page+0x1bb/0x7a0
> [ 6371.611591]        __handle_mm_fault+0xbb0/0xbe0
> [ 6371.611593]        handle_mm_fault+0xdc/0x2e0
> [ 6371.611595]        do_page_fault+0x2cb/0x9d7
> [ 6371.611597]        page_fault+0x34/0x40
> 
> [ 6371.611600] other info that might help us debug this:
> 
> [ 6371.611603] Chain exists of:
> [ 6371.611604]   console_owner --> &port->lock#2 --> &n->list_lock
> 
> [ 6371.611615]  Possible unsafe locking scenario:
> 
> [ 6371.611618]        CPU0                    CPU1
> [ 6371.611619]        ----                    ----
> [ 6371.611621]   lock(&n->list_lock);
> [ 6371.611625]                                lock(&port->lock#2);
> [ 6371.611630]                                lock(&n->list_lock);
> [ 6371.611634]   lock(console_owner);
> 
> [ 6371.611639]  *** DEADLOCK ***
> 
> [ 6371.611641] 5 locks held by oom02/43515:
> [ 6371.611642]  #0: ffff888ef72b4158 (&mm->mmap_sem#2){++++}-{3:3}, at: do_page_fault+0x1d6/0x9d7
> [ 6371.611649]  #1: ffffffff894dd268 (oom_lock){+.+.}-{3:3}, at: __alloc_pages_slowpath.constprop.61+0x90a/0x1850
> [ 6371.611656]  #2: ffffffff89520aa8 (slab_mutex){+.+.}-{3:3}, at: dump_unreclaimable_slab+0x2b/0x40
> [ 6371.611661]  #3: ffff8886456fcb98 (&n->list_lock){-.-.}-{2:2}, at: count_partial+0x29/0xe0
> [ 6371.611668]  #4: ffffffff893b8e60 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x100/0x340
> 
> [ 6371.611675] stack backtrace:
> [ 6371.611676] CPU: 1 PID: 43515 Comm: oom02 Not tainted 5.7.0-rc4-mm1+ #1
> [ 6371.611679] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
> [ 6371.611680] Call Trace:
> [ 6371.611681]  dump_stack+0xa7/0xea
> [ 6371.611682]  print_circular_bug.cold.54+0x147/0x14c
> [ 6371.611684]  check_noncircular+0x295/0x2d0
> [ 6371.611685]  ? print_circular_bug+0x1d0/0x1d0
> [ 6371.611686]  ? __kasan_check_read+0x11/0x20
> [ 6371.611688]  ? mark_lock+0x160/0xfe0
> [ 6371.611689]  __lock_acquire+0x21f8/0x3260
> [ 6371.611690]  ? register_lock_class+0xb90/0xb90
> [ 6371.611691]  ? snprintf+0xc0/0xc0
> [ 6371.611693]  ? __kasan_check_read+0x11/0x20
> [ 6371.611694]  ? check_chain_key+0x1df/0x2e0
> [ 6371.611695]  lock_acquire+0x1a2/0x680
> [ 6371.611697]  ? console_unlock+0x240/0x750
> [ 6371.611698]  ? lock_downgrade+0x3e0/0x3e0
> [ 6371.611699]  ? check_flags.part.28+0x220/0x220
> [ 6371.611701]  ? rwlock_bug.part.1+0x60/0x60
> [ 6371.611702]  ? __kasan_check_read+0x11/0x20
> [ 6371.611703]  console_unlock+0x2a2/0x750
> [ 6371.611705]  ? console_unlock+0x240/0x750
> [ 6371.611706]  vprintk_emit+0x10d/0x340
> [ 6371.611707]  ? kernel_poison_pages.cold.3+0x86/0x86
> [ 6371.611709]  vprintk_default+0x1f/0x30
> [ 6371.611710]  vprintk_func+0x44/0xd4
> [ 6371.611711]  ? do_raw_spin_lock+0x11e/0x1e0
> [ 6371.611712]  printk+0x9f/0xc5
> [ 6371.611714]  ? log_store.cold.31+0x11/0x11
> [ 6371.611715]  ? count_partial+0x29/0xe0
> [ 6371.611717]  ? do_raw_spin_lock+0x11e/0x1e0
> [ 6371.611718]  count_partial.cold.50+0x4d/0x52
> [ 6371.611719]  get_slabinfo+0x5c/0xb0
> [ 6371.611721]  dump_unreclaimable_slab.cold.35+0x97/0xe2
> [ 6371.611722]  ? show_mem+0x10b/0x11c
> [ 6371.611723]  dump_header+0x45a/0x510
> [ 6371.611724]  oom_kill_process+0xd0/0x280
> [ 6371.611726]  out_of_memory+0x478/0xa50
> [ 6371.611727]  ? oom_killer_disable+0x230/0x230
> [ 6371.611728]  ? mutex_trylock+0x17a/0x190
> [ 6371.611730]  __alloc_pages_slowpath.constprop.61+0x1680/0x1850
> [ 6371.611731]  ? warn_alloc+0x120/0x120
> [ 6371.611733]  ? check_flags.part.28+0x220/0x220
> [ 6371.611734]  ? ___might_sleep+0x178/0x210
> [ 6371.611735]  ? __kasan_check_read+0x11/0x20
> [ 6371.611737]  __alloc_pages_nodemask+0x57c/0x6f0
> [ 6371.611738]  ? __alloc_pages_slowpath.constprop.61+0x1850/0x1850
> [ 6371.611740]  alloc_pages_vma+0x81/0x310
> [ 6371.611741]  do_anonymous_page+0x1bb/0x7a0
> [ 6371.611742]  ? __pte_alloc+0x170/0x170
> [ 6371.611743]  ? match_held_lock+0x35/0x270
> [ 6371.611745]  __handle_mm_fault+0xbb0/0xbe0
> [ 6371.611746]  ? copy_page_range+0x420/0x420
> [ 6371.611747]  ? sync_mm_rss+0x7f/0x190
> [ 6371.611749]  handle_mm_fault+0xdc/0x2e0
> [ 6371.611750]  do_page_fault+0x2cb/0x9d7
> [ 6371.611751]  page_fault+0x34/0x40
> 
> 
>> ---
>> mm/slub.c |   15 ++++++++++++++-
>> 1 file changed, 14 insertions(+), 1 deletion(-)
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 9bf44955c4f1..86a366f7acb6 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2407,16 +2407,29 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
>> #endif /* CONFIG_SLUB_DEBUG */
>> 
>> #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
>> +
>> +static unsigned long max_partial_to_count __read_mostly = 10000;
>> +module_param(max_partial_to_count, ulong, 0644);
>> +
>> static unsigned long count_partial(struct kmem_cache_node *n,
>> 					int (*get_count)(struct page *))
>> {
>> +	unsigned long counted = 0;
>> 	unsigned long flags;
>> 	unsigned long x = 0;
>> 	struct page *page;
>> 
>> 	spin_lock_irqsave(&n->list_lock, flags);
>> -	list_for_each_entry(page, &n->partial, slab_list)
>> +	list_for_each_entry(page, &n->partial, slab_list) {
>> 		x += get_count(page);
>> +
>> +		if (++counted > max_partial_to_count) {
>> +			pr_warn_once("SLUB: too much partial slabs to count all objects, increase max_partial_to_count.\n");
>> +			/* Approximate total count of objects */
>> +			x = mult_frac(x, n->nr_partial, counted);
>> +			break;
>> +		}
>> +	}
>> 	spin_unlock_irqrestore(&n->list_lock, flags);
>> 	return x;
>> }


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slub: limit count of partial slabs scanned to gather statistics
  2020-05-07  3:01   ` Qian Cai
@ 2020-05-07  3:20     ` Stephen Rothwell
  2020-05-07  5:15     ` Konstantin Khlebnikov
  1 sibling, 0 replies; 14+ messages in thread
From: Stephen Rothwell @ 2020-05-07  3:20 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, LKML, Linux-MM, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Konstantin Khlebnikov

[-- Attachment #1: Type: text/plain, Size: 204 bytes --]

Hi Qian,

On Wed, 6 May 2020 23:01:54 -0400 Qian Cai <cai@lca.pw> wrote:
>
> Andrew, Stephen, can you remove this patch from linux-next?

Removed from linux-next.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slub: limit count of partial slabs scanned to gather statistics
  2020-05-07  3:01   ` Qian Cai
  2020-05-07  3:20     ` Stephen Rothwell
@ 2020-05-07  5:15     ` Konstantin Khlebnikov
  2020-05-07 18:24       ` David Rientjes
  1 sibling, 1 reply; 14+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-07  5:15 UTC (permalink / raw)
  To: Qian Cai, Andrew Morton, Stephen Rothwell
  Cc: LKML, Linux-MM, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Roman Gushchin, Wen Yang

On 07/05/2020 06.01, Qian Cai wrote:
> 
> 
>> On May 6, 2020, at 3:06 PM, Qian Cai <cai@lca.pw> wrote:
>>
>>
>>
>>> On May 4, 2020, at 12:07 PM, Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote:
>>>
>>> To get exact count of free and used objects slub have to scan list of
>>> partial slabs. This may take at long time. Scanning holds spinlock and
>>> blocks allocations which move partial slabs to per-cpu lists and back.
>>>
>>> Example found in the wild:
>>>
>>> # cat /sys/kernel/slab/dentry/partial
>>> 14478538 N0=7329569 N1=7148969
>>> # time cat /sys/kernel/slab/dentry/objects
>>> 286225471 N0=136967768 N1=149257703
>>>
>>> real	0m1.722s
>>> user	0m0.001s
>>> sys	0m1.721s
>>>
>>> The same problem in slab was addressed in commit f728b0a5d72a ("mm, slab:
>>> faster active and free stats") by adding more kmem cache statistics.
>>> For slub same approach requires atomic op on fast path when object frees.
>>>
>>> Let's simply limit count of scanned slabs and print warning.
>>> Limit set in /sys/module/slub/parameters/max_partial_to_count.
>>> Default is 10000 which should be enough for most sane cases.
>>>
>>> Return linear approximation if list of partials is longer than limit.
>>> Nobody should notice difference.
>>>
>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>
>> This patch will trigger the warning under memory pressure, and then makes lockdep unhappy. Also,  it is almost impossible tell how many max_partial_to_count is sufficient from user perspective.

Oops, my bad. Printing under this lock indeed a bad idea.

Probably it's better to simply remove this message.
I cannot imagine situation when precise count of object matters at such scale.

> 
> Andrew, Stephen, can you remove this patch from linux-next?
> 
> Even read some procfs files would trigger the warning and lockdep on a debug kernel probably due to kmemleak and debugobjects that would require more partial slabs objects. Thus, it would be problematic to break testing bots on linux-next like this.
> 
>>
>> [ 6371.600511] SLUB: too much partial slabs to count all objects, increase max_partial_to_count.
>> [ 6371.601399] irq event stamp: 8132599
>>
>> [ 6371.611415] ======================================================
>> [ 6371.611417] WARNING: possible circular locking dependency detected
>> [ 6371.611419] 5.7.0-rc4-mm1+ #1 Not tainted
>> [ 6371.611421] ------------------------------------------------------
>> [ 6371.611423] oom02/43515 is trying to acquire lock:
>> [ 6371.611425] ffffffff893b8980 (console_owner){-.-.}-{0:0}, at: console_unlock+0x240/0x750
>>
>> [ 6371.611433] but task is already holding lock:
>> [ 6371.611434] ffff8886456fcb98 (&n->list_lock){-.-.}-{2:2}, at: count_partial+0x29/0xe0
>>
>> [ 6371.611441] which lock already depends on the new lock.
>>
>>
>> [ 6371.611445] the existing dependency chain (in reverse order) is:
>>
>> [ 6371.611446] -> #3 (&n->list_lock){-.-.}-{2:2}:
>> [ 6371.611452]        _raw_spin_lock+0x2f/0x40
>> [ 6371.611453]        deactivate_slab+0x37a/0x690
>> [ 6371.611455]        ___slab_alloc+0x65d/0x810
>> [ 6371.611456]        __slab_alloc+0x43/0x70
>> [ 6371.611457]        __kmalloc+0x2b2/0x430
>> [ 6371.611459]        __tty_buffer_request_room+0x100/0x250
>> [ 6371.611460]        tty_insert_flip_string_fixed_flag+0x67/0x130
>> [ 6371.611462]        pty_write+0xa2/0xf0
>> [ 6371.611463]        n_tty_write+0x36b/0x7c0
>> [ 6371.611464]        tty_write+0x275/0x500
>> [ 6371.611466]        __vfs_write+0x50/0xa0
>> [ 6371.611467]        vfs_write+0x10b/0x290
>> [ 6371.611468]        redirected_tty_write+0x6a/0xc0
>> [ 6371.611470]        do_iter_write+0x253/0x2b0
>> [ 6371.611471]        vfs_writev+0x152/0x1f0
>> [ 6371.611472]        do_writev+0xda/0x180
>> [ 6371.611474]        __x64_sys_writev+0x45/0x50
>> [ 6371.611475]        do_syscall_64+0xcc/0xaf0
>> [ 6371.611477]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
>>
>> [ 6371.611478] -> #2 (&port->lock#2){-.-.}-{2:2}:
>> [ 6371.611484]        _raw_spin_lock_irqsave+0x3a/0x50
>> [ 6371.611486]        tty_port_tty_get+0x22/0xa0
>> [ 6371.611487]        tty_port_default_wakeup+0xf/0x30
>> [ 6371.611489]        tty_port_tty_wakeup+0x39/0x40
>> [ 6371.611490]        uart_write_wakeup+0x2a/0x40
>> [ 6371.611492]        serial8250_tx_chars+0x22e/0x410
>> [ 6371.611493]        serial8250_handle_irq.part.21+0x17c/0x180
>> [ 6371.611495]        serial8250_default_handle_irq+0x5c/0x90
>> [ 6371.611496]        serial8250_interrupt+0xa6/0x130
>> [ 6371.611498]        __handle_irq_event_percpu+0x81/0x550
>> [ 6371.611499]        handle_irq_event_percpu+0x70/0x100
>> [ 6371.611501]        handle_irq_event+0x5a/0x8b
>> [ 6371.611502]        handle_edge_irq+0x10c/0x370
>> [ 6371.611503]        do_IRQ+0x9e/0x1d0
>> [ 6371.611505]        ret_from_intr+0x0/0x37
>> [ 6371.611506]        cpuidle_enter_state+0x148/0x910
>> [ 6371.611507]        cpuidle_enter+0x41/0x70
>> [ 6371.611509]        do_idle+0x3cf/0x440
>> [ 6371.611510]        cpu_startup_entry+0x1d/0x1f
>> [ 6371.611511]        start_secondary+0x29a/0x340
>> [ 6371.611513]        secondary_startup_64+0xb6/0xc0
>>
>> [ 6371.611516] -> #1 (&port->lock){-.-.}-{2:2}:
>> [ 6371.611522]        _raw_spin_lock_irqsave+0x3a/0x50
>> [ 6371.611525]        serial8250_console_write+0x113/0x560
>> [ 6371.611527]        univ8250_console_write+0x4b/0x60
>> [ 6371.611529]        console_unlock+0x4e3/0x750
>> [ 6371.611530]        vprintk_emit+0x10d/0x340
>> [ 6371.611532]        vprintk_default+0x1f/0x30
>> [ 6371.611533]        vprintk_func+0x44/0xd4
>> [ 6371.611535]        printk+0x9f/0xc5
>> [ 6371.611537]        register_console+0x262/0x3e0
>> [ 6371.611538]        univ8250_console_init+0x23/0x2d
>> [ 6371.611540]        console_init+0x268/0x395
>> [ 6371.611542]        start_kernel+0x6c3/0x8b9
>> [ 6371.611544]        x86_64_start_reservations+0x24/0x26
>> [ 6371.611546]        x86_64_start_kernel+0xf4/0xfb
>> [ 6371.611548]        secondary_startup_64+0xb6/0xc0
>>
>> [ 6371.611551] -> #0 (console_owner){-.-.}-{0:0}:
>> [ 6371.611558]        __lock_acquire+0x21f8/0x3260
>> [ 6371.611560]        lock_acquire+0x1a2/0x680
>> [ 6371.611562]        console_unlock+0x2a2/0x750
>> [ 6371.611564]        vprintk_emit+0x10d/0x340
>> [ 6371.611566]        vprintk_default+0x1f/0x30
>> [ 6371.611568]        vprintk_func+0x44/0xd4
>> [ 6371.611569]        printk+0x9f/0xc5
>> [ 6371.611571]        count_partial.cold.50+0x4d/0x52
>> [ 6371.611573]        get_slabinfo+0x5c/0xb0
>> [ 6371.611575]        dump_unreclaimable_slab.cold.35+0x97/0xe2
>> [ 6371.611577]        dump_header+0x45a/0x510
>> [ 6371.611579]        oom_kill_process+0xd0/0x280
>> [ 6371.611581]        out_of_memory+0x478/0xa50
>> [ 6371.611583]        __alloc_pages_slowpath.constprop.61+0x1680/0x1850
>> [ 6371.611585]        __alloc_pages_nodemask+0x57c/0x6f0
>> [ 6371.611587]        alloc_pages_vma+0x81/0x310
>> [ 6371.611589]        do_anonymous_page+0x1bb/0x7a0
>> [ 6371.611591]        __handle_mm_fault+0xbb0/0xbe0
>> [ 6371.611593]        handle_mm_fault+0xdc/0x2e0
>> [ 6371.611595]        do_page_fault+0x2cb/0x9d7
>> [ 6371.611597]        page_fault+0x34/0x40
>>
>> [ 6371.611600] other info that might help us debug this:
>>
>> [ 6371.611603] Chain exists of:
>> [ 6371.611604]   console_owner --> &port->lock#2 --> &n->list_lock
>>
>> [ 6371.611615]  Possible unsafe locking scenario:
>>
>> [ 6371.611618]        CPU0                    CPU1
>> [ 6371.611619]        ----                    ----
>> [ 6371.611621]   lock(&n->list_lock);
>> [ 6371.611625]                                lock(&port->lock#2);
>> [ 6371.611630]                                lock(&n->list_lock);
>> [ 6371.611634]   lock(console_owner);
>>
>> [ 6371.611639]  *** DEADLOCK ***
>>
>> [ 6371.611641] 5 locks held by oom02/43515:
>> [ 6371.611642]  #0: ffff888ef72b4158 (&mm->mmap_sem#2){++++}-{3:3}, at: do_page_fault+0x1d6/0x9d7
>> [ 6371.611649]  #1: ffffffff894dd268 (oom_lock){+.+.}-{3:3}, at: __alloc_pages_slowpath.constprop.61+0x90a/0x1850
>> [ 6371.611656]  #2: ffffffff89520aa8 (slab_mutex){+.+.}-{3:3}, at: dump_unreclaimable_slab+0x2b/0x40
>> [ 6371.611661]  #3: ffff8886456fcb98 (&n->list_lock){-.-.}-{2:2}, at: count_partial+0x29/0xe0
>> [ 6371.611668]  #4: ffffffff893b8e60 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x100/0x340
>>
>> [ 6371.611675] stack backtrace:
>> [ 6371.611676] CPU: 1 PID: 43515 Comm: oom02 Not tainted 5.7.0-rc4-mm1+ #1
>> [ 6371.611679] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
>> [ 6371.611680] Call Trace:
>> [ 6371.611681]  dump_stack+0xa7/0xea
>> [ 6371.611682]  print_circular_bug.cold.54+0x147/0x14c
>> [ 6371.611684]  check_noncircular+0x295/0x2d0
>> [ 6371.611685]  ? print_circular_bug+0x1d0/0x1d0
>> [ 6371.611686]  ? __kasan_check_read+0x11/0x20
>> [ 6371.611688]  ? mark_lock+0x160/0xfe0
>> [ 6371.611689]  __lock_acquire+0x21f8/0x3260
>> [ 6371.611690]  ? register_lock_class+0xb90/0xb90
>> [ 6371.611691]  ? snprintf+0xc0/0xc0
>> [ 6371.611693]  ? __kasan_check_read+0x11/0x20
>> [ 6371.611694]  ? check_chain_key+0x1df/0x2e0
>> [ 6371.611695]  lock_acquire+0x1a2/0x680
>> [ 6371.611697]  ? console_unlock+0x240/0x750
>> [ 6371.611698]  ? lock_downgrade+0x3e0/0x3e0
>> [ 6371.611699]  ? check_flags.part.28+0x220/0x220
>> [ 6371.611701]  ? rwlock_bug.part.1+0x60/0x60
>> [ 6371.611702]  ? __kasan_check_read+0x11/0x20
>> [ 6371.611703]  console_unlock+0x2a2/0x750
>> [ 6371.611705]  ? console_unlock+0x240/0x750
>> [ 6371.611706]  vprintk_emit+0x10d/0x340
>> [ 6371.611707]  ? kernel_poison_pages.cold.3+0x86/0x86
>> [ 6371.611709]  vprintk_default+0x1f/0x30
>> [ 6371.611710]  vprintk_func+0x44/0xd4
>> [ 6371.611711]  ? do_raw_spin_lock+0x11e/0x1e0
>> [ 6371.611712]  printk+0x9f/0xc5
>> [ 6371.611714]  ? log_store.cold.31+0x11/0x11
>> [ 6371.611715]  ? count_partial+0x29/0xe0
>> [ 6371.611717]  ? do_raw_spin_lock+0x11e/0x1e0
>> [ 6371.611718]  count_partial.cold.50+0x4d/0x52
>> [ 6371.611719]  get_slabinfo+0x5c/0xb0
>> [ 6371.611721]  dump_unreclaimable_slab.cold.35+0x97/0xe2
>> [ 6371.611722]  ? show_mem+0x10b/0x11c
>> [ 6371.611723]  dump_header+0x45a/0x510
>> [ 6371.611724]  oom_kill_process+0xd0/0x280
>> [ 6371.611726]  out_of_memory+0x478/0xa50
>> [ 6371.611727]  ? oom_killer_disable+0x230/0x230
>> [ 6371.611728]  ? mutex_trylock+0x17a/0x190
>> [ 6371.611730]  __alloc_pages_slowpath.constprop.61+0x1680/0x1850
>> [ 6371.611731]  ? warn_alloc+0x120/0x120
>> [ 6371.611733]  ? check_flags.part.28+0x220/0x220
>> [ 6371.611734]  ? ___might_sleep+0x178/0x210
>> [ 6371.611735]  ? __kasan_check_read+0x11/0x20
>> [ 6371.611737]  __alloc_pages_nodemask+0x57c/0x6f0
>> [ 6371.611738]  ? __alloc_pages_slowpath.constprop.61+0x1850/0x1850
>> [ 6371.611740]  alloc_pages_vma+0x81/0x310
>> [ 6371.611741]  do_anonymous_page+0x1bb/0x7a0
>> [ 6371.611742]  ? __pte_alloc+0x170/0x170
>> [ 6371.611743]  ? match_held_lock+0x35/0x270
>> [ 6371.611745]  __handle_mm_fault+0xbb0/0xbe0
>> [ 6371.611746]  ? copy_page_range+0x420/0x420
>> [ 6371.611747]  ? sync_mm_rss+0x7f/0x190
>> [ 6371.611749]  handle_mm_fault+0xdc/0x2e0
>> [ 6371.611750]  do_page_fault+0x2cb/0x9d7
>> [ 6371.611751]  page_fault+0x34/0x40
>>
>>
>>> ---
>>> mm/slub.c |   15 ++++++++++++++-
>>> 1 file changed, 14 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/slub.c b/mm/slub.c
>>> index 9bf44955c4f1..86a366f7acb6 100644
>>> --- a/mm/slub.c
>>> +++ b/mm/slub.c
>>> @@ -2407,16 +2407,29 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
>>> #endif /* CONFIG_SLUB_DEBUG */
>>>
>>> #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
>>> +
>>> +static unsigned long max_partial_to_count __read_mostly = 10000;
>>> +module_param(max_partial_to_count, ulong, 0644);
>>> +
>>> static unsigned long count_partial(struct kmem_cache_node *n,
>>> 					int (*get_count)(struct page *))
>>> {
>>> +	unsigned long counted = 0;
>>> 	unsigned long flags;
>>> 	unsigned long x = 0;
>>> 	struct page *page;
>>>
>>> 	spin_lock_irqsave(&n->list_lock, flags);
>>> -	list_for_each_entry(page, &n->partial, slab_list)
>>> +	list_for_each_entry(page, &n->partial, slab_list) {
>>> 		x += get_count(page);
>>> +
>>> +		if (++counted > max_partial_to_count) {
>>> +			pr_warn_once("SLUB: too much partial slabs to count all objects, increase max_partial_to_count.\n");
>>> +			/* Approximate total count of objects */
>>> +			x = mult_frac(x, n->nr_partial, counted);
>>> +			break;
>>> +		}
>>> +	}
>>> 	spin_unlock_irqrestore(&n->list_lock, flags);
>>> 	return x;
>>> }
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slub: limit count of partial slabs scanned to gather statistics
  2020-05-06 11:56 ` Vlastimil Babka
@ 2020-05-07  5:25   ` Konstantin Khlebnikov
  2020-05-07 14:12     ` Vlastimil Babka
  0 siblings, 1 reply; 14+ messages in thread
From: Konstantin Khlebnikov @ 2020-05-07  5:25 UTC (permalink / raw)
  To: Vlastimil Babka, linux-kernel, linux-mm, Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Roman Gushchin, Wen Yang

On 06/05/2020 14.56, Vlastimil Babka wrote:
> On 5/4/20 6:07 PM, Konstantin Khlebnikov wrote:
>> To get exact count of free and used objects slub have to scan list of
>> partial slabs. This may take at long time. Scanning holds spinlock and
>> blocks allocations which move partial slabs to per-cpu lists and back.
>>
>> Example found in the wild:
>>
>> # cat /sys/kernel/slab/dentry/partial
>> 14478538 N0=7329569 N1=7148969
>> # time cat /sys/kernel/slab/dentry/objects
>> 286225471 N0=136967768 N1=149257703
>>
>> real	0m1.722s
>> user	0m0.001s
>> sys	0m1.721s
>>
>> The same problem in slab was addressed in commit f728b0a5d72a ("mm, slab:
>> faster active and free stats") by adding more kmem cache statistics.
>> For slub same approach requires atomic op on fast path when object frees.
> 
> In general yeah, but are you sure about this one? AFAICS this is about pages in
> the n->partial list, where manipulations happen under n->list_lock and shouldn't
> be fast path. It should be feasible to add a counter under the same lock, so it
> wouldn't even need to be atomic?

SLUB allocates objects from prepared per-cpu slabs, they could be subtracted from
count of free object under this lock in advance when slab moved out of this list.

But at freeing path object might belong to any slab, including global partials.

> 
>> Let's simply limit count of scanned slabs and print warning.
>> Limit set in /sys/module/slub/parameters/max_partial_to_count.
>> Default is 10000 which should be enough for most sane cases.
>>
>> Return linear approximation if list of partials is longer than limit.
>> Nobody should notice difference.
>>
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> 
> BTW there was a different patch in that area proposed recently [1] for slabinfo.
> Christopher argued that we can do that for slabinfo but leave /sys stats
> precise. Guess not then?
> 
> [1]
> https://lore.kernel.org/linux-mm/20200222092428.99488-1-wenyang@linux.alibaba.com/
> 
>> ---
>>   mm/slub.c |   15 ++++++++++++++-
>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 9bf44955c4f1..86a366f7acb6 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2407,16 +2407,29 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
>>   #endif /* CONFIG_SLUB_DEBUG */
>>   
>>   #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
>> +
>> +static unsigned long max_partial_to_count __read_mostly = 10000;
>> +module_param(max_partial_to_count, ulong, 0644);
>> +
>>   static unsigned long count_partial(struct kmem_cache_node *n,
>>   					int (*get_count)(struct page *))
>>   {
>> +	unsigned long counted = 0;
>>   	unsigned long flags;
>>   	unsigned long x = 0;
>>   	struct page *page;
>>   
>>   	spin_lock_irqsave(&n->list_lock, flags);
>> -	list_for_each_entry(page, &n->partial, slab_list)
>> +	list_for_each_entry(page, &n->partial, slab_list) {
>>   		x += get_count(page);
>> +
>> +		if (++counted > max_partial_to_count) {
>> +			pr_warn_once("SLUB: too much partial slabs to count all objects, increase max_partial_to_count.\n");
>> +			/* Approximate total count of objects */
>> +			x = mult_frac(x, n->nr_partial, counted);
>> +			break;
>> +		}
>> +	}
>>   	spin_unlock_irqrestore(&n->list_lock, flags);
>>   	return x;
>>   }
>>
>>
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slub: limit count of partial slabs scanned to gather statistics
  2020-05-07  5:25   ` Konstantin Khlebnikov
@ 2020-05-07 14:12     ` Vlastimil Babka
  0 siblings, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2020-05-07 14:12 UTC (permalink / raw)
  To: Konstantin Khlebnikov, linux-kernel, linux-mm, Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Roman Gushchin, Wen Yang

On 5/7/20 7:25 AM, Konstantin Khlebnikov wrote:
> On 06/05/2020 14.56, Vlastimil Babka wrote:
>> On 5/4/20 6:07 PM, Konstantin Khlebnikov wrote:
>>> To get exact count of free and used objects slub have to scan list of
>>> partial slabs. This may take at long time. Scanning holds spinlock and
>>> blocks allocations which move partial slabs to per-cpu lists and back.
>>>
>>> Example found in the wild:
>>>
>>> # cat /sys/kernel/slab/dentry/partial
>>> 14478538 N0=7329569 N1=7148969
>>> # time cat /sys/kernel/slab/dentry/objects
>>> 286225471 N0=136967768 N1=149257703
>>>
>>> real	0m1.722s
>>> user	0m0.001s
>>> sys	0m1.721s
>>>
>>> The same problem in slab was addressed in commit f728b0a5d72a ("mm, slab:
>>> faster active and free stats") by adding more kmem cache statistics.
>>> For slub same approach requires atomic op on fast path when object frees.
>> 
>> In general yeah, but are you sure about this one? AFAICS this is about pages in
>> the n->partial list, where manipulations happen under n->list_lock and shouldn't
>> be fast path. It should be feasible to add a counter under the same lock, so it
>> wouldn't even need to be atomic?
> 
> SLUB allocates objects from prepared per-cpu slabs, they could be subtracted from
> count of free object under this lock in advance when slab moved out of this list.
> 
> But at freeing path object might belong to any slab, including global partials.

Right, freeing can indeed modify a global partial without taking the lock.
Nevermind then.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slub: limit count of partial slabs scanned to gather statistics
  2020-05-07  5:15     ` Konstantin Khlebnikov
@ 2020-05-07 18:24       ` David Rientjes
  0 siblings, 0 replies; 14+ messages in thread
From: David Rientjes @ 2020-05-07 18:24 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Qian Cai, Andrew Morton, Stephen Rothwell, LKML, Linux-MM,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim, Roman Gushchin,
	Wen Yang

On Thu, 7 May 2020, Konstantin Khlebnikov wrote:

> > > > To get exact count of free and used objects slub have to scan list of
> > > > partial slabs. This may take at long time. Scanning holds spinlock and
> > > > blocks allocations which move partial slabs to per-cpu lists and back.
> > > > 
> > > > Example found in the wild:
> > > > 
> > > > # cat /sys/kernel/slab/dentry/partial
> > > > 14478538 N0=7329569 N1=7148969
> > > > # time cat /sys/kernel/slab/dentry/objects
> > > > 286225471 N0=136967768 N1=149257703
> > > > 
> > > > real	0m1.722s
> > > > user	0m0.001s
> > > > sys	0m1.721s
> > > > 
> > > > The same problem in slab was addressed in commit f728b0a5d72a ("mm,
> > > > slab:
> > > > faster active and free stats") by adding more kmem cache statistics.
> > > > For slub same approach requires atomic op on fast path when object
> > > > frees.
> > > > 
> > > > Let's simply limit count of scanned slabs and print warning.
> > > > Limit set in /sys/module/slub/parameters/max_partial_to_count.
> > > > Default is 10000 which should be enough for most sane cases.
> > > > 
> > > > Return linear approximation if list of partials is longer than limit.
> > > > Nobody should notice difference.
> > > > 
> > > > Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> > > 
> > > This patch will trigger the warning under memory pressure, and then makes
> > > lockdep unhappy. Also,  it is almost impossible tell how many
> > > max_partial_to_count is sufficient from user perspective.
> 
> Oops, my bad. Printing under this lock indeed a bad idea.
> 
> Probably it's better to simply remove this message.
> I cannot imagine situation when precise count of object matters at such scale.
> 

If the printk is removed, then probably better to remove the 
max_partial_to_count param as well?  I doubt it would ever be used since 
nothing points to it other than the kernel code now.  If somebody 
complains about the approximation, we can (a) convince them the 
approximation is better than precise calculation to prevent irqs from 
being disabled for several seconds and (b) add it later if absolutely 
necessary.  I notice the absence of other module_param()'s in mm/slub.c, 
so likely better to avoid adding special tunables like this unless 
absolutely necessary.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slub: limit count of partial slabs scanned to gather statistics
  2020-05-04 19:56 ` Andrew Morton
  2020-05-05  5:46   ` Konstantin Khlebnikov
@ 2020-05-08  3:18   ` Christopher Lameter
  1 sibling, 0 replies; 14+ messages in thread
From: Christopher Lameter @ 2020-05-08  3:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, linux-kernel, linux-mm, Pekka Enberg,
	David Rientjes, Joonsoo Kim

On Mon, 4 May 2020, Andrew Morton wrote:

> But I guess it's better than nothing at all, unless there are
> alternative ideas?

I its highly unsusual to have such large partial lists. In a typical case
allocations whould reduce the size of the lists. 1000s? That is scary.

Are there inodes or dentries by chance?

The defrag stuff that I had been trying to do for a long time would solve
that issue but then objects would need to be made movable....


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-05-08  3:18 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-04 16:07 [PATCH] slub: limit count of partial slabs scanned to gather statistics Konstantin Khlebnikov
2020-05-04 19:56 ` Andrew Morton
2020-05-05  5:46   ` Konstantin Khlebnikov
2020-05-08  3:18   ` Christopher Lameter
2020-05-04 21:19 ` David Rientjes
2020-05-05  6:20   ` Konstantin Khlebnikov
2020-05-06 11:56 ` Vlastimil Babka
2020-05-07  5:25   ` Konstantin Khlebnikov
2020-05-07 14:12     ` Vlastimil Babka
2020-05-06 19:06 ` Qian Cai
2020-05-07  3:01   ` Qian Cai
2020-05-07  3:20     ` Stephen Rothwell
2020-05-07  5:15     ` Konstantin Khlebnikov
2020-05-07 18:24       ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).