All of lore.kernel.org
 help / color / mirror / Atom feed
* slabinfo shows incorrect active_objs ???
@ 2022-02-22  9:22 Vasily Averin
  2022-02-22 10:23 ` Hyeonggon Yoo
  2022-02-22 11:10 ` Vlastimil Babka
  0 siblings, 2 replies; 24+ messages in thread
From: Vasily Averin @ 2022-02-22  9:22 UTC (permalink / raw)
  To: Linux MM, Andrew Morton; +Cc: kernel

Dear all,

I've found that /proc/slabinfo shows inadequate numbers of in-use slab objects.
it assumes that all objects stored in cpu caches are always 100% in use.

for example:
  slabinfo shows that all 20 objects are in use.

[root@fc34-vvs linux]# uname -a
Linux fc34-vvs.sw.ru 5.17.0-rc3+ #42 SMP PREEMPT Mon Feb 21 20:14:54 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

[root@fc34-vvs linux]# cat /proc/slabinfo
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
...
kmalloc-cg-8k         20     20   8192    4    8 : tunables    0    0    0 : slabdata      5      5      0

At the same time crash said that only 2 objects are in use.

crash> kmem -s kmalloc-cg-8k
CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
ffff8f4840043b00     8192          2        20      5    32k  kmalloc-cg-8k

And this looks like true, see kmem -S output below.

Is it a bug or perhaps a well-known feature that I missed?

Numbers are counted in mm/slub.c, see below,
but count_partial() doe not includes free objects of cpu caches

Moreover adequate statistic is not showed in any other interfaces too
/sys/kerenl/slab/ read cpu slab caches but does not output these numbers.

Thank you,
	Vasily Averin

#ifdef CONFIG_SLUB_DEBUG
void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
{
         unsigned long nr_slabs = 0;
         unsigned long nr_objs = 0;
         unsigned long nr_free = 0;
         int node;
         struct kmem_cache_node *n;

         for_each_kmem_cache_node(s, node, n) {
                 nr_slabs += node_nr_slabs(n);
                 nr_objs += node_nr_objs(n);
                 nr_free += count_partial(n, count_free);
         }

         sinfo->active_objs = nr_objs - nr_free;
         sinfo->num_objs = nr_objs;



crash> kmem -S kmalloc-cg-8k
CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
ffff8f4840043b00     8192          2        20      5    32k  kmalloc-cg-8k
CPU 0 KMEM_CACHE_CPU:
   ffff8f4b58236360
CPU 0 SLAB:
   (empty)
CPU 0 PARTIAL:
   (empty)
CPU 1 KMEM_CACHE_CPU:
   ffff8f4b58276360
CPU 1 SLAB:
   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
   ffffed3f842af400  ffff8f484abd0000     0      4          1     3
   FREE / [ALLOCATED]
    ffff8f484abd0000  (cpu 1 cache)
    ffff8f484abd2000  (cpu 1 cache)
    ffff8f484abd4000  (cpu 1 cache)
   [ffff8f484abd6000]
CPU 1 PARTIAL:
   (empty)
CPU 2 KMEM_CACHE_CPU:
   ffff8f4b582b6360
CPU 2 SLAB:
   (empty)
CPU 2 PARTIAL:
   (empty)
CPU 3 KMEM_CACHE_CPU:
   ffff8f4b582f6360
CPU 3 SLAB:
   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
   ffffed3f842ce600  ffff8f484b398000     0      4          0     4
   FREE / [ALLOCATED]
    ffff8f484b398000  (cpu 3 cache)
    ffff8f484b39a000  (cpu 3 cache)
    ffff8f484b39c000  (cpu 3 cache)
    ffff8f484b39e000  (cpu 3 cache)
CPU 3 PARTIAL:
   (empty)
CPU 4 KMEM_CACHE_CPU:
   ffff8f4b58336360
CPU 4 SLAB:
   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
   ffffed3f8418c200  ffff8f4846308000     0      4          0     4
   FREE / [ALLOCATED]
    ffff8f4846308000  (cpu 4 cache)
    ffff8f484630a000  (cpu 4 cache)
    ffff8f484630c000  (cpu 4 cache)
    ffff8f484630e000  (cpu 4 cache)
CPU 4 PARTIAL:
   (empty)
CPU 5 KMEM_CACHE_CPU:
   ffff8f4b58376360
CPU 5 SLAB:
   (empty)
CPU 5 PARTIAL:
   (empty)
CPU 6 KMEM_CACHE_CPU:
   ffff8f4b583b6360
CPU 6 SLAB:
   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
   ffffed3f8412d000  ffff8f4844b40000     0      4          0     4
   FREE / [ALLOCATED]
    ffff8f4844b40000  (cpu 6 cache)
    ffff8f4844b42000  (cpu 6 cache)
    ffff8f4844b44000  (cpu 6 cache)
    ffff8f4844b46000  (cpu 6 cache)
CPU 6 PARTIAL:
   (empty)
CPU 7 KMEM_CACHE_CPU:
   ffff8f4b583f6360
CPU 7 SLAB:
   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
   ffffed3f84124000  ffff8f4844900000     0      4          1     3
   FREE / [ALLOCATED]
    ffff8f4844900000  (cpu 7 cache)
    ffff8f4844902000  (cpu 7 cache)
   [ffff8f4844904000]
    ffff8f4844906000  (cpu 7 cache)
CPU 7 PARTIAL:
   (empty)
KMEM_CACHE_NODE   NODE  SLABS  PARTIAL  PER-CPU
ffff8f48400416c0     0      5        0        5
NODE 0 PARTIAL:
   (empty)
NODE 0 FULL:
   (not tracked)




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-22  9:22 slabinfo shows incorrect active_objs ??? Vasily Averin
@ 2022-02-22 10:23 ` Hyeonggon Yoo
  2022-02-22 12:10   ` Vasily Averin
  2022-02-22 11:10 ` Vlastimil Babka
  1 sibling, 1 reply; 24+ messages in thread
From: Hyeonggon Yoo @ 2022-02-22 10:23 UTC (permalink / raw)
  To: Vasily Averin; +Cc: Linux MM, Andrew Morton, kernel

On Tue, Feb 22, 2022 at 12:22:02PM +0300, Vasily Averin wrote:
> Dear all,
> 
> I've found that /proc/slabinfo shows inadequate numbers of in-use slab objects.
> it assumes that all objects stored in cpu caches are always 100% in use.
> 
> for example:
>  slabinfo shows that all 20 objects are in use.
> 
> [root@fc34-vvs linux]# uname -a
> Linux fc34-vvs.sw.ru 5.17.0-rc3+ #42 SMP PREEMPT Mon Feb 21 20:14:54 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
> 
> [root@fc34-vvs linux]# cat /proc/slabinfo
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> ...
> kmalloc-cg-8k         20     20   8192    4    8 : tunables    0    0    0 : slabdata      5      5      0
> 
> At the same time crash said that only 2 objects are in use.
> 
> crash> kmem -s kmalloc-cg-8k
> CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
> ffff8f4840043b00     8192          2        20      5    32k  kmalloc-cg-8k
> 
> And this looks like true, see kmem -S output below.
> 
> Is it a bug or perhaps a well-known feature that I missed?
> 

This is not a bug..

TL;DR: Some of slabs are locked by CPU and SLUB does not accurately account them.
(I guess its purpose is to reduce cacheline usages in fastpath?)

There are 3 status of slab. A slab can be:
	1) a cpu slab or 2) cpu partial slabs:
		They're locked (frozen) by cpu to avoid per-node
		locking overhead.

	3) node partial slabs:
		They exist in node's partial slab list. Any CPU can take them
		using spinlock. Usually SLUB takes slab from node
		partial slabs when there is no cpu slab and cpu partial
		slabs.

Only free objects in node partial slabs are calculated as free objects.

> Numbers are counted in mm/slub.c, see below,
> but count_partial() doe not includes free objects of cpu caches
> 
> Moreover adequate statistic is not showed in any other interfaces too
> /sys/kerenl/slab/ read cpu slab caches but does not output these numbers.
> 
> Thank you,
> 	Vasily Averin
> 
> #ifdef CONFIG_SLUB_DEBUG
> void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
> {
>         unsigned long nr_slabs = 0;
>         unsigned long nr_objs = 0;
>         unsigned long nr_free = 0;
>         int node;
>         struct kmem_cache_node *n;
> 
>         for_each_kmem_cache_node(s, node, n) {
>                 nr_slabs += node_nr_slabs(n);
>                 nr_objs += node_nr_objs(n);
>                 nr_free += count_partial(n, count_free);
>         }
> 
>         sinfo->active_objs = nr_objs - nr_free;

This code assumes all objects in frozen slabs (cpu slab or cpu
partial slabs) are not free.

With current implementation of SLUB, there is no easy way to accurately
account objects in cpu slab and cpu partial slabs' objects because SLUB
sets slab->inuse = slab->objects when taking slab from node partial slabs.

They are accounted again only when going back to node partial slabs.
(deactivated).

Thanks,
Hyeonggon

>         sinfo->num_objs = nr_objs;
> 
> 
> 
> crash> kmem -S kmalloc-cg-8k
> CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
> ffff8f4840043b00     8192          2        20      5    32k  kmalloc-cg-8k
> CPU 0 KMEM_CACHE_CPU:
>   ffff8f4b58236360
> CPU 0 SLAB:
>   (empty)
> CPU 0 PARTIAL:
>   (empty)
> CPU 1 KMEM_CACHE_CPU:
>   ffff8f4b58276360
> CPU 1 SLAB:
>   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>   ffffed3f842af400  ffff8f484abd0000     0      4          1     3
>   FREE / [ALLOCATED]
>    ffff8f484abd0000  (cpu 1 cache)
>    ffff8f484abd2000  (cpu 1 cache)
>    ffff8f484abd4000  (cpu 1 cache)
>   [ffff8f484abd6000]
> CPU 1 PARTIAL:
>   (empty)
> CPU 2 KMEM_CACHE_CPU:
>   ffff8f4b582b6360
> CPU 2 SLAB:
>   (empty)
> CPU 2 PARTIAL:
>   (empty)
> CPU 3 KMEM_CACHE_CPU:
>   ffff8f4b582f6360
> CPU 3 SLAB:
>   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>   ffffed3f842ce600  ffff8f484b398000     0      4          0     4
>   FREE / [ALLOCATED]
>    ffff8f484b398000  (cpu 3 cache)
>    ffff8f484b39a000  (cpu 3 cache)
>    ffff8f484b39c000  (cpu 3 cache)
>    ffff8f484b39e000  (cpu 3 cache)
> CPU 3 PARTIAL:
>   (empty)
> CPU 4 KMEM_CACHE_CPU:
>   ffff8f4b58336360
> CPU 4 SLAB:
>   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>   ffffed3f8418c200  ffff8f4846308000     0      4          0     4
>   FREE / [ALLOCATED]
>    ffff8f4846308000  (cpu 4 cache)
>    ffff8f484630a000  (cpu 4 cache)
>    ffff8f484630c000  (cpu 4 cache)
>    ffff8f484630e000  (cpu 4 cache)
> CPU 4 PARTIAL:
>   (empty)
> CPU 5 KMEM_CACHE_CPU:
>   ffff8f4b58376360
> CPU 5 SLAB:
>   (empty)
> CPU 5 PARTIAL:
>   (empty)
> CPU 6 KMEM_CACHE_CPU:
>   ffff8f4b583b6360
> CPU 6 SLAB:
>   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>   ffffed3f8412d000  ffff8f4844b40000     0      4          0     4
>   FREE / [ALLOCATED]
>    ffff8f4844b40000  (cpu 6 cache)
>    ffff8f4844b42000  (cpu 6 cache)
>    ffff8f4844b44000  (cpu 6 cache)
>    ffff8f4844b46000  (cpu 6 cache)
> CPU 6 PARTIAL:
>   (empty)
> CPU 7 KMEM_CACHE_CPU:
>   ffff8f4b583f6360
> CPU 7 SLAB:
>   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>   ffffed3f84124000  ffff8f4844900000     0      4          1     3
>   FREE / [ALLOCATED]
>    ffff8f4844900000  (cpu 7 cache)
>    ffff8f4844902000  (cpu 7 cache)
>   [ffff8f4844904000]
>    ffff8f4844906000  (cpu 7 cache)
> CPU 7 PARTIAL:
>   (empty)
> KMEM_CACHE_NODE   NODE  SLABS  PARTIAL  PER-CPU
> ffff8f48400416c0     0      5        0        5
> NODE 0 PARTIAL:
>   (empty)
> NODE 0 FULL:
>   (not tracked)
> 
> 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-22  9:22 slabinfo shows incorrect active_objs ??? Vasily Averin
  2022-02-22 10:23 ` Hyeonggon Yoo
@ 2022-02-22 11:10 ` Vlastimil Babka
  1 sibling, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2022-02-22 11:10 UTC (permalink / raw)
  To: Vasily Averin, Linux MM, Andrew Morton, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim, Roman Gushchin
  Cc: kernel

On 2/22/22 10:22, Vasily Averin wrote:
> Dear all,
> 
> I've found that /proc/slabinfo shows inadequate numbers of in-use slab objects.
> it assumes that all objects stored in cpu caches are always 100% in use.
> 
> for example:
>  slabinfo shows that all 20 objects are in use.
> 
> [root@fc34-vvs linux]# uname -a
> Linux fc34-vvs.sw.ru 5.17.0-rc3+ #42 SMP PREEMPT Mon Feb 21 20:14:54 UTC
> 2022 x86_64 x86_64 x86_64 GNU/Linux
> 
> [root@fc34-vvs linux]# cat /proc/slabinfo
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab>
> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata
> <active_slabs> <num_slabs> <sharedavail>
> ...
> kmalloc-cg-8k         20     20   8192    4    8 : tunables    0    0    0 :
> slabdata      5      5      0
> 
> At the same time crash said that only 2 objects are in use.
> 
> crash> kmem -s kmalloc-cg-8k
> CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
> ffff8f4840043b00     8192          2        20      5    32k  kmalloc-cg-8k
> 
> And this looks like true, see kmem -S output below.
> 
> Is it a bug or perhaps a well-known feature that I missed?

It's a known tradeoff. It would affect the allocation/free hot paths to
account the per-cpu slabs more accurately and thus some precision of
slabinfo is sacrificed.

> Numbers are counted in mm/slub.c, see below,
> but count_partial() doe not includes free objects of cpu caches
> 
> Moreover adequate statistic is not showed in any other interfaces too
> /sys/kerenl/slab/ read cpu slab caches but does not output these numbers.
> 
> Thank you,
>     Vasily Averin
> 
> #ifdef CONFIG_SLUB_DEBUG
> void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
> {
>         unsigned long nr_slabs = 0;
>         unsigned long nr_objs = 0;
>         unsigned long nr_free = 0;
>         int node;
>         struct kmem_cache_node *n;
> 
>         for_each_kmem_cache_node(s, node, n) {
>                 nr_slabs += node_nr_slabs(n);
>                 nr_objs += node_nr_objs(n);
>                 nr_free += count_partial(n, count_free);
>         }
> 
>         sinfo->active_objs = nr_objs - nr_free;
>         sinfo->num_objs = nr_objs;
> 
> 
> 
> crash> kmem -S kmalloc-cg-8k
> CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
> ffff8f4840043b00     8192          2        20      5    32k  kmalloc-cg-8k
> CPU 0 KMEM_CACHE_CPU:
>   ffff8f4b58236360
> CPU 0 SLAB:
>   (empty)
> CPU 0 PARTIAL:
>   (empty)
> CPU 1 KMEM_CACHE_CPU:
>   ffff8f4b58276360
> CPU 1 SLAB:
>   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>   ffffed3f842af400  ffff8f484abd0000     0      4          1     3
>   FREE / [ALLOCATED]
>    ffff8f484abd0000  (cpu 1 cache)
>    ffff8f484abd2000  (cpu 1 cache)
>    ffff8f484abd4000  (cpu 1 cache)
>   [ffff8f484abd6000]
> CPU 1 PARTIAL:
>   (empty)
> CPU 2 KMEM_CACHE_CPU:
>   ffff8f4b582b6360
> CPU 2 SLAB:
>   (empty)
> CPU 2 PARTIAL:
>   (empty)
> CPU 3 KMEM_CACHE_CPU:
>   ffff8f4b582f6360
> CPU 3 SLAB:
>   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>   ffffed3f842ce600  ffff8f484b398000     0      4          0     4
>   FREE / [ALLOCATED]
>    ffff8f484b398000  (cpu 3 cache)
>    ffff8f484b39a000  (cpu 3 cache)
>    ffff8f484b39c000  (cpu 3 cache)
>    ffff8f484b39e000  (cpu 3 cache)
> CPU 3 PARTIAL:
>   (empty)
> CPU 4 KMEM_CACHE_CPU:
>   ffff8f4b58336360
> CPU 4 SLAB:
>   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>   ffffed3f8418c200  ffff8f4846308000     0      4          0     4
>   FREE / [ALLOCATED]
>    ffff8f4846308000  (cpu 4 cache)
>    ffff8f484630a000  (cpu 4 cache)
>    ffff8f484630c000  (cpu 4 cache)
>    ffff8f484630e000  (cpu 4 cache)
> CPU 4 PARTIAL:
>   (empty)
> CPU 5 KMEM_CACHE_CPU:
>   ffff8f4b58376360
> CPU 5 SLAB:
>   (empty)
> CPU 5 PARTIAL:
>   (empty)
> CPU 6 KMEM_CACHE_CPU:
>   ffff8f4b583b6360
> CPU 6 SLAB:
>   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>   ffffed3f8412d000  ffff8f4844b40000     0      4          0     4
>   FREE / [ALLOCATED]
>    ffff8f4844b40000  (cpu 6 cache)
>    ffff8f4844b42000  (cpu 6 cache)
>    ffff8f4844b44000  (cpu 6 cache)
>    ffff8f4844b46000  (cpu 6 cache)
> CPU 6 PARTIAL:
>   (empty)
> CPU 7 KMEM_CACHE_CPU:
>   ffff8f4b583f6360
> CPU 7 SLAB:
>   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>   ffffed3f84124000  ffff8f4844900000     0      4          1     3
>   FREE / [ALLOCATED]
>    ffff8f4844900000  (cpu 7 cache)
>    ffff8f4844902000  (cpu 7 cache)
>   [ffff8f4844904000]
>    ffff8f4844906000  (cpu 7 cache)
> CPU 7 PARTIAL:
>   (empty)
> KMEM_CACHE_NODE   NODE  SLABS  PARTIAL  PER-CPU
> ffff8f48400416c0     0      5        0        5
> NODE 0 PARTIAL:
>   (empty)
> NODE 0 FULL:
>   (not tracked)
> 
> 
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-22 10:23 ` Hyeonggon Yoo
@ 2022-02-22 12:10   ` Vasily Averin
  2022-02-22 16:32     ` Shakeel Butt
                       ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Vasily Averin @ 2022-02-22 12:10 UTC (permalink / raw)
  To: Hyeonggon Yoo, Vlastimil Babka; +Cc: Linux MM, Andrew Morton, kernel

On 22.02.2022 13:23, Hyeonggon Yoo wrote:
> On Tue, Feb 22, 2022 at 12:22:02PM +0300, Vasily Averin wrote:
>> Dear all,
>>
>> I've found that /proc/slabinfo shows inadequate numbers of in-use slab objects.
>> it assumes that all objects stored in cpu caches are always 100% in use.

>> Is it a bug or perhaps a well-known feature that I missed?
> 
> This is not a bug..

Thank you for explanation,
I think it would be useful to document this somewhere. (Documnetation? man slabinfo ?)
Also I would like to know is it some (fast) way to get real numbers in userspace ?
crash is too fat for this task.
Do you know perhaps some other userspace utility or may be systemtap/drgn script?

I'm preparing new set of memcg accounting patches, with reparired tools/cgroup/memcg_slapinfo.py
I can get numbers of accounted resources, but I need to understand how may resources was NOT
accounted to memcg but allocated on host. I expected get these numbers from host's slabinfo but
it does not show correct numbers.

Thank you,
	Vasily Averin


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-22 12:10   ` Vasily Averin
@ 2022-02-22 16:32     ` Shakeel Butt
  2022-02-22 16:47     ` Roman Gushchin
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Shakeel Butt @ 2022-02-22 16:32 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Hyeonggon Yoo, Vlastimil Babka, Linux MM, Andrew Morton, kernel

On Tue, Feb 22, 2022 at 4:10 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> On 22.02.2022 13:23, Hyeonggon Yoo wrote:
> > On Tue, Feb 22, 2022 at 12:22:02PM +0300, Vasily Averin wrote:
> >> Dear all,
> >>
> >> I've found that /proc/slabinfo shows inadequate numbers of in-use slab objects.
> >> it assumes that all objects stored in cpu caches are always 100% in use.
>
> >> Is it a bug or perhaps a well-known feature that I missed?
> >
> > This is not a bug..
>
> Thank you for explanation,
> I think it would be useful to document this somewhere. (Documnetation? man slabinfo ?)
> Also I would like to know is it some (fast) way to get real numbers in userspace ?
> crash is too fat for this task.
> Do you know perhaps some other userspace utility or may be systemtap/drgn script?
>
> I'm preparing new set of memcg accounting patches, with reparired tools/cgroup/memcg_slapinfo.py
> I can get numbers of accounted resources, but I need to understand how may resources was NOT
> accounted to memcg but allocated on host. I expected get these numbers from host's slabinfo but
> it does not show correct numbers.
>

If you are just interested in the stats, you can use SLAB for your experiments.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-22 12:10   ` Vasily Averin
  2022-02-22 16:32     ` Shakeel Butt
@ 2022-02-22 16:47     ` Roman Gushchin
  2022-02-23  1:07       ` Vasily Averin
  2022-02-22 20:59     ` Roman Gushchin
  2022-03-04 16:29     ` Vlastimil Babka
  3 siblings, 1 reply; 24+ messages in thread
From: Roman Gushchin @ 2022-02-22 16:47 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Hyeonggon Yoo, Vlastimil Babka, Linux MM, Andrew Morton, kernel


> On Feb 22, 2022, at 4:10 AM, Vasily Averin <vvs@virtuozzo.com> wrote:
> 
> On 22.02.2022 13:23, Hyeonggon Yoo wrote:
>>> On Tue, Feb 22, 2022 at 12:22:02PM +0300, Vasily Averin wrote:
>>> Dear all,
>>> 
>>> I've found that /proc/slabinfo shows inadequate numbers of in-use slab objects.
>>> it assumes that all objects stored in cpu caches are always 100% in use.
> 
>>> Is it a bug or perhaps a well-known feature that I missed?
>> This is not a bug..
> 
> Thank you for explanation,
> I think it would be useful to document this somewhere. (Documnetation? man slabinfo ?)

Man page is the best place, IMO.

> Also I would like to know is it some (fast) way to get real numbers in userspace ?
> crash is too fat for this task.
> Do you know perhaps some other userspace utility or may be systemtap/drgn script?

You can hack something based on memcg_slabinfo.py for this purpose. It already contains the code to iterate over all slab pages, you just need to gather necessary stats.
Of course the data will be racy, but should be good enough for practical purposes.
I used something like this when was looking for the real slab utilization numbers when started reworking the slab memory accounting.

> I'm preparing new set of memcg accounting patches, with reparired tools/cgroup/memcg_slapinfo.py
> I can get numbers of accounted resources, but I need to understand how may resources was NOT
> accounted to memcg but allocated on host. I expected get these numbers from host's slabinfo but
> it does not show correct numbers.

I’m really curious what these patches are. Are you looking to enable accounting for more slab caches?

Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-22 12:10   ` Vasily Averin
  2022-02-22 16:32     ` Shakeel Butt
  2022-02-22 16:47     ` Roman Gushchin
@ 2022-02-22 20:59     ` Roman Gushchin
  2022-02-22 23:08       ` Vlastimil Babka
  2022-03-04 16:29     ` Vlastimil Babka
  3 siblings, 1 reply; 24+ messages in thread
From: Roman Gushchin @ 2022-02-22 20:59 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Hyeonggon Yoo, Vlastimil Babka, Linux MM, Andrew Morton, kernel


> On Feb 22, 2022, at 4:10 AM, Vasily Averin <vvs@virtuozzo.com> wrote:
> 
> On 22.02.2022 13:23, Hyeonggon Yoo wrote:
>>> On Tue, Feb 22, 2022 at 12:22:02PM +0300, Vasily Averin wrote:
>>> Dear all,
>>> 
>>> I've found that /proc/slabinfo shows inadequate numbers of in-use slab objects.
>>> it assumes that all objects stored in cpu caches are always 100% in use.
> 
>>> Is it a bug or perhaps a well-known feature that I missed?
>> This is not a bug..
> 
> Thank you for explanation,
> I think it would be useful to document this somewhere. (Documnetation? man slabinfo ?)
> Also I would like to know is it some (fast) way to get real numbers in userspace ?
> crash is too fat for this task.
> Do you know perhaps some other userspace utility or may be systemtap/drgn script?

Btw, implementing fast slab counters independent from the sl*b implementation and the physical layout of data might be an interesting idea.
Currently /proc/slabinfo is often confusing because of the slab merging. It’s particularly true when someone tries to compare memory usage on two different kernel versions, for example: the set of slab caches might look very different depending on subtle changes in object sizes and the caches merging outcome.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-22 20:59     ` Roman Gushchin
@ 2022-02-22 23:08       ` Vlastimil Babka
  2022-02-23  0:07         ` Roman Gushchin
  0 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2022-02-22 23:08 UTC (permalink / raw)
  To: Roman Gushchin, Vasily Averin, Christoph Lameter, David Rientjes,
	Joonsoo Kim, Pekka Enberg
  Cc: Hyeonggon Yoo, Linux MM, Andrew Morton, kernel

On 2/22/22 21:59, Roman Gushchin wrote:
> 
>> On Feb 22, 2022, at 4:10 AM, Vasily Averin <vvs@virtuozzo.com> wrote:

BTW please To/Cc directly all slab maintainers on future slab related
threads (added now).

>> On 22.02.2022 13:23, Hyeonggon Yoo wrote:
>>>> On Tue, Feb 22, 2022 at 12:22:02PM +0300, Vasily Averin wrote:
>>>> Dear all,
>>>> 
>>>> I've found that /proc/slabinfo shows inadequate numbers of in-use slab objects.
>>>> it assumes that all objects stored in cpu caches are always 100% in use.
>> 
>>>> Is it a bug or perhaps a well-known feature that I missed?
>>> This is not a bug..
>> 
>> Thank you for explanation,
>> I think it would be useful to document this somewhere. (Documnetation? man slabinfo ?)
>> Also I would like to know is it some (fast) way to get real numbers in userspace ?
>> crash is too fat for this task.
>> Do you know perhaps some other userspace utility or may be systemtap/drgn script?
> 
> Btw, implementing fast slab counters independent from the sl*b implementation and the physical layout of data might be an interesting idea.

Interesting idea, but merging will be an issue if we ever manage to
officially allow kfree() on object allocated by kmem_cache_alloc() - which
is now blocked by SLOB (there was a recent thread that stalled).

> Currently /proc/slabinfo is often confusing because of the slab merging. It’s particularly true when someone tries to compare memory usage on two different kernel versions, for example: the set of slab caches might look very different depending on subtle changes in object sizes and the caches merging outcome.
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-22 23:08       ` Vlastimil Babka
@ 2022-02-23  0:07         ` Roman Gushchin
  2022-02-23  0:32           ` Vlastimil Babka
  0 siblings, 1 reply; 24+ messages in thread
From: Roman Gushchin @ 2022-02-23  0:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Vasily Averin, Christoph Lameter, David Rientjes, Joonsoo Kim,
	Pekka Enberg, Hyeonggon Yoo, Linux MM, Andrew Morton, kernel


> On Feb 22, 2022, at 3:08 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> 
> On 2/22/22 21:59, Roman Gushchin wrote:
>> 
>>> On Feb 22, 2022, at 4:10 AM, Vasily Averin <vvs@virtuozzo.com> wrote:
> 
> BTW please To/Cc directly all slab maintainers on future slab related
> threads (added now).
> 
>>> On 22.02.2022 13:23, Hyeonggon Yoo wrote:
>>>>>> On Tue, Feb 22, 2022 at 12:22:02PM +0300, Vasily Averin wrote:
>>>>>> Dear all,
>>>>>> 
>>>>>> I've found that /proc/slabinfo shows inadequate numbers of in-use slab objects.
>>>>>> it assumes that all objects stored in cpu caches are always 100% in use.
>>> 
>>>>> Is it a bug or perhaps a well-known feature that I missed?
>>>> This is not a bug..
>>> 
>>> Thank you for explanation,
>>> I think it would be useful to document this somewhere. (Documnetation? man slabinfo ?)
>>> Also I would like to know is it some (fast) way to get real numbers in userspace ?
>>> crash is too fat for this task.
>>> Do you know perhaps some other userspace utility or may be systemtap/drgn script?
>> 
>> Btw, implementing fast slab counters independent from the sl*b implementation and the physical layout of data might be an interesting idea.
> 
> Interesting idea, but merging will be an issue if we ever manage to
> officially allow kfree() on object allocated by kmem_cache_alloc() - which
> is now blocked by SLOB (there was a recent thread that stalled).

Well, we can store an id somewhere (like right behind the object). Depending on the object size and padding it might even take not so much extra space. Maybe not the feature everybody needs (so it can be turned off by default), but something that can be really useful in some cases.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-23  0:07         ` Roman Gushchin
@ 2022-02-23  0:32           ` Vlastimil Babka
  2022-02-23  3:45             ` Hyeonggon Yoo
  0 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2022-02-23  0:32 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Vasily Averin, Christoph Lameter, David Rientjes, Joonsoo Kim,
	Pekka Enberg, Hyeonggon Yoo, Linux MM, Andrew Morton, kernel

On 2/23/22 01:07, Roman Gushchin wrote:
> 
>> On Feb 22, 2022, at 3:08 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>> 
>> On 2/22/22 21:59, Roman Gushchin wrote:
>>> 
>>>> On Feb 22, 2022, at 4:10 AM, Vasily Averin <vvs@virtuozzo.com> wrote:
>> 
>> BTW please To/Cc directly all slab maintainers on future slab related
>> threads (added now).
>> 
>>>> On 22.02.2022 13:23, Hyeonggon Yoo wrote:
>>>>>>> On Tue, Feb 22, 2022 at 12:22:02PM +0300, Vasily Averin wrote:
>>>>>>> Dear all,
>>>>>>> 
>>>>>>> I've found that /proc/slabinfo shows inadequate numbers of in-use slab objects.
>>>>>>> it assumes that all objects stored in cpu caches are always 100% in use.
>>>> 
>>>>>> Is it a bug or perhaps a well-known feature that I missed?
>>>>> This is not a bug..
>>>> 
>>>> Thank you for explanation,
>>>> I think it would be useful to document this somewhere. (Documnetation? man slabinfo ?)
>>>> Also I would like to know is it some (fast) way to get real numbers in userspace ?
>>>> crash is too fat for this task.
>>>> Do you know perhaps some other userspace utility or may be systemtap/drgn script?
>>> 
>>> Btw, implementing fast slab counters independent from the sl*b implementation and the physical layout of data might be an interesting idea.
>> 
>> Interesting idea, but merging will be an issue if we ever manage to
>> officially allow kfree() on object allocated by kmem_cache_alloc() - which
>> is now blocked by SLOB (there was a recent thread that stalled).
> 
> Well, we can store an id somewhere (like right behind the object).
> Depending on the object size and padding it might even take not so much
> extra space. Maybe not the feature everybody needs (so it can be turned
> off by default), but something that can be really useful in some cases.
Hm it would be easier just to disable merging when the precise counters are
enabled. Assume it would be a config option (possibly boot-time option with
static keys) anyway so those who don't need them can avoid the overhead.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-22 16:47     ` Roman Gushchin
@ 2022-02-23  1:07       ` Vasily Averin
  0 siblings, 0 replies; 24+ messages in thread
From: Vasily Averin @ 2022-02-23  1:07 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Hyeonggon Yoo, Vlastimil Babka, Linux MM, Andrew Morton, kernel

On 22.02.2022 19:47, Roman Gushchin wrote:
>> On Feb 22, 2022, at 4:10 AM, Vasily Averin <vvs@virtuozzo.com> wrote:
>> I'm preparing new set of memcg accounting patches, with reparired tools/cgroup/memcg_slapinfo.py
>> I can get numbers of accounted resources, but I need to understand how may resources was NOT
>> accounted to memcg but allocated on host. I expected get these numbers from host's slabinfo but
>> it does not show correct numbers.
> 
> I’m really curious what these patches are. Are you looking to enable accounting for more slab caches?

I think I can announce it right now:

- Terminal accounting patch was lost in previous iteration,
- nft replaced iptables but still lacks an accounting,
- in openvz, we have a limit for each container for network interfaces, but upstream lacks it.  As a result, you can create many network interfaces, allocate a lot of non-memcg-accounted memory, and easily run OOM from a memcg-limited container. When creating a network device, various objects are allocated: queues, sysctl tables, kernfs_node, hash tables with dynamically resizable size using hashtable_init() and some others. I expect accounting for some of them can be quickly approved, but others may meet resistance. Moreover, I tested only veth devices, others may consume some other specific resources.
  In any case, I'm going to pay attention to this problem and find some acceptable solution.

Thank you,
	Vasily Averin


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-23  0:32           ` Vlastimil Babka
@ 2022-02-23  3:45             ` Hyeonggon Yoo
  2022-02-23 17:31               ` Vlastimil Babka
  0 siblings, 1 reply; 24+ messages in thread
From: Hyeonggon Yoo @ 2022-02-23  3:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Roman Gushchin, Vasily Averin, Christoph Lameter, David Rientjes,
	Joonsoo Kim, Pekka Enberg, Linux MM, Andrew Morton, kernel

On Wed, Feb 23, 2022 at 01:32:36AM +0100, Vlastimil Babka wrote:
> On 2/23/22 01:07, Roman Gushchin wrote:
> > 
> >> On Feb 22, 2022, at 3:08 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> >> 
> >> On 2/22/22 21:59, Roman Gushchin wrote:
> >>> 
> >>>> On Feb 22, 2022, at 4:10 AM, Vasily Averin <vvs@virtuozzo.com> wrote:
> >> 
> >> BTW please To/Cc directly all slab maintainers on future slab related
> >> threads (added now).
> >> 
> >>>> On 22.02.2022 13:23, Hyeonggon Yoo wrote:
> >>>>>>> On Tue, Feb 22, 2022 at 12:22:02PM +0300, Vasily Averin wrote:
> >>>>>>> Dear all,
> >>>>>>> 
> >>>>>>> I've found that /proc/slabinfo shows inadequate numbers of in-use slab objects.
> >>>>>>> it assumes that all objects stored in cpu caches are always 100% in use.
> >>>> 
> >>>>>> Is it a bug or perhaps a well-known feature that I missed?
> >>>>> This is not a bug..
> >>>> 
> >>>> Thank you for explanation,
> >>>> I think it would be useful to document this somewhere. (Documnetation? man slabinfo ?)
> >>>> Also I would like to know is it some (fast) way to get real numbers in userspace ?
> >>>> crash is too fat for this task.
> >>>> Do you know perhaps some other userspace utility or may be systemtap/drgn script?
> >>> 
> >>> Btw, implementing fast slab counters independent from the sl*b implementation and the physical layout of data might be an interesting idea.
> >> 
> >> Interesting idea, but merging will be an issue if we ever manage to
> >> officially allow kfree() on object allocated by kmem_cache_alloc() - which
> >> is now blocked by SLOB (there was a recent thread that stalled).
> > 
> > Well, we can store an id somewhere (like right behind the object).
> > Depending on the object size and padding it might even take not so much
> > extra space. Maybe not the feature everybody needs (so it can be turned
> > off by default), but something that can be really useful in some cases.

> Hm it would be easier just to disable merging when the precise counters are
> enabled. Assume it would be a config option (possibly boot-time option with
> static keys) anyway so those who don't need them can avoid the overhead.

Is it possible to accurately account objects in SLUB? I think it's not
easy because a CPU can free objects to remote cpu's partial slabs using
cmpxchg_double()...

Or would you count them by iterating all of cpu partial slabs and node
partial slabs? (Something like drgn script or new procfs file?)

-- 
Hyeonggon


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-23  3:45             ` Hyeonggon Yoo
@ 2022-02-23 17:31               ` Vlastimil Babka
  2022-02-23 18:15                 ` Roman Gushchin
  2022-02-24 13:16                 ` Vasily Averin
  0 siblings, 2 replies; 24+ messages in thread
From: Vlastimil Babka @ 2022-02-23 17:31 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: Roman Gushchin, Vasily Averin, Christoph Lameter, David Rientjes,
	Joonsoo Kim, Pekka Enberg, Linux MM, Andrew Morton, kernel

On 2/23/22 04:45, Hyeonggon Yoo wrote:
> On Wed, Feb 23, 2022 at 01:32:36AM +0100, Vlastimil Babka wrote:
>> Hm it would be easier just to disable merging when the precise counters are
>> enabled. Assume it would be a config option (possibly boot-time option with
>> static keys) anyway so those who don't need them can avoid the overhead.
> 
> Is it possible to accurately account objects in SLUB? I think it's not
> easy because a CPU can free objects to remote cpu's partial slabs using
> cmpxchg_double()...

AFAIU Roman's idea would be that each alloc/free would simply inc/dec an
object counter that's disconnected from physical handling of particular sl*b
implementation. It would provide exact count of objects from the perspective
of slab users.
I assume for reduced overhead the counters would be implemented in a percpu
fashion as e.g. vmstats. Slabinfo gathering would thus have to e.g. sum up
those percpu counters.

> Or would you count them by iterating all of cpu partial slabs and node
> partial slabs? (Something like drgn script or new procfs file?)
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-23 17:31               ` Vlastimil Babka
@ 2022-02-23 18:15                 ` Roman Gushchin
  2022-02-24 13:16                 ` Vasily Averin
  1 sibling, 0 replies; 24+ messages in thread
From: Roman Gushchin @ 2022-02-23 18:15 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hyeonggon Yoo, Vasily Averin, Christoph Lameter, David Rientjes,
	Joonsoo Kim, Pekka Enberg, Linux MM, Andrew Morton, kernel


> On Feb 23, 2022, at 9:31 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
> 
> On 2/23/22 04:45, Hyeonggon Yoo wrote:
>>> On Wed, Feb 23, 2022 at 01:32:36AM +0100, Vlastimil Babka wrote:
>>> Hm it would be easier just to disable merging when the precise counters are
>>> enabled. Assume it would be a config option (possibly boot-time option with
>>> static keys) anyway so those who don't need them can avoid the overhead.
>> 
>> Is it possible to accurately account objects in SLUB? I think it's not
>> easy because a CPU can free objects to remote cpu's partial slabs using
>> cmpxchg_double()...
> 
> AFAIU Roman's idea would be that each alloc/free would simply inc/dec an
> object counter that's disconnected from physical handling of particular sl*b
> implementation. It would provide exact count of objects from the perspective
> of slab users.
> I assume for reduced overhead the counters would be implemented in a percpu
> fashion as e.g. vmstats. Slabinfo gathering would thus have to e.g. sum up
> those percpu counters.

Right, something like this.

A bigger picture here: since we’ve moved to a per-object accounting, accounted allocations became a bit slower. It seems to be a non-issue in the real life (at least I haven’t seen any reports, maybe because it counter-weighted by memory savings and lower fragmentation), however benchmarks clearly show it and it’s something nice to fix at some point. The obvious solution is to maintain some sort of a per-memcg cache of pre-charged objects for particularly hot slab caches. But to do this we need to understand dynamically which slab caches are hot (and which cgroups are using them). This will require a per-memcg per-slab accounting mechanism. So maybe it will replace the drgn script in the long run.

Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-23 17:31               ` Vlastimil Babka
  2022-02-23 18:15                 ` Roman Gushchin
@ 2022-02-24 13:16                 ` Vasily Averin
  2022-02-25  0:08                   ` Roman Gushchin
  2022-03-03  8:39                   ` Christoph Lameter
  1 sibling, 2 replies; 24+ messages in thread
From: Vasily Averin @ 2022-02-24 13:16 UTC (permalink / raw)
  To: Vlastimil Babka, Hyeonggon Yoo
  Cc: Roman Gushchin, Christoph Lameter, David Rientjes, Joonsoo Kim,
	Pekka Enberg, Linux MM, Andrew Morton, kernel

On 22.02.2022 19:32, Shakeel Butt wrote:
> If you are just interested in the stats, you can use SLAB for your experiments.

Unfortunately memcg_slabino.py does not support SLAB right now.

On 23.02.2022 20:31, Vlastimil Babka wrote:
> On 2/23/22 04:45, Hyeonggon Yoo wrote:
>> On Wed, Feb 23, 2022 at 01:32:36AM +0100, Vlastimil Babka wrote:
>>> Hm it would be easier just to disable merging when the precise counters are
>>> enabled. Assume it would be a config option (possibly boot-time option with
>>> static keys) anyway so those who don't need them can avoid the overhead.
>>
>> Is it possible to accurately account objects in SLUB? I think it's not
>> easy because a CPU can free objects to remote cpu's partial slabs using
>> cmpxchg_double()...
> 
> AFAIU Roman's idea would be that each alloc/free would simply inc/dec an
> object counter that's disconnected from physical handling of particular sl*b
> implementation. It would provide exact count of objects from the perspective
> of slab users.
> I assume for reduced overhead the counters would be implemented in a percpu
> fashion as e.g. vmstats. Slabinfo gathering would thus have to e.g. sum up
> those percpu counters.

I like this idea too and I'm going to spend some time for its implementation.


Thank you,
	Vasily Averin


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-24 13:16                 ` Vasily Averin
@ 2022-02-25  0:08                   ` Roman Gushchin
  2022-02-25  4:37                     ` Vasily Averin
  2022-03-03  8:39                   ` Christoph Lameter
  1 sibling, 1 reply; 24+ messages in thread
From: Roman Gushchin @ 2022-02-25  0:08 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Vlastimil Babka, Hyeonggon Yoo, Christoph Lameter,
	David Rientjes, Joonsoo Kim, Pekka Enberg, Linux MM,
	Andrew Morton, kernel


> On Feb 24, 2022, at 5:17 AM, Vasily Averin <vvs@virtuozzo.com> wrote:
> 
> On 22.02.2022 19:32, Shakeel Butt wrote:
>> If you are just interested in the stats, you can use SLAB for your experiments.
> 
> Unfortunately memcg_slabino.py does not support SLAB right now.
> 
>> On 23.02.2022 20:31, Vlastimil Babka wrote:
>>> On 2/23/22 04:45, Hyeonggon Yoo wrote:
>>> On Wed, Feb 23, 2022 at 01:32:36AM +0100, Vlastimil Babka wrote:
>>>> Hm it would be easier just to disable merging when the precise counters are
>>>> enabled. Assume it would be a config option (possibly boot-time option with
>>>> static keys) anyway so those who don't need them can avoid the overhead.
>>> 
>>> Is it possible to accurately account objects in SLUB? I think it's not
>>> easy because a CPU can free objects to remote cpu's partial slabs using
>>> cmpxchg_double()...
>> AFAIU Roman's idea would be that each alloc/free would simply inc/dec an
>> object counter that's disconnected from physical handling of particular sl*b
>> implementation. It would provide exact count of objects from the perspective
>> of slab users.
>> I assume for reduced overhead the counters would be implemented in a percpu
>> fashion as e.g. vmstats. Slabinfo gathering would thus have to e.g. sum up
>> those percpu counters.
> 
> I like this idea too and I'm going to spend some time for its implementation.

Sounds good!

Unfortunately it’s quite tricky: the problem is that there is potentially a large and dynamic set of cgroups and also large and dynamic set of slab caches. Given the performance considerations, it’s also unlikely to avoid using percpu variables.
So we come to the (nr_slab_caches * nr_cgroups * nr_cpus) number of “objects”. If we create them proactively, we’re likely wasting lot of memory. Creating them on demand is tricky too (especially without losing some accounting accuracy).

Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-25  0:08                   ` Roman Gushchin
@ 2022-02-25  4:37                     ` Vasily Averin
  2022-02-28  6:17                       ` Vasily Averin
  0 siblings, 1 reply; 24+ messages in thread
From: Vasily Averin @ 2022-02-25  4:37 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Vlastimil Babka, Hyeonggon Yoo, Christoph Lameter,
	David Rientjes, Joonsoo Kim, Pekka Enberg, Linux MM,
	Andrew Morton, kernel

On 25.02.2022 03:08, Roman Gushchin wrote:
> 
>> On Feb 24, 2022, at 5:17 AM, Vasily Averin <vvs@virtuozzo.com> wrote:
>>
>> On 22.02.2022 19:32, Shakeel Butt wrote:
>>> If you are just interested in the stats, you can use SLAB for your experiments.
>>
>> Unfortunately memcg_slabino.py does not support SLAB right now.
>>
>>> On 23.02.2022 20:31, Vlastimil Babka wrote:
>>>> On 2/23/22 04:45, Hyeonggon Yoo wrote:
>>>> On Wed, Feb 23, 2022 at 01:32:36AM +0100, Vlastimil Babka wrote:
>>>>> Hm it would be easier just to disable merging when the precise counters are
>>>>> enabled. Assume it would be a config option (possibly boot-time option with
>>>>> static keys) anyway so those who don't need them can avoid the overhead.
>>>>
>>>> Is it possible to accurately account objects in SLUB? I think it's not
>>>> easy because a CPU can free objects to remote cpu's partial slabs using
>>>> cmpxchg_double()...
>>> AFAIU Roman's idea would be that each alloc/free would simply inc/dec an
>>> object counter that's disconnected from physical handling of particular sl*b
>>> implementation. It would provide exact count of objects from the perspective
>>> of slab users.
>>> I assume for reduced overhead the counters would be implemented in a percpu
>>> fashion as e.g. vmstats. Slabinfo gathering would thus have to e.g. sum up
>>> those percpu counters.
>>
>> I like this idea too and I'm going to spend some time for its implementation.
> 
> Sounds good!
> 
> Unfortunately it’s quite tricky: the problem is that there is potentially a large and dynamic set of cgroups and also large and dynamic set of slab caches. Given the performance considerations, it’s also unlikely to avoid using percpu variables.
> So we come to the (nr_slab_caches * nr_cgroups * nr_cpus) number of “objects”. If we create them proactively, we’re likely wasting lot of memory. Creating them on demand is tricky too (especially without losing some accounting accuracy).

I told about global (i.e. non-memcg) precise slab counters only.
I'm expect it can done under new config option and/or static key, and if present use them in /proc/slabinfo output.

At present I'm still going to extract memcg counters via your memcg_slabinfo script.

Thank you,
	Vasily Averin



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-25  4:37                     ` Vasily Averin
@ 2022-02-28  6:17                       ` Vasily Averin
  2022-02-28 10:22                         ` Hyeonggon Yoo
                                           ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Vasily Averin @ 2022-02-28  6:17 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Vlastimil Babka, Hyeonggon Yoo, Christoph Lameter,
	David Rientjes, Joonsoo Kim, Pekka Enberg, Linux MM,
	Andrew Morton, kernel

On 25.02.2022 07:37, Vasily Averin wrote:
> On 25.02.2022 03:08, Roman Gushchin wrote:
>>
>>> On Feb 24, 2022, at 5:17 AM, Vasily Averin <vvs@virtuozzo.com> wrote:
>>>
>>> On 22.02.2022 19:32, Shakeel Butt wrote:
>>>> If you are just interested in the stats, you can use SLAB for your experiments.
>>>
>>> Unfortunately memcg_slabino.py does not support SLAB right now.
>>>
>>>> On 23.02.2022 20:31, Vlastimil Babka wrote:
>>>>> On 2/23/22 04:45, Hyeonggon Yoo wrote:
>>>>> On Wed, Feb 23, 2022 at 01:32:36AM +0100, Vlastimil Babka wrote:
>>>>>> Hm it would be easier just to disable merging when the precise counters are
>>>>>> enabled. Assume it would be a config option (possibly boot-time option with
>>>>>> static keys) anyway so those who don't need them can avoid the overhead.
>>>>>
>>>>> Is it possible to accurately account objects in SLUB? I think it's not
>>>>> easy because a CPU can free objects to remote cpu's partial slabs using
>>>>> cmpxchg_double()...
>>>> AFAIU Roman's idea would be that each alloc/free would simply inc/dec an
>>>> object counter that's disconnected from physical handling of particular sl*b
>>>> implementation. It would provide exact count of objects from the perspective
>>>> of slab users.
>>>> I assume for reduced overhead the counters would be implemented in a percpu
>>>> fashion as e.g. vmstats. Slabinfo gathering would thus have to e.g. sum up
>>>> those percpu counters.
>>>
>>> I like this idea too and I'm going to spend some time for its implementation.
>>
>> Sounds good!
>>
>> Unfortunately it’s quite tricky: the problem is that there is potentially a large and dynamic set of cgroups and also large and dynamic set of slab caches. Given the performance considerations, it’s also unlikely to avoid using percpu variables.
>> So we come to the (nr_slab_caches * nr_cgroups * nr_cpus) number of “objects”. If we create them proactively, we’re likely wasting lot of memory. Creating them on demand is tricky too (especially without losing some accounting accuracy).
> 
> I told about global (i.e. non-memcg) precise slab counters only.
> I'm expect it can done under new config option and/or static key, and if present use them in /proc/slabinfo output.
> 
> At present I'm still going to extract memcg counters via your memcg_slabinfo script.

I'm not sure I'll be able to debug this patch properly and decided to submit it as is.
I hope it can be useful.

In general it works and /proc/slabinfo shows reasonable numbers,
however in some cases they differs from crash' "kmem -s" output, either +1 or -1.
Obviously I missed something.

---[cut here]---
[PATCH RFC] slub: precise in-use counter for /proc/slabinfo output

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
  include/linux/slub_def.h |  3 +++
  init/Kconfig             |  7 +++++++
  mm/slub.c                | 20 +++++++++++++++++++-
  3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 33c5c0e3bd8d..d22e18dfe905 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -56,6 +56,9 @@ struct kmem_cache_cpu {
  #ifdef CONFIG_SLUB_STATS
  	unsigned stat[NR_SLUB_STAT_ITEMS];
  #endif
+#ifdef CONFIG_SLUB_PRECISE_INUSE
+	unsigned inuse;		/* Precise in-use counter */
+#endif
  };
  
  #ifdef CONFIG_SLUB_CPU_PARTIAL
diff --git a/init/Kconfig b/init/Kconfig
index e9119bf54b1f..5c57bdbb8938 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1995,6 +1995,13 @@ config SLUB_CPU_PARTIAL
  	  which requires the taking of locks that may cause latency spikes.
  	  Typically one would choose no for a realtime system.
  
+config SLUB_PRECISE_INUSE
+	default n
+	depends on SLUB && SMP
+	bool "SLUB precise in-use counter"
+	help
+	  Per cpu in-use counter shows precise statistic in slabinfo.
+
  config MMAP_ALLOW_UNINITIALIZED
  	bool "Allow mmapped anonymous memory to be uninitialized"
  	depends on EXPERT && !MMU
diff --git a/mm/slub.c b/mm/slub.c
index 261474092e43..90750cae0af9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3228,6 +3228,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
  
  out:
  	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init);
+#ifdef CONFIG_SLUB_PRECISE_INUSE
+	raw_cpu_inc(s->cpu_slab->inuse);
+#endif
  
  	return object;
  }
@@ -3506,8 +3509,12 @@ static __always_inline void slab_free(struct kmem_cache *s, struct slab *slab,
  	 * With KASAN enabled slab_free_freelist_hook modifies the freelist
  	 * to remove objects, whose reuse must be delayed.
  	 */
-	if (slab_free_freelist_hook(s, &head, &tail, &cnt))
+	if (slab_free_freelist_hook(s, &head, &tail, &cnt)) {
  		do_slab_free(s, slab, head, tail, cnt, addr);
+#ifdef CONFIG_SLUB_PRECISE_INUSE
+		raw_cpu_sub(s->cpu_slab->inuse, cnt);
+#endif
+	}
  }
  
  #ifdef CONFIG_KASAN_GENERIC
@@ -6253,6 +6260,17 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
  		nr_free += count_partial(n, count_free);
  	}
  
+#ifdef CONFIG_SLUB_PRECISE_INUSE
+	{
+		unsigned int cpu, nr_inuse = 0;
+
+		for_each_possible_cpu(cpu)
+			nr_inuse += per_cpu_ptr((s)->cpu_slab, cpu)->inuse;
+
+		if (nr_inuse <= nr_objs)
+			nr_free = nr_objs - nr_inuse;
+	}
+#endif
  	sinfo->active_objs = nr_objs - nr_free;
  	sinfo->num_objs = nr_objs;
  	sinfo->active_slabs = nr_slabs;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-28  6:17                       ` Vasily Averin
@ 2022-02-28 10:22                         ` Hyeonggon Yoo
  2022-02-28 10:28                           ` Hyeonggon Yoo
  2022-02-28 10:43                         ` Hyeonggon Yoo
  2022-02-28 12:09                         ` Hyeonggon Yoo
  2 siblings, 1 reply; 24+ messages in thread
From: Hyeonggon Yoo @ 2022-02-28 10:22 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Joonsoo Kim, Pekka Enberg, Linux MM,
	Andrew Morton, kernel

On Mon, Feb 28, 2022 at 09:17:27AM +0300, Vasily Averin wrote:
> On 25.02.2022 07:37, Vasily Averin wrote:
> > On 25.02.2022 03:08, Roman Gushchin wrote:
> > > 
> > > > On Feb 24, 2022, at 5:17 AM, Vasily Averin <vvs@virtuozzo.com> wrote:
> > > > 
> > > > On 22.02.2022 19:32, Shakeel Butt wrote:
> > > > > If you are just interested in the stats, you can use SLAB for your experiments.
> > > > 
> > > > Unfortunately memcg_slabino.py does not support SLAB right now.
> > > > 
> > > > > On 23.02.2022 20:31, Vlastimil Babka wrote:
> > > > > > On 2/23/22 04:45, Hyeonggon Yoo wrote:
> > > > > > On Wed, Feb 23, 2022 at 01:32:36AM +0100, Vlastimil Babka wrote:
> > > > > > > Hm it would be easier just to disable merging when the precise counters are
> > > > > > > enabled. Assume it would be a config option (possibly boot-time option with
> > > > > > > static keys) anyway so those who don't need them can avoid the overhead.
> > > > > > 
> > > > > > Is it possible to accurately account objects in SLUB? I think it's not
> > > > > > easy because a CPU can free objects to remote cpu's partial slabs using
> > > > > > cmpxchg_double()...
> > > > > AFAIU Roman's idea would be that each alloc/free would simply inc/dec an
> > > > > object counter that's disconnected from physical handling of particular sl*b
> > > > > implementation. It would provide exact count of objects from the perspective
> > > > > of slab users.
> > > > > I assume for reduced overhead the counters would be implemented in a percpu
> > > > > fashion as e.g. vmstats. Slabinfo gathering would thus have to e.g. sum up
> > > > > those percpu counters.
> > > > 
> > > > I like this idea too and I'm going to spend some time for its implementation.
> > > 
> > > Sounds good!
> > > 
> > > Unfortunately it’s quite tricky: the problem is that there is potentially a large and dynamic set of cgroups and also large and dynamic set of slab caches. Given the performance considerations, it’s also unlikely to avoid using percpu variables.
> > > So we come to the (nr_slab_caches * nr_cgroups * nr_cpus) number of “objects”. If we create them proactively, we’re likely wasting lot of memory. Creating them on demand is tricky too (especially without losing some accounting accuracy).
> > 
> > I told about global (i.e. non-memcg) precise slab counters only.
> > I'm expect it can done under new config option and/or static key, and if present use them in /proc/slabinfo output.
> > 
> > At present I'm still going to extract memcg counters via your memcg_slabinfo script.
> 
> I'm not sure I'll be able to debug this patch properly and decided to submit it as is.
> I hope it can be useful.
> 
> In general it works and /proc/slabinfo shows reasonable numbers,
> however in some cases they differs from crash' "kmem -s" output, either +1 or -1.
> Obviously I missed something.
> 
> ---[cut here]---
> [PATCH RFC] slub: precise in-use counter for /proc/slabinfo output
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  include/linux/slub_def.h |  3 +++
>  init/Kconfig             |  7 +++++++
>  mm/slub.c                | 20 +++++++++++++++++++-
>  3 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index 33c5c0e3bd8d..d22e18dfe905 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -56,6 +56,9 @@ struct kmem_cache_cpu {
>  #ifdef CONFIG_SLUB_STATS
>  	unsigned stat[NR_SLUB_STAT_ITEMS];
>  #endif
> +#ifdef CONFIG_SLUB_PRECISE_INUSE
> +	unsigned inuse;		/* Precise in-use counter */
> +#endif
>  };
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
> diff --git a/init/Kconfig b/init/Kconfig
> index e9119bf54b1f..5c57bdbb8938 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1995,6 +1995,13 @@ config SLUB_CPU_PARTIAL
>  	  which requires the taking of locks that may cause latency spikes.
>  	  Typically one would choose no for a realtime system.
> +config SLUB_PRECISE_INUSE
> +	default n
> +	depends on SLUB && SMP
> +	bool "SLUB precise in-use counter"
> +	help
> +	  Per cpu in-use counter shows precise statistic in slabinfo.
> +
>  config MMAP_ALLOW_UNINITIALIZED
>  	bool "Allow mmapped anonymous memory to be uninitialized"
>  	depends on EXPERT && !MMU
> diff --git a/mm/slub.c b/mm/slub.c
> index 261474092e43..90750cae0af9 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3228,6 +3228,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
>  out:
>  	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init);
> +#ifdef CONFIG_SLUB_PRECISE_INUSE
> +	raw_cpu_inc(s->cpu_slab->inuse);
> +#endif
>  	return object;
>  }
> @@ -3506,8 +3509,12 @@ static __always_inline void slab_free(struct kmem_cache *s, struct slab *slab,
>  	 * With KASAN enabled slab_free_freelist_hook modifies the freelist
>  	 * to remove objects, whose reuse must be delayed.
>  	 */
> -	if (slab_free_freelist_hook(s, &head, &tail, &cnt))
> +	if (slab_free_freelist_hook(s, &head, &tail, &cnt)) {
>  		do_slab_free(s, slab, head, tail, cnt, addr);
> +#ifdef CONFIG_SLUB_PRECISE_INUSE
> +		raw_cpu_sub(s->cpu_slab->inuse, cnt);
> +#endif
> +	}
>  }
>  #ifdef CONFIG_KASAN_GENERIC
> @@ -6253,6 +6260,17 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
>  		nr_free += count_partial(n, count_free);
>  	}
> +#ifdef CONFIG_SLUB_PRECISE_INUSE
> +	{
> +		unsigned int cpu, nr_inuse = 0;
> +
> +		for_each_possible_cpu(cpu)
> +			nr_inuse += per_cpu_ptr((s)->cpu_slab, cpu)->inuse;
> +
> +		if (nr_inuse <= nr_objs)
> +			nr_free = nr_objs - nr_inuse;
> +	}
> +#endif
>  	sinfo->active_objs = nr_objs - nr_free;
>  	sinfo->num_objs = nr_objs;
>  	sinfo->active_slabs = nr_slabs;

Hi Vasily, thank you for this patch.
This looks nice, but I see things we can improve:

1) using raw_cpu_{inc,sub}(), s->cpu_slab->inuse will be racy if kernel
can be preempted. slub does not disable preemption/interrupts at all in fastpath.

And yeah, we can accept being racy to some degree. but it will be incorrect
more and more if system is up for long time. So I think atomic integer
is right choice if correctness is important?

2) This code is not aware of cpu partials. there is list of slab for
each kmem_cache_cpu. you can iterate them by: 

	kmem_cache_cpu->partial->next->next->next->... and so on until it enters NULL.

So we need to count cpu partials' inuse too.
Then we need per-slab counters... I think we can use struct slab's
__unused field for this?

Thanks :)

-- 
Thank you, You are awesome!
Hyeonggon :-)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-28 10:22                         ` Hyeonggon Yoo
@ 2022-02-28 10:28                           ` Hyeonggon Yoo
  0 siblings, 0 replies; 24+ messages in thread
From: Hyeonggon Yoo @ 2022-02-28 10:28 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Joonsoo Kim, Pekka Enberg, Linux MM,
	Andrew Morton, kernel

On Mon, Feb 28, 2022 at 10:22:42AM +0000, Hyeonggon Yoo wrote:
> On Mon, Feb 28, 2022 at 09:17:27AM +0300, Vasily Averin wrote:
> > On 25.02.2022 07:37, Vasily Averin wrote:
> > > On 25.02.2022 03:08, Roman Gushchin wrote:
> > > > 
> > > > > On Feb 24, 2022, at 5:17 AM, Vasily Averin <vvs@virtuozzo.com> wrote:
> > > > > 
> > > > > On 22.02.2022 19:32, Shakeel Butt wrote:
> > > > > > If you are just interested in the stats, you can use SLAB for your experiments.
> > > > > 
> > > > > Unfortunately memcg_slabino.py does not support SLAB right now.
> > > > > 
> > > > > > On 23.02.2022 20:31, Vlastimil Babka wrote:
> > > > > > > On 2/23/22 04:45, Hyeonggon Yoo wrote:
> > > > > > > On Wed, Feb 23, 2022 at 01:32:36AM +0100, Vlastimil Babka wrote:
> > > > > > > > Hm it would be easier just to disable merging when the precise counters are
> > > > > > > > enabled. Assume it would be a config option (possibly boot-time option with
> > > > > > > > static keys) anyway so those who don't need them can avoid the overhead.
> > > > > > > 
> > > > > > > Is it possible to accurately account objects in SLUB? I think it's not
> > > > > > > easy because a CPU can free objects to remote cpu's partial slabs using
> > > > > > > cmpxchg_double()...
> > > > > > AFAIU Roman's idea would be that each alloc/free would simply inc/dec an
> > > > > > object counter that's disconnected from physical handling of particular sl*b
> > > > > > implementation. It would provide exact count of objects from the perspective
> > > > > > of slab users.
> > > > > > I assume for reduced overhead the counters would be implemented in a percpu
> > > > > > fashion as e.g. vmstats. Slabinfo gathering would thus have to e.g. sum up
> > > > > > those percpu counters.
> > > > > 
> > > > > I like this idea too and I'm going to spend some time for its implementation.
> > > > 
> > > > Sounds good!
> > > > 
> > > > Unfortunately it’s quite tricky: the problem is that there is potentially a large and dynamic set of cgroups and also large and dynamic set of slab caches. Given the performance considerations, it’s also unlikely to avoid using percpu variables.
> > > > So we come to the (nr_slab_caches * nr_cgroups * nr_cpus) number of “objects”. If we create them proactively, we’re likely wasting lot of memory. Creating them on demand is tricky too (especially without losing some accounting accuracy).
> > > 
> > > I told about global (i.e. non-memcg) precise slab counters only.
> > > I'm expect it can done under new config option and/or static key, and if present use them in /proc/slabinfo output.
> > > 
> > > At present I'm still going to extract memcg counters via your memcg_slabinfo script.
> > 
> > I'm not sure I'll be able to debug this patch properly and decided to submit it as is.
> > I hope it can be useful.
> > 
> > In general it works and /proc/slabinfo shows reasonable numbers,
> > however in some cases they differs from crash' "kmem -s" output, either +1 or -1.
> > Obviously I missed something.
> > 
> > ---[cut here]---
> > [PATCH RFC] slub: precise in-use counter for /proc/slabinfo output
> > 
> > Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> > ---
> >  include/linux/slub_def.h |  3 +++
> >  init/Kconfig             |  7 +++++++
> >  mm/slub.c                | 20 +++++++++++++++++++-
> >  3 files changed, 29 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> > index 33c5c0e3bd8d..d22e18dfe905 100644
> > --- a/include/linux/slub_def.h
> > +++ b/include/linux/slub_def.h
> > @@ -56,6 +56,9 @@ struct kmem_cache_cpu {
> >  #ifdef CONFIG_SLUB_STATS
> >  	unsigned stat[NR_SLUB_STAT_ITEMS];
> >  #endif
> > +#ifdef CONFIG_SLUB_PRECISE_INUSE
> > +	unsigned inuse;		/* Precise in-use counter */
> > +#endif
> >  };
> >  #ifdef CONFIG_SLUB_CPU_PARTIAL
> > diff --git a/init/Kconfig b/init/Kconfig
> > index e9119bf54b1f..5c57bdbb8938 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -1995,6 +1995,13 @@ config SLUB_CPU_PARTIAL
> >  	  which requires the taking of locks that may cause latency spikes.
> >  	  Typically one would choose no for a realtime system.
> > +config SLUB_PRECISE_INUSE
> > +	default n
> > +	depends on SLUB && SMP
> > +	bool "SLUB precise in-use counter"
> > +	help
> > +	  Per cpu in-use counter shows precise statistic in slabinfo.
> > +
> >  config MMAP_ALLOW_UNINITIALIZED
> >  	bool "Allow mmapped anonymous memory to be uninitialized"
> >  	depends on EXPERT && !MMU
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 261474092e43..90750cae0af9 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -3228,6 +3228,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
> >  out:
> >  	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init);
> > +#ifdef CONFIG_SLUB_PRECISE_INUSE
> > +	raw_cpu_inc(s->cpu_slab->inuse);
> > +#endif
> >  	return object;
> >  }
> > @@ -3506,8 +3509,12 @@ static __always_inline void slab_free(struct kmem_cache *s, struct slab *slab,
> >  	 * With KASAN enabled slab_free_freelist_hook modifies the freelist
> >  	 * to remove objects, whose reuse must be delayed.
> >  	 */
> > -	if (slab_free_freelist_hook(s, &head, &tail, &cnt))
> > +	if (slab_free_freelist_hook(s, &head, &tail, &cnt)) {
> >  		do_slab_free(s, slab, head, tail, cnt, addr);
> > +#ifdef CONFIG_SLUB_PRECISE_INUSE
> > +		raw_cpu_sub(s->cpu_slab->inuse, cnt);
> > +#endif
> > +	}
> >  }
> >  #ifdef CONFIG_KASAN_GENERIC
> > @@ -6253,6 +6260,17 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
> >  		nr_free += count_partial(n, count_free);
> >  	}
> > +#ifdef CONFIG_SLUB_PRECISE_INUSE
> > +	{
> > +		unsigned int cpu, nr_inuse = 0;
> > +
> > +		for_each_possible_cpu(cpu)
> > +			nr_inuse += per_cpu_ptr((s)->cpu_slab, cpu)->inuse;
> > +
> > +		if (nr_inuse <= nr_objs)
> > +			nr_free = nr_objs - nr_inuse;
> > +	}
> > +#endif
> >  	sinfo->active_objs = nr_objs - nr_free;
> >  	sinfo->num_objs = nr_objs;
> >  	sinfo->active_slabs = nr_slabs;
> 
> Hi Vasily, thank you for this patch.
> This looks nice, but I see things we can improve:
> 
> 1) using raw_cpu_{inc,sub}(), s->cpu_slab->inuse will be racy if kernel
> can be preempted. slub does not disable preemption/interrupts at all in fastpath.
> 
> And yeah, we can accept being racy to some degree. but it will be incorrect
> more and more if system is up for long time. So I think atomic integer
> is right choice if correctness is important?
> 
> 2) This code is not aware of cpu partials. there is list of slab for
> each kmem_cache_cpu. you can iterate them by: 
> 

And replying this, I realized again ... we need to consider disabling
preemption when freeing to remote cpu's partials if CONFIG_SLUB_PRECISE_INUSE=y.

Hmm, do we need another approach?

> 	kmem_cache_cpu->partial->next->next->next->... and so on until it enters NULL.
> 
> So we need to count cpu partials' inuse too.
> Then we need per-slab counters... I think we can use struct slab's
> __unused field for this?
> 
> Thanks :)
> 
> -- 
> Thank you, You are awesome!
> Hyeonggon :-)

-- 
Thank you, You are awesome!
Hyeonggon :-)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-28  6:17                       ` Vasily Averin
  2022-02-28 10:22                         ` Hyeonggon Yoo
@ 2022-02-28 10:43                         ` Hyeonggon Yoo
  2022-02-28 12:09                         ` Hyeonggon Yoo
  2 siblings, 0 replies; 24+ messages in thread
From: Hyeonggon Yoo @ 2022-02-28 10:43 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Joonsoo Kim, Pekka Enberg, Linux MM,
	Andrew Morton, kernel

On Mon, Feb 28, 2022 at 09:17:27AM +0300, Vasily Averin wrote:
> On 25.02.2022 07:37, Vasily Averin wrote:
> > On 25.02.2022 03:08, Roman Gushchin wrote:
> > > 
> > > > On Feb 24, 2022, at 5:17 AM, Vasily Averin <vvs@virtuozzo.com> wrote:
> > > > 
> > > > On 22.02.2022 19:32, Shakeel Butt wrote:
> > > > > If you are just interested in the stats, you can use SLAB for your experiments.
> > > > 
> > > > Unfortunately memcg_slabino.py does not support SLAB right now.
> > > > 
> > > > > On 23.02.2022 20:31, Vlastimil Babka wrote:
> > > > > > On 2/23/22 04:45, Hyeonggon Yoo wrote:
> > > > > > On Wed, Feb 23, 2022 at 01:32:36AM +0100, Vlastimil Babka wrote:
> > > > > > > Hm it would be easier just to disable merging when the precise counters are
> > > > > > > enabled. Assume it would be a config option (possibly boot-time option with
> > > > > > > static keys) anyway so those who don't need them can avoid the overhead.
> > > > > > 
> > > > > > Is it possible to accurately account objects in SLUB? I think it's not
> > > > > > easy because a CPU can free objects to remote cpu's partial slabs using
> > > > > > cmpxchg_double()...
> > > > > AFAIU Roman's idea would be that each alloc/free would simply inc/dec an
> > > > > object counter that's disconnected from physical handling of particular sl*b
> > > > > implementation. It would provide exact count of objects from the perspective
> > > > > of slab users.
> > > > > I assume for reduced overhead the counters would be implemented in a percpu
> > > > > fashion as e.g. vmstats. Slabinfo gathering would thus have to e.g. sum up
> > > > > those percpu counters.
> > > > 
> > > > I like this idea too and I'm going to spend some time for its implementation.
> > > 
> > > Sounds good!
> > > 
> > > Unfortunately it’s quite tricky: the problem is that there is potentially a large and dynamic set of cgroups and also large and dynamic set of slab caches. Given the performance considerations, it’s also unlikely to avoid using percpu variables.
> > > So we come to the (nr_slab_caches * nr_cgroups * nr_cpus) number of “objects”. If we create them proactively, we’re likely wasting lot of memory. Creating them on demand is tricky too (especially without losing some accounting accuracy).
> > 
> > I told about global (i.e. non-memcg) precise slab counters only.
> > I'm expect it can done under new config option and/or static key, and if present use them in /proc/slabinfo output.
> > 
> > At present I'm still going to extract memcg counters via your memcg_slabinfo script.
> 
> I'm not sure I'll be able to debug this patch properly and decided to submit it as is.
> I hope it can be useful.
> 
> In general it works and /proc/slabinfo shows reasonable numbers,
> however in some cases they differs from crash' "kmem -s" output, either +1 or -1.
> Obviously I missed something.
> 
> ---[cut here]---
> [PATCH RFC] slub: precise in-use counter for /proc/slabinfo output
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  include/linux/slub_def.h |  3 +++
>  init/Kconfig             |  7 +++++++
>  mm/slub.c                | 20 +++++++++++++++++++-
>  3 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index 33c5c0e3bd8d..d22e18dfe905 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -56,6 +56,9 @@ struct kmem_cache_cpu {
>  #ifdef CONFIG_SLUB_STATS
>  	unsigned stat[NR_SLUB_STAT_ITEMS];
>  #endif
> +#ifdef CONFIG_SLUB_PRECISE_INUSE
> +	unsigned inuse;		/* Precise in-use counter */
> +#endif
>  };
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
> diff --git a/init/Kconfig b/init/Kconfig
> index e9119bf54b1f..5c57bdbb8938 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1995,6 +1995,13 @@ config SLUB_CPU_PARTIAL
>  	  which requires the taking of locks that may cause latency spikes.
>  	  Typically one would choose no for a realtime system.
> +config SLUB_PRECISE_INUSE
> +	default n
> +	depends on SLUB && SMP
> +	bool "SLUB precise in-use counter"
> +	help
> +	  Per cpu in-use counter shows precise statistic in slabinfo.
> +
>  config MMAP_ALLOW_UNINITIALIZED
>  	bool "Allow mmapped anonymous memory to be uninitialized"
>  	depends on EXPERT && !MMU
> diff --git a/mm/slub.c b/mm/slub.c
> index 261474092e43..90750cae0af9 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3228,6 +3228,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
>  out:
>  	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init);
> +#ifdef CONFIG_SLUB_PRECISE_INUSE
> +	raw_cpu_inc(s->cpu_slab->inuse);
> +#endif

I think here is wrong place to increase s->cpu_slab->inuse.
I thought s->cpu_slab->inuse is to count inuse of current cpu slab, isn't it?

If so, you need to be sure that allocation is done from cpu slab.
Let me know if I'm missing something...

>  	return object;
>  }
> @@ -3506,8 +3509,12 @@ static __always_inline void slab_free(struct kmem_cache *s, struct slab *slab,
>  	 * With KASAN enabled slab_free_freelist_hook modifies the freelist
>  	 * to remove objects, whose reuse must be delayed.
>  	 */
> -	if (slab_free_freelist_hook(s, &head, &tail, &cnt))
> +	if (slab_free_freelist_hook(s, &head, &tail, &cnt)) {
>  		do_slab_free(s, slab, head, tail, cnt, addr);
> +#ifdef CONFIG_SLUB_PRECISE_INUSE
> +		raw_cpu_sub(s->cpu_slab->inuse, cnt);
> +#endif
> +	}

Same here.

>  }
>  #ifdef CONFIG_KASAN_GENERIC
> @@ -6253,6 +6260,17 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
>  		nr_free += count_partial(n, count_free);
>  	}
> +#ifdef CONFIG_SLUB_PRECISE_INUSE
> +	{
> +		unsigned int cpu, nr_inuse = 0;
> +
> +		for_each_possible_cpu(cpu)
> +			nr_inuse += per_cpu_ptr((s)->cpu_slab, cpu)->inuse;
> +
> +		if (nr_inuse <= nr_objs)
> +			nr_free = nr_objs - nr_inuse;
> +	}
> +#endif
>  	sinfo->active_objs = nr_objs - nr_free;
>  	sinfo->num_objs = nr_objs;
>  	sinfo->active_slabs = nr_slabs;
> -- 
> 2.25.1

-- 
Thank you, You are awesome!
Hyeonggon :-)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-28  6:17                       ` Vasily Averin
  2022-02-28 10:22                         ` Hyeonggon Yoo
  2022-02-28 10:43                         ` Hyeonggon Yoo
@ 2022-02-28 12:09                         ` Hyeonggon Yoo
  2 siblings, 0 replies; 24+ messages in thread
From: Hyeonggon Yoo @ 2022-02-28 12:09 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Joonsoo Kim, Pekka Enberg, Linux MM,
	Andrew Morton, kernel

On Mon, Feb 28, 2022 at 09:17:27AM +0300, Vasily Averin wrote:
> On 25.02.2022 07:37, Vasily Averin wrote:
> > On 25.02.2022 03:08, Roman Gushchin wrote:
> > > 
> > > > On Feb 24, 2022, at 5:17 AM, Vasily Averin <vvs@virtuozzo.com> wrote:
> > > > 
> > > > On 22.02.2022 19:32, Shakeel Butt wrote:
> > > > > If you are just interested in the stats, you can use SLAB for your experiments.
> > > > 
> > > > Unfortunately memcg_slabino.py does not support SLAB right now.
> > > > 
> > > > > On 23.02.2022 20:31, Vlastimil Babka wrote:
> > > > > > On 2/23/22 04:45, Hyeonggon Yoo wrote:
> > > > > > On Wed, Feb 23, 2022 at 01:32:36AM +0100, Vlastimil Babka wrote:
> > > > > > > Hm it would be easier just to disable merging when the precise counters are
> > > > > > > enabled. Assume it would be a config option (possibly boot-time option with
> > > > > > > static keys) anyway so those who don't need them can avoid the overhead.
> > > > > > 
> > > > > > Is it possible to accurately account objects in SLUB? I think it's not
> > > > > > easy because a CPU can free objects to remote cpu's partial slabs using
> > > > > > cmpxchg_double()...
> > > > > AFAIU Roman's idea would be that each alloc/free would simply inc/dec an
> > > > > object counter that's disconnected from physical handling of particular sl*b
> > > > > implementation. It would provide exact count of objects from the perspective
> > > > > of slab users.
> > > > > I assume for reduced overhead the counters would be implemented in a percpu
> > > > > fashion as e.g. vmstats. Slabinfo gathering would thus have to e.g. sum up
> > > > > those percpu counters.
> > > > 
> > > > I like this idea too and I'm going to spend some time for its implementation.
> > > 
> > > Sounds good!
> > > 
> > > Unfortunately it’s quite tricky: the problem is that there is potentially a large and dynamic set of cgroups and also large and dynamic set of slab caches. Given the performance considerations, it’s also unlikely to avoid using percpu variables.
> > > So we come to the (nr_slab_caches * nr_cgroups * nr_cpus) number of “objects”. If we create them proactively, we’re likely wasting lot of memory. Creating them on demand is tricky too (especially without losing some accounting accuracy).
> > 
> > I told about global (i.e. non-memcg) precise slab counters only.
> > I'm expect it can done under new config option and/or static key, and if present use them in /proc/slabinfo output.
> > 
> > At present I'm still going to extract memcg counters via your memcg_slabinfo script.
> 
> I'm not sure I'll be able to debug this patch properly and decided to submit it as is.
> I hope it can be useful.
> 
> In general it works and /proc/slabinfo shows reasonable numbers,
> however in some cases they differs from crash' "kmem -s" output, either +1 or -1.
> Obviously I missed something.
>

Oh, sorry for the noise. You implemented what Roman said.
So s->cpu_slab->inuse is just per-cpu counters for every object of a cache,
not cpu slab. Please ignore my last feedback.

Anyway, I think the +1 or -1 difference is due to race?
What was your preemption model?

> ---[cut here]---
> [PATCH RFC] slub: precise in-use counter for /proc/slabinfo output
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  include/linux/slub_def.h |  3 +++
>  init/Kconfig             |  7 +++++++
>  mm/slub.c                | 20 +++++++++++++++++++-
>  3 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index 33c5c0e3bd8d..d22e18dfe905 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -56,6 +56,9 @@ struct kmem_cache_cpu {
>  #ifdef CONFIG_SLUB_STATS
>  	unsigned stat[NR_SLUB_STAT_ITEMS];
>  #endif
> +#ifdef CONFIG_SLUB_PRECISE_INUSE
> +	unsigned inuse;		/* Precise in-use counter */
> +#endif
>  };
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
> diff --git a/init/Kconfig b/init/Kconfig
> index e9119bf54b1f..5c57bdbb8938 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1995,6 +1995,13 @@ config SLUB_CPU_PARTIAL
>  	  which requires the taking of locks that may cause latency spikes.
>  	  Typically one would choose no for a realtime system.
> +config SLUB_PRECISE_INUSE
> +	default n
> +	depends on SLUB && SMP
> +	bool "SLUB precise in-use counter"
> +	help
> +	  Per cpu in-use counter shows precise statistic in slabinfo.
> +
>  config MMAP_ALLOW_UNINITIALIZED
>  	bool "Allow mmapped anonymous memory to be uninitialized"
>  	depends on EXPERT && !MMU
> diff --git a/mm/slub.c b/mm/slub.c
> index 261474092e43..90750cae0af9 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3228,6 +3228,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
>  out:
>  	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init);
> +#ifdef CONFIG_SLUB_PRECISE_INUSE
> +	raw_cpu_inc(s->cpu_slab->inuse);
> +#endif
>  	return object;
>  }
> @@ -3506,8 +3509,12 @@ static __always_inline void slab_free(struct kmem_cache *s, struct slab *slab,
>  	 * With KASAN enabled slab_free_freelist_hook modifies the freelist
>  	 * to remove objects, whose reuse must be delayed.
>  	 */
> -	if (slab_free_freelist_hook(s, &head, &tail, &cnt))
> +	if (slab_free_freelist_hook(s, &head, &tail, &cnt)) {
>  		do_slab_free(s, slab, head, tail, cnt, addr);
> +#ifdef CONFIG_SLUB_PRECISE_INUSE
> +		raw_cpu_sub(s->cpu_slab->inuse, cnt);
> +#endif
> +	}
>  }
>  #ifdef CONFIG_KASAN_GENERIC
> @@ -6253,6 +6260,17 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
>  		nr_free += count_partial(n, count_free);
>  	}
> +#ifdef CONFIG_SLUB_PRECISE_INUSE
> +	{
> +		unsigned int cpu, nr_inuse = 0;
> +
> +		for_each_possible_cpu(cpu)
> +			nr_inuse += per_cpu_ptr((s)->cpu_slab, cpu)->inuse;
> +
> +		if (nr_inuse <= nr_objs)
> +			nr_free = nr_objs - nr_inuse;
> +	}
> +#endif
>  	sinfo->active_objs = nr_objs - nr_free;
>  	sinfo->num_objs = nr_objs;
>  	sinfo->active_slabs = nr_slabs;
> -- 
> 2.25.1

-- 
Thank you, You are awesome!
Hyeonggon :-)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-24 13:16                 ` Vasily Averin
  2022-02-25  0:08                   ` Roman Gushchin
@ 2022-03-03  8:39                   ` Christoph Lameter
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Lameter @ 2022-03-03  8:39 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Vlastimil Babka, Hyeonggon Yoo, Roman Gushchin, David Rientjes,
	Joonsoo Kim, Pekka Enberg, Linux MM, Andrew Morton, kernel

On Thu, 24 Feb 2022, Vasily Averin wrote:

> > I assume for reduced overhead the counters would be implemented in a percpu
> > fashion as e.g. vmstats. Slabinfo gathering would thus have to e.g. sum up
> > those percpu counters.
>
> I like this idea too and I'm going to spend some time for its implementation.

Well vmstats are also not entirely accurate. Maintaining counters is
expensive and it gets more expensive if you want to always be accurate.

VM stats were created in order to decrease the overhead of counter
maintenance and the explicit aim was to sacrifice accuracy for
performance.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: slabinfo shows incorrect active_objs ???
  2022-02-22 12:10   ` Vasily Averin
                       ` (2 preceding siblings ...)
  2022-02-22 20:59     ` Roman Gushchin
@ 2022-03-04 16:29     ` Vlastimil Babka
  3 siblings, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2022-03-04 16:29 UTC (permalink / raw)
  To: Vasily Averin, Hyeonggon Yoo, Christoph Lameter, Roman Gushchin
  Cc: Linux MM, Andrew Morton, kernel

On 2/22/22 13:10, Vasily Averin wrote:
> On 22.02.2022 13:23, Hyeonggon Yoo wrote:
>> On Tue, Feb 22, 2022 at 12:22:02PM +0300, Vasily Averin wrote:
>>> Dear all,
>>>
>>> I've found that /proc/slabinfo shows inadequate numbers of in-use slab
>>> objects.
>>> it assumes that all objects stored in cpu caches are always 100% in use.
> 
>>> Is it a bug or perhaps a well-known feature that I missed?
>>
>> This is not a bug..
> 
> Thank you for explanation,
> I think it would be useful to document this somewhere. (Documnetation? man
> slabinfo ?)
> Also I would like to know is it some (fast) way to get real numbers in
> userspace ?
> crash is too fat for this task.
> Do you know perhaps some other userspace utility or may be systemtap/drgn
> script?

Oh I realized you can get much closer to real numbers via doing
echo 1 > /sys/kernel/slab/<cache>/shrink
and then reading slabinfo immediately

Although it will be racy if the flushed slabs are immediately refilled by
allocation activity, and the flush will affect performance. But maybe in
some situations it's useful.

> I'm preparing new set of memcg accounting patches, with reparired
> tools/cgroup/memcg_slapinfo.py
> I can get numbers of accounted resources, but I need to understand how may
> resources was NOT
> accounted to memcg but allocated on host. I expected get these numbers from
> host's slabinfo but
> it does not show correct numbers.
> 
> Thank you,
>     Vasily Averin
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-03-04 16:29 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-22  9:22 slabinfo shows incorrect active_objs ??? Vasily Averin
2022-02-22 10:23 ` Hyeonggon Yoo
2022-02-22 12:10   ` Vasily Averin
2022-02-22 16:32     ` Shakeel Butt
2022-02-22 16:47     ` Roman Gushchin
2022-02-23  1:07       ` Vasily Averin
2022-02-22 20:59     ` Roman Gushchin
2022-02-22 23:08       ` Vlastimil Babka
2022-02-23  0:07         ` Roman Gushchin
2022-02-23  0:32           ` Vlastimil Babka
2022-02-23  3:45             ` Hyeonggon Yoo
2022-02-23 17:31               ` Vlastimil Babka
2022-02-23 18:15                 ` Roman Gushchin
2022-02-24 13:16                 ` Vasily Averin
2022-02-25  0:08                   ` Roman Gushchin
2022-02-25  4:37                     ` Vasily Averin
2022-02-28  6:17                       ` Vasily Averin
2022-02-28 10:22                         ` Hyeonggon Yoo
2022-02-28 10:28                           ` Hyeonggon Yoo
2022-02-28 10:43                         ` Hyeonggon Yoo
2022-02-28 12:09                         ` Hyeonggon Yoo
2022-03-03  8:39                   ` Christoph Lameter
2022-03-04 16:29     ` Vlastimil Babka
2022-02-22 11:10 ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.