linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC] mm/vmscan: add periodic slab shrinker
       [not found] <20220402072103.5140-1-hdanton@sina.com>
@ 2022-04-02 17:54 ` Roman Gushchin
       [not found] ` <20220403005618.5263-1-hdanton@sina.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Roman Gushchin @ 2022-04-02 17:54 UTC (permalink / raw)
  To: Hillf Danton
  Cc: MM, Matthew Wilcox, Dave Chinner, Mel Gorman, Stephen Brennan,
	Yu Zhao, David Hildenbrand, LKML

Hello Hillf!

Thank you for sharing it, really interesting! I’m actually working on the same problem. 

No code to share yet, but here are some of my thoughts:
1) If there is a “natural” memory pressure, no additional slab scanning is needed.
2) From a power perspective it’s better to scan more at once, but less often.
3) Maybe we need a feedback loop with the slab allocator: e.g. if slabs are almost full there is more sense to do a proactive scanning and free up some memory, otherwise we’ll end up allocating more slabs. But it’s tricky.
4) If the scanning is not resulting in any memory reclaim, maybe we should (temporarily) exclude the corresponding shrinker from the scanning.

Thanks!

> On Apr 2, 2022, at 12:21 AM, Hillf Danton <hdanton@sina.com> wrote:
> 
> To mitigate the pain of having "several millions" of negative dentries in
> a single directory [1] for example, add the periodic slab shrinker that
> runs independent of direct and background reclaimers in bid to recycle the
> slab objects that haven been cold for more than 30 seconds.
> 
> Q, Why is it needed?
> A, Kswapd may take a nap as long as 30 minutes.
> 
> Add periodic flag to shrink control to let cache owners know this is the
> periodic shrinker that equals to the regular one running at the lowest
> recalim priority, and feel free to take no action without one-off objects
> piling up.
> 
> Only for thoughts now.
> 
> Hillf
> 
> [1] https://lore.kernel.org/linux-fsdevel/20220209231406.187668-1-stephen.s.brennan@oracle.com/
> 
> --- x/include/linux/shrinker.h
> +++ y/include/linux/shrinker.h
> @@ -14,6 +14,7 @@ struct shrink_control {
> 
>    /* current node being shrunk (for NUMA aware shrinkers) */
>    int nid;
> +    int periodic;
> 
>    /*
>     * How many objects scan_objects should scan and try to reclaim.
> --- x/mm/vmscan.c
> +++ y/mm/vmscan.c
> @@ -781,6 +781,8 @@ static unsigned long do_shrink_slab(stru
>        scanned += shrinkctl->nr_scanned;
> 
>        cond_resched();
> +        if (shrinkctl->periodic)
> +            break;
>    }
> 
>    /*
> @@ -906,7 +908,8 @@ static unsigned long shrink_slab_memcg(g
>  */
> static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>                 struct mem_cgroup *memcg,
> -                 int priority)
> +                 int priority,
> +                 int periodic)
> {
>    unsigned long ret, freed = 0;
>    struct shrinker *shrinker;
> @@ -929,6 +932,7 @@ static unsigned long shrink_slab(gfp_t g
>            .gfp_mask = gfp_mask,
>            .nid = nid,
>            .memcg = memcg,
> +            .periodic = periodic,
>        };
> 
>        ret = do_shrink_slab(&sc, shrinker, priority);
> @@ -952,7 +956,7 @@ out:
>    return freed;
> }
> 
> -static void drop_slab_node(int nid)
> +static void drop_slab_node(int nid, int periodic)
> {
>    unsigned long freed;
>    int shift = 0;
> @@ -966,19 +970,31 @@ static void drop_slab_node(int nid)
>        freed = 0;
>        memcg = mem_cgroup_iter(NULL, NULL, NULL);
>        do {
> -            freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
> +            freed += shrink_slab(GFP_KERNEL, nid, memcg, 0, periodic);
>        } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
>    } while ((freed >> shift++) > 1);
> }
> 
> -void drop_slab(void)
> +static void __drop_slab(int periodic)
> {
>    int nid;
> 
>    for_each_online_node(nid)
> -        drop_slab_node(nid);
> +        drop_slab_node(nid, periodic);
> +}
> +
> +void drop_slab(void)
> +{
> +    __drop_slab(0);
> }
> 
> +static void periodic_slab_shrinker_workfn(struct work_struct *work)
> +{
> +    __drop_slab(1);
> +    queue_delayed_work(system_unbound_wq, to_delayed_work(work), 30*HZ);
> +}
> +static DECLARE_DELAYED_WORK(periodic_slab_shrinker, periodic_slab_shrinker_workfn);
> +
> static inline int is_page_cache_freeable(struct folio *folio)
> {
>    /*
> @@ -3098,7 +3114,7 @@ static void shrink_node_memcgs(pg_data_t
>        shrink_lruvec(lruvec, sc);
> 
>        shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
> -                sc->priority);
> +                sc->priority, 0);
> 
>        /* Record the group's reclaim efficiency */
>        vmpressure(sc->gfp_mask, memcg, false,
> @@ -4354,8 +4370,11 @@ static void kswapd_try_to_sleep(pg_data_
>         */
>        set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
> 
> -        if (!kthread_should_stop())
> +        if (!kthread_should_stop()) {
> +            queue_delayed_work(system_unbound_wq,
> +                        &periodic_slab_shrinker, 60*HZ);
>            schedule();
> +        }
> 
>        set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
>    } else {
> --
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] mm/vmscan: add periodic slab shrinker
       [not found] ` <20220403005618.5263-1-hdanton@sina.com>
@ 2022-04-04  1:09   ` Dave Chinner
  2022-04-04 19:08     ` Roman Gushchin
  2022-04-05 17:22     ` Stephen Brennan
       [not found]   ` <20220404051442.5419-1-hdanton@sina.com>
  1 sibling, 2 replies; 18+ messages in thread
From: Dave Chinner @ 2022-04-04  1:09 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Roman Gushchin, MM, Matthew Wilcox, Mel Gorman, Stephen Brennan,
	Yu Zhao, David Hildenbrand, LKML

On Sun, Apr 03, 2022 at 08:56:18AM +0800, Hillf Danton wrote:
> On Sat, 2 Apr 2022 10:54:36 -0700 Roman Gushchin wrote:
> > Hello Hillf!
> > 
> Hello Roman,
> 
> > Thank you for sharing it, really interesting! I=E2=80=99m actually working o=
> > n the same problem.=20
> 
> Good to know you have some interest in it.
> Feel free to let me know you would like to take it over to avoid
> repeated works on both sides.
> 
> > 
> > No code to share yet, but here are some of my thoughts:
> > 1) If there is a =E2=80=9Cnatural=E2=80=9D memory pressure, no additional sl=
> > ab scanning is needed.
> 
> Agree - the periodic shrinker can be canceled once kswapd wakes up.

I think we should be waking up per-node kswapd to do the periodic
shrinking, not adding yet another way of executing (thousands of)
shrinkers (across hundreds of nodes) from a single threaded context.

Indeed, the problem of "no reclaim when there is no memory
pressure" also affects the page cache, not just shrinkable caches
and we really don't want periodic reclaim to have a compeltely
different behaviour to general memory reclaim.

i.e. the amount of work that shrinkers need to do in a periodic scan
is largerly determined by the rate of shrinkable cache memory usage
growth rather than memory reclaim priority as it is now. Hence there
needs to be different high level "shrinker needs to do X amount of
work" calculation for periodic reclaim than there is now.

e.g. we calculate a rolling average of the size of the cache and a
rate of change over a series of polling operations (i.e. calling
->scan_count) and then when sustained growth is detected we start
trying to shrink the cache to limit the rate of growth of the cache.

If the cache keeps growing, then it's objects are being repeatedly
referenced and it *should* keep growing. If it's one-off objects
that are causing the growth of the cache and so objects are being
reclaimed by the shrinker, then matching the periodic shrink scan to
the growth rate will substantially reduce the rate of growth of that
cache.

And if it's integrated into the existing do_shrink_slab
calculations, the moment actual memory reclaim calls the shrinker
the periodic scan calculations can be reset back to zero and it
starts again...

> > 2) =46rom a power perspective it=E2=80=99s better to scan more at once, but l=
> > ess often.
> 
> The shrinker proposed is a catapult on the vmscan side without knowing
> where the cold slab objects are piling up in Dave's backyard but he is
> free to take different actions than the regular shrinker - IOW this
> shrinker alone does not make much sense wrt shooting six birds without
> the stone on the slab owner side.
> 
> It is currently scanning *every* slab cache at an arbitrary frequency,
> once 30 seconds - I am open to a minute or whatever.

Sorry, I don't understand what "Dave's backyard" is or why it would
ever need to be special cased?

> > 3) Maybe we need a feedback loop with the slab allocator: e.g. if slabs are a=
> > lmost full there is more sense to do a proactive scanning and free up some m=
> > emory, otherwise we=E2=80=99ll end up allocating more slabs. But it=E2=80=99=
> > s tricky.
> 
> There are 31 bits available in the periodic flag added to shrink control.
> 
> > 4) If the scanning is not resulting in any memory reclaim, maybe we should (=
> > temporarily) exclude the corresponding shrinker from the scanning.
> 
> Given the periodic flag, Dave is free to ignore the scan request and the
> scan result is currently dropped on the vmscan side because what is
> considered is the cold slab objects that for instance have been inactive
> for more than 30 seconds in every slab cache, rather than kswapd's cake.

I don't understand how passing a "periodic" flag to individual
shrinkers is really useful here. How does the shrinker
implementation use this to determine how much work it needs to do?

i.e. The amount of work a shrinker needs to perform is calculated by
the high level slab scanning code based on relative cache size and
reclaim priority.  If there's a periodic scanner, it should be
calculating a specific amount of work for the shrinkers to do way up
in do_shrink_slab() and then asking the shrinker to perform that
work in exactly the way it does now - the shrinker itself doesn't
need to know anything about whether it's a periodic memory reclaim
scan or whether there's actual memory pressure - it just needs to
scan the oldest objects in it's cache to try to reclaim them.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] mm/vmscan: add periodic slab shrinker
       [not found]   ` <20220404051442.5419-1-hdanton@sina.com>
@ 2022-04-04 18:32     ` Roman Gushchin
  0 siblings, 0 replies; 18+ messages in thread
From: Roman Gushchin @ 2022-04-04 18:32 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Dave Chinner, MM, Matthew Wilcox, Mel Gorman, Stephen Brennan,
	Yu Zhao, David Hildenbrand, LKML

On Mon, Apr 04, 2022 at 01:14:42PM +0800, Hillf Danton wrote:
> On Mon, 4 Apr 2022 11:09:48 +1000 Dave Chinner wrote:
> > On Sun, Apr 03, 2022 at 08:56:18AM +0800, Hillf Danton wrote:
> > > On Sat, 2 Apr 2022 10:54:36 -0700 Roman Gushchin wrote:
> > > > Hello Hillf!
> > > > 
> > > Hello Roman,
> > > 
> > > > Thank you for sharing it, really interesting! I=E2=80=99m actually working o=
> > > > n the same problem.=20
> > > 
> > > Good to know you have some interest in it.
> > > Feel free to let me know you would like to take it over to avoid
> > > repeated works on both sides.

Only if you've something more exciting to work on. It seems like at this
point it's not really clear what exactly we need to do and how to approach it,
so I don't think we're doing any repeated work. The more
ideas/opinions/suggestions, then better.

> > > 
> > > > 
> > > > No code to share yet, but here are some of my thoughts:
> > > > 1) If there is a =E2=80=9Cnatural=E2=80=9D memory pressure, no additional sl=
> > > > ab scanning is needed.
> > > 
> > > Agree - the periodic shrinker can be canceled once kswapd wakes up.
> > 
> > I think we should be waking up per-node kswapd to do the periodic
> > shrinking, not adding yet another way of executing (thousands of)
> > shrinkers (across hundreds of nodes) from a single threaded context.
> 
> Kswapd is majorly responsible for keeping the high water mark, a
> different target from cold slab objects - I am inclined to staying a
> safe distance from any code churn in that area.

The problem is that kswapd is also doing slab shrinking and we can't
simple ignore it, so it has to interact in some way.

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] mm/vmscan: add periodic slab shrinker
  2022-04-04  1:09   ` Dave Chinner
@ 2022-04-04 19:08     ` Roman Gushchin
  2022-04-05  5:17       ` Dave Chinner
  2022-04-05 17:22     ` Stephen Brennan
  1 sibling, 1 reply; 18+ messages in thread
From: Roman Gushchin @ 2022-04-04 19:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hillf Danton, MM, Matthew Wilcox, Mel Gorman, Stephen Brennan,
	Yu Zhao, David Hildenbrand, LKML

On Mon, Apr 04, 2022 at 11:09:48AM +1000, Dave Chinner wrote:
> On Sun, Apr 03, 2022 at 08:56:18AM +0800, Hillf Danton wrote:
> > On Sat, 2 Apr 2022 10:54:36 -0700 Roman Gushchin wrote:
> > > Hello Hillf!
> > > 
> > Hello Roman,
> > 
> > > Thank you for sharing it, really interesting! I=E2=80=99m actually working o=
> > > n the same problem.=20
> > 
> > Good to know you have some interest in it.
> > Feel free to let me know you would like to take it over to avoid
> > repeated works on both sides.
> > 
> > > 
> > > No code to share yet, but here are some of my thoughts:
> > > 1) If there is a =E2=80=9Cnatural=E2=80=9D memory pressure, no additional sl=
> > > ab scanning is needed.
> > 
> > Agree - the periodic shrinker can be canceled once kswapd wakes up.
> 
> I think we should be waking up per-node kswapd to do the periodic
> shrinking, not adding yet another way of executing (thousands of)
> shrinkers (across hundreds of nodes) from a single threaded context.

I'm on the same page here. Ideally we just need to set somewhere an
"extra" slab scanning target and wake kswapd.

> 
> Indeed, the problem of "no reclaim when there is no memory
> pressure" also affects the page cache, not just shrinkable caches
> and we really don't want periodic reclaim to have a compeltely
> different behaviour to general memory reclaim.

I agree. Slabs are a worse because they fragment the memory with unmovable
pages and are generally slower/more expensive to reclaim. But in general
it's the same problem.

> 
> i.e. the amount of work that shrinkers need to do in a periodic scan
> is largerly determined by the rate of shrinkable cache memory usage
> growth rather than memory reclaim priority as it is now. Hence there
> needs to be different high level "shrinker needs to do X amount of
> work" calculation for periodic reclaim than there is now.
> 
> e.g. we calculate a rolling average of the size of the cache and a
> rate of change over a series of polling operations (i.e. calling
> ->scan_count) and then when sustained growth is detected we start
> trying to shrink the cache to limit the rate of growth of the cache.
> 
> If the cache keeps growing, then it's objects are being repeatedly
> referenced and it *should* keep growing. If it's one-off objects
> that are causing the growth of the cache and so objects are being
> reclaimed by the shrinker, then matching the periodic shrink scan to
> the growth rate will substantially reduce the rate of growth of that
> cache.

A clever idea!

It seems like we need to add some stats to the list_lru API or maybe to
the shrinker API (and let list_lru to use it).
E.g. total/scanned/reclaimed, maybe with a time decay

I'm also thinking about:
1) adding a sysfs/debugfs interface to expose shrinkers current size and
   statistics, with an ability to call into the reclaim manually.
2) formalizing a reference bit/counter API on the shrinkers level, so that
   shrinker users can explicitly mark objects (re)-"activation".
3) _maybe_ we need to change the shrinkers API from a number of objects to
   bytes, so that releasing a small number of large objects can compete with
   a releasing on many small objects. But I'm not sure.
4) also some shrinkers (e.g. shadow nodes) lying about the total size of
   objects and I have an uneasy feeling about this approach.

Dave, what are your thoughts on these ideas/issues?

> 
> And if it's integrated into the existing do_shrink_slab
> calculations, the moment actual memory reclaim calls the shrinker
> the periodic scan calculations can be reset back to zero and it
> starts again...
> 
> > > 2) =46rom a power perspective it=E2=80=99s better to scan more at once, but l=
> > > ess often.
> > 
> > The shrinker proposed is a catapult on the vmscan side without knowing
> > where the cold slab objects are piling up in Dave's backyard but he is
> > free to take different actions than the regular shrinker - IOW this
> > shrinker alone does not make much sense wrt shooting six birds without
> > the stone on the slab owner side.
> > 
> > It is currently scanning *every* slab cache at an arbitrary frequency,
> > once 30 seconds - I am open to a minute or whatever.
> 
> Sorry, I don't understand what "Dave's backyard" is or why it would
> ever need to be special cased?

Me neither.

Also 30s seconds sound like a perfect way to drain the laptop's battery,
I'd say 1h is a more reasonable scale, but of course it's a pure speculation.

> 
> > > 3) Maybe we need a feedback loop with the slab allocator: e.g. if slabs are a=
> > > lmost full there is more sense to do a proactive scanning and free up some m=
> > > emory, otherwise we=E2=80=99ll end up allocating more slabs. But it=E2=80=99=
> > > s tricky.
> > 
> > There are 31 bits available in the periodic flag added to shrink control.
> > 
> > > 4) If the scanning is not resulting in any memory reclaim, maybe we should (=
> > > temporarily) exclude the corresponding shrinker from the scanning.
> > 
> > Given the periodic flag, Dave is free to ignore the scan request and the
> > scan result is currently dropped on the vmscan side because what is
> > considered is the cold slab objects that for instance have been inactive
> > for more than 30 seconds in every slab cache, rather than kswapd's cake.
> 
> I don't understand how passing a "periodic" flag to individual
> shrinkers is really useful here. How does the shrinker
> implementation use this to determine how much work it needs to do?
> 
> i.e. The amount of work a shrinker needs to perform is calculated by
> the high level slab scanning code based on relative cache size and
> reclaim priority.  If there's a periodic scanner, it should be
> calculating a specific amount of work for the shrinkers to do way up
> in do_shrink_slab() and then asking the shrinker to perform that
> work in exactly the way it does now - the shrinker itself doesn't
> need to know anything about whether it's a periodic memory reclaim
> scan or whether there's actual memory pressure - it just needs to
> scan the oldest objects in it's cache to try to reclaim them.

Again I'm on the same page here. We have a mechanism for a deferred work,
it or something similar can be reused to schedule an extra work.

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] mm/vmscan: add periodic slab shrinker
  2022-04-04 19:08     ` Roman Gushchin
@ 2022-04-05  5:17       ` Dave Chinner
  2022-04-05 16:35         ` Roman Gushchin
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2022-04-05  5:17 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Hillf Danton, MM, Matthew Wilcox, Mel Gorman, Stephen Brennan,
	Yu Zhao, David Hildenbrand, LKML

On Mon, Apr 04, 2022 at 12:08:25PM -0700, Roman Gushchin wrote:
> On Mon, Apr 04, 2022 at 11:09:48AM +1000, Dave Chinner wrote:
> > i.e. the amount of work that shrinkers need to do in a periodic scan
> > is largerly determined by the rate of shrinkable cache memory usage
> > growth rather than memory reclaim priority as it is now. Hence there
> > needs to be different high level "shrinker needs to do X amount of
> > work" calculation for periodic reclaim than there is now.
> > 
> > e.g. we calculate a rolling average of the size of the cache and a
> > rate of change over a series of polling operations (i.e. calling
> > ->scan_count) and then when sustained growth is detected we start
> > trying to shrink the cache to limit the rate of growth of the cache.
> > 
> > If the cache keeps growing, then it's objects are being repeatedly
> > referenced and it *should* keep growing. If it's one-off objects
> > that are causing the growth of the cache and so objects are being
> > reclaimed by the shrinker, then matching the periodic shrink scan to
> > the growth rate will substantially reduce the rate of growth of that
> > cache.
> 
> A clever idea!
> 
> It seems like we need to add some stats to the list_lru API or maybe to
> the shrinker API (and let list_lru to use it).
> E.g. total/scanned/reclaimed, maybe with a time decay
> 
> I'm also thinking about:
> 1) adding a sysfs/debugfs interface to expose shrinkers current size and
>    statistics, with an ability to call into the reclaim manually.

I've thought about it, too, and can see where it could be useful.
However, when I consider the list_lru memcg integration, I suspect
it becomes a "can't see the forest for the trees" problem. We're
going to end up with millions of sysfs objects with no obvious way
to navigate, iterate or search them if we just take the naive "sysfs
object + stats per list_lru instance" approach.

Also, if you look at commit 6a6b7b77cc0f mm: ("list_lru: transpose the
array of per-node per-memcg lru lists") that went into 5.18-rc1,
you'll get an idea of the amount of memory overhead just tracking
the list_lru x memcg infrastructure consumes at scale:

    I had done a easy test to show the optimization.  I create 10k memory
    cgroups and mount 10k filesystems in the systems.  We use free command to
    show how many memory does the systems comsumes after this operation (There
    are 2 numa nodes in the system).

            +-----------------------+------------------------+
            |      condition        |   memory consumption   |
            +-----------------------+------------------------+
            | without this patchset |        24464 MB        |
            +-----------------------+------------------------+
            |     after patch 1     |        21957 MB        | <--------+
            +-----------------------+------------------------+          |
            |     after patch 10    |         6895 MB        |          |
            +-----------------------+------------------------+          |
            |     after patch 12    |         4367 MB        |          |
            +-----------------------+------------------------+          |
                                                                        |
            The more the number of nodes, the more obvious the effect---+

If we now add sysfs objects and stats arrays to each of the
list_lrus that we initiate even now on 5.18-rc1, we're going to
massively blow out the memory footprint again. 

So, nice idea, but I'm not sure we can make it useful and not
consume huge amounts of memory....

What might be more useful is a way of getting the kernel to tell us
what the, say, 20 biggest slab caches are in the system and then
provide a way to selectively shrink them. We generally don't really
care about tiny caches, just the ones consuming all the memory. This
isn't my idea - Kent has been looking at this because of how useless
OOM kill output is for debugging slab cache based OOM triggers. See
this branch for an example:

https://evilpiepirate.org/git/bcachefs.git/log/?h=shrinker_to_text

Even a sysfs entry that you echo a number into and it returns
the "top N" largest LRU lists. echo -1 into it and it returns every
single one if you want all the information.

The whole "one sysfs file, one value" architecture falls completely
apart when we might have to indexing *millions* of internal
structures with many parameters per structure...

> 2) formalizing a reference bit/counter API on the shrinkers level, so that
>    shrinker users can explicitly mark objects (re)-"activation".

Not 100% certain what you are refering to here - something to do
with active object rotation? Or an active/inactive list split with period
demotion like we have for the page LRUs? Or workingset refault
detection? Can you explain in more detail?

> 3) _maybe_ we need to change the shrinkers API from a number of objects to
>    bytes, so that releasing a small number of large objects can compete with
>    a releasing on many small objects. But I'm not sure.

I think I suggested something similar a long time ago. We have
shrinkers that track things other than slab objects. e.g. IIRC the
ttm graphics allocator shrinker tracks sets of pages and frees
pages, not slab objects. The XFS buffer cache tracks variable sized
objects, from 512 bytes to 64KB in length, so the amount of memory
it frees is variable even if the number of handles it scans and
reclaims is fixed and consistent. Other subsystems have different
"non object" shrinker needs as well.

The long and short of it is that two shrinkers might have the same
object count, but one might free 10x the amount of memory than the
other for the same amount of shrinking work. Being able to focus
reclaim work on caches that can free a lot of memory much more
quickly would be a great idea.

It also means that a shrinker that scans a fragmented slab can keep
going until a set number of slab pages have been freed, rather than
a set number of slab objects. We can push reclaim of fragmented slab
caches much harder when necessary if we are reclaiming by freed byte
counts...

So, yeah, byte-count based reclaim definitely has merit compared to
what we currently do. It's more generic and more flexible...

> 4) also some shrinkers (e.g. shadow nodes) lying about the total size of
>    objects and I have an uneasy feeling about this approach.

Yeah, never been a fan about use cases like that, but IIRC it was really
the only option at the time to only trigger reclaim when the kernel
was absolutely desperate because we really need the working set
shadow nodes to do their job well when memory is very low...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] mm/vmscan: add periodic slab shrinker
  2022-04-05  5:17       ` Dave Chinner
@ 2022-04-05 16:35         ` Roman Gushchin
  2022-04-05 20:58           ` Yang Shi
  0 siblings, 1 reply; 18+ messages in thread
From: Roman Gushchin @ 2022-04-05 16:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hillf Danton, MM, Matthew Wilcox, Mel Gorman, Stephen Brennan,
	Yu Zhao, David Hildenbrand, LKML

On Tue, Apr 05, 2022 at 03:17:10PM +1000, Dave Chinner wrote:
> On Mon, Apr 04, 2022 at 12:08:25PM -0700, Roman Gushchin wrote:
> > On Mon, Apr 04, 2022 at 11:09:48AM +1000, Dave Chinner wrote:
> > > i.e. the amount of work that shrinkers need to do in a periodic scan
> > > is largerly determined by the rate of shrinkable cache memory usage
> > > growth rather than memory reclaim priority as it is now. Hence there
> > > needs to be different high level "shrinker needs to do X amount of
> > > work" calculation for periodic reclaim than there is now.
> > > 
> > > e.g. we calculate a rolling average of the size of the cache and a
> > > rate of change over a series of polling operations (i.e. calling
> > > ->scan_count) and then when sustained growth is detected we start
> > > trying to shrink the cache to limit the rate of growth of the cache.
> > > 
> > > If the cache keeps growing, then it's objects are being repeatedly
> > > referenced and it *should* keep growing. If it's one-off objects
> > > that are causing the growth of the cache and so objects are being
> > > reclaimed by the shrinker, then matching the periodic shrink scan to
> > > the growth rate will substantially reduce the rate of growth of that
> > > cache.
> > 
> > A clever idea!
> > 
> > It seems like we need to add some stats to the list_lru API or maybe to
> > the shrinker API (and let list_lru to use it).
> > E.g. total/scanned/reclaimed, maybe with a time decay
> > 
> > I'm also thinking about:
> > 1) adding a sysfs/debugfs interface to expose shrinkers current size and
> >    statistics, with an ability to call into the reclaim manually.
> 
> I've thought about it, too, and can see where it could be useful.
> However, when I consider the list_lru memcg integration, I suspect
> it becomes a "can't see the forest for the trees" problem. We're
> going to end up with millions of sysfs objects with no obvious way
> to navigate, iterate or search them if we just take the naive "sysfs
> object + stats per list_lru instance" approach.

Ok, I'll try to master something and share patches. I assume it would be useful
to have statistics and be able to trigger a reclaim of a particular shrinker
without creating a real memory pressure. I anticipate many interesting
findings in the implementation of individual shrinkers...

> 
> Also, if you look at commit 6a6b7b77cc0f mm: ("list_lru: transpose the
> array of per-node per-memcg lru lists") that went into 5.18-rc1,
> you'll get an idea of the amount of memory overhead just tracking
> the list_lru x memcg infrastructure consumes at scale:
> 
>     I had done a easy test to show the optimization.  I create 10k memory
>     cgroups and mount 10k filesystems in the systems.  We use free command to
>     show how many memory does the systems comsumes after this operation (There
>     are 2 numa nodes in the system).
> 
>             +-----------------------+------------------------+
>             |      condition        |   memory consumption   |
>             +-----------------------+------------------------+
>             | without this patchset |        24464 MB        |
>             +-----------------------+------------------------+
>             |     after patch 1     |        21957 MB        | <--------+
>             +-----------------------+------------------------+          |
>             |     after patch 10    |         6895 MB        |          |
>             +-----------------------+------------------------+          |
>             |     after patch 12    |         4367 MB        |          |
>             +-----------------------+------------------------+          |
>                                                                         |
>             The more the number of nodes, the more obvious the effect---+

Yes, I know, I've reviewed this patchset. However, 10k cgroups _and_ 10k
mounts look a bit artificial. It's hard to believe that there are 100M
LRUs containing hot objects. If most of them are cold, it's actually
a question how to efficiently reclaim them and free the wasted memory.

> 
> If we now add sysfs objects and stats arrays to each of the
> list_lrus that we initiate even now on 5.18-rc1, we're going to
> massively blow out the memory footprint again.

sysfs can be a config option, it doesn't have to be enabled in prod.
But I agree, because of memory constraints we're very limited in the size
of statistics which can be used for reclaim decisions.

> 
> So, nice idea, but I'm not sure we can make it useful and not
> consume huge amounts of memory....
> 
> What might be more useful is a way of getting the kernel to tell us
> what the, say, 20 biggest slab caches are in the system and then
> provide a way to selectively shrink them. We generally don't really
> care about tiny caches, just the ones consuming all the memory. This
> isn't my idea - Kent has been looking at this because of how useless
> OOM kill output is for debugging slab cache based OOM triggers. See
> this branch for an example:
> 
> https://evilpiepirate.org/git/bcachefs.git/log/?h=shrinker_to_text
> 
> Even a sysfs entry that you echo a number into and it returns
> the "top N" largest LRU lists. echo -1 into it and it returns every
> single one if you want all the information.
> 
> The whole "one sysfs file, one value" architecture falls completely
> apart when we might have to indexing *millions* of internal
> structures with many parameters per structure...

It's a good point, I need to think what can we do here.

Actually, if it's so hard and inefficient for a sysfs interface,
it also means it's inefficient for the reclaim path. So if we have
an ability to quickly find beefy LRU lists worth shrinking, it might
improve the reclaim efficiency too.

Actually, thinking of reducing the memory footprint, it would be great
to combine LRU's belonging to different superblocks (of the same type).
Pagecache analogy: we have an LRU for all pagecache pages, it's not
per-file or per-sb. Not sure how easy/viable it is.

> 
> > 2) formalizing a reference bit/counter API on the shrinkers level, so that
> >    shrinker users can explicitly mark objects (re)-"activation".
> 
> Not 100% certain what you are refering to here - something to do
> with active object rotation? Or an active/inactive list split with period
> demotion like we have for the page LRUs? Or workingset refault
> detection? Can you explain in more detail?

It's a vague thinking at this point, but you got me right, I was thinking about
all the listed ideas above. In general I'm trying to understand whether
the existing model is sufficient for a more or less effective management
of hot/cold objects and if we can improve it borrowing some ideas from
the page reclaim code.
Shadow entries would be great but likely not possible because of the memory
footprint.

> 
> > 3) _maybe_ we need to change the shrinkers API from a number of objects to
> >    bytes, so that releasing a small number of large objects can compete with
> >    a releasing on many small objects. But I'm not sure.
> 
> I think I suggested something similar a long time ago. We have
> shrinkers that track things other than slab objects. e.g. IIRC the
> ttm graphics allocator shrinker tracks sets of pages and frees
> pages, not slab objects. The XFS buffer cache tracks variable sized
> objects, from 512 bytes to 64KB in length, so the amount of memory
> it frees is variable even if the number of handles it scans and
> reclaims is fixed and consistent. Other subsystems have different
> "non object" shrinker needs as well.

It's true for many objects, because it's not unusual for kernel objects
to have attached kmallocs or pin other objects in the memory. Unlikely
we can be 100% accurate, but we might improve the "smoothness" of the
reclaim process.

> 
> The long and short of it is that two shrinkers might have the same
> object count, but one might free 10x the amount of memory than the
> other for the same amount of shrinking work. Being able to focus
> reclaim work on caches that can free a lot of memory much more
> quickly would be a great idea.

Right. It's also about "bytes freed/seeks" ratio. Why would we reclaim
expensive small objects if there are big and cheap.

> 
> It also means that a shrinker that scans a fragmented slab can keep
> going until a set number of slab pages have been freed, rather than
> a set number of slab objects. We can push reclaim of fragmented slab
> caches much harder when necessary if we are reclaiming by freed byte
> counts...

I thought about it, but it's tricky. If all slabs are almost full (fragmentation
is low), we want to shrink aggressively, otherwise we end up allocating
more slabs and fragmenting more memory. If slabs are almost empty (fragmentation
is very high), we also want to shrink more aggressively with a hope to release
physical memory, as you said. But setting a target in pages is dangerous: slab caches
can be merged or slab pages can be pinned by objects belonging to a different
memory cgroup, so it might be not possible to reclaim pages no matter how hard
we're trying.

> 
> So, yeah, byte-count based reclaim definitely has merit compared to
> what we currently do. It's more generic and more flexible...

Great, I'm glad we're on the same page here.

Thank you for the answers, I think it's a really useful discussion!

Roman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] mm/vmscan: add periodic slab shrinker
  2022-04-04  1:09   ` Dave Chinner
  2022-04-04 19:08     ` Roman Gushchin
@ 2022-04-05 17:22     ` Stephen Brennan
  2022-04-05 21:18       ` Matthew Wilcox
  1 sibling, 1 reply; 18+ messages in thread
From: Stephen Brennan @ 2022-04-05 17:22 UTC (permalink / raw)
  To: Dave Chinner, Hillf Danton
  Cc: Roman Gushchin, MM, Matthew Wilcox, Mel Gorman, Yu Zhao,
	David Hildenbrand, LKML

Dave Chinner <david@fromorbit.com> writes:
[snip]
> If the cache keeps growing, then it's objects are being repeatedly
> referenced and it *should* keep growing. If it's one-off objects
> that are causing the growth of the cache and so objects are being
> reclaimed by the shrinker, then matching the periodic shrink scan to
> the growth rate will substantially reduce the rate of growth of that
> cache.

I can't speak for every slab cache, but I've been coming to the same
conclusion myself regarding the dentry cache. I think that the rate of
stepping through the LRU should be tied to the rate of allocations.
Truly in-use objects shouldn't be harmed by this, as they should get
referenced and rotated to the beginning of the LRU. But the one-offs
which are bloating the cache will be found and removed.

My dentry-related patch here [1] does tie the reclaim to the rate of
allocations. In that patch, I looked for sibling negative dentries to
reclaim, which is just silly in hindsight :)

I've implemented a version of this patch which just takes one step
through the LRU on each d_alloc. It's quite interesting to watch it
work. You can create 5 million negative dentries in directory /A via
stat(), and then create 5 million negative dentries in directory /B. The
total dentry slab size reaches 5 million but never goes past it, since
the old negative dentries from /A aren't really in use, and they get
pruned at the same rate as negative dentries from /B get created. On the
other hand, if you *continue* to stat() on the dentries of /A while you
create negative dentries in /B, then the cache grows to 10 million,
since the /A dentries are actually in use.

Maybe a solution could involve some generic list_lru machinery that can
let you do these sorts of incremental scans? Maybe batching them up so
instead of doing one every allocation, you do them every 100 or 1000?
It would still be up to the individual user to put this to good use in
the object allocation path.

Thanks,
Stephen

[1] https://lore.kernel.org/linux-fsdevel/20220209231406.187668-1-stephen.s.brennan@oracle.com/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] mm/vmscan: add periodic slab shrinker
  2022-04-05 16:35         ` Roman Gushchin
@ 2022-04-05 20:58           ` Yang Shi
  2022-04-05 21:21             ` Matthew Wilcox
  2022-04-05 21:31             ` Roman Gushchin
  0 siblings, 2 replies; 18+ messages in thread
From: Yang Shi @ 2022-04-05 20:58 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Dave Chinner, Hillf Danton, MM, Matthew Wilcox, Mel Gorman,
	Stephen Brennan, Yu Zhao, David Hildenbrand, LKML

On Tue, Apr 5, 2022 at 9:36 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Tue, Apr 05, 2022 at 03:17:10PM +1000, Dave Chinner wrote:
> > On Mon, Apr 04, 2022 at 12:08:25PM -0700, Roman Gushchin wrote:
> > > On Mon, Apr 04, 2022 at 11:09:48AM +1000, Dave Chinner wrote:
> > > > i.e. the amount of work that shrinkers need to do in a periodic scan
> > > > is largerly determined by the rate of shrinkable cache memory usage
> > > > growth rather than memory reclaim priority as it is now. Hence there
> > > > needs to be different high level "shrinker needs to do X amount of
> > > > work" calculation for periodic reclaim than there is now.
> > > >
> > > > e.g. we calculate a rolling average of the size of the cache and a
> > > > rate of change over a series of polling operations (i.e. calling
> > > > ->scan_count) and then when sustained growth is detected we start
> > > > trying to shrink the cache to limit the rate of growth of the cache.
> > > >
> > > > If the cache keeps growing, then it's objects are being repeatedly
> > > > referenced and it *should* keep growing. If it's one-off objects
> > > > that are causing the growth of the cache and so objects are being
> > > > reclaimed by the shrinker, then matching the periodic shrink scan to
> > > > the growth rate will substantially reduce the rate of growth of that
> > > > cache.
> > >
> > > A clever idea!
> > >
> > > It seems like we need to add some stats to the list_lru API or maybe to
> > > the shrinker API (and let list_lru to use it).
> > > E.g. total/scanned/reclaimed, maybe with a time decay
> > >
> > > I'm also thinking about:
> > > 1) adding a sysfs/debugfs interface to expose shrinkers current size and
> > >    statistics, with an ability to call into the reclaim manually.
> >
> > I've thought about it, too, and can see where it could be useful.
> > However, when I consider the list_lru memcg integration, I suspect
> > it becomes a "can't see the forest for the trees" problem. We're
> > going to end up with millions of sysfs objects with no obvious way
> > to navigate, iterate or search them if we just take the naive "sysfs
> > object + stats per list_lru instance" approach.
>
> Ok, I'll try to master something and share patches. I assume it would be useful
> to have statistics and be able to trigger a reclaim of a particular shrinker
> without creating a real memory pressure. I anticipate many interesting
> findings in the implementation of individual shrinkers...
>
> >
> > Also, if you look at commit 6a6b7b77cc0f mm: ("list_lru: transpose the
> > array of per-node per-memcg lru lists") that went into 5.18-rc1,
> > you'll get an idea of the amount of memory overhead just tracking
> > the list_lru x memcg infrastructure consumes at scale:
> >
> >     I had done a easy test to show the optimization.  I create 10k memory
> >     cgroups and mount 10k filesystems in the systems.  We use free command to
> >     show how many memory does the systems comsumes after this operation (There
> >     are 2 numa nodes in the system).
> >
> >             +-----------------------+------------------------+
> >             |      condition        |   memory consumption   |
> >             +-----------------------+------------------------+
> >             | without this patchset |        24464 MB        |
> >             +-----------------------+------------------------+
> >             |     after patch 1     |        21957 MB        | <--------+
> >             +-----------------------+------------------------+          |
> >             |     after patch 10    |         6895 MB        |          |
> >             +-----------------------+------------------------+          |
> >             |     after patch 12    |         4367 MB        |          |
> >             +-----------------------+------------------------+          |
> >                                                                         |
> >             The more the number of nodes, the more obvious the effect---+
>
> Yes, I know, I've reviewed this patchset. However, 10k cgroups _and_ 10k
> mounts look a bit artificial. It's hard to believe that there are 100M
> LRUs containing hot objects. If most of them are cold, it's actually
> a question how to efficiently reclaim them and free the wasted memory.
>
> >
> > If we now add sysfs objects and stats arrays to each of the
> > list_lrus that we initiate even now on 5.18-rc1, we're going to
> > massively blow out the memory footprint again.
>
> sysfs can be a config option, it doesn't have to be enabled in prod.
> But I agree, because of memory constraints we're very limited in the size
> of statistics which can be used for reclaim decisions.
>
> >
> > So, nice idea, but I'm not sure we can make it useful and not
> > consume huge amounts of memory....
> >
> > What might be more useful is a way of getting the kernel to tell us
> > what the, say, 20 biggest slab caches are in the system and then
> > provide a way to selectively shrink them. We generally don't really
> > care about tiny caches, just the ones consuming all the memory. This
> > isn't my idea - Kent has been looking at this because of how useless
> > OOM kill output is for debugging slab cache based OOM triggers. See
> > this branch for an example:
> >
> > https://evilpiepirate.org/git/bcachefs.git/log/?h=shrinker_to_text
> >
> > Even a sysfs entry that you echo a number into and it returns
> > the "top N" largest LRU lists. echo -1 into it and it returns every
> > single one if you want all the information.
> >
> > The whole "one sysfs file, one value" architecture falls completely
> > apart when we might have to indexing *millions* of internal
> > structures with many parameters per structure...
>
> It's a good point, I need to think what can we do here.
>
> Actually, if it's so hard and inefficient for a sysfs interface,
> it also means it's inefficient for the reclaim path. So if we have
> an ability to quickly find beefy LRU lists worth shrinking, it might
> improve the reclaim efficiency too.
>
> Actually, thinking of reducing the memory footprint, it would be great
> to combine LRU's belonging to different superblocks (of the same type).
> Pagecache analogy: we have an LRU for all pagecache pages, it's not
> per-file or per-sb. Not sure how easy/viable it is.
>
> >
> > > 2) formalizing a reference bit/counter API on the shrinkers level, so that
> > >    shrinker users can explicitly mark objects (re)-"activation".
> >
> > Not 100% certain what you are refering to here - something to do
> > with active object rotation? Or an active/inactive list split with period
> > demotion like we have for the page LRUs? Or workingset refault
> > detection? Can you explain in more detail?
>
> It's a vague thinking at this point, but you got me right, I was thinking about
> all the listed ideas above. In general I'm trying to understand whether
> the existing model is sufficient for a more or less effective management
> of hot/cold objects and if we can improve it borrowing some ideas from
> the page reclaim code.
> Shadow entries would be great but likely not possible because of the memory
> footprint.
>
> >
> > > 3) _maybe_ we need to change the shrinkers API from a number of objects to
> > >    bytes, so that releasing a small number of large objects can compete with
> > >    a releasing on many small objects. But I'm not sure.
> >
> > I think I suggested something similar a long time ago. We have
> > shrinkers that track things other than slab objects. e.g. IIRC the
> > ttm graphics allocator shrinker tracks sets of pages and frees
> > pages, not slab objects. The XFS buffer cache tracks variable sized
> > objects, from 512 bytes to 64KB in length, so the amount of memory
> > it frees is variable even if the number of handles it scans and
> > reclaims is fixed and consistent. Other subsystems have different
> > "non object" shrinker needs as well.
>
> It's true for many objects, because it's not unusual for kernel objects
> to have attached kmallocs or pin other objects in the memory. Unlikely
> we can be 100% accurate, but we might improve the "smoothness" of the
> reclaim process.
>
> >
> > The long and short of it is that two shrinkers might have the same
> > object count, but one might free 10x the amount of memory than the
> > other for the same amount of shrinking work. Being able to focus
> > reclaim work on caches that can free a lot of memory much more
> > quickly would be a great idea.
>
> Right. It's also about "bytes freed/seeks" ratio. Why would we reclaim
> expensive small objects if there are big and cheap.
>
> >
> > It also means that a shrinker that scans a fragmented slab can keep
> > going until a set number of slab pages have been freed, rather than
> > a set number of slab objects. We can push reclaim of fragmented slab
> > caches much harder when necessary if we are reclaiming by freed byte
> > counts...
>
> I thought about it, but it's tricky. If all slabs are almost full (fragmentation
> is low), we want to shrink aggressively, otherwise we end up allocating
> more slabs and fragmenting more memory. If slabs are almost empty (fragmentation
> is very high), we also want to shrink more aggressively with a hope to release
> physical memory, as you said. But setting a target in pages is dangerous: slab caches
> can be merged or slab pages can be pinned by objects belonging to a different
> memory cgroup, so it might be not possible to reclaim pages no matter how hard
> we're trying.

Yeah, I agree it actually doesn't make too much sense to return the
number of reclaimed objects. Other part of vmscan returns the number
of base pages, the sizes of slab objects are varied, it may be much
smaller than a page, for example, dentry may be 192 bytes.

Another problem which doesn't help the feedback loop is even though
bytes is returned instead of the number of objects, it doesn't mean
that much memory is actually freed and available for allocation. IMHO
the number of really freed pages should be returned (I do understand
it is not that easy for now), and returning 0 should be fine. The
current logic (returning the number of objects) may feed up something
over-optimistic. I, at least, experienced once or twice that a
significant amount of slab caches were shrunk, but actually 0 pages
were freed actually. TBH the new slab controller may make it worse
since the page may be pinned by the objects from other memcgs.

>
> >
> > So, yeah, byte-count based reclaim definitely has merit compared to
> > what we currently do. It's more generic and more flexible...
>
> Great, I'm glad we're on the same page here.
>
> Thank you for the answers, I think it's a really useful discussion!
>
> Roman
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] mm/vmscan: add periodic slab shrinker
  2022-04-05 17:22     ` Stephen Brennan
@ 2022-04-05 21:18       ` Matthew Wilcox
  2022-04-05 23:54         ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2022-04-05 21:18 UTC (permalink / raw)
  To: Stephen Brennan
  Cc: Dave Chinner, Hillf Danton, Roman Gushchin, MM, Mel Gorman,
	Yu Zhao, David Hildenbrand, LKML

On Tue, Apr 05, 2022 at 10:22:14AM -0700, Stephen Brennan wrote:
> I can't speak for every slab cache, but I've been coming to the same
> conclusion myself regarding the dentry cache. I think that the rate of
> stepping through the LRU should be tied to the rate of allocations.
> Truly in-use objects shouldn't be harmed by this, as they should get
> referenced and rotated to the beginning of the LRU. But the one-offs
> which are bloating the cache will be found and removed.

I agree with all this.

> I've implemented a version of this patch which just takes one step
> through the LRU on each d_alloc. It's quite interesting to watch it
> work. You can create 5 million negative dentries in directory /A via
> stat(), and then create 5 million negative dentries in directory /B. The
> total dentry slab size reaches 5 million but never goes past it, since
> the old negative dentries from /A aren't really in use, and they get
> pruned at the same rate as negative dentries from /B get created. On the
> other hand, if you *continue* to stat() on the dentries of /A while you
> create negative dentries in /B, then the cache grows to 10 million,
> since the /A dentries are actually in use.
> 
> Maybe a solution could involve some generic list_lru machinery that can
> let you do these sorts of incremental scans? Maybe batching them up so
> instead of doing one every allocation, you do them every 100 or 1000?
> It would still be up to the individual user to put this to good use in
> the object allocation path.

I feel like we need to allow the list to both shrink and grow, depending
on how useful the entries in it are.  So one counter per LRU, incremented
on every add.  When that counter gets to 100, reset it to 0 and scan
110 entries.  Maybe 0 of them can be reclaimed; maybe 110 of them can be.
But the list can shrink over time instead of being a "one in, one out"
scenario.

Clearly 110 is a magic number, but intuitively, attempting to shrink
by 10% feels reasonable.  Need to strike a balance between shrinking
quickly enough and giving the cache time to figure out which entries
are actually useful.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] mm/vmscan: add periodic slab shrinker
  2022-04-05 20:58           ` Yang Shi
@ 2022-04-05 21:21             ` Matthew Wilcox
  2022-04-06  0:01               ` Dave Chinner
  2022-04-05 21:31             ` Roman Gushchin
  1 sibling, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2022-04-05 21:21 UTC (permalink / raw)
  To: Yang Shi
  Cc: Roman Gushchin, Dave Chinner, Hillf Danton, MM, Mel Gorman,
	Stephen Brennan, Yu Zhao, David Hildenbrand, LKML

On Tue, Apr 05, 2022 at 01:58:59PM -0700, Yang Shi wrote:
> Yeah, I agree it actually doesn't make too much sense to return the
> number of reclaimed objects. Other part of vmscan returns the number
> of base pages, the sizes of slab objects are varied, it may be much
> smaller than a page, for example, dentry may be 192 bytes.

From the point of view of vmscan, it only cares about the number of pages
freed because it's trying to free pages.  But from the point of view of
trying to keep the number of non-useful objects in check, the number of
objects freed is more important, and it doesn't matter whether we ended
up freeing any pages because we made memory available for this slab cache.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] mm/vmscan: add periodic slab shrinker
  2022-04-05 20:58           ` Yang Shi
  2022-04-05 21:21             ` Matthew Wilcox
@ 2022-04-05 21:31             ` Roman Gushchin
  2022-04-06  0:11               ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Roman Gushchin @ 2022-04-05 21:31 UTC (permalink / raw)
  To: Yang Shi
  Cc: Dave Chinner, Hillf Danton, MM, Matthew Wilcox, Mel Gorman,
	Stephen Brennan, Yu Zhao, David Hildenbrand, LKML

On Tue, Apr 05, 2022 at 01:58:59PM -0700, Yang Shi wrote:
> On Tue, Apr 5, 2022 at 9:36 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Tue, Apr 05, 2022 at 03:17:10PM +1000, Dave Chinner wrote:
> > > On Mon, Apr 04, 2022 at 12:08:25PM -0700, Roman Gushchin wrote:
> > > > On Mon, Apr 04, 2022 at 11:09:48AM +1000, Dave Chinner wrote:
> > > > > i.e. the amount of work that shrinkers need to do in a periodic scan
> > > > > is largerly determined by the rate of shrinkable cache memory usage
> > > > > growth rather than memory reclaim priority as it is now. Hence there
> > > > > needs to be different high level "shrinker needs to do X amount of
> > > > > work" calculation for periodic reclaim than there is now.
> > > > >
> > > > > e.g. we calculate a rolling average of the size of the cache and a
> > > > > rate of change over a series of polling operations (i.e. calling
> > > > > ->scan_count) and then when sustained growth is detected we start
> > > > > trying to shrink the cache to limit the rate of growth of the cache.
> > > > >
> > > > > If the cache keeps growing, then it's objects are being repeatedly
> > > > > referenced and it *should* keep growing. If it's one-off objects
> > > > > that are causing the growth of the cache and so objects are being
> > > > > reclaimed by the shrinker, then matching the periodic shrink scan to
> > > > > the growth rate will substantially reduce the rate of growth of that
> > > > > cache.
> > > >
> > > > A clever idea!
> > > >
> > > > It seems like we need to add some stats to the list_lru API or maybe to
> > > > the shrinker API (and let list_lru to use it).
> > > > E.g. total/scanned/reclaimed, maybe with a time decay
> > > >
> > > > I'm also thinking about:
> > > > 1) adding a sysfs/debugfs interface to expose shrinkers current size and
> > > >    statistics, with an ability to call into the reclaim manually.
> > >
> > > I've thought about it, too, and can see where it could be useful.
> > > However, when I consider the list_lru memcg integration, I suspect
> > > it becomes a "can't see the forest for the trees" problem. We're
> > > going to end up with millions of sysfs objects with no obvious way
> > > to navigate, iterate or search them if we just take the naive "sysfs
> > > object + stats per list_lru instance" approach.
> >
> > Ok, I'll try to master something and share patches. I assume it would be useful
> > to have statistics and be able to trigger a reclaim of a particular shrinker
> > without creating a real memory pressure. I anticipate many interesting
> > findings in the implementation of individual shrinkers...
> >
> > >
> > > Also, if you look at commit 6a6b7b77cc0f mm: ("list_lru: transpose the
> > > array of per-node per-memcg lru lists") that went into 5.18-rc1,
> > > you'll get an idea of the amount of memory overhead just tracking
> > > the list_lru x memcg infrastructure consumes at scale:
> > >
> > >     I had done a easy test to show the optimization.  I create 10k memory
> > >     cgroups and mount 10k filesystems in the systems.  We use free command to
> > >     show how many memory does the systems comsumes after this operation (There
> > >     are 2 numa nodes in the system).
> > >
> > >             +-----------------------+------------------------+
> > >             |      condition        |   memory consumption   |
> > >             +-----------------------+------------------------+
> > >             | without this patchset |        24464 MB        |
> > >             +-----------------------+------------------------+
> > >             |     after patch 1     |        21957 MB        | <--------+
> > >             +-----------------------+------------------------+          |
> > >             |     after patch 10    |         6895 MB        |          |
> > >             +-----------------------+------------------------+          |
> > >             |     after patch 12    |         4367 MB        |          |
> > >             +-----------------------+------------------------+          |
> > >                                                                         |
> > >             The more the number of nodes, the more obvious the effect---+
> >
> > Yes, I know, I've reviewed this patchset. However, 10k cgroups _and_ 10k
> > mounts look a bit artificial. It's hard to believe that there are 100M
> > LRUs containing hot objects. If most of them are cold, it's actually
> > a question how to efficiently reclaim them and free the wasted memory.
> >
> > >
> > > If we now add sysfs objects and stats arrays to each of the
> > > list_lrus that we initiate even now on 5.18-rc1, we're going to
> > > massively blow out the memory footprint again.
> >
> > sysfs can be a config option, it doesn't have to be enabled in prod.
> > But I agree, because of memory constraints we're very limited in the size
> > of statistics which can be used for reclaim decisions.
> >
> > >
> > > So, nice idea, but I'm not sure we can make it useful and not
> > > consume huge amounts of memory....
> > >
> > > What might be more useful is a way of getting the kernel to tell us
> > > what the, say, 20 biggest slab caches are in the system and then
> > > provide a way to selectively shrink them. We generally don't really
> > > care about tiny caches, just the ones consuming all the memory. This
> > > isn't my idea - Kent has been looking at this because of how useless
> > > OOM kill output is for debugging slab cache based OOM triggers. See
> > > this branch for an example:
> > >
> > > https://evilpiepirate.org/git/bcachefs.git/log/?h=shrinker_to_text
> > >
> > > Even a sysfs entry that you echo a number into and it returns
> > > the "top N" largest LRU lists. echo -1 into it and it returns every
> > > single one if you want all the information.
> > >
> > > The whole "one sysfs file, one value" architecture falls completely
> > > apart when we might have to indexing *millions* of internal
> > > structures with many parameters per structure...
> >
> > It's a good point, I need to think what can we do here.
> >
> > Actually, if it's so hard and inefficient for a sysfs interface,
> > it also means it's inefficient for the reclaim path. So if we have
> > an ability to quickly find beefy LRU lists worth shrinking, it might
> > improve the reclaim efficiency too.
> >
> > Actually, thinking of reducing the memory footprint, it would be great
> > to combine LRU's belonging to different superblocks (of the same type).
> > Pagecache analogy: we have an LRU for all pagecache pages, it's not
> > per-file or per-sb. Not sure how easy/viable it is.
> >
> > >
> > > > 2) formalizing a reference bit/counter API on the shrinkers level, so that
> > > >    shrinker users can explicitly mark objects (re)-"activation".
> > >
> > > Not 100% certain what you are refering to here - something to do
> > > with active object rotation? Or an active/inactive list split with period
> > > demotion like we have for the page LRUs? Or workingset refault
> > > detection? Can you explain in more detail?
> >
> > It's a vague thinking at this point, but you got me right, I was thinking about
> > all the listed ideas above. In general I'm trying to understand whether
> > the existing model is sufficient for a more or less effective management
> > of hot/cold objects and if we can improve it borrowing some ideas from
> > the page reclaim code.
> > Shadow entries would be great but likely not possible because of the memory
> > footprint.
> >
> > >
> > > > 3) _maybe_ we need to change the shrinkers API from a number of objects to
> > > >    bytes, so that releasing a small number of large objects can compete with
> > > >    a releasing on many small objects. But I'm not sure.
> > >
> > > I think I suggested something similar a long time ago. We have
> > > shrinkers that track things other than slab objects. e.g. IIRC the
> > > ttm graphics allocator shrinker tracks sets of pages and frees
> > > pages, not slab objects. The XFS buffer cache tracks variable sized
> > > objects, from 512 bytes to 64KB in length, so the amount of memory
> > > it frees is variable even if the number of handles it scans and
> > > reclaims is fixed and consistent. Other subsystems have different
> > > "non object" shrinker needs as well.
> >
> > It's true for many objects, because it's not unusual for kernel objects
> > to have attached kmallocs or pin other objects in the memory. Unlikely
> > we can be 100% accurate, but we might improve the "smoothness" of the
> > reclaim process.
> >
> > >
> > > The long and short of it is that two shrinkers might have the same
> > > object count, but one might free 10x the amount of memory than the
> > > other for the same amount of shrinking work. Being able to focus
> > > reclaim work on caches that can free a lot of memory much more
> > > quickly would be a great idea.
> >
> > Right. It's also about "bytes freed/seeks" ratio. Why would we reclaim
> > expensive small objects if there are big and cheap.
> >
> > >
> > > It also means that a shrinker that scans a fragmented slab can keep
> > > going until a set number of slab pages have been freed, rather than
> > > a set number of slab objects. We can push reclaim of fragmented slab
> > > caches much harder when necessary if we are reclaiming by freed byte
> > > counts...
> >
> > I thought about it, but it's tricky. If all slabs are almost full (fragmentation
> > is low), we want to shrink aggressively, otherwise we end up allocating
> > more slabs and fragmenting more memory. If slabs are almost empty (fragmentation
> > is very high), we also want to shrink more aggressively with a hope to release
> > physical memory, as you said. But setting a target in pages is dangerous: slab caches
> > can be merged or slab pages can be pinned by objects belonging to a different
> > memory cgroup, so it might be not possible to reclaim pages no matter how hard
> > we're trying.
> 
> Yeah, I agree it actually doesn't make too much sense to return the
> number of reclaimed objects. Other part of vmscan returns the number
> of base pages, the sizes of slab objects are varied, it may be much
> smaller than a page, for example, dentry may be 192 bytes.
> 
> Another problem which doesn't help the feedback loop is even though
> bytes is returned instead of the number of objects, it doesn't mean
> that much memory is actually freed and available for allocation.

It might be not available for non-slab allocations, but at least new
slab objects can be allocated without polluting more pages. So it
might not be completely useless even if 0 pages were reclaimed.

> IMHO
> the number of really freed pages should be returned (I do understand
> it is not that easy for now), and returning 0 should be fine.

It's doable, there is already a mechanism in place which hooks into
the slub/slab/slob release path and stops the slab reclaim as a whole
if enough memory was freed.

> The
> current logic (returning the number of objects) may feed up something
> over-optimistic. I, at least, experienced once or twice that a
> significant amount of slab caches were shrunk, but actually 0 pages
> were freed actually. TBH the new slab controller may make it worse
> since the page may be pinned by the objects from other memcgs.

Of course, the more dense the placement of objects is, the harder is to get
the physical pages back. But usually it pays off by having a dramatically
lower total number of slab pages.

If the total size of slab memory is not very high and more or less constant,
we don't have to move large number of pages back and forth between slab- and
non-slab users. So if we can satisfy new allocations using "old" pages, it's
totally fine.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] mm/vmscan: add periodic slab shrinker
  2022-04-05 21:18       ` Matthew Wilcox
@ 2022-04-05 23:54         ` Dave Chinner
  2022-04-06  1:06           ` Stephen Brennan
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2022-04-05 23:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Stephen Brennan, Hillf Danton, Roman Gushchin, MM, Mel Gorman,
	Yu Zhao, David Hildenbrand, LKML

On Tue, Apr 05, 2022 at 10:18:01PM +0100, Matthew Wilcox wrote:
> On Tue, Apr 05, 2022 at 10:22:14AM -0700, Stephen Brennan wrote:
> > I can't speak for every slab cache, but I've been coming to the same
> > conclusion myself regarding the dentry cache. I think that the rate of
> > stepping through the LRU should be tied to the rate of allocations.
> > Truly in-use objects shouldn't be harmed by this, as they should get
> > referenced and rotated to the beginning of the LRU. But the one-offs
> > which are bloating the cache will be found and removed.
> 
> I agree with all this.

Same here.

> > I've implemented a version of this patch which just takes one step
> > through the LRU on each d_alloc. It's quite interesting to watch it
> > work. You can create 5 million negative dentries in directory /A via
> > stat(), and then create 5 million negative dentries in directory /B. The
> > total dentry slab size reaches 5 million but never goes past it, since
> > the old negative dentries from /A aren't really in use, and they get
> > pruned at the same rate as negative dentries from /B get created. On the
> > other hand, if you *continue* to stat() on the dentries of /A while you
> > create negative dentries in /B, then the cache grows to 10 million,
> > since the /A dentries are actually in use.
> > 
> > Maybe a solution could involve some generic list_lru machinery that can
> > let you do these sorts of incremental scans? Maybe batching them up so
> > instead of doing one every allocation, you do them every 100 or 1000?
> > It would still be up to the individual user to put this to good use in
> > the object allocation path.
> 
> I feel like we need to allow the list to both shrink and grow, depending
> on how useful the entries in it are.  So one counter per LRU, incremented
> on every add.  When that counter gets to 100, reset it to 0 and scan
> 110 entries.  Maybe 0 of them can be reclaimed; maybe 110 of them can be.
> But the list can shrink over time instead of being a "one in, one out"
> scenario.

Yes, this is pretty much what I've been saying we should be using
the list-lru for since .... Well, let's just say it was one of the
things I wanted to be able to do when I first created the list-lru
infrastructure.

But it is much more complex than this. One of the issues with purely
list-lru add-time accounting is that we cannot make reclaim
decisions from list-add context because the list-add can occur in
reclaim context.  e.g.  dentry reclaim will drop the last reference
to an inode, which then gets inserted into the the inode list-lru in
reclaim context.  The very next thing the superblock shrinker does
is scan the inode cache list-lru and remove a pre-calculated number
of objects from the list. Hence in reclaim context we can be both
increasing and decreasing the size of the list-lru by significant
percentages in a very short time window. This means it will be quite
challenging to make clear decisions based purely on lru list add
operations.

Hence I think we want to connect more than just the unused entries
to periodic reclaim - we really need to know both the number of free
objects on the list-lru as well as the total number of objects
allocated that could be on the list_lru. This would give us some
comparitive measure of free objects vs active referenced objects
and that would allow better decisions to be made.

As it is, we've recently made a necessary connection between
allocation and the list-lru via kmem_cache_alloc_lru(). This was
done as part of the list-lru/memcg rework patchset I referenced
earlier in the thread.

This means that operations that slab objects that are kept
on list_lrus for caching are now supplied with the list_lru at
allocation time. We already use this API for inode caches (via
inode_alloc_sb()) and the dentry cache (via __d_alloc()), so we
already have the infrastructure in place to do per-allocation
list-lru accounting for inodes and dentries, not just "per list-lru
add/remove" accounting.

Extending that to other slab-based list-lru users should be pretty
easy, and in doing so would remove another difference between memcg
and non-memcg aware list-lrus....

> Clearly 110 is a magic number, but intuitively, attempting to shrink
> by 10% feels reasonable.  Need to strike a balance between shrinking
> quickly enough and giving the cache time to figure out which entries
> are actually useful.

Testing will teach us where the thresholds need to be pretty
quickly. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] mm/vmscan: add periodic slab shrinker
  2022-04-05 21:21             ` Matthew Wilcox
@ 2022-04-06  0:01               ` Dave Chinner
  2022-04-21 19:03                 ` Kent Overstreet
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2022-04-06  0:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yang Shi, Roman Gushchin, Hillf Danton, MM, Mel Gorman,
	Stephen Brennan, Yu Zhao, David Hildenbrand, LKML

On Tue, Apr 05, 2022 at 10:21:53PM +0100, Matthew Wilcox wrote:
> On Tue, Apr 05, 2022 at 01:58:59PM -0700, Yang Shi wrote:
> > Yeah, I agree it actually doesn't make too much sense to return the
> > number of reclaimed objects. Other part of vmscan returns the number
> > of base pages, the sizes of slab objects are varied, it may be much
> > smaller than a page, for example, dentry may be 192 bytes.
> 
> From the point of view of vmscan, it only cares about the number of pages
> freed because it's trying to free pages.  But from the point of view of
> trying to keep the number of non-useful objects in check, the number of
> objects freed is more important, and it doesn't matter whether we ended
> up freeing any pages because we made memory available for this slab cache.

Yes and no. If the memory pressure is being placed on this cache,
then freeing any number of objects is a win-win situation - reclaim
makes progress and new allocations don't need to wait for reclaim.

However, if there is no pressure on this slab cache, then freeing
objects but no actual memory pages is largely wasted reclaim effort.
Freeing those objects does nothing to alleviate the memory shortage,
and the memory freed is not going to be consumed any time soon so
all we've done is fragment the slab cache and require the subsystem
to spend more resources re-populating it. That's a lose-lose.

We want to select the shrinkers that will result in the former
occurring, not the latter.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] mm/vmscan: add periodic slab shrinker
  2022-04-05 21:31             ` Roman Gushchin
@ 2022-04-06  0:11               ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2022-04-06  0:11 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Yang Shi, Hillf Danton, MM, Matthew Wilcox, Mel Gorman,
	Stephen Brennan, Yu Zhao, David Hildenbrand, LKML

On Tue, Apr 05, 2022 at 02:31:02PM -0700, Roman Gushchin wrote:
> On Tue, Apr 05, 2022 at 01:58:59PM -0700, Yang Shi wrote:
> > On Tue, Apr 5, 2022 at 9:36 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > On Tue, Apr 05, 2022 at 03:17:10PM +1000, Dave Chinner wrote:
> > > > On Mon, Apr 04, 2022 at 12:08:25PM -0700, Roman Gushchin wrote:
> > > > > On Mon, Apr 04, 2022 at 11:09:48AM +1000, Dave Chinner wrote:
> > IMHO
> > the number of really freed pages should be returned (I do understand
> > it is not that easy for now), and returning 0 should be fine.
> 
> It's doable, there is already a mechanism in place which hooks into
> the slub/slab/slob release path and stops the slab reclaim as a whole
> if enough memory was freed.

The reclaim state that accounts for slab pages freed really
needs to be first class shrinker state that is aggregated at the
do_shrink_slab() level and passed back to the vmscan code. The
shrinker infrastructure itself should be aware of the progress each
shrinker is making - not just objects reclaimed but also pages
reclaimed - so it can make better decisions about how much work
should be done by each shrinker.

e.g. lots of objects in cache, lots of objects reclaimed, no pages
reclaimed is indicative of a fragmented slab cache. If this keeps
happening, we should be trying to apply extra pressure to this
specific cache because the only method we have for correcting a
fragmented cache to return some memory is to reclaim lots more
objects from it. 

> > The
> > current logic (returning the number of objects) may feed up something
> > over-optimistic. I, at least, experienced once or twice that a
> > significant amount of slab caches were shrunk, but actually 0 pages
> > were freed actually. TBH the new slab controller may make it worse
> > since the page may be pinned by the objects from other memcgs.
> 
> Of course, the more dense the placement of objects is, the harder is to get
> the physical pages back. But usually it pays off by having a dramatically
> lower total number of slab pages.

Unless you have tens of millions of objects in the cache. The dentry
cache is a prime example of this "lots of tiny cached objects" where
we have tens of objects per slab page and so can suffer badly from
internal fragmentation....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] mm/vmscan: add periodic slab shrinker
  2022-04-05 23:54         ` Dave Chinner
@ 2022-04-06  1:06           ` Stephen Brennan
  2022-04-06  3:52             ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Stephen Brennan @ 2022-04-06  1:06 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox
  Cc: Hillf Danton, Roman Gushchin, MM, Mel Gorman, Yu Zhao,
	David Hildenbrand, LKML

Dave Chinner <david@fromorbit.com> writes:
> On Tue, Apr 05, 2022 at 10:18:01PM +0100, Matthew Wilcox wrote:
>> On Tue, Apr 05, 2022 at 10:22:14AM -0700, Stephen Brennan wrote:
>> > I can't speak for every slab cache, but I've been coming to the same
>> > conclusion myself regarding the dentry cache. I think that the rate of
>> > stepping through the LRU should be tied to the rate of allocations.
>> > Truly in-use objects shouldn't be harmed by this, as they should get
>> > referenced and rotated to the beginning of the LRU. But the one-offs
>> > which are bloating the cache will be found and removed.
>> 
>> I agree with all this.
>
> Same here.
>
>> > I've implemented a version of this patch which just takes one step
>> > through the LRU on each d_alloc. It's quite interesting to watch it
>> > work. You can create 5 million negative dentries in directory /A via
>> > stat(), and then create 5 million negative dentries in directory /B. The
>> > total dentry slab size reaches 5 million but never goes past it, since
>> > the old negative dentries from /A aren't really in use, and they get
>> > pruned at the same rate as negative dentries from /B get created. On the
>> > other hand, if you *continue* to stat() on the dentries of /A while you
>> > create negative dentries in /B, then the cache grows to 10 million,
>> > since the /A dentries are actually in use.
>> > 
>> > Maybe a solution could involve some generic list_lru machinery that can
>> > let you do these sorts of incremental scans? Maybe batching them up so
>> > instead of doing one every allocation, you do them every 100 or 1000?
>> > It would still be up to the individual user to put this to good use in
>> > the object allocation path.
>> 
>> I feel like we need to allow the list to both shrink and grow, depending
>> on how useful the entries in it are.  So one counter per LRU, incremented
>> on every add.  When that counter gets to 100, reset it to 0 and scan
>> 110 entries.  Maybe 0 of them can be reclaimed; maybe 110 of them can be.
>> But the list can shrink over time instead of being a "one in, one out"
>> scenario.
>
> Yes, this is pretty much what I've been saying we should be using
> the list-lru for since .... Well, let's just say it was one of the
> things I wanted to be able to do when I first created the list-lru
> infrastructure.
>
> But it is much more complex than this. One of the issues with purely
> list-lru add-time accounting is that we cannot make reclaim
> decisions from list-add context because the list-add can occur in
> reclaim context.  e.g.  dentry reclaim will drop the last reference
> to an inode, which then gets inserted into the the inode list-lru in
> reclaim context.  The very next thing the superblock shrinker does
> is scan the inode cache list-lru and remove a pre-calculated number
> of objects from the list. Hence in reclaim context we can be both
> increasing and decreasing the size of the list-lru by significant
> percentages in a very short time window. This means it will be quite
> challenging to make clear decisions based purely on lru list add
> operations.

Plus, for the dcache, dentries are added to the LRU the first time their
reference count drops to zero, but they're not removed if they gain a
new reference. So at least for the dentry case, it's not clear that
list_lru_add/del would be good indicators of free/in-use object counts.

> Hence I think we want to connect more than just the unused entries
> to periodic reclaim - we really need to know both the number of free
> objects on the list-lru as well as the total number of objects
> allocated that could be on the list_lru. This would give us some
> comparitive measure of free objects vs active referenced objects
> and that would allow better decisions to be made.

The dentry branch I have works purely based on total allocated objects:
no differentiation between free and active referenced objects. I could
hook into dput() where reference counts drop to zero for the other part
of the equation, because like I said, list_lru_{add,del} aren't reliable
for the dentry cache.

I have opinions about whether or not it would be helpful to add in the
dput() signal, but I'd rather just try it and see. I'm learning that my
opinion and intuition are not all that reliable when it comes to caching
and LRU algorithms; trial and error is key.

> As it is, we've recently made a necessary connection between
> allocation and the list-lru via kmem_cache_alloc_lru(). This was
> done as part of the list-lru/memcg rework patchset I referenced
> earlier in the thread.
>
> This means that operations that slab objects that are kept
> on list_lrus for caching are now supplied with the list_lru at
> allocation time. We already use this API for inode caches (via
> inode_alloc_sb()) and the dentry cache (via __d_alloc()), so we
> already have the infrastructure in place to do per-allocation
> list-lru accounting for inodes and dentries, not just "per list-lru
> add/remove" accounting.
>
> Extending that to other slab-based list-lru users should be pretty
> easy, and in doing so would remove another difference between memcg
> and non-memcg aware list-lrus....
>
>> Clearly 110 is a magic number, but intuitively, attempting to shrink
>> by 10% feels reasonable.  Need to strike a balance between shrinking
>> quickly enough and giving the cache time to figure out which entries
>> are actually useful.
>
> Testing will teach us where the thresholds need to be pretty
> quickly. :)

100% (that is, 2 per allocation) is too high in my very cursory trial,
the dentry cache didn't seem to grow much during filesystem workloads.
0% (that is, 1 per allocation) worked well enough but like Matthew says,
won't tend to shrink the cache when it is necessary.

Stephen

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] mm/vmscan: add periodic slab shrinker
  2022-04-06  1:06           ` Stephen Brennan
@ 2022-04-06  3:52             ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2022-04-06  3:52 UTC (permalink / raw)
  To: Stephen Brennan
  Cc: Matthew Wilcox, Hillf Danton, Roman Gushchin, MM, Mel Gorman,
	Yu Zhao, David Hildenbrand, LKML

On Tue, Apr 05, 2022 at 06:06:38PM -0700, Stephen Brennan wrote:
> Dave Chinner <david@fromorbit.com> writes:
> > On Tue, Apr 05, 2022 at 10:18:01PM +0100, Matthew Wilcox wrote:
> >> On Tue, Apr 05, 2022 at 10:22:14AM -0700, Stephen Brennan wrote:
> >> > I can't speak for every slab cache, but I've been coming to the same
> >> > conclusion myself regarding the dentry cache. I think that the rate of
> >> > stepping through the LRU should be tied to the rate of allocations.
> >> > Truly in-use objects shouldn't be harmed by this, as they should get
> >> > referenced and rotated to the beginning of the LRU. But the one-offs
> >> > which are bloating the cache will be found and removed.
> >> 
> >> I agree with all this.
> >
> > Same here.
> >
> >> > I've implemented a version of this patch which just takes one step
> >> > through the LRU on each d_alloc. It's quite interesting to watch it
> >> > work. You can create 5 million negative dentries in directory /A via
> >> > stat(), and then create 5 million negative dentries in directory /B. The
> >> > total dentry slab size reaches 5 million but never goes past it, since
> >> > the old negative dentries from /A aren't really in use, and they get
> >> > pruned at the same rate as negative dentries from /B get created. On the
> >> > other hand, if you *continue* to stat() on the dentries of /A while you
> >> > create negative dentries in /B, then the cache grows to 10 million,
> >> > since the /A dentries are actually in use.
> >> > 
> >> > Maybe a solution could involve some generic list_lru machinery that can
> >> > let you do these sorts of incremental scans? Maybe batching them up so
> >> > instead of doing one every allocation, you do them every 100 or 1000?
> >> > It would still be up to the individual user to put this to good use in
> >> > the object allocation path.
> >> 
> >> I feel like we need to allow the list to both shrink and grow, depending
> >> on how useful the entries in it are.  So one counter per LRU, incremented
> >> on every add.  When that counter gets to 100, reset it to 0 and scan
> >> 110 entries.  Maybe 0 of them can be reclaimed; maybe 110 of them can be.
> >> But the list can shrink over time instead of being a "one in, one out"
> >> scenario.
> >
> > Yes, this is pretty much what I've been saying we should be using
> > the list-lru for since .... Well, let's just say it was one of the
> > things I wanted to be able to do when I first created the list-lru
> > infrastructure.
> >
> > But it is much more complex than this. One of the issues with purely
> > list-lru add-time accounting is that we cannot make reclaim
> > decisions from list-add context because the list-add can occur in
> > reclaim context.  e.g.  dentry reclaim will drop the last reference
> > to an inode, which then gets inserted into the the inode list-lru in
> > reclaim context.  The very next thing the superblock shrinker does
> > is scan the inode cache list-lru and remove a pre-calculated number
> > of objects from the list. Hence in reclaim context we can be both
> > increasing and decreasing the size of the list-lru by significant
> > percentages in a very short time window. This means it will be quite
> > challenging to make clear decisions based purely on lru list add
> > operations.
> 
> Plus, for the dcache, dentries are added to the LRU the first time their
> reference count drops to zero, but they're not removed if they gain a
> new reference. So at least for the dentry case, it's not clear that
> list_lru_add/del would be good indicators of free/in-use object counts.

Same for several other list-lru caches - the lazy removal trick is
used by (at least) the dentry, inode, XFS buffer and XFS dquot
caches. It's simply a way of reducing lock contention on the LRU
list - generally only memory reclaim or object freeing needs list
removal. So why remove it on every new reference to an object only
to have to put it back on the list a short time later?

> > Hence I think we want to connect more than just the unused entries
> > to periodic reclaim - we really need to know both the number of free
> > objects on the list-lru as well as the total number of objects
> > allocated that could be on the list_lru. This would give us some
> > comparitive measure of free objects vs active referenced objects
> > and that would allow better decisions to be made.
> 
> The dentry branch I have works purely based on total allocated objects:
> no differentiation between free and active referenced objects. I could
> hook into dput() where reference counts drop to zero for the other part
> of the equation, because like I said, list_lru_{add,del} aren't reliable
> for the dentry cache.

Right, but shrinkers work from the number of objects that are
potentially freeable in a cache, not from the number that are
currently allocated. i.e. it uses the number of objects on the LRU
as a proxy for size of the allocated cache.

My point is that we actually need both sets of stats to make decent
reclaim decisions - scanning large caches is not useful if there is
almost zero potentially reclaimable objects in the cache...

That, however, brings into focus another issue: we conflate "reclaim
scanning" with list rotation (removing referenced bits and giving an
object another pass through the list). This forces reclaim to try to
reclaim a significant portion of any cache regardless of whether
there is nothing we can free immediately. If we don't then we never
strip the referenced bits that will allows the objects to eventually
be reclaimed.that will allows the objects to eventually be
reclaimed...

> I have opinions about whether or not it would be helpful to add in the
> dput() signal, but I'd rather just try it and see. I'm learning that my
> opinion and intuition are not all that reliable when it comes to caching
> and LRU algorithms; trial and error is key.

*nod*

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] mm/vmscan: add periodic slab shrinker
  2022-04-06  0:01               ` Dave Chinner
@ 2022-04-21 19:03                 ` Kent Overstreet
  2022-04-21 23:55                   ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Kent Overstreet @ 2022-04-21 19:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Yang Shi, Roman Gushchin, Hillf Danton, MM,
	Mel Gorman, Stephen Brennan, Yu Zhao, David Hildenbrand, LKML

On Wed, Apr 06, 2022 at 10:01:30AM +1000, Dave Chinner wrote:
> On Tue, Apr 05, 2022 at 10:21:53PM +0100, Matthew Wilcox wrote:
> > On Tue, Apr 05, 2022 at 01:58:59PM -0700, Yang Shi wrote:
> > > Yeah, I agree it actually doesn't make too much sense to return the
> > > number of reclaimed objects. Other part of vmscan returns the number
> > > of base pages, the sizes of slab objects are varied, it may be much
> > > smaller than a page, for example, dentry may be 192 bytes.
> > 
> > From the point of view of vmscan, it only cares about the number of pages
> > freed because it's trying to free pages.  But from the point of view of
> > trying to keep the number of non-useful objects in check, the number of
> > objects freed is more important, and it doesn't matter whether we ended
> > up freeing any pages because we made memory available for this slab cache.
> 
> Yes and no. If the memory pressure is being placed on this cache,
> then freeing any number of objects is a win-win situation - reclaim
> makes progress and new allocations don't need to wait for reclaim.
> 
> However, if there is no pressure on this slab cache, then freeing
> objects but no actual memory pages is largely wasted reclaim effort.
> Freeing those objects does nothing to alleviate the memory shortage,
> and the memory freed is not going to be consumed any time soon so
> all we've done is fragment the slab cache and require the subsystem
> to spend more resources re-populating it. That's a lose-lose.
> 
> We want to select the shrinkers that will result in the former
> occurring, not the latter.

Do we have any existing shrinkers that preferentially free from mostly empty
slab pages though? And do we want them to?

You're talking about memory fragmentation, and I'm not sure that should be the
shrinker's concern (on the other hand, I'm not sure it shouldn't - just freeing
the objects on mostly empty slab pages is pretty reasonable for cached objects).

We could also plumb compaction down to the slab level, and just request the
subsystem move those objects. Might be easier than making that an additional
responsibility of the shrinkers, which really should be more concerned with
implementing cache replacement policy and whatnot - e.g. shrinkers were doing
something more LFU-ish that would also help with the one-off object problem.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] mm/vmscan: add periodic slab shrinker
  2022-04-21 19:03                 ` Kent Overstreet
@ 2022-04-21 23:55                   ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2022-04-21 23:55 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Yang Shi, Roman Gushchin, Hillf Danton, MM,
	Mel Gorman, Stephen Brennan, Yu Zhao, David Hildenbrand, LKML

On Thu, Apr 21, 2022 at 03:03:39PM -0400, Kent Overstreet wrote:
> On Wed, Apr 06, 2022 at 10:01:30AM +1000, Dave Chinner wrote:
> > On Tue, Apr 05, 2022 at 10:21:53PM +0100, Matthew Wilcox wrote:
> > > On Tue, Apr 05, 2022 at 01:58:59PM -0700, Yang Shi wrote:
> > > > Yeah, I agree it actually doesn't make too much sense to return the
> > > > number of reclaimed objects. Other part of vmscan returns the number
> > > > of base pages, the sizes of slab objects are varied, it may be much
> > > > smaller than a page, for example, dentry may be 192 bytes.
> > > 
> > > From the point of view of vmscan, it only cares about the number of pages
> > > freed because it's trying to free pages.  But from the point of view of
> > > trying to keep the number of non-useful objects in check, the number of
> > > objects freed is more important, and it doesn't matter whether we ended
> > > up freeing any pages because we made memory available for this slab cache.
> > 
> > Yes and no. If the memory pressure is being placed on this cache,
> > then freeing any number of objects is a win-win situation - reclaim
> > makes progress and new allocations don't need to wait for reclaim.
> > 
> > However, if there is no pressure on this slab cache, then freeing
> > objects but no actual memory pages is largely wasted reclaim effort.
> > Freeing those objects does nothing to alleviate the memory shortage,
> > and the memory freed is not going to be consumed any time soon so
> > all we've done is fragment the slab cache and require the subsystem
> > to spend more resources re-populating it. That's a lose-lose.
> > 
> > We want to select the shrinkers that will result in the former
> > occurring, not the latter.
> 
> Do we have any existing shrinkers that preferentially free from mostly empty
> slab pages though?

No, because shrinkers have no visbility into slab cache layout.

> And do we want them to?

There have been attempts in the past to do this, which started from
selecting partial slab pages and then trying free/reclaim the
objects on those pages.

The problems these direct defrag attempts face is that partial slabs
can be a mix of referenced (in use) and unreferenced (reclaimable)
objects. Freeing/relocating an in use object is largely impossible
because of all the (unknown) external references to the object would
need to be updated and that's an intractable problem.

Hence attempts to directly target partial slab pages for reclaim
have largely ended up not having very good results because partial
slab pages containing only unreferenced objects are much rarer than
partial slab pages....

> You're talking about memory fragmentation, and I'm not sure that should be the
> shrinker's concern (on the other hand, I'm not sure it shouldn't - just freeing
> the objects on mostly empty slab pages is pretty reasonable for cached objects).

Yeah, but as per above, having mostly empty slab pages does not mean
the remaining active objects on those pages can actually be
reclaimed...

> We could also plumb compaction down to the slab level, and just request the
> subsystem move those objects.

Been there, tried that, not feasible. How do you find all the
external active references to a dentry or inode at any given point
in time and then prevent all of them from being actively used while
you switch all the external pointers to the old object to the new
object?

> Might be easier than making that an additional
> responsibility of the shrinkers, which really should be more concerned with
> implementing cache replacement policy and whatnot - e.g. shrinkers were doing
> something more LFU-ish that would also help with the one-off object problem.

*nod*

I've said this many times in the past when people have wanted to
hack special cases into inode and dentry cache reference bit
management.  We need to make the list_lru rotation implementation
smarter, not hack special cases into individual shrinker algorithms

One of the original goals of the list_lru was to unify all the LRU
mechanisms used by shrinkable slab caches so we could then do
something smarter with the list_lru than a plain LRU + reference bit
and everything would benefit.  e.g. Being able to bias reclaim
priority of objects based on their known access patterns (which I
implemented for the xfs buffer cache via b_lru_ref so it holds onto
tree roots and nodes harder than it does leaves) or moving the list
ordering towards LFU or clock-pro style active/inactive lists page
reclaim uses to allow single use objects to stream through the cache
instead of turn it entirely over were all potential improvements I
as thinking of.

So, yes, we should be looking at improving the list_lru
implementation so taht it handles streaming single use objects a
whole lot better. Fix it once, fix it properly, and everyone who
uses list-lru and shrinkers benefits...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-04-21 23:55 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20220402072103.5140-1-hdanton@sina.com>
2022-04-02 17:54 ` [RFC] mm/vmscan: add periodic slab shrinker Roman Gushchin
     [not found] ` <20220403005618.5263-1-hdanton@sina.com>
2022-04-04  1:09   ` Dave Chinner
2022-04-04 19:08     ` Roman Gushchin
2022-04-05  5:17       ` Dave Chinner
2022-04-05 16:35         ` Roman Gushchin
2022-04-05 20:58           ` Yang Shi
2022-04-05 21:21             ` Matthew Wilcox
2022-04-06  0:01               ` Dave Chinner
2022-04-21 19:03                 ` Kent Overstreet
2022-04-21 23:55                   ` Dave Chinner
2022-04-05 21:31             ` Roman Gushchin
2022-04-06  0:11               ` Dave Chinner
2022-04-05 17:22     ` Stephen Brennan
2022-04-05 21:18       ` Matthew Wilcox
2022-04-05 23:54         ` Dave Chinner
2022-04-06  1:06           ` Stephen Brennan
2022-04-06  3:52             ` Dave Chinner
     [not found]   ` <20220404051442.5419-1-hdanton@sina.com>
2022-04-04 18:32     ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).