linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yang Shi <shy828301@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Linux MM <linux-mm@kvack.org>,
	Linux FS-devel Mailing List <linux-fsdevel@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 0/2] Remove shrinker's nr_deferred
Date: Wed, 30 Sep 2020 10:23:07 -0700	[thread overview]
Message-ID: <CAHbLzkp2L-g7ms7_ddpA+LpEmKkk6a+9KebBYWkaLV5EXSPu=Q@mail.gmail.com> (raw)
In-Reply-To: <20200930073152.GH12096@dread.disaster.area>

On Wed, Sep 30, 2020 at 12:31 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sat, Sep 26, 2020 at 01:31:36PM -0700, Yang Shi wrote:
> > Hi Dave,
> >
> > I was exploring to make the "nr_deferred" per memcg. I looked into and
> > had some prototypes for two approaches so far:
> > 1. Have per memcg data structure for each memcg aware shrinker, just
> > like what shrinker_map does.
> > 2. Have "nr_deferred" on list_lru_one for memcg aware lists.
> >
> > Both seem feasible, however the latter one looks much cleaner, I just
> > need to add two new APIs for shrinker which gets and sets
> > "nr_deferred" respectively. And, just memcg aware shrinkers need
> > define those two callouts. We just need to care about memcg aware
> > shrinkers, and the most memcg aware shrinkers (inode/dentry, nfs and
> > workingset) use list_lru, so I'd prefer the latter one.
>
> The list_lru is completely separate from the shrinker context. The
> structure that tracks objects in a subsystem is not visible or aware
> of how the high level shrinker scanning algorithms work. Not to
> mention that subsystem shrinkers can be memcg aware without using
> list_lru structures to index objects owned by a given memcg. Hence I
> really don't like the idea of tying the shrinker control data deeply
> into subsystem cache indexing....

I see your points. Yes, makes sense to me. The list_lru is a common
data structure and could be used by other subsystems, not only memcg
aware shrinkers.

Putting shrinker control data in list_lru seems break the layer. So,
option #1 might be more appropriate. The change looks like:

struct mem_cgroup_per_node {
...
        struct memcg_shrinker_map __rcu *shrinker_map;
+       struct memcg_shrinker_deferred __rcu    *shrinker_deferred;
...
}

>
>
> > But there are two memcg aware shrinkers are not that straightforward
> > to deal with:
> > 1. The deferred split THP. It doesn't use list_lru, but actually I
> > don't worry about this one, since it is not cache just some partial
> > unmapped THPs. I could try to convert it to use list_lru later on or
> > just kill deferred split by making vmscan split partial unmapped THPs.
> > So TBH I don't think it is a blocker.
>
> What a fantastic abuse of the reclaim infrastructure. :/
>
> First it was just defered work. Then it became NUMA_AWARE. THen it
> became MEMCG_AWARE and....
>
> Oh, man what a nasty hack that SHRINKER_NONSLAB flag is so that it
> runs through shrink_slab_memcg() even when memcgs are configured in
> but kmem tracking disabled. We have heaps of shrinkers that reclaim
> from things that aren't slab caches, but this one is just nuts.
>
> > 2. The fs_objects. This one looks weird. It shares the same shrinker
> > with inode/dentry. The only user is XFS currently. But it looks it is
> > not really memcg aware at all, right?
>
> It most definitely is.
>
> The VFS dentry and inode cache reclaim are memcg aware. The
> fs_objects callout is for filesystem level object garbage collection
> that can be done as a result of the dentry and inode caches being
> reclaimed.
>
> i.e. once the VFS has reclaimed the inode attached to the memcg, it
> is no longer attached and accounted to the memcg anymore. It is
> owned by the filesystem at this point, and it is entirely up to the
> filesytem to when it can then be freed. Most filesystems do it in
> the inode cache reclaim via the ->destroy method. XFS, OTOH, tags
> freeable inodes in it's internal radix trees rather than freeing
> them immediately because it still may have to clean the inode before
> it can be freed. Hence we defer freeing of inodes until the
> ->fs_objects pass....

Aha, thanks for elaborating. Now I see what it is doing for...

>
> > They are managed by radix tree
> > which is not per memcg by looking into xfs code, so the "per list_lru
> > nr_deferred" can't work for it.  I thought of a couple of ways to
> > tackle it off the top of my head:
> >     A. Just ignore it. If the amount of fs_objects are negligible
> > comparing to inode/dentry, then I think it can be just ignored and
> > kept it as is.
>
> Ah, no, they are not negliable. Under memory pressure, the number of
> objects is typically 1/3rd dentries, 1/3rd VFS inodes, 1/3rd fs
> objects to be reclaimed. The dentries and VFS inodes are owned by
> VFS level caches and associated with memcgs, the fs_objects are only
> visible to the filesystem.
>
> >     B. Move it out of inode/dentry shrinker. Add a dedicated shrinker
> > for it, for example, sb->s_fs_obj_shrink.
>
> No, they are there because the reclaim has to be kept in exact
> proportion to the dentry and inode reclaim quantities. That's the
> reason I put that code there in the first place: a separate inode
> filesystem cache shrinker just didn't work well at all.
>
> >     C. Make it really memcg aware and use list_lru.
>
> Two things. Firstly, objects are owned by the filesystem at this
> point, not memcgs. Memcgs were detatched at the VFS inode reclaim
> layer.
>
> Secondly, list-lru does not scale well enough for the use needed by
> XFS. We use radix trees so we can do lockless batch lookups and
> IO-efficient inode-order reclaim passes. We also have concurrent
> reclaim capabilities because of the lockless tag lookup walks.
> Using a list_lru for this substantially reduces reclaim performance
> and greatly increases CPU usage of reclaim because of contention on
> the internal list lru locks. Been there, measured that....
>
> > I don't have any experience on XFS code, #C seems the most optimal,
> > but should be the most time consuming, I'm not sure if it is worth it
> > or not. So, #B sounds more preferred IMHO.
>
> I think you're going completely in the wrong direction. The problem
> that needs solving is integrating shrinker scanning control state
> with memcgs more tightly, not force every memcg aware shrinker to
> use list_lru for their subsystem shrinker implementations....

Thanks a lot for all the elaboration and advice. Integrating shrinker
scanning control state with memcgs more tightly makes sense to me.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

      reply	other threads:[~2020-09-30 17:23 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-16 18:58 [RFC PATCH 0/2] Remove shrinker's nr_deferred Yang Shi
2020-09-16 18:58 ` [RFC PATCH 1/2] mm: vmscan: remove shrinker's nr_deferred from tracepoint Yang Shi
2020-09-16 18:58 ` [RFC PATCH 2/2] mm: vmscan: remove shrinker's nr_deferred Yang Shi
2020-09-17  2:37 ` [RFC PATCH 0/2] Remove " Dave Chinner
2020-09-18  0:12   ` Yang Shi
2020-09-21  0:32     ` Dave Chinner
2020-09-22 23:45       ` Yang Shi
2020-09-26 20:31         ` Yang Shi
2020-09-30  7:31           ` Dave Chinner
2020-09-30 17:23             ` Yang Shi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHbLzkp2L-g7ms7_ddpA+LpEmKkk6a+9KebBYWkaLV5EXSPu=Q@mail.gmail.com' \
    --to=shy828301@gmail.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).