From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B0162C4727E for ; Wed, 30 Sep 2020 17:23:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 64C9720754 for ; Wed, 30 Sep 2020 17:23:26 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="GeSpouXk" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728232AbgI3RXW (ORCPT ); Wed, 30 Sep 2020 13:23:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60042 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725355AbgI3RXW (ORCPT ); Wed, 30 Sep 2020 13:23:22 -0400 Received: from mail-ej1-x642.google.com (mail-ej1-x642.google.com [IPv6:2a00:1450:4864:20::642]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B5CD4C061755; Wed, 30 Sep 2020 10:23:21 -0700 (PDT) Received: by mail-ej1-x642.google.com with SMTP id a3so3195047ejy.11; Wed, 30 Sep 2020 10:23:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=UrgpaXZQcEiWRN4UDFx1mJBkzqn9rEMA2DsycEOwqxw=; b=GeSpouXkNgMhABdxtGEAULbpTjggCCUC7yOIyyn+prS51XPGqv0fuKCQaa0RssUOVs 4Lp+QEJzLdPw7NGgdz6+nxUBYFpwULFSi91lek7bMUHMdrc2dglvgtugmhV00sQ13+tm NOyJ4fdAWiSdAakCMtZLGp9eCjsw2jO0+QSUV/A7X+qd36EqdH5CQNcvxDQHeH2L7esv FWgyROP46BF472HcsIkKX+UsaiaMpdFXZ2qn7vod/nQg12BwFrjyPjLG9hkosh4410Si 8ZyVNbD6vRRPjIpIr0laOd5TaB+lLTHOuy8hnLuMaTAYwCTnTEdp9zM25ktQTyXq1IE7 5Dgw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=UrgpaXZQcEiWRN4UDFx1mJBkzqn9rEMA2DsycEOwqxw=; b=IJF0rzowX0Qpov/Y9EoMWD0IvQTdAhzUZC4jVsbjniVlA5rK3dMNUEdRo7mlMzDgzv XeYvaSCmAVF3Kc8L4QnLxY7Im/9AMXXDwmoPpn+AQtOlxbSJJAQkr1utGZydX+9s8I7P 8H1ScmZhMP/s4prHAhQKixUg3k6Q7xk5WC0Xmu1tbRNo94QTWdJ2AD2HT7XIXbJiJ3Ut VI22N6Ln3h4/SLFAWVrxvGSIEOiC5KhhMiIdrr0kB6zo/u3OAvcimWRidBYroQGGxxSH 43Ga1ASs2q2ggw0MbDFV3LHuFGfhETb6Z0z6zOP/5g17sgO4JF9ER6jyLQLjczPpWkvJ 7bOg== X-Gm-Message-State: AOAM533quzXgryGLy/Ex/tHpMr1B0AESrcgTPTWgZGDwV5LWQ2CpH42l IRFtqF9q/E5u1N3QFMqVIW+mPVxHvSyu2YsiR9yQZIFDDluH0A== X-Google-Smtp-Source: ABdhPJwepsRiteoDrEmKrYtno7tHE8B1tyjlYMXcr+1OIZY6WQXlxKnWUjBPx6MPce8qUWrKjfz6OgsSjsv4B4cEC2M= X-Received: by 2002:a17:906:71cc:: with SMTP id i12mr3848447ejk.507.1601486600334; Wed, 30 Sep 2020 10:23:20 -0700 (PDT) MIME-Version: 1.0 References: <20200916185823.5347-1-shy828301@gmail.com> <20200917023742.GT12096@dread.disaster.area> <20200921003231.GZ12096@dread.disaster.area> <20200930073152.GH12096@dread.disaster.area> In-Reply-To: <20200930073152.GH12096@dread.disaster.area> From: Yang Shi Date: Wed, 30 Sep 2020 10:23:07 -0700 Message-ID: Subject: Re: [RFC PATCH 0/2] Remove shrinker's nr_deferred To: Dave Chinner Cc: Linux MM , Linux FS-devel Mailing List , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Wed, Sep 30, 2020 at 12:31 AM Dave Chinner wrote: > > On Sat, Sep 26, 2020 at 01:31:36PM -0700, Yang Shi wrote: > > Hi Dave, > > > > I was exploring to make the "nr_deferred" per memcg. I looked into and > > had some prototypes for two approaches so far: > > 1. Have per memcg data structure for each memcg aware shrinker, just > > like what shrinker_map does. > > 2. Have "nr_deferred" on list_lru_one for memcg aware lists. > > > > Both seem feasible, however the latter one looks much cleaner, I just > > need to add two new APIs for shrinker which gets and sets > > "nr_deferred" respectively. And, just memcg aware shrinkers need > > define those two callouts. We just need to care about memcg aware > > shrinkers, and the most memcg aware shrinkers (inode/dentry, nfs and > > workingset) use list_lru, so I'd prefer the latter one. > > The list_lru is completely separate from the shrinker context. The > structure that tracks objects in a subsystem is not visible or aware > of how the high level shrinker scanning algorithms work. Not to > mention that subsystem shrinkers can be memcg aware without using > list_lru structures to index objects owned by a given memcg. Hence I > really don't like the idea of tying the shrinker control data deeply > into subsystem cache indexing.... I see your points. Yes, makes sense to me. The list_lru is a common data structure and could be used by other subsystems, not only memcg aware shrinkers. Putting shrinker control data in list_lru seems break the layer. So, option #1 might be more appropriate. The change looks like: struct mem_cgroup_per_node { ... struct memcg_shrinker_map __rcu *shrinker_map; + struct memcg_shrinker_deferred __rcu *shrinker_deferred; ... } > > > > But there are two memcg aware shrinkers are not that straightforward > > to deal with: > > 1. The deferred split THP. It doesn't use list_lru, but actually I > > don't worry about this one, since it is not cache just some partial > > unmapped THPs. I could try to convert it to use list_lru later on or > > just kill deferred split by making vmscan split partial unmapped THPs. > > So TBH I don't think it is a blocker. > > What a fantastic abuse of the reclaim infrastructure. :/ > > First it was just defered work. Then it became NUMA_AWARE. THen it > became MEMCG_AWARE and.... > > Oh, man what a nasty hack that SHRINKER_NONSLAB flag is so that it > runs through shrink_slab_memcg() even when memcgs are configured in > but kmem tracking disabled. We have heaps of shrinkers that reclaim > from things that aren't slab caches, but this one is just nuts. > > > 2. The fs_objects. This one looks weird. It shares the same shrinker > > with inode/dentry. The only user is XFS currently. But it looks it is > > not really memcg aware at all, right? > > It most definitely is. > > The VFS dentry and inode cache reclaim are memcg aware. The > fs_objects callout is for filesystem level object garbage collection > that can be done as a result of the dentry and inode caches being > reclaimed. > > i.e. once the VFS has reclaimed the inode attached to the memcg, it > is no longer attached and accounted to the memcg anymore. It is > owned by the filesystem at this point, and it is entirely up to the > filesytem to when it can then be freed. Most filesystems do it in > the inode cache reclaim via the ->destroy method. XFS, OTOH, tags > freeable inodes in it's internal radix trees rather than freeing > them immediately because it still may have to clean the inode before > it can be freed. Hence we defer freeing of inodes until the > ->fs_objects pass.... Aha, thanks for elaborating. Now I see what it is doing for... > > > They are managed by radix tree > > which is not per memcg by looking into xfs code, so the "per list_lru > > nr_deferred" can't work for it. I thought of a couple of ways to > > tackle it off the top of my head: > > A. Just ignore it. If the amount of fs_objects are negligible > > comparing to inode/dentry, then I think it can be just ignored and > > kept it as is. > > Ah, no, they are not negliable. Under memory pressure, the number of > objects is typically 1/3rd dentries, 1/3rd VFS inodes, 1/3rd fs > objects to be reclaimed. The dentries and VFS inodes are owned by > VFS level caches and associated with memcgs, the fs_objects are only > visible to the filesystem. > > > B. Move it out of inode/dentry shrinker. Add a dedicated shrinker > > for it, for example, sb->s_fs_obj_shrink. > > No, they are there because the reclaim has to be kept in exact > proportion to the dentry and inode reclaim quantities. That's the > reason I put that code there in the first place: a separate inode > filesystem cache shrinker just didn't work well at all. > > > C. Make it really memcg aware and use list_lru. > > Two things. Firstly, objects are owned by the filesystem at this > point, not memcgs. Memcgs were detatched at the VFS inode reclaim > layer. > > Secondly, list-lru does not scale well enough for the use needed by > XFS. We use radix trees so we can do lockless batch lookups and > IO-efficient inode-order reclaim passes. We also have concurrent > reclaim capabilities because of the lockless tag lookup walks. > Using a list_lru for this substantially reduces reclaim performance > and greatly increases CPU usage of reclaim because of contention on > the internal list lru locks. Been there, measured that.... > > > I don't have any experience on XFS code, #C seems the most optimal, > > but should be the most time consuming, I'm not sure if it is worth it > > or not. So, #B sounds more preferred IMHO. > > I think you're going completely in the wrong direction. The problem > that needs solving is integrating shrinker scanning control state > with memcgs more tightly, not force every memcg aware shrinker to > use list_lru for their subsystem shrinker implementations.... Thanks a lot for all the elaboration and advice. Integrating shrinker scanning control state with memcgs more tightly makes sense to me. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com