From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id ADDC0C433ED
	for <linux-kernel@archiver.kernel.org>; Mon,  3 May 2021 06:34:08 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 880F161244
	for <linux-kernel@archiver.kernel.org>; Mon,  3 May 2021 06:34:08 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232752AbhECGe7 (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Mon, 3 May 2021 02:34:59 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57746 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229933AbhECGe5 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 3 May 2021 02:34:57 -0400
Received: from mail-pj1-x102c.google.com (mail-pj1-x102c.google.com [IPv6:2607:f8b0:4864:20::102c])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E2F84C06174A
        for <linux-kernel@vger.kernel.org>; Sun,  2 May 2021 23:33:58 -0700 (PDT)
Received: by mail-pj1-x102c.google.com with SMTP id p17so2488989pjz.3
        for <linux-kernel@vger.kernel.org>; Sun, 02 May 2021 23:33:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bytedance-com.20150623.gappssmtp.com; s=20150623;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=YAfjaggDazXvQtYp2FS5y03jgZVhcBFsjn4+g9RTkfc=;
        b=xVJ7R69O2ECyU0IHuW1rDWvr0VqTU50aIrgzPU9MNFjG8trH7bMeywNS5jLq2b1IWp
         X2yPhy6OQINzdEYHRqLd9uCiHVslCu4qd5YrwV4j7EdBS094XPIhUQpku9O7nzPMuWXc
         ZvAHnnFNRwElWf8eCLM0VcMOr9hg5CCU0sC9mWBnTw91VcdpjRU7v+v5qe2TioP5oYxD
         xn+Yix1jrTA7hER1/DGHC8zmKoU1PfLv+2XuyW5xjTXUU9LP93uSz8AS/kz7Pcg5AVAi
         lyJfj18aUtanwex7L2CWXn3M1gpeWEM/VtFmEijMqRLUzUIBcDH7fquyDzainBL0VlLA
         sjAg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=YAfjaggDazXvQtYp2FS5y03jgZVhcBFsjn4+g9RTkfc=;
        b=qvldpcXAW/aDKCVwJmZ65I1FmEDGijuDQWGrpV0/0q+biIj4j1k4eoQaMbqVj+cbVS
         +5V9LtLWgqSKH+8hOxpYpZw1m4jMPsXqTIrgM9A426l01UJPPRHKd4+ysFnHbTJcew0L
         u4PgsKK7eQhq9LyfGIstKcSrJnQJSD05FnLaQMBC5F4vntg2BHp4IXCfQLlIhfuXozLf
         FAm6AwUsvCZW7YAz9xX1G0IZ/9hGuk/gJemGBbkj3vShZwJbU/8X4n7f9CyPn3dr6aPk
         QCenm1ATZcM2rdYORgNpR67BX9U2LPCtb+LAh/vMTMA2hbULeQTsZCBcbZIgsEcV4che
         Jebw==
X-Gm-Message-State: AOAM5308cnBWr94WeKro6M0jfwhahkK6jrVIzfK1ZMvuUjVDgp2qj19s
        W8R/DNlCjSEjttTO2lvCBvnm6c6VPVnA4BurNblvNg==
X-Google-Smtp-Source: ABdhPJxGflL8gqcfGI3qnihVxhFS3o1OhYuGJJNiad6v/bHj/QMwqly1oymqd0EqExsT0FHNjife4SK934iFvznUWNo=
X-Received: by 2002:a17:90a:644b:: with SMTP id y11mr19284123pjm.229.1620023638171;
 Sun, 02 May 2021 23:33:58 -0700 (PDT)
MIME-Version: 1.0
References: <20210428094949.43579-1-songmuchun@bytedance.com>
 <20210430004903.GF1872259@dread.disaster.area> <YItf3GIUs2skeuyi@carbon.dhcp.thefacebook.com>
 <20210430032739.GG1872259@dread.disaster.area> <CAMZfGtXawtMT4JfBtDLZ+hES4iEHFboe2UgJee_s-NhZR5faAw@mail.gmail.com>
 <20210502235843.GJ1872259@dread.disaster.area>
In-Reply-To: <20210502235843.GJ1872259@dread.disaster.area>
From:   Muchun Song <songmuchun@bytedance.com>
Date:   Mon, 3 May 2021 14:33:21 +0800
Message-ID: <CAMZfGtVK2Sracf=ongpNJqacafmC2ZsNy-KxEL67fVCAGXz3xA@mail.gmail.com>
Subject: Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory
 cgroup removal
To:     Dave Chinner <david@fromorbit.com>
Cc:     Roman Gushchin <guro@fb.com>, Matthew Wilcox <willy@infradead.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Michal Hocko <mhocko@kernel.org>,
        Vladimir Davydov <vdavydov.dev@gmail.com>,
        Shakeel Butt <shakeelb@google.com>,
        Yang Shi <shy828301@gmail.com>, alexs@kernel.org,
        Alexander Duyck <alexander.h.duyck@linux.intel.com>,
        Wei Yang <richard.weiyang@gmail.com>,
        linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        LKML <linux-kernel@vger.kernel.org>,
        Linux Memory Management List <linux-mm@kvack.org>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, May 3, 2021 at 7:58 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Apr 30, 2021 at 04:32:39PM +0800, Muchun Song wrote:
> > On Fri, Apr 30, 2021 at 11:27 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote:
> > > > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote:
> > > > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote:
> > > > > > In our server, we found a suspected memory leak problem. The kmalloc-32
> > > > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > > > > > memory.
> > > > > >
> > > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > > > > > cache is the cause of list_lru_one allocation.
> > > > > >
> > > > > >   crash> p memcg_nr_cache_ids
> > > > > >   memcg_nr_cache_ids = $2 = 24574
> > > > > >
> > > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > > > > > can be calculated with the following formula.
> > > > > >
> > > > > >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> > > > > >
> > > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> > > > > >
> > > > > >   crash> list super_blocks | wc -l
> > > > > >   952
> > > > >
> > > > > The more I see people trying to work around this, the more I think
> > > > > that the way memcgs have been grafted into the list_lru is back to
> > > > > front.
> > > > >
> > > > > We currently allocate scope for every memcg to be able to tracked on
> > > > > every not on every superblock instantiated in the system, regardless
> > > > > of whether that superblock is even accessible to that memcg.
> > > > >
> > > > > These huge memcg counts come from container hosts where memcgs are
> > > > > confined to just a small subset of the total number of superblocks
> > > > > that instantiated at any given point in time.
> > > > >
> > > > > IOWs, for these systems with huge container counts, list_lru does
> > > > > not need the capability of tracking every memcg on every superblock.
> > > > >
> > > > > What it comes down to is that the list_lru is only needed for a
> > > > > given memcg if that memcg is instatiating and freeing objects on a
> > > > > given list_lru.
> > > > >
> > > > > Which makes me think we should be moving more towards "add the memcg
> > > > > to the list_lru at the first insert" model rather than "instantiate
> > > > > all at memcg init time just in case". The model we originally came
> > > > > up with for supprting memcgs is really starting to show it's limits,
> > > > > and we should address those limitations rahter than hack more
> > > > > complexity into the system that does nothing to remove the
> > > > > limitations that are causing the problems in the first place.
> > > >
> > > > I totally agree.
> > > >
> > > > It looks like the initial implementation of the whole kernel memory accounting
> > > > and memcg-aware shrinkers was based on the idea that the number of memory
> > > > cgroups is relatively small and stable.
> > >
> > > Yes, that was one of the original assumptions - tens to maybe low
> > > hundreds of memcgs at most. The other was that memcgs weren't NUMA
> > > aware, and so would only need a single LRU list per memcg. Hence the
> > > total overhead even with "lots" of memcgsi and superblocks the
> > > overhead wasn't that great.
> > >
> > > Then came "memcgs need to be NUMA aware" because of the size of the
> > > machines they were being use for resrouce management in, and that
> > > greatly increased the per-memcg, per LRU overhead. Now we're talking
> > > about needing to support a couple of orders of magnitude more memcgs
> > > and superblocks than were originally designed for.
> > >
> > > So, really, we're way beyond the original design scope of this
> > > subsystem now.
> >
> > Got it. So it is better to allocate the structure of the list_lru_node
> > dynamically. We should only allocate it when it is really demanded.
> > But allocating memory by using GFP_ATOMIC in list_lru_add() is
> > not a good idea. So we should allocate the memory out of
> > list_lru_add(). I can propose an approach that may work.
> >
> > Before start, we should know about the following rules of list lrus.
> >
> > - Only objects allocated with __GFP_ACCOUNT need to allocate
> >   the struct list_lru_node.
>
> This seems .... misguided. inode and dentry caches are already
> marked as accounted, so individual calls to allocate from these
> slabs do not need this annotation.

Sorry for the confusion. You are right.

>
> > - The caller of allocating memory must know which list_lru the
> >   object will insert.
> >
> > So we can allocate struct list_lru_node when allocating the
> > object instead of allocating it when list_lru_add().  It is easy, because
> > we already know the list_lru and memcg which the object belongs
> > to. So we can introduce a new helper to allocate the object and
> > list_lru_node. Like below.
> >
> > void *list_lru_kmem_cache_alloc(struct list_lru *lru, struct kmem_cache *s,
> >                                 gfp_t gfpflags)
> > {
> >         void *ret = kmem_cache_alloc(s, gfpflags);
> >
> >         if (ret && (gfpflags & __GFP_ACCOUNT)) {
> >                 struct mem_cgroup *memcg = mem_cgroup_from_obj(ret);
> >
> >                 if (mem_cgroup_is_root(memcg))
> >                         return ret;
> >
> >                 /* Allocate per-memcg list_lru_node, if it already
> > allocated, do nothing. */
> >                 memcg_list_lru_node_alloc(lru, memcg,
> > page_to_nid(virt_to_page(ret)), gfpflags);
>
> If we are allowing kmem_cache_alloc() to fail, then we can allow
> memcg_list_lru_node_alloc() to fail, too.
>
> Also, why put this outside kmem_cache_alloc()? Node id and memcg is
> already known internally to kmem_cache_alloc() when allocating from
> a slab, so why not associate the slab allocation with the LRU
> directly when doing the memcg accounting and so avoid doing costly
> duplicate work on every allocation?
>
> i.e. the list-lru was moved inside the mm/ dir because "it's a mm
> specific construct only", so why not actually make use of that
> designation to internalise this entire memcg management issue into
> the slab allocation routines? i.e.  an API like

Yeah, we can.

> kmem_cache_alloc_lru(cache, lru, gfpflags) allows this to be
> completely internalised and efficiently implemented with minimal
> change to callers. It also means that memory allocation callers
> don't need to know anything about memcg management, which is always
> a win....

Great idea. It's efficient. I'd give it a try.

>
> >         }
> >
> >         return ret;
> > }
> >
> > If the user wants to insert the allocated object to its lru list in
> > the feature. The
> > user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc().
> > I have looked at the code closely. There are 3 different kmem_caches that
> > need to use this new API to allocate memory. They are inode_cachep,
> > dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate.
>
> It might work, but I think you may have overlooked the complexity
> of inode allocation for filesystems. i.e.  alloc_inode() calls out
> to filesystem allocation functions more often than it allocates
> directly from the inode_cachep.  i.e.  Most filesystems provide
> their own ->alloc_inode superblock operation, and they allocate
> inodes out of their own specific slab caches, not the inode_cachep.

I didn't realize this before. You are right. Most filesystems
have their own kmem_cache instead of inode_cachep.
We need a lot of filesystems special to be changed.
Thanks for your reminder.

>
> And then you have filesystems like XFS, where alloc_inode() will
> never be called, and implement ->alloc_inode as:
>
> /* Catch misguided souls that try to use this interface on XFS */
> STATIC struct inode *
> xfs_fs_alloc_inode(
>         struct super_block      *sb)
> {
>         BUG();
>         return NULL;
> }
>
> Because all the inode caching and allocation is internal to XFS and
> VFS inode management interfaces are not used.
>
> So I suspect that an external wrapper function is not the way to go
> here - either internalising the LRU management into the slab
> allocation or adding the memcg code to alloc_inode() and filesystem
> specific routines would make a lot more sense to me.

Sure. If we introduce kmem_cache_alloc_lru, all filesystems
need to migrate to kmem_cache_alloc_lru. I cannot figure out
an approach that does not need to change filesystems code.

Thanks.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=6v1A=J6=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8BE9FC433B4
	for <linux-mm@archiver.kernel.org>; Mon,  3 May 2021 06:34:02 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id DE29061353
	for <linux-mm@archiver.kernel.org>; Mon,  3 May 2021 06:34:01 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DE29061353
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 2DA706B0036; Mon,  3 May 2021 02:34:01 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 28A726B006E; Mon,  3 May 2021 02:34:01 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 12B5B6B0070; Mon,  3 May 2021 02:34:01 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0072.hostedemail.com [216.40.44.72])
	by kanga.kvack.org (Postfix) with ESMTP id E336D6B0036
	for <linux-mm@kvack.org>; Mon,  3 May 2021 02:34:00 -0400 (EDT)
Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id A22D18249980
	for <linux-mm@kvack.org>; Mon,  3 May 2021 06:34:00 +0000 (UTC)
X-FDA: 78098954640.01.666B329
Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51])
	by imf04.hostedemail.com (Postfix) with ESMTP id B82E33C2
	for <linux-mm@kvack.org>; Mon,  3 May 2021 06:33:54 +0000 (UTC)
Received: by mail-pj1-f51.google.com with SMTP id f2-20020a17090a4a82b02900c67bf8dc69so5170700pjh.1
        for <linux-mm@kvack.org>; Sun, 02 May 2021 23:33:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bytedance-com.20150623.gappssmtp.com; s=20150623;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=YAfjaggDazXvQtYp2FS5y03jgZVhcBFsjn4+g9RTkfc=;
        b=xVJ7R69O2ECyU0IHuW1rDWvr0VqTU50aIrgzPU9MNFjG8trH7bMeywNS5jLq2b1IWp
         X2yPhy6OQINzdEYHRqLd9uCiHVslCu4qd5YrwV4j7EdBS094XPIhUQpku9O7nzPMuWXc
         ZvAHnnFNRwElWf8eCLM0VcMOr9hg5CCU0sC9mWBnTw91VcdpjRU7v+v5qe2TioP5oYxD
         xn+Yix1jrTA7hER1/DGHC8zmKoU1PfLv+2XuyW5xjTXUU9LP93uSz8AS/kz7Pcg5AVAi
         lyJfj18aUtanwex7L2CWXn3M1gpeWEM/VtFmEijMqRLUzUIBcDH7fquyDzainBL0VlLA
         sjAg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=YAfjaggDazXvQtYp2FS5y03jgZVhcBFsjn4+g9RTkfc=;
        b=orlUHCVbLEjq4CJO71uJvfGdXpFTyaoKeN/t4F3QuOwsfNRqAOel9RdqUMfo3T6kGK
         ULmjlmqTQ/SGm4XWHnOobKnuuaTOB5f5TAywa6fiLKLDCZMRX6ReqIetdYxdlQWYn7YO
         84XCPsRJNRExWYT9WpX9q2RFyiXdWUWx0NKQXrg5f/CIlxMm7h7suHW8BZRDGlIbNwxi
         kZQXSeWfHSv0X1c7DqTCxVtSFTdZq0cJ+swGlJ7DYYsul49E8lwR8ET7fk8h0MTxyjiq
         h6DTTYQ9gJInzIqhHYDjBn5P46GsedyJ4igJo4jjlzdVoWKvmPB8eVlTRYb2kuyWok56
         21uQ==
X-Gm-Message-State: AOAM531XxfWnqDY0OOgxAOVkz1P+ZXHdGqCddeZe9byTYP7tkRuInojS
	5KGr4BNHycxgsCWEKttLTnPco55OHW3mSuLoVwtqrQ==
X-Google-Smtp-Source: ABdhPJxGflL8gqcfGI3qnihVxhFS3o1OhYuGJJNiad6v/bHj/QMwqly1oymqd0EqExsT0FHNjife4SK934iFvznUWNo=
X-Received: by 2002:a17:90a:644b:: with SMTP id y11mr19284123pjm.229.1620023638171;
 Sun, 02 May 2021 23:33:58 -0700 (PDT)
MIME-Version: 1.0
References: <20210428094949.43579-1-songmuchun@bytedance.com>
 <20210430004903.GF1872259@dread.disaster.area> <YItf3GIUs2skeuyi@carbon.dhcp.thefacebook.com>
 <20210430032739.GG1872259@dread.disaster.area> <CAMZfGtXawtMT4JfBtDLZ+hES4iEHFboe2UgJee_s-NhZR5faAw@mail.gmail.com>
 <20210502235843.GJ1872259@dread.disaster.area>
In-Reply-To: <20210502235843.GJ1872259@dread.disaster.area>
From: Muchun Song <songmuchun@bytedance.com>
Date: Mon, 3 May 2021 14:33:21 +0800
Message-ID: <CAMZfGtVK2Sracf=ongpNJqacafmC2ZsNy-KxEL67fVCAGXz3xA@mail.gmail.com>
Subject: Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory
 cgroup removal
To: Dave Chinner <david@fromorbit.com>
Cc: Roman Gushchin <guro@fb.com>, Matthew Wilcox <willy@infradead.org>, 
	Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, 
	Michal Hocko <mhocko@kernel.org>, Vladimir Davydov <vdavydov.dev@gmail.com>, 
	Shakeel Butt <shakeelb@google.com>, Yang Shi <shy828301@gmail.com>, alexs@kernel.org, 
	Alexander Duyck <alexander.h.duyck@linux.intel.com>, Wei Yang <richard.weiyang@gmail.com>, 
	linux-fsdevel <linux-fsdevel@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, 
	Linux Memory Management List <linux-mm@kvack.org>
Content-Type: text/plain; charset="UTF-8"
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=xVJ7R69O;
	dmarc=pass (policy=none) header.from=bytedance.com;
	spf=pass (imf04.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: B82E33C2
X-Stat-Signature: 14f1kmf9xwdxrjwxwps86icmoiyzgsw7
Received-SPF: none (bytedance.com>: No applicable sender policy available) receiver=imf04; identity=mailfrom; envelope-from="<songmuchun@bytedance.com>"; helo=mail-pj1-f51.google.com; client-ip=209.85.216.51
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1620023634-323731
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, May 3, 2021 at 7:58 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Apr 30, 2021 at 04:32:39PM +0800, Muchun Song wrote:
> > On Fri, Apr 30, 2021 at 11:27 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote:
> > > > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote:
> > > > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote:
> > > > > > In our server, we found a suspected memory leak problem. The kmalloc-32
> > > > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > > > > > memory.
> > > > > >
> > > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > > > > > cache is the cause of list_lru_one allocation.
> > > > > >
> > > > > >   crash> p memcg_nr_cache_ids
> > > > > >   memcg_nr_cache_ids = $2 = 24574
> > > > > >
> > > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > > > > > can be calculated with the following formula.
> > > > > >
> > > > > >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> > > > > >
> > > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> > > > > >
> > > > > >   crash> list super_blocks | wc -l
> > > > > >   952
> > > > >
> > > > > The more I see people trying to work around this, the more I think
> > > > > that the way memcgs have been grafted into the list_lru is back to
> > > > > front.
> > > > >
> > > > > We currently allocate scope for every memcg to be able to tracked on
> > > > > every not on every superblock instantiated in the system, regardless
> > > > > of whether that superblock is even accessible to that memcg.
> > > > >
> > > > > These huge memcg counts come from container hosts where memcgs are
> > > > > confined to just a small subset of the total number of superblocks
> > > > > that instantiated at any given point in time.
> > > > >
> > > > > IOWs, for these systems with huge container counts, list_lru does
> > > > > not need the capability of tracking every memcg on every superblock.
> > > > >
> > > > > What it comes down to is that the list_lru is only needed for a
> > > > > given memcg if that memcg is instatiating and freeing objects on a
> > > > > given list_lru.
> > > > >
> > > > > Which makes me think we should be moving more towards "add the memcg
> > > > > to the list_lru at the first insert" model rather than "instantiate
> > > > > all at memcg init time just in case". The model we originally came
> > > > > up with for supprting memcgs is really starting to show it's limits,
> > > > > and we should address those limitations rahter than hack more
> > > > > complexity into the system that does nothing to remove the
> > > > > limitations that are causing the problems in the first place.
> > > >
> > > > I totally agree.
> > > >
> > > > It looks like the initial implementation of the whole kernel memory accounting
> > > > and memcg-aware shrinkers was based on the idea that the number of memory
> > > > cgroups is relatively small and stable.
> > >
> > > Yes, that was one of the original assumptions - tens to maybe low
> > > hundreds of memcgs at most. The other was that memcgs weren't NUMA
> > > aware, and so would only need a single LRU list per memcg. Hence the
> > > total overhead even with "lots" of memcgsi and superblocks the
> > > overhead wasn't that great.
> > >
> > > Then came "memcgs need to be NUMA aware" because of the size of the
> > > machines they were being use for resrouce management in, and that
> > > greatly increased the per-memcg, per LRU overhead. Now we're talking
> > > about needing to support a couple of orders of magnitude more memcgs
> > > and superblocks than were originally designed for.
> > >
> > > So, really, we're way beyond the original design scope of this
> > > subsystem now.
> >
> > Got it. So it is better to allocate the structure of the list_lru_node
> > dynamically. We should only allocate it when it is really demanded.
> > But allocating memory by using GFP_ATOMIC in list_lru_add() is
> > not a good idea. So we should allocate the memory out of
> > list_lru_add(). I can propose an approach that may work.
> >
> > Before start, we should know about the following rules of list lrus.
> >
> > - Only objects allocated with __GFP_ACCOUNT need to allocate
> >   the struct list_lru_node.
>
> This seems .... misguided. inode and dentry caches are already
> marked as accounted, so individual calls to allocate from these
> slabs do not need this annotation.

Sorry for the confusion. You are right.

>
> > - The caller of allocating memory must know which list_lru the
> >   object will insert.
> >
> > So we can allocate struct list_lru_node when allocating the
> > object instead of allocating it when list_lru_add().  It is easy, because
> > we already know the list_lru and memcg which the object belongs
> > to. So we can introduce a new helper to allocate the object and
> > list_lru_node. Like below.
> >
> > void *list_lru_kmem_cache_alloc(struct list_lru *lru, struct kmem_cache *s,
> >                                 gfp_t gfpflags)
> > {
> >         void *ret = kmem_cache_alloc(s, gfpflags);
> >
> >         if (ret && (gfpflags & __GFP_ACCOUNT)) {
> >                 struct mem_cgroup *memcg = mem_cgroup_from_obj(ret);
> >
> >                 if (mem_cgroup_is_root(memcg))
> >                         return ret;
> >
> >                 /* Allocate per-memcg list_lru_node, if it already
> > allocated, do nothing. */
> >                 memcg_list_lru_node_alloc(lru, memcg,
> > page_to_nid(virt_to_page(ret)), gfpflags);
>
> If we are allowing kmem_cache_alloc() to fail, then we can allow
> memcg_list_lru_node_alloc() to fail, too.
>
> Also, why put this outside kmem_cache_alloc()? Node id and memcg is
> already known internally to kmem_cache_alloc() when allocating from
> a slab, so why not associate the slab allocation with the LRU
> directly when doing the memcg accounting and so avoid doing costly
> duplicate work on every allocation?
>
> i.e. the list-lru was moved inside the mm/ dir because "it's a mm
> specific construct only", so why not actually make use of that
> designation to internalise this entire memcg management issue into
> the slab allocation routines? i.e.  an API like

Yeah, we can.

> kmem_cache_alloc_lru(cache, lru, gfpflags) allows this to be
> completely internalised and efficiently implemented with minimal
> change to callers. It also means that memory allocation callers
> don't need to know anything about memcg management, which is always
> a win....

Great idea. It's efficient. I'd give it a try.

>
> >         }
> >
> >         return ret;
> > }
> >
> > If the user wants to insert the allocated object to its lru list in
> > the feature. The
> > user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc().
> > I have looked at the code closely. There are 3 different kmem_caches that
> > need to use this new API to allocate memory. They are inode_cachep,
> > dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate.
>
> It might work, but I think you may have overlooked the complexity
> of inode allocation for filesystems. i.e.  alloc_inode() calls out
> to filesystem allocation functions more often than it allocates
> directly from the inode_cachep.  i.e.  Most filesystems provide
> their own ->alloc_inode superblock operation, and they allocate
> inodes out of their own specific slab caches, not the inode_cachep.

I didn't realize this before. You are right. Most filesystems
have their own kmem_cache instead of inode_cachep.
We need a lot of filesystems special to be changed.
Thanks for your reminder.

>
> And then you have filesystems like XFS, where alloc_inode() will
> never be called, and implement ->alloc_inode as:
>
> /* Catch misguided souls that try to use this interface on XFS */
> STATIC struct inode *
> xfs_fs_alloc_inode(
>         struct super_block      *sb)
> {
>         BUG();
>         return NULL;
> }
>
> Because all the inode caching and allocation is internal to XFS and
> VFS inode management interfaces are not used.
>
> So I suspect that an external wrapper function is not the way to go
> here - either internalising the LRU management into the slab
> allocation or adding the memcg code to alloc_inode() and filesystem
> specific routines would make a lot more sense to me.

Sure. If we introduce kmem_cache_alloc_lru, all filesystems
need to migrate to kmem_cache_alloc_lru. I cannot figure out
an approach that does not need to change filesystems code.

Thanks.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com