From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id ADDC0C433ED for ; Mon, 3 May 2021 06:34:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 880F161244 for ; Mon, 3 May 2021 06:34:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232752AbhECGe7 (ORCPT ); Mon, 3 May 2021 02:34:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57746 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229933AbhECGe5 (ORCPT ); Mon, 3 May 2021 02:34:57 -0400 Received: from mail-pj1-x102c.google.com (mail-pj1-x102c.google.com [IPv6:2607:f8b0:4864:20::102c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E2F84C06174A for ; Sun, 2 May 2021 23:33:58 -0700 (PDT) Received: by mail-pj1-x102c.google.com with SMTP id p17so2488989pjz.3 for ; Sun, 02 May 2021 23:33:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=YAfjaggDazXvQtYp2FS5y03jgZVhcBFsjn4+g9RTkfc=; b=xVJ7R69O2ECyU0IHuW1rDWvr0VqTU50aIrgzPU9MNFjG8trH7bMeywNS5jLq2b1IWp X2yPhy6OQINzdEYHRqLd9uCiHVslCu4qd5YrwV4j7EdBS094XPIhUQpku9O7nzPMuWXc ZvAHnnFNRwElWf8eCLM0VcMOr9hg5CCU0sC9mWBnTw91VcdpjRU7v+v5qe2TioP5oYxD xn+Yix1jrTA7hER1/DGHC8zmKoU1PfLv+2XuyW5xjTXUU9LP93uSz8AS/kz7Pcg5AVAi lyJfj18aUtanwex7L2CWXn3M1gpeWEM/VtFmEijMqRLUzUIBcDH7fquyDzainBL0VlLA sjAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=YAfjaggDazXvQtYp2FS5y03jgZVhcBFsjn4+g9RTkfc=; b=qvldpcXAW/aDKCVwJmZ65I1FmEDGijuDQWGrpV0/0q+biIj4j1k4eoQaMbqVj+cbVS +5V9LtLWgqSKH+8hOxpYpZw1m4jMPsXqTIrgM9A426l01UJPPRHKd4+ysFnHbTJcew0L u4PgsKK7eQhq9LyfGIstKcSrJnQJSD05FnLaQMBC5F4vntg2BHp4IXCfQLlIhfuXozLf FAm6AwUsvCZW7YAz9xX1G0IZ/9hGuk/gJemGBbkj3vShZwJbU/8X4n7f9CyPn3dr6aPk QCenm1ATZcM2rdYORgNpR67BX9U2LPCtb+LAh/vMTMA2hbULeQTsZCBcbZIgsEcV4che Jebw== X-Gm-Message-State: AOAM5308cnBWr94WeKro6M0jfwhahkK6jrVIzfK1ZMvuUjVDgp2qj19s W8R/DNlCjSEjttTO2lvCBvnm6c6VPVnA4BurNblvNg== X-Google-Smtp-Source: ABdhPJxGflL8gqcfGI3qnihVxhFS3o1OhYuGJJNiad6v/bHj/QMwqly1oymqd0EqExsT0FHNjife4SK934iFvznUWNo= X-Received: by 2002:a17:90a:644b:: with SMTP id y11mr19284123pjm.229.1620023638171; Sun, 02 May 2021 23:33:58 -0700 (PDT) MIME-Version: 1.0 References: <20210428094949.43579-1-songmuchun@bytedance.com> <20210430004903.GF1872259@dread.disaster.area> <20210430032739.GG1872259@dread.disaster.area> <20210502235843.GJ1872259@dread.disaster.area> In-Reply-To: <20210502235843.GJ1872259@dread.disaster.area> From: Muchun Song Date: Mon, 3 May 2021 14:33:21 +0800 Message-ID: Subject: Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal To: Dave Chinner Cc: Roman Gushchin , Matthew Wilcox , Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Shakeel Butt , Yang Shi , alexs@kernel.org, Alexander Duyck , Wei Yang , linux-fsdevel , LKML , Linux Memory Management List Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, May 3, 2021 at 7:58 AM Dave Chinner wrote: > > On Fri, Apr 30, 2021 at 04:32:39PM +0800, Muchun Song wrote: > > On Fri, Apr 30, 2021 at 11:27 AM Dave Chinner wrote: > > > > > > On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote: > > > > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote: > > > > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote: > > > > > > In our server, we found a suspected memory leak problem. The kmalloc-32 > > > > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > > > > > > memory. > > > > > > > > > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > > > > > > cache is the cause of list_lru_one allocation. > > > > > > > > > > > > crash> p memcg_nr_cache_ids > > > > > > memcg_nr_cache_ids = $2 = 24574 > > > > > > > > > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > > > > > > can be calculated with the following formula. > > > > > > > > > > > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > > > > > > > > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > > > > > > > > > > > crash> list super_blocks | wc -l > > > > > > 952 > > > > > > > > > > The more I see people trying to work around this, the more I think > > > > > that the way memcgs have been grafted into the list_lru is back to > > > > > front. > > > > > > > > > > We currently allocate scope for every memcg to be able to tracked on > > > > > every not on every superblock instantiated in the system, regardless > > > > > of whether that superblock is even accessible to that memcg. > > > > > > > > > > These huge memcg counts come from container hosts where memcgs are > > > > > confined to just a small subset of the total number of superblocks > > > > > that instantiated at any given point in time. > > > > > > > > > > IOWs, for these systems with huge container counts, list_lru does > > > > > not need the capability of tracking every memcg on every superblock. > > > > > > > > > > What it comes down to is that the list_lru is only needed for a > > > > > given memcg if that memcg is instatiating and freeing objects on a > > > > > given list_lru. > > > > > > > > > > Which makes me think we should be moving more towards "add the memcg > > > > > to the list_lru at the first insert" model rather than "instantiate > > > > > all at memcg init time just in case". The model we originally came > > > > > up with for supprting memcgs is really starting to show it's limits, > > > > > and we should address those limitations rahter than hack more > > > > > complexity into the system that does nothing to remove the > > > > > limitations that are causing the problems in the first place. > > > > > > > > I totally agree. > > > > > > > > It looks like the initial implementation of the whole kernel memory accounting > > > > and memcg-aware shrinkers was based on the idea that the number of memory > > > > cgroups is relatively small and stable. > > > > > > Yes, that was one of the original assumptions - tens to maybe low > > > hundreds of memcgs at most. The other was that memcgs weren't NUMA > > > aware, and so would only need a single LRU list per memcg. Hence the > > > total overhead even with "lots" of memcgsi and superblocks the > > > overhead wasn't that great. > > > > > > Then came "memcgs need to be NUMA aware" because of the size of the > > > machines they were being use for resrouce management in, and that > > > greatly increased the per-memcg, per LRU overhead. Now we're talking > > > about needing to support a couple of orders of magnitude more memcgs > > > and superblocks than were originally designed for. > > > > > > So, really, we're way beyond the original design scope of this > > > subsystem now. > > > > Got it. So it is better to allocate the structure of the list_lru_node > > dynamically. We should only allocate it when it is really demanded. > > But allocating memory by using GFP_ATOMIC in list_lru_add() is > > not a good idea. So we should allocate the memory out of > > list_lru_add(). I can propose an approach that may work. > > > > Before start, we should know about the following rules of list lrus. > > > > - Only objects allocated with __GFP_ACCOUNT need to allocate > > the struct list_lru_node. > > This seems .... misguided. inode and dentry caches are already > marked as accounted, so individual calls to allocate from these > slabs do not need this annotation. Sorry for the confusion. You are right. > > > - The caller of allocating memory must know which list_lru the > > object will insert. > > > > So we can allocate struct list_lru_node when allocating the > > object instead of allocating it when list_lru_add(). It is easy, because > > we already know the list_lru and memcg which the object belongs > > to. So we can introduce a new helper to allocate the object and > > list_lru_node. Like below. > > > > void *list_lru_kmem_cache_alloc(struct list_lru *lru, struct kmem_cache *s, > > gfp_t gfpflags) > > { > > void *ret = kmem_cache_alloc(s, gfpflags); > > > > if (ret && (gfpflags & __GFP_ACCOUNT)) { > > struct mem_cgroup *memcg = mem_cgroup_from_obj(ret); > > > > if (mem_cgroup_is_root(memcg)) > > return ret; > > > > /* Allocate per-memcg list_lru_node, if it already > > allocated, do nothing. */ > > memcg_list_lru_node_alloc(lru, memcg, > > page_to_nid(virt_to_page(ret)), gfpflags); > > If we are allowing kmem_cache_alloc() to fail, then we can allow > memcg_list_lru_node_alloc() to fail, too. > > Also, why put this outside kmem_cache_alloc()? Node id and memcg is > already known internally to kmem_cache_alloc() when allocating from > a slab, so why not associate the slab allocation with the LRU > directly when doing the memcg accounting and so avoid doing costly > duplicate work on every allocation? > > i.e. the list-lru was moved inside the mm/ dir because "it's a mm > specific construct only", so why not actually make use of that > designation to internalise this entire memcg management issue into > the slab allocation routines? i.e. an API like Yeah, we can. > kmem_cache_alloc_lru(cache, lru, gfpflags) allows this to be > completely internalised and efficiently implemented with minimal > change to callers. It also means that memory allocation callers > don't need to know anything about memcg management, which is always > a win.... Great idea. It's efficient. I'd give it a try. > > > } > > > > return ret; > > } > > > > If the user wants to insert the allocated object to its lru list in > > the feature. The > > user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc(). > > I have looked at the code closely. There are 3 different kmem_caches that > > need to use this new API to allocate memory. They are inode_cachep, > > dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate. > > It might work, but I think you may have overlooked the complexity > of inode allocation for filesystems. i.e. alloc_inode() calls out > to filesystem allocation functions more often than it allocates > directly from the inode_cachep. i.e. Most filesystems provide > their own ->alloc_inode superblock operation, and they allocate > inodes out of their own specific slab caches, not the inode_cachep. I didn't realize this before. You are right. Most filesystems have their own kmem_cache instead of inode_cachep. We need a lot of filesystems special to be changed. Thanks for your reminder. > > And then you have filesystems like XFS, where alloc_inode() will > never be called, and implement ->alloc_inode as: > > /* Catch misguided souls that try to use this interface on XFS */ > STATIC struct inode * > xfs_fs_alloc_inode( > struct super_block *sb) > { > BUG(); > return NULL; > } > > Because all the inode caching and allocation is internal to XFS and > VFS inode management interfaces are not used. > > So I suspect that an external wrapper function is not the way to go > here - either internalising the LRU management into the slab > allocation or adding the memcg code to alloc_inode() and filesystem > specific routines would make a lot more sense to me. Sure. If we introduce kmem_cache_alloc_lru, all filesystems need to migrate to kmem_cache_alloc_lru. I cannot figure out an approach that does not need to change filesystems code. Thanks. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8BE9FC433B4 for ; Mon, 3 May 2021 06:34:02 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DE29061353 for ; Mon, 3 May 2021 06:34:01 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DE29061353 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2DA706B0036; Mon, 3 May 2021 02:34:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 28A726B006E; Mon, 3 May 2021 02:34:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 12B5B6B0070; Mon, 3 May 2021 02:34:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0072.hostedemail.com [216.40.44.72]) by kanga.kvack.org (Postfix) with ESMTP id E336D6B0036 for ; Mon, 3 May 2021 02:34:00 -0400 (EDT) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id A22D18249980 for ; Mon, 3 May 2021 06:34:00 +0000 (UTC) X-FDA: 78098954640.01.666B329 Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) by imf04.hostedemail.com (Postfix) with ESMTP id B82E33C2 for ; Mon, 3 May 2021 06:33:54 +0000 (UTC) Received: by mail-pj1-f51.google.com with SMTP id f2-20020a17090a4a82b02900c67bf8dc69so5170700pjh.1 for ; Sun, 02 May 2021 23:33:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=YAfjaggDazXvQtYp2FS5y03jgZVhcBFsjn4+g9RTkfc=; b=xVJ7R69O2ECyU0IHuW1rDWvr0VqTU50aIrgzPU9MNFjG8trH7bMeywNS5jLq2b1IWp X2yPhy6OQINzdEYHRqLd9uCiHVslCu4qd5YrwV4j7EdBS094XPIhUQpku9O7nzPMuWXc ZvAHnnFNRwElWf8eCLM0VcMOr9hg5CCU0sC9mWBnTw91VcdpjRU7v+v5qe2TioP5oYxD xn+Yix1jrTA7hER1/DGHC8zmKoU1PfLv+2XuyW5xjTXUU9LP93uSz8AS/kz7Pcg5AVAi lyJfj18aUtanwex7L2CWXn3M1gpeWEM/VtFmEijMqRLUzUIBcDH7fquyDzainBL0VlLA sjAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=YAfjaggDazXvQtYp2FS5y03jgZVhcBFsjn4+g9RTkfc=; b=orlUHCVbLEjq4CJO71uJvfGdXpFTyaoKeN/t4F3QuOwsfNRqAOel9RdqUMfo3T6kGK ULmjlmqTQ/SGm4XWHnOobKnuuaTOB5f5TAywa6fiLKLDCZMRX6ReqIetdYxdlQWYn7YO 84XCPsRJNRExWYT9WpX9q2RFyiXdWUWx0NKQXrg5f/CIlxMm7h7suHW8BZRDGlIbNwxi kZQXSeWfHSv0X1c7DqTCxVtSFTdZq0cJ+swGlJ7DYYsul49E8lwR8ET7fk8h0MTxyjiq h6DTTYQ9gJInzIqhHYDjBn5P46GsedyJ4igJo4jjlzdVoWKvmPB8eVlTRYb2kuyWok56 21uQ== X-Gm-Message-State: AOAM531XxfWnqDY0OOgxAOVkz1P+ZXHdGqCddeZe9byTYP7tkRuInojS 5KGr4BNHycxgsCWEKttLTnPco55OHW3mSuLoVwtqrQ== X-Google-Smtp-Source: ABdhPJxGflL8gqcfGI3qnihVxhFS3o1OhYuGJJNiad6v/bHj/QMwqly1oymqd0EqExsT0FHNjife4SK934iFvznUWNo= X-Received: by 2002:a17:90a:644b:: with SMTP id y11mr19284123pjm.229.1620023638171; Sun, 02 May 2021 23:33:58 -0700 (PDT) MIME-Version: 1.0 References: <20210428094949.43579-1-songmuchun@bytedance.com> <20210430004903.GF1872259@dread.disaster.area> <20210430032739.GG1872259@dread.disaster.area> <20210502235843.GJ1872259@dread.disaster.area> In-Reply-To: <20210502235843.GJ1872259@dread.disaster.area> From: Muchun Song Date: Mon, 3 May 2021 14:33:21 +0800 Message-ID: Subject: Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal To: Dave Chinner Cc: Roman Gushchin , Matthew Wilcox , Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Shakeel Butt , Yang Shi , alexs@kernel.org, Alexander Duyck , Wei Yang , linux-fsdevel , LKML , Linux Memory Management List Content-Type: text/plain; charset="UTF-8" Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=xVJ7R69O; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf04.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: B82E33C2 X-Stat-Signature: 14f1kmf9xwdxrjwxwps86icmoiyzgsw7 Received-SPF: none (bytedance.com>: No applicable sender policy available) receiver=imf04; identity=mailfrom; envelope-from=""; helo=mail-pj1-f51.google.com; client-ip=209.85.216.51 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1620023634-323731 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, May 3, 2021 at 7:58 AM Dave Chinner wrote: > > On Fri, Apr 30, 2021 at 04:32:39PM +0800, Muchun Song wrote: > > On Fri, Apr 30, 2021 at 11:27 AM Dave Chinner wrote: > > > > > > On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote: > > > > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote: > > > > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote: > > > > > > In our server, we found a suspected memory leak problem. The kmalloc-32 > > > > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > > > > > > memory. > > > > > > > > > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > > > > > > cache is the cause of list_lru_one allocation. > > > > > > > > > > > > crash> p memcg_nr_cache_ids > > > > > > memcg_nr_cache_ids = $2 = 24574 > > > > > > > > > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > > > > > > can be calculated with the following formula. > > > > > > > > > > > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > > > > > > > > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > > > > > > > > > > > crash> list super_blocks | wc -l > > > > > > 952 > > > > > > > > > > The more I see people trying to work around this, the more I think > > > > > that the way memcgs have been grafted into the list_lru is back to > > > > > front. > > > > > > > > > > We currently allocate scope for every memcg to be able to tracked on > > > > > every not on every superblock instantiated in the system, regardless > > > > > of whether that superblock is even accessible to that memcg. > > > > > > > > > > These huge memcg counts come from container hosts where memcgs are > > > > > confined to just a small subset of the total number of superblocks > > > > > that instantiated at any given point in time. > > > > > > > > > > IOWs, for these systems with huge container counts, list_lru does > > > > > not need the capability of tracking every memcg on every superblock. > > > > > > > > > > What it comes down to is that the list_lru is only needed for a > > > > > given memcg if that memcg is instatiating and freeing objects on a > > > > > given list_lru. > > > > > > > > > > Which makes me think we should be moving more towards "add the memcg > > > > > to the list_lru at the first insert" model rather than "instantiate > > > > > all at memcg init time just in case". The model we originally came > > > > > up with for supprting memcgs is really starting to show it's limits, > > > > > and we should address those limitations rahter than hack more > > > > > complexity into the system that does nothing to remove the > > > > > limitations that are causing the problems in the first place. > > > > > > > > I totally agree. > > > > > > > > It looks like the initial implementation of the whole kernel memory accounting > > > > and memcg-aware shrinkers was based on the idea that the number of memory > > > > cgroups is relatively small and stable. > > > > > > Yes, that was one of the original assumptions - tens to maybe low > > > hundreds of memcgs at most. The other was that memcgs weren't NUMA > > > aware, and so would only need a single LRU list per memcg. Hence the > > > total overhead even with "lots" of memcgsi and superblocks the > > > overhead wasn't that great. > > > > > > Then came "memcgs need to be NUMA aware" because of the size of the > > > machines they were being use for resrouce management in, and that > > > greatly increased the per-memcg, per LRU overhead. Now we're talking > > > about needing to support a couple of orders of magnitude more memcgs > > > and superblocks than were originally designed for. > > > > > > So, really, we're way beyond the original design scope of this > > > subsystem now. > > > > Got it. So it is better to allocate the structure of the list_lru_node > > dynamically. We should only allocate it when it is really demanded. > > But allocating memory by using GFP_ATOMIC in list_lru_add() is > > not a good idea. So we should allocate the memory out of > > list_lru_add(). I can propose an approach that may work. > > > > Before start, we should know about the following rules of list lrus. > > > > - Only objects allocated with __GFP_ACCOUNT need to allocate > > the struct list_lru_node. > > This seems .... misguided. inode and dentry caches are already > marked as accounted, so individual calls to allocate from these > slabs do not need this annotation. Sorry for the confusion. You are right. > > > - The caller of allocating memory must know which list_lru the > > object will insert. > > > > So we can allocate struct list_lru_node when allocating the > > object instead of allocating it when list_lru_add(). It is easy, because > > we already know the list_lru and memcg which the object belongs > > to. So we can introduce a new helper to allocate the object and > > list_lru_node. Like below. > > > > void *list_lru_kmem_cache_alloc(struct list_lru *lru, struct kmem_cache *s, > > gfp_t gfpflags) > > { > > void *ret = kmem_cache_alloc(s, gfpflags); > > > > if (ret && (gfpflags & __GFP_ACCOUNT)) { > > struct mem_cgroup *memcg = mem_cgroup_from_obj(ret); > > > > if (mem_cgroup_is_root(memcg)) > > return ret; > > > > /* Allocate per-memcg list_lru_node, if it already > > allocated, do nothing. */ > > memcg_list_lru_node_alloc(lru, memcg, > > page_to_nid(virt_to_page(ret)), gfpflags); > > If we are allowing kmem_cache_alloc() to fail, then we can allow > memcg_list_lru_node_alloc() to fail, too. > > Also, why put this outside kmem_cache_alloc()? Node id and memcg is > already known internally to kmem_cache_alloc() when allocating from > a slab, so why not associate the slab allocation with the LRU > directly when doing the memcg accounting and so avoid doing costly > duplicate work on every allocation? > > i.e. the list-lru was moved inside the mm/ dir because "it's a mm > specific construct only", so why not actually make use of that > designation to internalise this entire memcg management issue into > the slab allocation routines? i.e. an API like Yeah, we can. > kmem_cache_alloc_lru(cache, lru, gfpflags) allows this to be > completely internalised and efficiently implemented with minimal > change to callers. It also means that memory allocation callers > don't need to know anything about memcg management, which is always > a win.... Great idea. It's efficient. I'd give it a try. > > > } > > > > return ret; > > } > > > > If the user wants to insert the allocated object to its lru list in > > the feature. The > > user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc(). > > I have looked at the code closely. There are 3 different kmem_caches that > > need to use this new API to allocate memory. They are inode_cachep, > > dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate. > > It might work, but I think you may have overlooked the complexity > of inode allocation for filesystems. i.e. alloc_inode() calls out > to filesystem allocation functions more often than it allocates > directly from the inode_cachep. i.e. Most filesystems provide > their own ->alloc_inode superblock operation, and they allocate > inodes out of their own specific slab caches, not the inode_cachep. I didn't realize this before. You are right. Most filesystems have their own kmem_cache instead of inode_cachep. We need a lot of filesystems special to be changed. Thanks for your reminder. > > And then you have filesystems like XFS, where alloc_inode() will > never be called, and implement ->alloc_inode as: > > /* Catch misguided souls that try to use this interface on XFS */ > STATIC struct inode * > xfs_fs_alloc_inode( > struct super_block *sb) > { > BUG(); > return NULL; > } > > Because all the inode caching and allocation is internal to XFS and > VFS inode management interfaces are not used. > > So I suspect that an external wrapper function is not the way to go > here - either internalising the LRU management into the slab > allocation or adding the memcg code to alloc_inode() and filesystem > specific routines would make a lot more sense to me. Sure. If we introduce kmem_cache_alloc_lru, all filesystems need to migrate to kmem_cache_alloc_lru. I cannot figure out an approach that does not need to change filesystems code. Thanks. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com