From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 94892C433F5 for ; Sat, 18 Sep 2021 08:00:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7A85161108 for ; Sat, 18 Sep 2021 08:00:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243658AbhIRIBa (ORCPT ); Sat, 18 Sep 2021 04:01:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44308 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236044AbhIRIBZ (ORCPT ); Sat, 18 Sep 2021 04:01:25 -0400 Received: from mail-pf1-x436.google.com (mail-pf1-x436.google.com [IPv6:2607:f8b0:4864:20::436]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 23326C06175F for ; Sat, 18 Sep 2021 01:00:02 -0700 (PDT) Received: by mail-pf1-x436.google.com with SMTP id e16so11400469pfc.6 for ; Sat, 18 Sep 2021 01:00:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=LpJAvb2r/ypZU5FQrVW/iFzPACcNLQ13GoNWicSevGc=; b=XeZo/7cHcvVIjZlH/JJ+iSEorvYofxtg1I9uKERgtWRETsP5Q+FBsqKs6WJbYo93zn bucVdfSWE/C0O5QrU6kV1y5muj+qxF4W181zsCod+FAKOfMf71QyYj2Hng6gXNUWP/Vu hnuRZjOpmr9UL5/InNNY+c8Hrmj0ADj75KVxJcXlwrCx+O3O51WDiXqhlt2IdPXOlSAu RZ0b0mBTCufq4ij8if5CvRSuxENrWZi7RiIWLOVGEt7yHryp3PUoH3u1hXH1sjI/uR8o 9wX4WGtrvU7o6C2VeogjO3CWRcHYLiLBbnCMtzQorDbhmdzbzl2KS0PitSZJaYoUJcU6 P50w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=LpJAvb2r/ypZU5FQrVW/iFzPACcNLQ13GoNWicSevGc=; b=6ue4ZV9LrUaPU000cwcHE+lpiqh105VKSMeYpA5M1wn+iP3obPHDNnh4OkiB5Kfsn6 pFHII6FlM7wAy65qwBEO5vM8Dqk7c7EQrtiA9S+iwM9+D8SEFwJoNzlFTRLTJfyJ8Gao qNOqBwbFqdAB3kGL08hvl9wEIFi7DEx9gHesct8kRbvbt3BNy/oE8d3MIFx8/conA7cZ EnO5IOqubLNecWEhXe6YTOF7k3hwFhzXBe1eo8z6ysMVjbf721OlvmXULgHW2e3J3dN3 /NRsfjoKAJBsmc4/bb5IPiKG/2MTU2jFFL4eiegWTb3RemLsQpGSt+Y3BXyIskY+c0BG PEqw== X-Gm-Message-State: AOAM532oClB0x/UPKOZ/4lCYIALeIlt7kBo7yhcQsMoPVeZocDUrWG4u nWILwE1rAKvOV2MEcsuI/y20D661b4ATQz2ZkVFVMw== X-Google-Smtp-Source: ABdhPJwOPM9hyj8FUgewEtKSBa/GTOzwuER7HXnT5vEXHyjNKsPkpdr6kLQRUL63c8xt41iLm9rj2v67+a5GpMiHC0c= X-Received: by 2002:a62:1717:0:b0:440:527f:6664 with SMTP id 23-20020a621717000000b00440527f6664mr13602699pfx.73.1631952001524; Sat, 18 Sep 2021 01:00:01 -0700 (PDT) MIME-Version: 1.0 References: <20210914072938.6440-1-songmuchun@bytedance.com> <20210918065624.dbaar4lss5olrfhu@kari-VirtualBox> In-Reply-To: <20210918065624.dbaar4lss5olrfhu@kari-VirtualBox> From: Muchun Song Date: Sat, 18 Sep 2021 15:59:23 +0800 Message-ID: Subject: Re: [PATCH v3 00/76] Optimize list lru memory consumption To: Kari Argillander Cc: Matthew Wilcox , Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Shakeel Butt , Roman Gushchin , Yang Shi , Alex Shi , Wei Yang , Dave Chinner , trond.myklebust@hammerspace.com, anna.schumaker@netapp.com, linux-fsdevel , LKML , Linux Memory Management List , linux-nfs@vger.kernel.org, Qi Zheng , Xiongchun duan , fam.zheng@bytedance.com, Muchun Song Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Sep 18, 2021 at 2:56 PM Kari Argillander wrote: > > On Tue, Sep 14, 2021 at 03:28:22PM +0800, Muchun Song wrote: > > We introduced alloc_inode_sb() in previous version 2, which sets up the > > inode reclaim context properly, to allocate filesystems specific inode. > > So we have to convert to new API for all filesystems, which is done in > > one patch. Some filesystems are easy to convert (just replace > > kmem_cache_alloc() to alloc_inode_sb()), while other filesystems need to > > do more work. In order to make it easy for maintainers of different > > filesystems to review their own maintained part, I split the patch into > > patches which are per-filesystem in this version. I am not sure if this > > is a good idea, because there is going to be more commits. > > > > In our server, we found a suspected memory leak problem. The kmalloc-32 > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > > memory. > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > > cache is the cause of list_lru_one allocation. > > > > crash> p memcg_nr_cache_ids > > memcg_nr_cache_ids = $2 = 24574 > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > > can be calculated with the following formula. > > > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > > > crash> list super_blocks | wc -l > > 952 > > > > Every mount will register 2 list lrus, one is for inode, another is for > > dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3 > > MB (~5.6GB). But now the number of memory cgroups is less than 500. So I > > guess more than 12286 memory cgroups have been created on this machine (I > > do not know why there are so many cgroups, it may be a user's bug or > > the user really want to do that). Because memcg_nr_cache_ids has not been > > reduced to a suitable value. It leads to waste a lot of memory. If we want > > to reduce memcg_nr_cache_ids, we have to *reboot* the server. This is not > > what we want. > > > > In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do > > this. But this did not fundamentally solve the problem. > > > > We currently allocate scope for every memcg to be able to tracked on every > > superblock instantiated in the system, regardless of whether that superblock > > is even accessible to that memcg. > > > > These huge memcg counts come from container hosts where memcgs are confined > > to just a small subset of the total number of superblocks that instantiated > > at any given point in time. > > > > For these systems with huge container counts, list_lru does not need the > > capability of tracking every memcg on every superblock. > > > > What it comes down to is that the list_lru is only needed for a given memcg > > if that memcg is instatiating and freeing objects on a given list_lru. > > > > As Dave said, "Which makes me think we should be moving more towards 'add the > > memcg to the list_lru at the first insert' model rather than 'instantiate > > all at memcg init time just in case'." > > > > This patchset aims to optimize the list lru memory consumption from different > > aspects. > > > > Patch 1-6 are code simplification. > > Patch 7 converts the array from per-memcg per-node to per-memcg > > Patch 8 introduces kmem_cache_alloc_lru() > > Patch 9 introduces alloc_inode_sb() > > Patch 10-66 convert all filesystems to alloc_inode_sb() respectively. > > There is now days also ntfs3. If you do not plan to convert this please > CC me atleast so that I can do it when these lands. > > Argillander > Wow, a new filesystem. I didn't notice it before. I'll cover it in the next version and Cc you if you can do a review. Thanks for your reminder. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4277C4332F for ; Sat, 18 Sep 2021 08:00:05 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EA6686113A for ; Sat, 18 Sep 2021 08:00:03 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org EA6686113A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 69A0F6B0071; Sat, 18 Sep 2021 04:00:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 670786B0072; Sat, 18 Sep 2021 04:00:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 537D3900002; Sat, 18 Sep 2021 04:00:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0197.hostedemail.com [216.40.44.197]) by kanga.kvack.org (Postfix) with ESMTP id 4050F6B0071 for ; Sat, 18 Sep 2021 04:00:03 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id F08941824196E for ; Sat, 18 Sep 2021 08:00:02 +0000 (UTC) X-FDA: 78599945844.26.7BBD49B Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169]) by imf22.hostedemail.com (Postfix) with ESMTP id A32C51909 for ; Sat, 18 Sep 2021 08:00:02 +0000 (UTC) Received: by mail-pf1-f169.google.com with SMTP id c1so8463760pfp.10 for ; Sat, 18 Sep 2021 01:00:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=LpJAvb2r/ypZU5FQrVW/iFzPACcNLQ13GoNWicSevGc=; b=XeZo/7cHcvVIjZlH/JJ+iSEorvYofxtg1I9uKERgtWRETsP5Q+FBsqKs6WJbYo93zn bucVdfSWE/C0O5QrU6kV1y5muj+qxF4W181zsCod+FAKOfMf71QyYj2Hng6gXNUWP/Vu hnuRZjOpmr9UL5/InNNY+c8Hrmj0ADj75KVxJcXlwrCx+O3O51WDiXqhlt2IdPXOlSAu RZ0b0mBTCufq4ij8if5CvRSuxENrWZi7RiIWLOVGEt7yHryp3PUoH3u1hXH1sjI/uR8o 9wX4WGtrvU7o6C2VeogjO3CWRcHYLiLBbnCMtzQorDbhmdzbzl2KS0PitSZJaYoUJcU6 P50w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=LpJAvb2r/ypZU5FQrVW/iFzPACcNLQ13GoNWicSevGc=; b=2GITLYOGx19JEcOEkHbc4u1/E1lB/JBqVWonh1Flim7grZdDQzOJyLRzQ3MwQpT5Bd nd3dH5kZ0zypkcOhW6tMY3RX/m+MPOwQZKQ+UGsQ0VHw8mDlJIgy+q6kpS/zs3aMDKa7 Xa5XftoxLO/Ay3dWtXiBbuYWdrL87m22kC9xycEXAXVyEo/xyy18NlK0iJzC6KRQn20v XtZq4tEA1CxuRPB25Z3k5ppsyRXFwm4/2iu4iPntMrujOPrIZT1f9KWRuZWuejFlNCf0 91mGC7x1399UrUwSohbXkMUeG0c8qzsdCno2NlZYVEkbz1o0BG9B3/4zF5vCGTvo6VEW WIZQ== X-Gm-Message-State: AOAM530S2sOcKXxw1QUb9MmN53UpHQVuedj+vIWllePlQI+l3n96+/az +FsLrZvjbd2NVHBEDHdtz4Fu27iHsJ54g1/KB3m89Q== X-Google-Smtp-Source: ABdhPJwOPM9hyj8FUgewEtKSBa/GTOzwuER7HXnT5vEXHyjNKsPkpdr6kLQRUL63c8xt41iLm9rj2v67+a5GpMiHC0c= X-Received: by 2002:a62:1717:0:b0:440:527f:6664 with SMTP id 23-20020a621717000000b00440527f6664mr13602699pfx.73.1631952001524; Sat, 18 Sep 2021 01:00:01 -0700 (PDT) MIME-Version: 1.0 References: <20210914072938.6440-1-songmuchun@bytedance.com> <20210918065624.dbaar4lss5olrfhu@kari-VirtualBox> In-Reply-To: <20210918065624.dbaar4lss5olrfhu@kari-VirtualBox> From: Muchun Song Date: Sat, 18 Sep 2021 15:59:23 +0800 Message-ID: Subject: Re: [PATCH v3 00/76] Optimize list lru memory consumption To: Kari Argillander Cc: Matthew Wilcox , Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Shakeel Butt , Roman Gushchin , Yang Shi , Alex Shi , Wei Yang , Dave Chinner , trond.myklebust@hammerspace.com, anna.schumaker@netapp.com, linux-fsdevel , LKML , Linux Memory Management List , linux-nfs@vger.kernel.org, Qi Zheng , Xiongchun duan , fam.zheng@bytedance.com, Muchun Song Content-Type: text/plain; charset="UTF-8" X-Stat-Signature: 5kce3hcn6mgy3u548cuijaog8wrxat6f Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b="XeZo/7cH"; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf22.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.210.169 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: A32C51909 X-HE-Tag: 1631952002-384846 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, Sep 18, 2021 at 2:56 PM Kari Argillander wrote: > > On Tue, Sep 14, 2021 at 03:28:22PM +0800, Muchun Song wrote: > > We introduced alloc_inode_sb() in previous version 2, which sets up the > > inode reclaim context properly, to allocate filesystems specific inode. > > So we have to convert to new API for all filesystems, which is done in > > one patch. Some filesystems are easy to convert (just replace > > kmem_cache_alloc() to alloc_inode_sb()), while other filesystems need to > > do more work. In order to make it easy for maintainers of different > > filesystems to review their own maintained part, I split the patch into > > patches which are per-filesystem in this version. I am not sure if this > > is a good idea, because there is going to be more commits. > > > > In our server, we found a suspected memory leak problem. The kmalloc-32 > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > > memory. > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > > cache is the cause of list_lru_one allocation. > > > > crash> p memcg_nr_cache_ids > > memcg_nr_cache_ids = $2 = 24574 > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > > can be calculated with the following formula. > > > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > > > crash> list super_blocks | wc -l > > 952 > > > > Every mount will register 2 list lrus, one is for inode, another is for > > dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3 > > MB (~5.6GB). But now the number of memory cgroups is less than 500. So I > > guess more than 12286 memory cgroups have been created on this machine (I > > do not know why there are so many cgroups, it may be a user's bug or > > the user really want to do that). Because memcg_nr_cache_ids has not been > > reduced to a suitable value. It leads to waste a lot of memory. If we want > > to reduce memcg_nr_cache_ids, we have to *reboot* the server. This is not > > what we want. > > > > In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do > > this. But this did not fundamentally solve the problem. > > > > We currently allocate scope for every memcg to be able to tracked on every > > superblock instantiated in the system, regardless of whether that superblock > > is even accessible to that memcg. > > > > These huge memcg counts come from container hosts where memcgs are confined > > to just a small subset of the total number of superblocks that instantiated > > at any given point in time. > > > > For these systems with huge container counts, list_lru does not need the > > capability of tracking every memcg on every superblock. > > > > What it comes down to is that the list_lru is only needed for a given memcg > > if that memcg is instatiating and freeing objects on a given list_lru. > > > > As Dave said, "Which makes me think we should be moving more towards 'add the > > memcg to the list_lru at the first insert' model rather than 'instantiate > > all at memcg init time just in case'." > > > > This patchset aims to optimize the list lru memory consumption from different > > aspects. > > > > Patch 1-6 are code simplification. > > Patch 7 converts the array from per-memcg per-node to per-memcg > > Patch 8 introduces kmem_cache_alloc_lru() > > Patch 9 introduces alloc_inode_sb() > > Patch 10-66 convert all filesystems to alloc_inode_sb() respectively. > > There is now days also ntfs3. If you do not plan to convert this please > CC me atleast so that I can do it when these lands. > > Argillander > Wow, a new filesystem. I didn't notice it before. I'll cover it in the next version and Cc you if you can do a review. Thanks for your reminder.