From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F9F9C433F5 for ; Sat, 18 Sep 2021 08:00:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7BEED6124B for ; Sat, 18 Sep 2021 08:00:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243555AbhIRIBZ (ORCPT ); Sat, 18 Sep 2021 04:01:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44304 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233741AbhIRIBZ (ORCPT ); Sat, 18 Sep 2021 04:01:25 -0400 Received: from mail-pg1-x534.google.com (mail-pg1-x534.google.com [IPv6:2607:f8b0:4864:20::534]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 11EF2C061574 for ; Sat, 18 Sep 2021 01:00:02 -0700 (PDT) Received: by mail-pg1-x534.google.com with SMTP id k24so11959038pgh.8 for ; Sat, 18 Sep 2021 01:00:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=LpJAvb2r/ypZU5FQrVW/iFzPACcNLQ13GoNWicSevGc=; b=XeZo/7cHcvVIjZlH/JJ+iSEorvYofxtg1I9uKERgtWRETsP5Q+FBsqKs6WJbYo93zn bucVdfSWE/C0O5QrU6kV1y5muj+qxF4W181zsCod+FAKOfMf71QyYj2Hng6gXNUWP/Vu hnuRZjOpmr9UL5/InNNY+c8Hrmj0ADj75KVxJcXlwrCx+O3O51WDiXqhlt2IdPXOlSAu RZ0b0mBTCufq4ij8if5CvRSuxENrWZi7RiIWLOVGEt7yHryp3PUoH3u1hXH1sjI/uR8o 9wX4WGtrvU7o6C2VeogjO3CWRcHYLiLBbnCMtzQorDbhmdzbzl2KS0PitSZJaYoUJcU6 P50w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=LpJAvb2r/ypZU5FQrVW/iFzPACcNLQ13GoNWicSevGc=; b=Z1jlmGB08KC3/rvyo/KprqiRslDXNdIduVG/7fylIxpCqvHQOW0EgqIeHFN0aDgpAI eWRj1OGQv+lH7Vk2TZLJa3ahKh18yMixcrxHN8z9+FPeeMmrrcRFYIcreH9apLr2nvtV IJiAOAZqs/e6QU5lbIQqTE2ROvgBdVPblIBZPuPDDiKkZH4TaUHwtChGpWfz6nvMV1Ua 1RPKePVSaiUa6QZPjDPEZhaKMe2Wr86uspJeYlkDoHzO/ric3b5N7WmTtCqtALV05AEK C0iH8WjXdiMmYW7b8Cd1Mj2w5SsG8gVhTid1rLo7m2YwvVMHQFGEEVtehOJy0+JcpuVH c8xA== X-Gm-Message-State: AOAM5313Q5HETpbDWqV5VjrbzKp+64XnNLEQNsgV1VcRGFvV4GaWmzM2 7u4vfWP+mmC9SCAo6w2k2CmJ6gZYvtJl7ptGNNRCMQ== X-Google-Smtp-Source: ABdhPJwOPM9hyj8FUgewEtKSBa/GTOzwuER7HXnT5vEXHyjNKsPkpdr6kLQRUL63c8xt41iLm9rj2v67+a5GpMiHC0c= X-Received: by 2002:a62:1717:0:b0:440:527f:6664 with SMTP id 23-20020a621717000000b00440527f6664mr13602699pfx.73.1631952001524; Sat, 18 Sep 2021 01:00:01 -0700 (PDT) MIME-Version: 1.0 References: <20210914072938.6440-1-songmuchun@bytedance.com> <20210918065624.dbaar4lss5olrfhu@kari-VirtualBox> In-Reply-To: <20210918065624.dbaar4lss5olrfhu@kari-VirtualBox> From: Muchun Song Date: Sat, 18 Sep 2021 15:59:23 +0800 Message-ID: Subject: Re: [PATCH v3 00/76] Optimize list lru memory consumption To: Kari Argillander Cc: Matthew Wilcox , Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Shakeel Butt , Roman Gushchin , Yang Shi , Alex Shi , Wei Yang , Dave Chinner , trond.myklebust@hammerspace.com, anna.schumaker@netapp.com, linux-fsdevel , LKML , Linux Memory Management List , linux-nfs@vger.kernel.org, Qi Zheng , Xiongchun duan , fam.zheng@bytedance.com, Muchun Song Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Sat, Sep 18, 2021 at 2:56 PM Kari Argillander wrote: > > On Tue, Sep 14, 2021 at 03:28:22PM +0800, Muchun Song wrote: > > We introduced alloc_inode_sb() in previous version 2, which sets up the > > inode reclaim context properly, to allocate filesystems specific inode. > > So we have to convert to new API for all filesystems, which is done in > > one patch. Some filesystems are easy to convert (just replace > > kmem_cache_alloc() to alloc_inode_sb()), while other filesystems need to > > do more work. In order to make it easy for maintainers of different > > filesystems to review their own maintained part, I split the patch into > > patches which are per-filesystem in this version. I am not sure if this > > is a good idea, because there is going to be more commits. > > > > In our server, we found a suspected memory leak problem. The kmalloc-32 > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > > memory. > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > > cache is the cause of list_lru_one allocation. > > > > crash> p memcg_nr_cache_ids > > memcg_nr_cache_ids = $2 = 24574 > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > > can be calculated with the following formula. > > > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > > > crash> list super_blocks | wc -l > > 952 > > > > Every mount will register 2 list lrus, one is for inode, another is for > > dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3 > > MB (~5.6GB). But now the number of memory cgroups is less than 500. So I > > guess more than 12286 memory cgroups have been created on this machine (I > > do not know why there are so many cgroups, it may be a user's bug or > > the user really want to do that). Because memcg_nr_cache_ids has not been > > reduced to a suitable value. It leads to waste a lot of memory. If we want > > to reduce memcg_nr_cache_ids, we have to *reboot* the server. This is not > > what we want. > > > > In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do > > this. But this did not fundamentally solve the problem. > > > > We currently allocate scope for every memcg to be able to tracked on every > > superblock instantiated in the system, regardless of whether that superblock > > is even accessible to that memcg. > > > > These huge memcg counts come from container hosts where memcgs are confined > > to just a small subset of the total number of superblocks that instantiated > > at any given point in time. > > > > For these systems with huge container counts, list_lru does not need the > > capability of tracking every memcg on every superblock. > > > > What it comes down to is that the list_lru is only needed for a given memcg > > if that memcg is instatiating and freeing objects on a given list_lru. > > > > As Dave said, "Which makes me think we should be moving more towards 'add the > > memcg to the list_lru at the first insert' model rather than 'instantiate > > all at memcg init time just in case'." > > > > This patchset aims to optimize the list lru memory consumption from different > > aspects. > > > > Patch 1-6 are code simplification. > > Patch 7 converts the array from per-memcg per-node to per-memcg > > Patch 8 introduces kmem_cache_alloc_lru() > > Patch 9 introduces alloc_inode_sb() > > Patch 10-66 convert all filesystems to alloc_inode_sb() respectively. > > There is now days also ntfs3. If you do not plan to convert this please > CC me atleast so that I can do it when these lands. > > Argillander > Wow, a new filesystem. I didn't notice it before. I'll cover it in the next version and Cc you if you can do a review. Thanks for your reminder.