linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Muchun Song <songmuchun@bytedance.com>
To: willy@infradead.org, akpm@linux-foundation.org,
	hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com,
	shakeelb@google.com, guro@fb.com, shy828301@gmail.com,
	alexs@kernel.org, richard.weiyang@gmail.com, david@fromorbit.com,
	trond.myklebust@hammerspace.com, anna.schumaker@netapp.com
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, linux-nfs@vger.kernel.org,
	zhengqi.arch@bytedance.com, duanxiongchun@bytedance.com,
	fam.zheng@bytedance.com, Muchun Song <songmuchun@bytedance.com>
Subject: [PATCH 00/17] Optimize list lru memory consumption
Date: Tue, 11 May 2021 18:46:30 +0800	[thread overview]
Message-ID: <20210511104647.604-1-songmuchun@bytedance.com> (raw)

In our server, we found a suspected memory leak problem. The kmalloc-32
consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
memory.

After our in-depth analysis, the memory consumption of kmalloc-32 slab
cache is the cause of list_lru_one allocation.

  crash> p memcg_nr_cache_ids
  memcg_nr_cache_ids = $2 = 24574

memcg_nr_cache_ids is very large and memory consumption of each list_lru
can be calculated with the following formula.

  num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)

There are 4 numa nodes in our system, so each list_lru consumes ~3MB.

  crash> list super_blocks | wc -l
  952

Every mount will register 2 list lrus, one is for inode, another is for
dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3
MB (~5.6GB). But now the number of memory cgroups is less than 500. So I
guess more than 12286 memory cgroups have been created on this machine (I
do not know why there are so many cgroups, it may be a user's bug or
the user really want to do that). Because memcg_nr_cache_ids has not been
reduced to a suitable value. This can waste a lot of memory. If we want
to reduce memcg_nr_cache_ids, we have to reboot the server. This is not
what we want.

In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do
this. But this did not fundamentally solve the problem.

We currently allocate scope for every memcg to be able to tracked on every
not on every superblock instantiated in the system, regardless of whether
that superblock is even accessible to that memcg.

These huge memcg counts come from container hosts where memcgs are confined
to just a small subset of the total number of superblocks that instantiated
at any given point in time.

For these systems with huge container counts, list_lru does not need the
capability of tracking every memcg on every superblock.

What it comes down to is that the list_lru is only needed for a given memcg
if that memcg is instatiating and freeing objects on a given list_lru.

As Dave said, "Which makes me think we should be moving more towards 'add the
memcg to the list_lru at the first insert' model rather than 'instantiate
all at memcg init time just in case'."

This patchset aims to optimize the list lru memory consumption from different
aspects.

Patch 1-6 are code simplification.
Patch 7 converts the array from per-memcg per-node to per-memcg
Patch 9-15 let list_lru allocation dynamically.
Patch 17 use xarray to optimize per memcg pointer array size.

I had done a easy test to show the optimization. I create 10k memory cgroups
and mount 10k filesystems in the systems. We use free command to show how many
memory does the systems comsumes after this operation.

        +------------------------------------------------+
        |      condition        |   memory consumption   |
        +-----------------------+------------------------+
        | without this patchset |        24464 MB        |
        +-----------------------+------------------------+
        |     after patch 7     |        21957 MB        | <--------+
        +-----------------------+------------------------+          |
        |     after patch 15    |         6895 MB        |          |
        +-----------------------+------------------------+          |
        |     after patch 17    |         4367 MB        |          |
        +-----------------------+------------------------+          |
                                                                    |
        The more the number of nodes, the more obvious the effect---+

BTW, there was a recent discussion [2] on the same issue.

[1] https://lore.kernel.org/linux-fsdevel/20210428094949.43579-1-songmuchun@bytedance.com/
[2] https://lore.kernel.org/linux-fsdevel/20210405054848.GA1077931@in.ibm.com/

Muchun Song (17):
  mm: list_lru: fix list_lru_count_one() return value
  mm: memcontrol: remove kmemcg_id reparenting
  mm: memcontrol: remove the kmem states
  mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online()
  mm: list_lru: remove holding lru node lock
  mm: list_lru: only add the memcg aware lrus to the list
  mm: list_lru: optimize the array of per memcg lists
  mm: list_lru: remove memcg_aware from struct list_lru
  mm: introduce kmem_cache_alloc_lru
  fs: introduce alloc_inode_sb() to allocate filesystems specific inode
  mm: dcache: use kmem_cache_alloc_lru() to allocate dentry
  xarray: replace kmem_cache_alloc with kmem_cache_alloc_lru
  mm: workingset: allocate list_lru on xa_node allocation
  nfs42: use a specific kmem_cache to allocate nfs4_xattr_entry
  mm: list_lru: allocate list_lru_one only when needed
  mm: list_lru: rename memcg_drain_all_list_lrus to
    memcg_reparent_list_lrus
  mm: list_lru: replace linear array with xarray

 drivers/dax/super.c        |   2 +-
 fs/9p/vfs_inode.c          |   2 +-
 fs/adfs/super.c            |   2 +-
 fs/affs/super.c            |   2 +-
 fs/afs/super.c             |   2 +-
 fs/befs/linuxvfs.c         |   2 +-
 fs/bfs/inode.c             |   2 +-
 fs/block_dev.c             |   2 +-
 fs/btrfs/inode.c           |   2 +-
 fs/ceph/inode.c            |   2 +-
 fs/cifs/cifsfs.c           |   2 +-
 fs/coda/inode.c            |   2 +-
 fs/dcache.c                |   3 +-
 fs/ecryptfs/super.c        |   2 +-
 fs/efs/super.c             |   2 +-
 fs/erofs/super.c           |   2 +-
 fs/exfat/super.c           |   2 +-
 fs/ext2/super.c            |   2 +-
 fs/ext4/super.c            |   2 +-
 fs/f2fs/super.c            |   2 +-
 fs/fat/inode.c             |   2 +-
 fs/freevxfs/vxfs_super.c   |   2 +-
 fs/fuse/inode.c            |   2 +-
 fs/gfs2/super.c            |   2 +-
 fs/hfs/super.c             |   2 +-
 fs/hfsplus/super.c         |   2 +-
 fs/hostfs/hostfs_kern.c    |   2 +-
 fs/hpfs/super.c            |   2 +-
 fs/hugetlbfs/inode.c       |   2 +-
 fs/inode.c                 |   2 +-
 fs/isofs/inode.c           |   2 +-
 fs/jffs2/super.c           |   2 +-
 fs/jfs/super.c             |   2 +-
 fs/minix/inode.c           |   2 +-
 fs/nfs/inode.c             |   2 +-
 fs/nfs/nfs42xattr.c        |  95 +++++-----
 fs/nilfs2/super.c          |   2 +-
 fs/ntfs/inode.c            |   2 +-
 fs/ocfs2/dlmfs/dlmfs.c     |   2 +-
 fs/ocfs2/super.c           |   2 +-
 fs/openpromfs/inode.c      |   2 +-
 fs/orangefs/super.c        |   2 +-
 fs/overlayfs/super.c       |   2 +-
 fs/proc/inode.c            |   2 +-
 fs/qnx4/inode.c            |   2 +-
 fs/qnx6/inode.c            |   2 +-
 fs/reiserfs/super.c        |   2 +-
 fs/romfs/super.c           |   2 +-
 fs/squashfs/super.c        |   2 +-
 fs/sysv/inode.c            |   2 +-
 fs/ubifs/super.c           |   2 +-
 fs/udf/super.c             |   2 +-
 fs/ufs/super.c             |   2 +-
 fs/vboxsf/super.c          |   2 +-
 fs/xfs/xfs_icache.c        |   3 +-
 fs/zonefs/super.c          |   2 +-
 include/linux/fs.h         |   7 +
 include/linux/list_lru.h   |  24 +--
 include/linux/memcontrol.h |  31 ++--
 include/linux/slab.h       |   4 +
 include/linux/swap.h       |   5 +-
 include/linux/xarray.h     |   9 +-
 ipc/mqueue.c               |   2 +-
 lib/xarray.c               |  10 +-
 mm/list_lru.c              | 430 +++++++++++++++++++++------------------------
 mm/memcontrol.c            | 154 +++-------------
 mm/shmem.c                 |   2 +-
 mm/slab.c                  |  39 ++--
 mm/slab.h                  |  17 +-
 mm/slub.c                  |  42 +++--
 mm/workingset.c            |   2 +-
 net/socket.c               |   2 +-
 net/sunrpc/rpc_pipe.c      |   2 +-
 73 files changed, 460 insertions(+), 529 deletions(-)

-- 
2.11.0


             reply	other threads:[~2021-05-11 10:51 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-11 10:46 Muchun Song [this message]
2021-05-11 10:46 ` [PATCH 01/17] mm: list_lru: fix list_lru_count_one() return value Muchun Song
2021-05-11 10:46 ` [PATCH 02/17] mm: memcontrol: remove kmemcg_id reparenting Muchun Song
2021-05-11 10:46 ` [PATCH 03/17] mm: memcontrol: remove the kmem states Muchun Song
2021-05-11 10:46 ` [PATCH 04/17] mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online() Muchun Song
2021-05-11 10:46 ` [PATCH 05/17] mm: list_lru: remove holding lru node lock Muchun Song
2021-05-11 10:46 ` [PATCH 06/17] mm: list_lru: only add the memcg aware lrus to the list Muchun Song
2021-05-11 10:46 ` [PATCH 07/17] mm: list_lru: optimize the array of per memcg lists Muchun Song
2021-05-11 10:46 ` [PATCH 08/17] mm: list_lru: remove memcg_aware from struct list_lru Muchun Song
2021-05-11 10:46 ` [PATCH 09/17] mm: introduce kmem_cache_alloc_lru Muchun Song
2021-05-11 10:46 ` [PATCH 10/17] fs: introduce alloc_inode_sb() to allocate filesystems specific inode Muchun Song
2021-05-11 23:40   ` Dave Chinner
2021-05-12  3:20     ` [External] " Muchun Song
2021-05-11 10:46 ` [PATCH 11/17] mm: dcache: use kmem_cache_alloc_lru() to allocate dentry Muchun Song
2021-05-11 10:46 ` [PATCH 12/17] xarray: replace kmem_cache_alloc with kmem_cache_alloc_lru Muchun Song
2021-05-11 10:46 ` [PATCH 13/17] mm: workingset: allocate list_lru on xa_node allocation Muchun Song
2021-05-11 10:46 ` [PATCH 14/17] nfs42: use a specific kmem_cache to allocate nfs4_xattr_entry Muchun Song
2021-05-11 10:46 ` [PATCH 15/17] mm: list_lru: allocate list_lru_one only when needed Muchun Song
2021-05-11 10:46 ` [PATCH 16/17] mm: list_lru: rename memcg_drain_all_list_lrus to memcg_reparent_list_lrus Muchun Song
2021-05-11 10:46 ` [PATCH 17/17] mm: list_lru: replace linear array with xarray Muchun Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210511104647.604-1-songmuchun@bytedance.com \
    --to=songmuchun@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexs@kernel.org \
    --cc=anna.schumaker@netapp.com \
    --cc=david@fromorbit.com \
    --cc=duanxiongchun@bytedance.com \
    --cc=fam.zheng@bytedance.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=mhocko@kernel.org \
    --cc=richard.weiyang@gmail.com \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=trond.myklebust@hammerspace.com \
    --cc=vdavydov.dev@gmail.com \
    --cc=willy@infradead.org \
    --cc=zhengqi.arch@bytedance.com \
    --subject='Re: [PATCH 00/17] Optimize list lru memory consumption' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).