All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qi Zheng <zhengqi.arch@bytedance.com>
To: akpm@linux-foundation.org, hannes@cmpxchg.org,
	shakeelb@google.com, mhocko@kernel.org, roman.gushchin@linux.dev,
	muchun.song@linux.dev, david@redhat.com, shy828301@gmail.com
Cc: tkhai@ya.ru, sultan@kerneltoast.com, dave@stgolabs.net,
	penguin-kernel@I-love.SAKURA.ne.jp, paulmck@kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Qi Zheng <zhengqi.arch@bytedance.com>
Subject: [PATCH 0/5] make slab shrink lockless
Date: Mon, 20 Feb 2023 17:16:32 +0800	[thread overview]
Message-ID: <20230220091637.64865-1-zhengqi.arch@bytedance.com> (raw)

Hi all,

This patch series aims to make slab shrink lockless.

1. Background
=============

On our servers, we often find the following system cpu hotspots:

  44.16%  [kernel]  [k] down_read_trylock
  14.12%  [kernel]  [k] up_read
  13.43%  [kernel]  [k] shrink_slab
   5.25%  [kernel]  [k] count_shadow_nodes
   3.42%  [kernel]  [k] idr_find

Then we used bpftrace to capture its calltrace as follows:

@[
    down_read_trylock+5
    shrink_slab+292
    shrink_node+640
    do_try_to_free_pages+211
    try_to_free_mem_cgroup_pages+266
    try_charge_memcg+386
    charge_memcg+51
    __mem_cgroup_charge+44
    __handle_mm_fault+1416
    handle_mm_fault+260
    do_user_addr_fault+459
    exc_page_fault+104
    asm_exc_page_fault+38
    clear_user_rep_good+18
    read_zero+100
    vfs_read+176
    ksys_read+93
    do_syscall_64+62
    entry_SYSCALL_64_after_hwframe+114
]: 1868979

It is easy to see that this is caused by the frequent failure to obtain the
read lock of shrinker_rwsem when reclaiming slab memory.

Currently, the shrinker_rwsem is a global lock. And the following cases may
cause the above system cpu hotspots:

a. the write lock of shrinker_rwsem was held for too long. For example, there
   are many memcgs in the system, which causes some paths to hold locks and
   traverse it for too long. (e.g. expand_shrinker_info())
b. the read lock of shrinker_rwsem was held for too long, and a writer came at
   this time. Then this writer will be forced to wait and block all subsequent
   readers.
   For example:
   - be scheduled when the read lock of shrinker_rwsem is held in
     do_shrink_slab()
   - some shrinker are blocked for too long. Like the case mentioned in the
     patchset[1].

[1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/

And all the down_read_trylock() hotspots caused by the above cases can be
solved by replacing the shrinker_rwsem trylocks with SRCU.

2. Survey
=========

Before doing the code implementation, I found that there were many similar
submissions in the community:

a. Davidlohr Bueso submitted a patch in 2015.
   Subject: [PATCH -next v2] mm: srcu-ify shrinkers
   Link: https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
   Result: It was finally merged into the linux-next branch, but failed on arm
           allnoconfig (without CONFIG_SRCU)

b. Tetsuo Handa submitted a patchset in 2017.
   Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
   Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
   Result: Finally chose to use the current simple way (break when
           rwsem_is_contended()). And Christoph Hellwig suggested to using SRCU,
           but SRCU was not unconditionally enabled at the time.

c. Kirill Tkhai submitted a patchset in 2018.
   Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab()
   Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
   Result: At that time, SRCU was not unconditionally enabled, and there were
           some objections to enabling SRCU. Later, because Kirill's focus was
           moved to other things, this patchset was not continued to be updated.

d. Sultan Alsawaf submitted a patch in 2021.
   Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection
   Link: https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
   Result: Rejected because SRCU was not unconditionally enabled.

We can find that almost all these historical commits were abandoned because SRCU
was not unconditionally enabled. But now SRCU has been unconditionally enable
by Paul E. McKenney in 2023 [2], so it's time to replace shrinker_rwsem trylocks
with SRCU.

[2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/

3. Reproduction and testing
===========================

We can reproduce the down_read_trylock() hotspot through the following script:

```
#!/bin/bash
DIR="/root/shrinker/memcg/mnt"

do_create()
{
        mkdir /sys/fs/cgroup/memory/test
        echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
        for i in `seq 0 $1`;
        do
                mkdir /sys/fs/cgroup/memory/test/$i;
                echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
                mkdir -p $DIR/$i;
        done
}

do_mount()
{
        for i in `seq $1 $2`;
        do
                mount -t tmpfs $i $DIR/$i;
        done
}

do_touch()
{
        for i in `seq $1 $2`;
        do
                echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
                dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
        done
}

do_create 2000
do_mount 0 2000
do_touch 0 1000
```

Save the above script and execute it, we can get the following perf hotspots:

  46.60%  [kernel]  [k] down_read_trylock
  18.70%  [kernel]  [k] up_read
  15.44%  [kernel]  [k] shrink_slab
   4.37%  [kernel]  [k] _find_next_bit
   2.75%  [kernel]  [k] xa_load
   2.07%  [kernel]  [k] idr_find
   1.73%  [kernel]  [k] do_shrink_slab
   1.42%  [kernel]  [k] shrink_lruvec
   0.74%  [kernel]  [k] shrink_node
   0.60%  [kernel]  [k] list_lru_count_one

After applying this patchset, the hotspot becomes as follows:

  19.53%  [kernel]  [k] _find_next_bit
  14.63%  [kernel]  [k] do_shrink_slab
  14.58%  [kernel]  [k] shrink_slab
  11.83%  [kernel]  [k] shrink_lruvec
   9.33%  [kernel]  [k] __blk_flush_plug
   6.67%  [kernel]  [k] mem_cgroup_iter
   3.73%  [kernel]  [k] list_lru_count_one
   2.43%  [kernel]  [k] shrink_node
   1.96%  [kernel]  [k] super_cache_count
   1.78%  [kernel]  [k] __rcu_read_unlock
   1.38%  [kernel]  [k] __srcu_read_lock
   1.30%  [kernel]  [k] xas_descend

We can see that the slab reclaim is no longer blocked by shinker_rwsem trylock,
which realizes the lockless slab reclaim.

This series is based on next-20230217.

Comments and suggestions are welcome.

Thanks,
Qi.

Qi Zheng (5):
  mm: vmscan: make global slab shrink lockless
  mm: vmscan: make memcg slab shrink lockless
  mm: shrinkers: make count and scan in shrinker debugfs lockless
  mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
  mm: shrinkers: convert shrinker_rwsem to mutex

 drivers/md/dm-cache-metadata.c |   2 +-
 drivers/md/dm-thin-metadata.c  |   2 +-
 fs/super.c                     |   2 +-
 mm/shrinker_debug.c            |  38 ++++-------
 mm/vmscan.c                    | 119 ++++++++++++++++-----------------
 5 files changed, 76 insertions(+), 87 deletions(-)

-- 
2.20.1


             reply	other threads:[~2023-02-20  9:17 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-20  9:16 Qi Zheng [this message]
2023-02-20  9:16 ` [PATCH 1/5] mm: vmscan: make global slab shrink lockless Qi Zheng
2023-02-20  9:16 ` [PATCH 2/5] mm: vmscan: make memcg " Qi Zheng
2023-02-21 21:28   ` Kirill Tkhai
2023-02-22  7:32     ` Qi Zheng
2023-02-22 19:58       ` Kirill Tkhai
2023-02-23  4:36         ` Qi Zheng
2023-02-21 21:43   ` Kirill Tkhai
2023-02-22  8:16     ` Qi Zheng
2023-02-22  8:21       ` Qi Zheng
2023-02-22 20:05         ` Kirill Tkhai
2023-02-23  4:37           ` Qi Zheng
2023-02-20  9:16 ` [PATCH 3/5] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
2023-02-20 13:22   ` kernel test robot
2023-02-20  9:16 ` [PATCH 4/5] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() Qi Zheng
2023-02-20  9:16 ` [PATCH 5/5] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230220091637.64865-1-zhengqi.arch@bytedance.com \
    --to=zhengqi.arch@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=paulmck@kernel.org \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=sultan@kerneltoast.com \
    --cc=tkhai@ya.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.