All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kirill Tkhai <ktkhai@virtuozzo.com>
To: vdavydov.dev@gmail.com, shakeelb@google.com,
	viro@zeniv.linux.org.uk, hannes@cmpxchg.org, mhocko@kernel.org,
	tglx@linutronix.de, pombredanne@nexb.com,
	stummala@codeaurora.org, gregkh@linuxfoundation.org,
	sfr@canb.auug.org.au, guro@fb.com, mka@chromium.org,
	penguin-kernel@I-love.SAKURA.ne.jp, chris@chris-wilson.co.uk,
	longman@redhat.com, minchan@kernel.org, ying.huang@intel.com,
	mgorman@techsingularity.net, jbacik@fb.com, linux@roeck-us.net,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	willy@infradead.org, lirongqing@baidu.com,
	aryabinin@virtuozzo.com, akpm@linux-foundation.org,
	ktkhai@virtuozzo.com
Subject: [PATCH v8 17/17] mm: Clear shrinker bit if there are no objects related to memcg
Date: Tue, 03 Jul 2018 18:11:48 +0300	[thread overview]
Message-ID: <153063070859.1818.11870882950920963480.stgit@localhost.localdomain> (raw)
In-Reply-To: <153063036670.1818.16010062622751502.stgit@localhost.localdomain>

To avoid further unneed calls of do_shrink_slab()
for shrinkers, which already do not have any charged
objects in a memcg, their bits have to be cleared.

This patch introduces a lockless mechanism to do that
without races without parallel list lru add. After
do_shrink_slab() returns SHRINK_EMPTY the first time,
we clear the bit and call it once again. Then we restore
the bit, if the new return value is different.

Note, that single smp_mb__after_atomic() in shrink_slab_memcg()
covers two situations:

1)list_lru_add()     shrink_slab_memcg
    list_add_tail()    for_each_set_bit() <--- read bit
                         do_shrink_slab() <--- missed list update (no barrier)
    <MB>                 <MB>
    set_bit()            do_shrink_slab() <--- seen list update

This situation, when the first do_shrink_slab() sees set bit,
but it doesn't see list update (i.e., race with the first element
queueing), is rare. So we don't add <MB> before the first call
of do_shrink_slab() instead of this to do not slow down generic
case. Also, it's need the second call as seen in below in (2).

2)list_lru_add()      shrink_slab_memcg()
    list_add_tail()     ...
    set_bit()           ...
  ...                   for_each_set_bit()
  do_shrink_slab()        do_shrink_slab()
    clear_bit()           ...
  ...                     ...
  list_lru_add()          ...
    list_add_tail()       clear_bit()
    <MB>                  <MB>
    set_bit()             do_shrink_slab()

The barriers guarantees, the second do_shrink_slab()
in the right side task sees list update if really
cleared the bit. This case is drawn in the code comment.

[Results/performance of the patchset]

After the whole patchset applied the below test shows signify
increase of performance:

$echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
$mkdir /sys/fs/cgroup/memory/ct
$echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
    $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i;
			    echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
			    mkdir -p s/$i; mount -t tmpfs $i s/$i;
			    touch s/$i/file; done

Then, 5 sequential calls of drop caches:
$time echo 3 > /proc/sys/vm/drop_caches

1)Before:
0.00user 13.78system 0:13.78elapsed 99%CPU
0.00user 5.59system 0:05.60elapsed 99%CPU
0.00user 5.48system 0:05.48elapsed 99%CPU
0.00user 8.35system 0:08.35elapsed 99%CPU
0.00user 8.34system 0:08.35elapsed 99%CPU

2)After
0.00user 1.10system 0:01.10elapsed 99%CPU
0.00user 0.00system 0:00.01elapsed 64%CPU
0.00user 0.01system 0:00.01elapsed 82%CPU
0.00user 0.00system 0:00.01elapsed 64%CPU
0.00user 0.01system 0:00.01elapsed 82%CPU

The results show the performance increases at least in 548 times.

Shakeel Butt tested this patchset with fork-bomb on his configuration:

 > I created 255 memcgs, 255 ext4 mounts and made each memcg create a
 > file containing few KiBs on corresponding mount. Then in a separate
 > memcg of 200 MiB limit ran a fork-bomb.
 >
 > I ran the "perf record -ag -- sleep 60" and below are the results:
 >
 > Without the patch series:
 > Samples: 4M of event 'cycles', Event count (approx.): 3279403076005
 > +  36.40%            fb.sh  [kernel.kallsyms]    [k] shrink_slab
 > +  18.97%            fb.sh  [kernel.kallsyms]    [k] list_lru_count_one
 > +   6.75%            fb.sh  [kernel.kallsyms]    [k] super_cache_count
 > +   0.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
 > +   0.44%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
 > +   0.27%            fb.sh  [kernel.kallsyms]    [k] up_read
 > +   0.21%            fb.sh  [kernel.kallsyms]    [k] osq_lock
 > +   0.13%            fb.sh  [kernel.kallsyms]    [k] shmem_unused_huge_count
 > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
 > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node
 >
 > With the patch series:
 > Samples: 4M of event 'cycles', Event count (approx.): 2756866824946
 > +  47.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
 > +  30.72%            fb.sh  [kernel.kallsyms]    [k] up_read
 > +   9.51%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
 > +   1.69%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
 > +   1.35%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_protected
 > +   1.05%            fb.sh  [kernel.kallsyms]    [k] queued_spin_lock_slowpath
 > +   0.85%            fb.sh  [kernel.kallsyms]    [k] _raw_spin_lock
 > +   0.78%            fb.sh  [kernel.kallsyms]    [k] lruvec_lru_size
 > +   0.57%            fb.sh  [kernel.kallsyms]    [k] shrink_node
 > +   0.54%            fb.sh  [kernel.kallsyms]    [k] queue_work_on
 > +   0.46%            fb.sh  [kernel.kallsyms]    [k] shrink_slab_memcg

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
---
 include/linux/memcontrol.h |    2 ++
 mm/vmscan.c                |   25 +++++++++++++++++++++++--
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c6ea182ca54b..c79c4a54c0ee 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1316,6 +1316,8 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 
 		rcu_read_lock();
 		map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map);
+		/* Pairs with smp mb in shrink_slab() */
+		smp_mb__before_atomic();
 		set_bit(shrinker_id, map->map);
 		rcu_read_unlock();
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 96279b5f1f6d..45d153508d1c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -597,8 +597,29 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 			continue;
 
 		ret = do_shrink_slab(&sc, shrinker, priority);
-		if (ret == SHRINK_EMPTY)
-			ret = 0;
+		if (ret == SHRINK_EMPTY) {
+			clear_bit(i, map->map);
+			/*
+			 * After the shrinker reported that it had no objects to free,
+			 * but before we cleared the corresponding bit in the memcg
+			 * shrinker map, a new object might have been added. To make
+			 * sure, we have the bit set in this case, we invoke the
+			 * shrinker one more time and re-set the bit if it reports that
+			 * it is not empty anymore. The memory barrier here pairs with
+			 * the barrier in memcg_set_shrinker_bit():
+			 *
+			 * list_lru_add()     shrink_slab_memcg()
+			 *   list_add_tail()    clear_bit()
+			 *   <MB>               <MB>
+			 *   set_bit()          do_shrink_slab()
+			 */
+			smp_mb__after_atomic();
+			ret = do_shrink_slab(&sc, shrinker, priority);
+			if (ret == SHRINK_EMPTY)
+				ret = 0;
+			else
+				memcg_set_shrinker_bit(memcg, nid, i);
+		}
 		freed += ret;
 
 		if (rwsem_is_contended(&shrinker_rwsem)) {


      parent reply	other threads:[~2018-07-03 15:12 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-03 15:08 [PATCH v8 00/17] Improve shrink_slab() scalability (old complexity was O(n^2), new is O(n)) Kirill Tkhai
2018-07-03 15:08 ` [PATCH v8 01/17] list_lru: Combine code under the same define Kirill Tkhai
2018-07-03 15:08 ` [PATCH v8 02/17] mm: Introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOB Kirill Tkhai
2018-07-03 15:09 ` [PATCH v8 03/17] mm: Assign id to every memcg-aware shrinker Kirill Tkhai
2018-07-03 15:27   ` Matthew Wilcox
2018-07-03 15:46     ` Shakeel Butt
2018-07-03 16:17       ` Kirill Tkhai
2018-07-03 17:00         ` Shakeel Butt
2018-07-03 17:32           ` Kirill Tkhai
2018-07-12 11:13             ` Kirill Tkhai
2018-07-12 11:19               ` Kirill Tkhai
2018-07-03 17:47       ` Matthew Wilcox
2018-07-03 20:39         ` Al Viro
2018-07-03 15:46     ` Kirill Tkhai
2018-07-03 17:58       ` Matthew Wilcox
2018-07-03 19:12         ` Kirill Tkhai
2018-07-03 19:19           ` Shakeel Butt
2018-07-03 19:25             ` Matthew Wilcox
2018-07-03 19:54               ` Shakeel Butt
2018-07-03 15:09 ` [PATCH v8 04/17] memcg: Move up for_each_mem_cgroup{, _tree} defines Kirill Tkhai
2018-07-03 15:09 ` [PATCH v8 05/17] mm: Assign memcg-aware shrinkers bitmap to memcg Kirill Tkhai
2018-07-03 20:50   ` Andrew Morton
2018-07-04 15:51     ` Kirill Tkhai
2018-07-05 22:10       ` Andrew Morton
2018-07-06 17:50         ` Vladimir Davydov
2018-07-05 22:50     ` Matthew Wilcox
2018-07-06 17:30     ` Vladimir Davydov
2018-07-03 15:09 ` [PATCH v8 06/17] mm: Refactoring in workingset_init() Kirill Tkhai
2018-07-03 15:09 ` [PATCH v8 07/17] fs: Refactoring in alloc_super() Kirill Tkhai
2018-07-03 15:09 ` [PATCH v8 08/17] fs: Propagate shrinker::id to list_lru Kirill Tkhai
2018-07-03 15:10 ` [PATCH v8 09/17] list_lru: Add memcg argument to list_lru_from_kmem() Kirill Tkhai
2018-07-03 15:10 ` [PATCH v8 10/17] list_lru: Pass dst_memcg argument to memcg_drain_list_lru_node() Kirill Tkhai
2018-07-03 15:10 ` [PATCH v8 11/17] list_lru: Pass lru " Kirill Tkhai
2018-07-03 15:10 ` [PATCH v8 12/17] mm: Export mem_cgroup_is_root() Kirill Tkhai
2018-07-03 15:10 ` [PATCH v8 13/17] mm: Set bit in memcg shrinker bitmap on first list_lru item apearance Kirill Tkhai
2018-07-03 20:54   ` Andrew Morton
2018-07-03 15:11 ` [PATCH v8 14/17] mm: Iterate only over charged shrinkers during memcg shrink_slab() Kirill Tkhai
2018-07-03 20:58   ` Andrew Morton
2018-07-04 14:56     ` Kirill Tkhai
2018-07-03 15:11 ` [PATCH v8 15/17] mm: Generalize shrink_slab() calls in shrink_node() Kirill Tkhai
2018-07-03 15:11 ` [PATCH v8 16/17] mm: Add SHRINK_EMPTY shrinker methods return value Kirill Tkhai
2018-07-03 15:11 ` Kirill Tkhai [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=153063070859.1818.11870882950920963480.stgit@localhost.localdomain \
    --to=ktkhai@virtuozzo.com \
    --cc=akpm@linux-foundation.org \
    --cc=aryabinin@virtuozzo.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=gregkh@linuxfoundation.org \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=jbacik@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux@roeck-us.net \
    --cc=lirongqing@baidu.com \
    --cc=longman@redhat.com \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@kernel.org \
    --cc=minchan@kernel.org \
    --cc=mka@chromium.org \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=pombredanne@nexb.com \
    --cc=sfr@canb.auug.org.au \
    --cc=shakeelb@google.com \
    --cc=stummala@codeaurora.org \
    --cc=tglx@linutronix.de \
    --cc=vdavydov.dev@gmail.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.