All of lore.kernel.org
 help / color / mirror / Atom feed
From: chengming.zhou@linux.dev
To: cl@linux.com, penberg@kernel.org
Cc: rientjes@google.com, iamjoonsoo.kim@lge.com,
	akpm@linux-foundation.org, vbabka@suse.cz,
	roman.gushchin@linux.dev, 42.hyeyoo@gmail.com,
	willy@infradead.org, pcc@google.com, tytso@mit.edu,
	maz@kernel.org, ruansy.fnst@fujitsu.com, vishal.moola@gmail.com,
	lrh2000@pku.edu.cn, hughd@google.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	chengming.zhou@linux.dev,
	Chengming Zhou <zhouchengming@bytedance.com>
Subject: [RFC PATCH v2 0/6] slub: Delay freezing of CPU partial slabs
Date: Sat, 21 Oct 2023 14:43:11 +0000	[thread overview]
Message-ID: <20231021144317.3400916-1-chengming.zhou@linux.dev> (raw)

From: Chengming Zhou <zhouchengming@bytedance.com>

Changes in RFC v2:
 - Reuse PG_workingset bit to keep track of whether slub is on the
   per-node partial list, as suggested by Matthew Wilcox.
 - Fix OOM problem on kernel without CONFIG_SLUB_CPU_PARTIAL, which
   is caused by leak of partial slabs when get_partial_node().
 - Add a patch to simplify acquire_slab().
 - Reorder patches a little.
 - v1: https://lore.kernel.org/all/20231017154439.3036608-1-chengming.zhou@linux.dev/

1. Problem
==========
Now we have to freeze the slab when get from the node partial list, and
unfreeze the slab when put to the node partial list. Because we need to
rely on the node list_lock to synchronize the "frozen" bit changes.

This implementation has some drawbacks:

 - Alloc path: twice cmpxchg_double.
   It has to get some partial slabs from node when the allocator has used
   up the CPU partial slabs. So it freeze the slab (one cmpxchg_double)
   with node list_lock held, put those frozen slabs on its CPU partial
   list. Later ___slab_alloc() will cmpxchg_double try-loop again if that
   slab is picked to use.

 - Alloc path: amplified contention on node list_lock.
   Since we have to synchronize the "frozen" bit changes under the node
   list_lock, the contention of slab (struct page) can be transferred
   to the node list_lock. On machine with many CPUs in one node, the
   contention of list_lock will be amplified by all CPUs' alloc path.

   The current code has to workaround this problem by avoiding using
   cmpxchg_double try-loop, which will just break and return when
   contention of page encountered and the first cmpxchg_double failed.
   But this workaround has its own problem.

 - Free path: redundant unfreeze.
   __slab_free() will freeze and cache some slabs on its partial list,
   and flush them to the node partial list when exceed, which has to
   unfreeze those slabs again under the node list_lock. Actually we
   don't need to freeze slab on CPU partial list, in which case we
   can save the unfreeze cmpxchg_double operations in flush path.

2. Solution
===========
We solve these problems by leaving slabs unfrozen when moving out of
the node partial list and on CPU partial list, so "frozen" bit is 0.

These partial slabs won't be manipulate concurrently by alloc path,
the only racer is free path, which may manipulate its list when !inuse.
So we need to introduce another synchronization way to avoid it, we
reuse PG_workingset to keep track of whether the slab is on node partial
list or not, only in that case we can manipulate the slab list.

The slab will be delay frozen when it's picked to actively use by the
CPU, it becomes full at the same time, in which case we still need to
rely on "frozen" bit to avoid manipulating its list. So the slab will
be frozen only when activate use and be unfrozen only when deactivate.

3. Testing
==========
We just did some simple testing on a server with 128 CPUs (2 nodes) to
compare performance for now.

 - perf bench sched messaging -g 5 -t -l 100000
   baseline	RFC
   7.042s	6.966s
   7.022s	7.045s
   7.054s	6.985s

 - stress-ng --rawpkt 128 --rawpkt-ops 100000000
   baseline	RFC
   2.42s	2.15s
   2.45s	2.16s
   2.44s	2.17s

It shows above there is about 10% improvement on stress-ng rawpkt
testcase, although no much improvement on perf sched bench testcase.

Thanks for any comment and code review!

Chengming Zhou (6):
  slub: Keep track of whether slub is on the per-node partial list
  slub: Prepare __slab_free() for unfrozen partial slab out of node
    partial list
  slub: Don't freeze slabs for cpu partial
  slub: Simplify acquire_slab()
  slub: Introduce get_cpu_partial()
  slub: Optimize deactivate_slab()

 include/linux/page-flags.h |   2 +
 mm/slab.h                  |  19 +++
 mm/slub.c                  | 245 +++++++++++++++++++------------------
 3 files changed, 150 insertions(+), 116 deletions(-)

-- 
2.20.1


             reply	other threads:[~2023-10-21 14:43 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-21 14:43 chengming.zhou [this message]
2023-10-21 14:43 ` [RFC PATCH v2 1/6] slub: Keep track of whether slub is on the per-node partial list chengming.zhou
2023-10-23 12:32   ` Matthew Wilcox
2023-10-23 16:22     ` Matthew Wilcox
2023-10-24  1:57       ` Chengming Zhou
2023-10-21 14:43 ` [RFC PATCH v2 2/6] slub: Prepare __slab_free() for unfrozen partial slab out of node " chengming.zhou
2023-10-21 14:43 ` [RFC PATCH v2 3/6] slub: Don't freeze slabs for cpu partial chengming.zhou
2023-10-23 16:00   ` Vlastimil Babka
2023-10-24  2:39     ` Chengming Zhou
2023-10-21 14:43 ` [RFC PATCH v2 4/6] slub: Simplify acquire_slab() chengming.zhou
2023-10-21 14:43 ` [RFC PATCH v2 5/6] slub: Introduce get_cpu_partial() chengming.zhou
2023-10-21 14:43 ` [RFC PATCH v2 6/6] slub: Optimize deactivate_slab() chengming.zhou
2023-10-22 14:52 ` [RFC PATCH v2 0/6] slub: Delay freezing of CPU partial slabs Hyeonggon Yoo
2023-10-24  2:02   ` Chengming Zhou
2023-10-23 15:46 ` Vlastimil Babka
2023-10-23 17:00   ` Christoph Lameter (Ampere)
2023-10-23 18:44     ` Vlastimil Babka
2023-10-23 21:05       ` Christoph Lameter (Ampere)
2023-10-24  8:19         ` Vlastimil Babka
2023-10-24 11:03         ` Chengming Zhou
2023-10-24  2:20   ` Chengming Zhou
2023-10-24  8:20     ` Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231021144317.3400916-1-chengming.zhou@linux.dev \
    --to=chengming.zhou@linux.dev \
    --cc=42.hyeyoo@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=hughd@google.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lrh2000@pku.edu.cn \
    --cc=maz@kernel.org \
    --cc=pcc@google.com \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=ruansy.fnst@fujitsu.com \
    --cc=tytso@mit.edu \
    --cc=vbabka@suse.cz \
    --cc=vishal.moola@gmail.com \
    --cc=willy@infradead.org \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.