From: Mel Gorman <mgorman@techsingularity.net>
To: Michal Hocko <mhocko@suse.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>,
Vlastimil Babka <vbabka@suse.cz>,
Christoph Lameter <cl@linux.com>,
Bharata B Rao <bharata@linux.ibm.com>,
linux-kernel <linux-kernel@vger.kernel.org>,
linux-mm@kvack.org, David Rientjes <rientjes@google.com>,
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
Andrew Morton <akpm@linux-foundation.org>,
guro@fb.com, Shakeel Butt <shakeelb@google.com>,
Johannes Weiner <hannes@cmpxchg.org>,
aneesh.kumar@linux.ibm.com, Jann Horn <jannh@google.com>
Subject: Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order
Date: Thu, 28 Jan 2021 13:45:12 +0000 [thread overview]
Message-ID: <20210128134512.GF3592@techsingularity.net> (raw)
In-Reply-To: <20210126135918.GQ827@dhcp22.suse.cz>
On Tue, Jan 26, 2021 at 02:59:18PM +0100, Michal Hocko wrote:
> > > This thread shows that this is still somehow related to performance but
> > > the real reason is not clear. I believe we should be focusing on the
> > > actual reasons for the performance impact than playing with some fancy
> > > math and tuning for a benchmark on a particular machine which doesn't
> > > work for others due to subtle initialization timing issues.
> > >
> > > Fundamentally why should higher number of CPUs imply the size of slab in
> > > the first place?
> >
> > A 1st answer is that the activity and the number of threads involved
> > scales with the number of CPUs. Regarding the hackbench benchmark as
> > an example, the number of group/threads raise to a higher level on the
> > server than on the small system which doesn't seem unreasonable.
> >
> > On 8 CPUs, I run hackbench with up to 16 groups which means 16*40
> > threads. But I raise up to 256 groups, which means 256*40 threads, on
> > the 224 CPUs system. In fact, hackbench -g 1 (with 1 group) doesn't
> > regress on the 224 CPUs system. The next test with 4 groups starts
> > to regress by -7%. But the next one: hackbench -g 16 regresses by 187%
> > (duration is almost 3 times longer). It seems reasonable to assume
> > that the number of running threads and resources scale with the number
> > of CPUs because we want to run more stuff.
>
> OK, I do understand that more jobs scale with the number of CPUs but I
> would also expect that higher order pages are generally more expensive
> to get so this is not really a clear cut especially under some more
> demand on the memory where allocations are smooth. So the question
> really is whether this is not just optimizing for artificial conditions.
The flip side is that smaller orders increase zone lock contention and
contention can csale with the number of CPUs so it's partially related.
hackbench-sockets is an extreme case (pipetest is not affected) but it's
the messenger in this case.
On a x86-64 2-socket 40 core (80 threads) machine then comparing a revert
of the patch with vanilla 5.11-rc5 is
hackbench-process-sockets
5.11-rc5 5.11-rc5
revert-lockstat vanilla-lockstat
Amean 1 1.1560 ( 0.00%) 1.0633 * 8.02%*
Amean 4 2.0797 ( 0.00%) 2.5470 * -22.47%*
Amean 7 3.2693 ( 0.00%) 4.3433 * -32.85%*
Amean 12 5.2043 ( 0.00%) 6.5600 * -26.05%*
Amean 21 10.5817 ( 0.00%) 11.3320 * -7.09%*
Amean 30 13.3923 ( 0.00%) 15.5817 * -16.35%*
Amean 48 20.3893 ( 0.00%) 23.6733 * -16.11%*
Amean 79 31.4210 ( 0.00%) 38.2787 * -21.83%*
Amean 110 43.6177 ( 0.00%) 53.8847 * -23.54%*
Amean 141 56.3840 ( 0.00%) 68.4257 * -21.36%*
Amean 172 70.0577 ( 0.00%) 85.0077 * -21.34%*
Amean 203 81.9717 ( 0.00%) 100.7137 * -22.86%*
Amean 234 95.1900 ( 0.00%) 116.0280 * -21.89%*
Amean 265 108.9097 ( 0.00%) 130.4307 * -19.76%*
Amean 296 119.7470 ( 0.00%) 142.3637 * -18.89%*
i.e. the patch incurs a 7% to 32% performance penalty. This bisected
cleanly yesterday when I was looking for the regression and then found
the thread.
Numerous caches change size. For example, kmalloc-512 goes from order-0
(vanilla) to order-2 with the revert.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
VANILLA
&zone->lock: 1202731 1203433 0.07 120.55 1555485.48 1.29 8920825 12537091 0.06 84.10 9855085.12 0.79
-----------
&zone->lock 61903 [<00000000b47dc96a>] free_one_page+0x3f/0x530
&zone->lock 7655 [<00000000099f6e05>] get_page_from_freelist+0x475/0x1370
&zone->lock 36529 [<0000000075b9b918>] free_pcppages_bulk+0x1ac/0x7d0
&zone->lock 1097346 [<00000000b8e4950a>] get_page_from_freelist+0xaf0/0x1370
-----------
&zone->lock 44716 [<00000000099f6e05>] get_page_from_freelist+0x475/0x1370
&zone->lock 69813 [<0000000075b9b918>] free_pcppages_bulk+0x1ac/0x7d0
&zone->lock 31596 [<00000000b47dc96a>] free_one_page+0x3f/0x530
&zone->lock 1057308 [<00000000b8e4950a>] get_page_from_freelist+0xaf0/0x1370
REVERT
&zone->lock: 735827 739037 0.06 66.12 699661.56 0.95 4095299 7757942 0.05 54.35 5670083.68 0.73
-----------
&zone->lock 101927 [<00000000a60d5f86>] free_one_page+0x3f/0x530
&zone->lock 626426 [<00000000122cecf3>] get_page_from_freelist+0xaf0/0x1370
&zone->lock 9207 [<0000000068b9c9a1>] free_pcppages_bulk+0x1ac/0x7d0
&zone->lock 1477 [<00000000f856e720>] get_page_from_freelist+0x475/0x1370
-----------
&zone->lock 6249 [<00000000f856e720>] get_page_from_freelist+0x475/0x1370
&zone->lock 92224 [<00000000a60d5f86>] free_one_page+0x3f/0x530
&zone->lock 19690 [<0000000068b9c9a1>] free_pcppages_bulk+0x1ac/0x7d0
&zone->lock 620874 [<00000000122cecf3>] get_page_from_freelist+0xaf0/0x1370
Each individual wait time is small but the maximum waittime-max is roughly
double (120us vanilla vs 66us reverting the patch). Total wait time is
roughly doubled also due to the patch. Acquisitions are almost doubled.
So mostly this is down to the number of times SLUB calls into the page
allocator which only caches order-0 pages on a per-cpu basis. I do have
a prototype for a high-order per-cpu allocator but it is very rough --
high watermarks stop making sense, code is rough, memory needed for the
pcpu structures quadruples etc.
--
Mel Gorman
SUSE Labs
next prev parent reply other threads:[~2021-01-28 13:45 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-18 8:27 [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order Bharata B Rao
2020-11-18 11:25 ` Vlastimil Babka
2020-11-18 19:34 ` Roman Gushchin
2020-11-18 19:53 ` David Rientjes
2021-01-20 17:36 ` Vincent Guittot
2021-01-21 5:30 ` Bharata B Rao
2021-01-21 9:09 ` Vincent Guittot
2021-01-21 10:01 ` Christoph Lameter
2021-01-21 10:48 ` Vincent Guittot
2021-01-21 18:19 ` Vlastimil Babka
2021-01-22 8:03 ` Vincent Guittot
2021-01-22 12:03 ` Vlastimil Babka
2021-01-22 13:16 ` Vincent Guittot
2021-01-23 5:16 ` Bharata B Rao
2021-01-23 12:32 ` Vincent Guittot
2021-01-25 11:20 ` Vlastimil Babka
2021-01-26 23:03 ` Will Deacon
2021-01-27 9:10 ` Christoph Lameter
2021-01-27 11:04 ` Vlastimil Babka
2021-02-03 11:10 ` Bharata B Rao
2021-02-04 7:32 ` Vincent Guittot
2021-02-04 9:07 ` Christoph Lameter
2021-02-04 9:33 ` Vlastimil Babka
2021-02-08 13:41 ` [PATCH] mm, slub: better heuristic for number of cpus when calculating slab order Vlastimil Babka
2021-02-08 14:54 ` Vincent Guittot
2021-02-10 14:07 ` Mel Gorman
2021-01-22 13:05 ` [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order Jann Horn
2021-01-22 13:09 ` Jann Horn
2021-01-22 15:27 ` Vlastimil Babka
2021-01-25 4:28 ` Bharata B Rao
2021-01-26 8:52 ` Michal Hocko
2021-01-26 13:38 ` Vincent Guittot
2021-01-26 13:59 ` Michal Hocko
2021-01-28 13:45 ` Mel Gorman [this message]
2021-01-28 13:57 ` Michal Hocko
2021-01-28 14:42 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210128134512.GF3592@techsingularity.net \
--to=mgorman@techsingularity.net \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=bharata@linux.ibm.com \
--cc=cl@linux.com \
--cc=guro@fb.com \
--cc=hannes@cmpxchg.org \
--cc=iamjoonsoo.kim@lge.com \
--cc=jannh@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=rientjes@google.com \
--cc=shakeelb@google.com \
--cc=vbabka@suse.cz \
--cc=vincent.guittot@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).