From: Daniel Jordan <daniel.m.jordan@oracle.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>,
Daniel Jordan <daniel.m.jordan@oracle.com>,
Andrew Morton <akpm@linux-foundation.org>,
Herbert Xu <herbert@gondor.apana.org.au>,
Steffen Klassert <steffen.klassert@secunet.com>,
Alex Williamson <alex.williamson@redhat.com>,
Alexander Duyck <alexander.h.duyck@linux.intel.com>,
Dan Williams <dan.j.williams@intel.com>,
Dave Hansen <dave.hansen@linux.intel.com>,
David Hildenbrand <david@redhat.com>,
Jason Gunthorpe <jgg@ziepe.ca>, Jonathan Corbet <corbet@lwn.net>,
Kirill Tkhai <ktkhai@virtuozzo.com>,
Michal Hocko <mhocko@kernel.org>, Pavel Machek <pavel@ucw.cz>,
Pavel Tatashin <pasha.tatashin@soleen.com>,
Peter Zijlstra <peterz@infradead.org>,
Randy Dunlap <rdunlap@infradead.org>,
Shile Zhang <shile.zhang@linux.alibaba.com>,
Tejun Heo <tj@kernel.org>, Zi Yan <ziy@nvidia.com>,
linux-crypto@vger.kernel.org, linux-mm <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 6/7] mm: parallelize deferred_init_memmap()
Date: Mon, 4 May 2020 21:48:44 -0400 [thread overview]
Message-ID: <20200505014844.ulp4rtih7adtcicm@ca-dmjordan1.us.oracle.com> (raw)
In-Reply-To: <CAKgT0UdBv-Wj98P2wMFGDSihPLKWFsqpu77ZmO+eA51uteZ-Ag@mail.gmail.com>
On Mon, May 04, 2020 at 05:40:19PM -0700, Alexander Duyck wrote:
> On Mon, May 4, 2020 at 4:44 PM Josh Triplett <josh@joshtriplett.org> wrote:
> >
> > On May 4, 2020 3:33:58 PM PDT, Alexander Duyck <alexander.duyck@gmail.com> wrote:
> > >On Thu, Apr 30, 2020 at 1:12 PM Daniel Jordan
> > ><daniel.m.jordan@oracle.com> wrote:
> > >> /*
> > >> - * Initialize and free pages in MAX_ORDER sized increments so
> > >> - * that we can avoid introducing any issues with the buddy
> > >> - * allocator.
> > >> + * More CPUs always led to greater speedups on tested
> > >systems, up to
> > >> + * all the nodes' CPUs. Use all since the system is
> > >otherwise idle now.
> > >> */
> > >
> > >I would be curious about your data. That isn't what I have seen in the
> > >past. Typically only up to about 8 or 10 CPUs gives you any benefit,
> > >beyond that I was usually cache/memory bandwidth bound.
On Skylake it took more than 8 or 10 CPUs, though on other machines the benefit
of using all versus half or 3/4 of the CPUs is less significant.
Given that the rest of the system is idle at this point, my main concern is
whether other archs regress past a certain thread count.
Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
2 nodes * 26 cores * 2 threads = 104 CPUs
384G/node = 768G memory
kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 4056.7 ( 5.5) -- 1763.3 ( 4.2)
( 1) -2.3% 4153.3 ( 2.5) -5.3% 1861.7 ( 5.5)
12% ( 6) 53.8% 2637.7 ( 38.7) 408.7% 346.7 ( 37.5)
25% ( 13) 62.4% 2497.3 ( 38.5) 739.7% 210.0 ( 41.8)
37% ( 19) 63.8% 2477.0 ( 19.0) 851.4% 185.3 ( 21.5)
50% ( 26) 64.1% 2471.7 ( 21.4) 881.4% 179.7 ( 25.8)
75% ( 39) 65.2% 2455.7 ( 33.2) 990.7% 161.7 ( 29.3)
100% ( 52) 66.5% 2436.7 ( 2.1) 1121.7% 144.3 ( 5.9)
Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, bare metal)
1 node * 16 cores * 2 threads = 32 CPUs
192G/node = 192G memory
kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 1957.3 ( 14.0) -- 1093.7 ( 12.9)
( 1) 1.4% 1930.7 ( 10.0) 3.8% 1053.3 ( 7.6)
12% ( 4) 70.0% 1151.7 ( 9.0) 292.5% 278.7 ( 0.6)
25% ( 8) 86.2% 1051.0 ( 7.8) 514.4% 178.0 ( 2.6)
37% ( 12) 95.1% 1003.3 ( 7.6) 672.0% 141.7 ( 3.8)
50% ( 16) 93.0% 1014.3 ( 20.0) 720.2% 133.3 ( 3.2)
75% ( 24) 97.8% 989.3 ( 6.7) 765.7% 126.3 ( 1.5)
100% ( 32) 96.5% 996.0 ( 7.2) 758.9% 127.3 ( 5.1)
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
2 nodes * 18 cores * 2 threads = 72 CPUs
128G/node = 256G memory
kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 1666.0 ( 3.5) -- 618.0 ( 3.5)
( 1) 1.0% 1649.7 ( 1.5) 3.0% 600.0 ( 1.0)
12% ( 4) 34.9% 1234.7 ( 21.4) 237.7% 183.0 ( 22.5)
25% ( 9) 42.0% 1173.0 ( 10.0) 417.9% 119.3 ( 9.6)
37% ( 13) 44.4% 1153.7 ( 17.0) 524.2% 99.0 ( 15.6)
50% ( 18) 44.8% 1150.3 ( 15.5) 534.9% 97.3 ( 16.2)
75% ( 27) 44.8% 1150.3 ( 2.5) 550.5% 95.0 ( 5.6)
100% ( 36) 45.5% 1145.3 ( 1.5) 594.4% 89.0 ( 1.7)
AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
1 node * 8 cores * 2 threads = 16 CPUs
64G/node = 64G memory
kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 1029.7 ( 42.3) -- 253.7 ( 3.1)
( 1) 3.4% 995.3 ( 21.4) 4.5% 242.7 ( 5.5)
12% ( 2) 16.3% 885.7 ( 24.4) 86.5% 136.0 ( 5.2)
25% ( 4) 23.3% 835.0 ( 21.5) 195.0% 86.0 ( 1.7)
37% ( 6) 28.0% 804.7 ( 15.7) 249.1% 72.7 ( 2.1)
50% ( 8) 26.3% 815.3 ( 11.7) 290.3% 65.0 ( 3.5)
75% ( 12) 30.7% 787.7 ( 2.1) 284.3% 66.0 ( 3.6)
100% ( 16) 30.4% 789.3 ( 15.0) 322.8% 60.0 ( 5.6)
AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
1 node * 2 cores * 2 threads = 4 CPUs
16G/node = 16G memory
kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 757.7 ( 17.1) -- 57.0 ( 0.0)
25% ( 1) -1.0% 765.3 ( 5.5) 3.6% 55.0 ( 0.0)
50% ( 2) 4.9% 722.3 ( 21.5) 74.5% 32.7 ( 4.6)
75% ( 3) 3.8% 729.7 ( 4.9) 119.2% 26.0 ( 0.0)
100% ( 4) 6.7% 710.3 ( 15.0) 171.4% 21.0 ( 0.0)
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
1 node * 2 cores * 2 threads = 4 CPUs
14G/node = 14G memory
kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 656.3 ( 7.1) -- 57.3 ( 1.5)
25% ( 1) 1.8% 644.7 ( 3.1) 0.6% 57.0 ( 0.0)
50% ( 2) 7.0% 613.7 ( 5.1) 68.6% 34.0 ( 5.3)
75% ( 3) 7.4% 611.3 ( 6.7) 135.6% 24.3 ( 0.6)
100% ( 4) 9.4% 599.7 ( 5.9) 168.8% 21.3 ( 1.2)
> > I've found pretty much linear performance up to memory bandwidth, and on the systems I was testing, I didn't saturate memory bandwidth until about the full number of physical cores. From number of cores up to number of threads, the performance stayed about flat; it didn't get any better or worse.
>
> That doesn't sound right though based on the numbers you provided. The
> system you had was 192GB spread over 2 nodes with 48thread/24core per
> node, correct? Your numbers went from ~290ms to ~28ms so a 10x
> decrease, that doesn't sound linear when you spread the work over 24
> cores to get there. I agree that the numbers largely stay flat once
> you hit the peak, I have seen similar behavior when I was working on
> the deferred init code previously. One concern I have though is that
> we may end up seeing better performance with a subset of cores instead
> of running all of the cores/threads, especially if features such as
> turbo come into play. In addition we are talking x86 only so far. I
> would be interested in seeing if this has benefits or not for other
> architectures.
>
> Also what is the penalty that is being paid in order to break up the
> work before-hand and set it up for the parallel work? I would be
> interested in seeing what the cost is on a system with fewer cores per
> node, maybe even down to 1. That would tell us how much additional
> overhead is being added to set things up to run in parallel.
The numbers above have the 1-thread case. It seems close to the noise.
> If I get
> a chance tomorrow I might try applying the patches and doing some
> testing myself.
If you end up doing that, you might find this helpful:
https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=patch;h=afc72bf8478b95a1d6d174c269ff3693c60630e0
and maybe this:
https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=patch;h=dff6537eab281e5a9917682c4adf9059c0574223
Thanks for looking this over.
[ By the way, I'm going to be out Tuesday but back the rest of the week. ]
next prev parent reply other threads:[~2020-05-05 1:48 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-30 20:11 [PATCH 0/7] padata: parallelize deferred page init Daniel Jordan
2020-04-30 20:11 ` [PATCH 1/7] padata: remove exit routine Daniel Jordan
2020-04-30 20:11 ` [PATCH 2/7] padata: initialize earlier Daniel Jordan
2020-04-30 20:11 ` [PATCH 3/7] padata: allocate work structures for parallel jobs from a pool Daniel Jordan
2020-04-30 20:11 ` [PATCH 4/7] padata: add basic support for multithreaded jobs Daniel Jordan
2020-04-30 20:11 ` [PATCH 5/7] mm: move zone iterator outside of deferred_init_maxorder() Daniel Jordan
2020-04-30 21:43 ` Alexander Duyck
2020-05-01 2:45 ` Daniel Jordan
2020-05-04 22:10 ` Alexander Duyck
2020-05-05 0:54 ` Daniel Jordan
2020-05-05 15:27 ` Alexander Duyck
2020-05-06 22:39 ` Daniel Jordan
2020-05-07 15:26 ` Alexander Duyck
2020-05-07 20:20 ` Daniel Jordan
2020-05-07 21:18 ` Alexander Duyck
2020-05-07 22:15 ` Daniel Jordan
2020-04-30 20:11 ` [PATCH 6/7] mm: parallelize deferred_init_memmap() Daniel Jordan
2020-05-04 22:33 ` Alexander Duyck
2020-05-04 23:38 ` Josh Triplett
2020-05-05 0:40 ` Alexander Duyck
2020-05-05 1:48 ` Daniel Jordan [this message]
2020-05-05 2:09 ` Daniel Jordan
2020-05-05 14:55 ` Alexander Duyck
2020-05-06 22:21 ` Daniel Jordan
2020-05-06 22:36 ` Alexander Duyck
2020-05-06 22:43 ` Daniel Jordan
2020-05-06 23:01 ` Daniel Jordan
2020-05-05 1:26 ` Daniel Jordan
2020-04-30 20:11 ` [PATCH 7/7] padata: document multithreaded jobs Daniel Jordan
2020-04-30 21:31 ` [PATCH 0/7] padata: parallelize deferred page init Andrew Morton
2020-04-30 21:40 ` Pavel Tatashin
2020-05-01 2:40 ` Daniel Jordan
2020-05-01 0:50 ` Josh Triplett
2020-05-01 1:09 ` Josh Triplett
2020-05-01 2:48 ` Daniel Jordan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200505014844.ulp4rtih7adtcicm@ca-dmjordan1.us.oracle.com \
--to=daniel.m.jordan@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alex.williamson@redhat.com \
--cc=alexander.duyck@gmail.com \
--cc=alexander.h.duyck@linux.intel.com \
--cc=corbet@lwn.net \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=herbert@gondor.apana.org.au \
--cc=jgg@ziepe.ca \
--cc=josh@joshtriplett.org \
--cc=ktkhai@virtuozzo.com \
--cc=linux-crypto@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=pasha.tatashin@soleen.com \
--cc=pavel@ucw.cz \
--cc=peterz@infradead.org \
--cc=rdunlap@infradead.org \
--cc=shile.zhang@linux.alibaba.com \
--cc=steffen.klassert@secunet.com \
--cc=tj@kernel.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).