From: Daniel Jordan <daniel.m.jordan@oracle.com>
To: Andrew Morton <akpm@linux-foundation.org>,
Herbert Xu <herbert@gondor.apana.org.au>,
Steffen Klassert <steffen.klassert@secunet.com>
Cc: Alex Williamson <alex.williamson@redhat.com>,
Alexander Duyck <alexander.h.duyck@linux.intel.com>,
Dan Williams <dan.j.williams@intel.com>,
Dave Hansen <dave.hansen@linux.intel.com>,
David Hildenbrand <david@redhat.com>,
Jason Gunthorpe <jgg@ziepe.ca>, Jonathan Corbet <corbet@lwn.net>,
Josh Triplett <josh@joshtriplett.org>,
Kirill Tkhai <ktkhai@virtuozzo.com>,
Michal Hocko <mhocko@kernel.org>, Pavel Machek <pavel@ucw.cz>,
Pavel Tatashin <pasha.tatashin@soleen.com>,
Peter Zijlstra <peterz@infradead.org>,
Randy Dunlap <rdunlap@infradead.org>,
Shile Zhang <shile.zhang@linux.alibaba.com>,
Tejun Heo <tj@kernel.org>, Zi Yan <ziy@nvidia.com>,
linux-crypto@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org,
Daniel Jordan <daniel.m.jordan@oracle.com>
Subject: [PATCH 6/7] mm: parallelize deferred_init_memmap()
Date: Thu, 30 Apr 2020 16:11:24 -0400 [thread overview]
Message-ID: <20200430201125.532129-7-daniel.m.jordan@oracle.com> (raw)
In-Reply-To: <20200430201125.532129-1-daniel.m.jordan@oracle.com>
Deferred struct page init uses one thread per node, which is a
significant bottleneck at boot for big machines--often the largest.
Parallelize to reduce system downtime.
The maximum number of threads is capped at the number of CPUs on the
node because speedups always improve with additional threads on every
system tested, and at this phase of boot, the system is otherwise idle
and waiting on page init to finish.
Helper threads operate on MAX_ORDER_NR_PAGES-aligned ranges to avoid
accessing uninitialized buddy pages, so set the job's alignment
accordingly.
The minimum chunk size is also MAX_ORDER_NR_PAGES because there was
benefit to using multiple threads even on relatively small memory (1G)
systems.
Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
2 nodes * 26 cores * 2 threads = 104 CPUs
384G/node = 768G memory
kernel boot deferred init
------------------------ ------------------------
speedup time_ms (stdev) speedup time_ms (stdev)
base -- 4056.7 ( 5.5) -- 1763.3 ( 4.2)
test 39.9% 2436.7 ( 2.1) 91.8% 144.3 ( 5.9)
Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, bare metal)
1 node * 16 cores * 2 threads = 32 CPUs
192G/node = 192G memory
kernel boot deferred init
------------------------ ------------------------
speedup time_ms (stdev) speedup time_ms (stdev)
base -- 1957.3 ( 14.0) -- 1093.7 ( 12.9)
test 49.1% 996.0 ( 7.2) 88.4% 127.3 ( 5.1)
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
2 nodes * 18 cores * 2 threads = 72 CPUs
128G/node = 256G memory
kernel boot deferred init
------------------------ ------------------------
speedup time_ms (stdev) speedup time_ms (stdev)
base -- 1666.0 ( 3.5) -- 618.0 ( 3.5)
test 31.3% 1145.3 ( 1.5) 85.6% 89.0 ( 1.7)
AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
1 node * 8 cores * 2 threads = 16 CPUs
64G/node = 64G memory
kernel boot deferred init
------------------------ ------------------------
speedup time_ms (stdev) speedup time_ms (stdev)
base -- 1029.7 ( 42.3) -- 253.7 ( 3.1)
test 23.3% 789.3 ( 15.0) 76.3% 60.0 ( 5.6)
Server-oriented distros that enable deferred page init sometimes run in
small VMs, and they still benefit even though the fraction of boot time
saved is smaller:
AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
1 node * 2 cores * 2 threads = 4 CPUs
16G/node = 16G memory
kernel boot deferred init
------------------------ ------------------------
speedup time_ms (stdev) speedup time_ms (stdev)
base -- 757.7 ( 17.1) -- 57.0 ( 0.0)
test 6.2% 710.3 ( 15.0) 63.2% 21.0 ( 0.0)
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
1 node * 2 cores * 2 threads = 4 CPUs
14G/node = 14G memory
kernel boot deferred init
------------------------ ------------------------
speedup time_ms (stdev) speedup time_ms (stdev)
base -- 656.3 ( 7.1) -- 57.3 ( 1.5)
test 8.6% 599.7 ( 5.9) 62.8% 21.3 ( 1.2)
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
mm/Kconfig | 6 +++---
mm/page_alloc.c | 46 ++++++++++++++++++++++++++++++++++++++--------
2 files changed, 41 insertions(+), 11 deletions(-)
diff --git a/mm/Kconfig b/mm/Kconfig
index ab80933be65ff..e5007206c7601 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -622,13 +622,13 @@ config DEFERRED_STRUCT_PAGE_INIT
depends on SPARSEMEM
depends on !NEED_PER_CPU_KM
depends on 64BIT
+ select PADATA
help
Ordinarily all struct pages are initialised during early boot in a
single thread. On very large machines this can take a considerable
amount of time. If this option is set, large machines will bring up
- a subset of memmap at boot and then initialise the rest in parallel
- by starting one-off "pgdatinitX" kernel thread for each node X. This
- has a potential performance impact on processes running early in the
+ a subset of memmap at boot and then initialise the rest in parallel.
+ This has a potential performance impact on tasks running early in the
lifetime of the system until these kthreads finish the
initialisation.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 990514d8f0d94..96d6d0d920c27 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -68,6 +68,7 @@
#include <linux/lockdep.h>
#include <linux/nmi.h>
#include <linux/psi.h>
+#include <linux/padata.h>
#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -1729,6 +1730,25 @@ deferred_init_maxorder(struct zone *zone, unsigned long *start_pfn,
return nr_pages;
}
+struct def_init_args {
+ struct zone *zone;
+ atomic_long_t nr_pages;
+};
+
+static void __init deferred_init_memmap_chunk(unsigned long spfn,
+ unsigned long epfn, void *arg)
+{
+ struct def_init_args *args = arg;
+ unsigned long nr_pages = 0;
+
+ while (spfn < epfn) {
+ nr_pages += deferred_init_maxorder(args->zone, &spfn, epfn);
+ cond_resched();
+ }
+
+ atomic_long_add(nr_pages, &args->nr_pages);
+}
+
/* Initialise remaining memory on a node */
static int __init deferred_init_memmap(void *data)
{
@@ -1738,7 +1758,7 @@ static int __init deferred_init_memmap(void *data)
unsigned long first_init_pfn, flags;
unsigned long start = jiffies;
struct zone *zone;
- int zid;
+ int zid, max_threads;
u64 i;
/* Bind memory initialisation thread to a local node if possible */
@@ -1778,15 +1798,25 @@ static int __init deferred_init_memmap(void *data)
goto zone_empty;
/*
- * Initialize and free pages in MAX_ORDER sized increments so
- * that we can avoid introducing any issues with the buddy
- * allocator.
+ * More CPUs always led to greater speedups on tested systems, up to
+ * all the nodes' CPUs. Use all since the system is otherwise idle now.
*/
+ max_threads = max(cpumask_weight(cpumask), 1u);
+
for_each_free_mem_pfn_range_in_zone_from(i, zone, &spfn, &epfn) {
- while (spfn < epfn) {
- nr_pages += deferred_init_maxorder(zone, &spfn, epfn);
- cond_resched();
- }
+ struct def_init_args args = { zone, ATOMIC_LONG_INIT(0) };
+ struct padata_mt_job job = {
+ .thread_fn = deferred_init_memmap_chunk,
+ .fn_arg = &args,
+ .start = spfn,
+ .size = epfn - spfn,
+ .align = MAX_ORDER_NR_PAGES,
+ .min_chunk = MAX_ORDER_NR_PAGES,
+ .max_threads = max_threads,
+ };
+
+ padata_do_multithreaded(&job);
+ nr_pages += atomic_long_read(&args.nr_pages);
}
zone_empty:
/* Sanity check that the next zone really is unpopulated */
--
2.26.2
next prev parent reply other threads:[~2020-04-30 20:12 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-30 20:11 [PATCH 0/7] padata: parallelize deferred page init Daniel Jordan
2020-04-30 20:11 ` [PATCH 1/7] padata: remove exit routine Daniel Jordan
2020-04-30 20:11 ` [PATCH 2/7] padata: initialize earlier Daniel Jordan
2020-04-30 20:11 ` [PATCH 3/7] padata: allocate work structures for parallel jobs from a pool Daniel Jordan
2020-04-30 20:11 ` [PATCH 4/7] padata: add basic support for multithreaded jobs Daniel Jordan
2020-04-30 20:11 ` [PATCH 5/7] mm: move zone iterator outside of deferred_init_maxorder() Daniel Jordan
2020-04-30 21:43 ` Alexander Duyck
2020-05-01 2:45 ` Daniel Jordan
2020-05-04 22:10 ` Alexander Duyck
2020-05-05 0:54 ` Daniel Jordan
2020-05-05 15:27 ` Alexander Duyck
2020-05-06 22:39 ` Daniel Jordan
2020-05-07 15:26 ` Alexander Duyck
2020-05-07 20:20 ` Daniel Jordan
2020-05-07 21:18 ` Alexander Duyck
2020-05-07 22:15 ` Daniel Jordan
2020-04-30 20:11 ` Daniel Jordan [this message]
2020-05-04 22:33 ` [PATCH 6/7] mm: parallelize deferred_init_memmap() Alexander Duyck
2020-05-04 23:38 ` Josh Triplett
2020-05-05 0:40 ` Alexander Duyck
2020-05-05 1:48 ` Daniel Jordan
2020-05-05 2:09 ` Daniel Jordan
2020-05-05 14:55 ` Alexander Duyck
2020-05-06 22:21 ` Daniel Jordan
2020-05-06 22:36 ` Alexander Duyck
2020-05-06 22:43 ` Daniel Jordan
2020-05-06 23:01 ` Daniel Jordan
2020-05-05 1:26 ` Daniel Jordan
2020-04-30 20:11 ` [PATCH 7/7] padata: document multithreaded jobs Daniel Jordan
2020-04-30 21:31 ` [PATCH 0/7] padata: parallelize deferred page init Andrew Morton
2020-04-30 21:40 ` Pavel Tatashin
2020-05-01 2:40 ` Daniel Jordan
2020-05-01 0:50 ` Josh Triplett
2020-05-01 1:09 ` Josh Triplett
2020-05-01 2:48 ` Daniel Jordan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200430201125.532129-7-daniel.m.jordan@oracle.com \
--to=daniel.m.jordan@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alex.williamson@redhat.com \
--cc=alexander.h.duyck@linux.intel.com \
--cc=corbet@lwn.net \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=herbert@gondor.apana.org.au \
--cc=jgg@ziepe.ca \
--cc=josh@joshtriplett.org \
--cc=ktkhai@virtuozzo.com \
--cc=linux-crypto@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=pasha.tatashin@soleen.com \
--cc=pavel@ucw.cz \
--cc=peterz@infradead.org \
--cc=rdunlap@infradead.org \
--cc=shile.zhang@linux.alibaba.com \
--cc=steffen.klassert@secunet.com \
--cc=tj@kernel.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).