linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] padata: parallelize deferred page init
@ 2020-05-20 18:26 Daniel Jordan
  2020-05-20 18:26 ` [PATCH v2 1/7] padata: remove exit routine Daniel Jordan
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Daniel Jordan @ 2020-05-20 18:26 UTC (permalink / raw)
  To: Andrew Morton, Herbert Xu, Steffen Klassert
  Cc: Alex Williamson, Alexander Duyck, Dan Williams, Dave Hansen,
	David Hildenbrand, Jason Gunthorpe, Jonathan Corbet,
	Josh Triplett, Kirill Tkhai, Michal Hocko, Pavel Machek,
	Pavel Tatashin, Peter Zijlstra, Randy Dunlap, Robert Elliott,
	Shile Zhang, Steven Sistare, Tejun Heo, Zi Yan, linux-crypto,
	linux-mm, linux-kernel, linux-s390, linuxppc-dev, Daniel Jordan

Deferred struct page init is a bottleneck in kernel boot--the biggest
for us and probably others.  Optimizing it maximizes availability for
large-memory systems and allows spinning up short-lived VMs as needed
without having to leave them running.  It also benefits bare metal
machines hosting VMs that are sensitive to downtime.  In projects such
as VMM Fast Restart[1], where guest state is preserved across kexec
reboot, it helps prevent application and network timeouts in the guests.

So, multithread deferred init to take full advantage of system memory
bandwidth.

Extend padata, a framework that handles many parallel singlethreaded
jobs, to handle multithreaded jobs as well by adding support for
splitting up the work evenly, specifying a minimum amount of work that's
appropriate for one helper thread to do, load balancing between helpers,
and coordinating them.  More documentation in patches 4 and 7.

This series is the first step in a project to address other memory
proportional bottlenecks in the kernel such as pmem struct page init,
vfio page pinning, hugetlb fallocate, and munmap.  Deferred page init
doesn't require concurrency limits, resource control, or priority
adjustments like these other users will because it happens during boot
when the system is otherwise idle and waiting for page init to finish.

This has been run on a variety of x86 systems and speeds up kernel boot
by 3% to 49%, saving up to 1.6 out of 4 seconds.  Patch 5 has more
numbers.

Please review and test, and thanks to Alex, Andrew, Josh, and Pavel for
their feedback in the last version.

The powerpc and s390 lists are included in case they want to give this a
try, they had enabled this feature when it was configured per arch.

Series based on 5.7-rc6 plus these three from mmotm

  mm-call-touch_nmi_watchdog-on-max-order-boundaries-in-deferred-init.patch
  mm-initialize-deferred-pages-with-interrupts-enabled.patch
  mm-call-cond_resched-from-deferred_init_memmap.patch

and it's available here:

  git://oss.oracle.com/git/linux-dmjordan.git padata-mt-definit-v2
  https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-definit-v2

and the future users and related features are available as
work-in-progress:

  git://oss.oracle.com/git/linux-dmjordan.git padata-mt-wip-v0.4
  https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-wip-v0.4

v2:
 - Improve the problem statement (Andrew, Josh, Pavel)
 - Add T-b's to unchanged patches (Josh)
 - Fully initialize max-order blocks to avoid buddy issues (Alex)
 - Parallelize on section-aligned boundaries to avoid potential
   false sharing (Alex)
 - Return the maximum thread count from a function that architectures
   can override, with the generic version returning 1 (current
   behavior).  Override for x86 since that's the only arch this series
   has been tested on so far.  Other archs can test with more threads
   by dropping patch 6.
 - Rebase to v5.7-rc6, rerun tests

RFC v4 [2] -> v1:
 - merged with padata (Peter)
 - got rid of the 'task' nomenclature (Peter, Jon)

future work branch:
 - made lockdep-aware (Jason, Peter)
 - adjust workqueue worker priority with renice_or_cancel() (Tejun)
 - fixed undo problem in VFIO (Alex)

The remaining feedback, mainly resource control awareness (cgroup etc),
is TODO for later series.

[1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf
    https://www.youtube.com/watch?v=pBsHnf93tcQ
    https://lore.kernel.org/linux-mm/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/

[2] https://lore.kernel.org/linux-mm/20181105165558.11698-1-daniel.m.jordan@oracle.com/

Daniel Jordan (7):
  padata: remove exit routine
  padata: initialize earlier
  padata: allocate work structures for parallel jobs from a pool
  padata: add basic support for multithreaded jobs
  mm: parallelize deferred_init_memmap()
  mm: make deferred init's max threads arch-specific
  padata: document multithreaded jobs

 Documentation/core-api/padata.rst |  41 +++--
 arch/x86/mm/init_64.c             |  12 ++
 include/linux/memblock.h          |   3 +
 include/linux/padata.h            |  43 ++++-
 init/main.c                       |   2 +
 kernel/padata.c                   | 277 ++++++++++++++++++++++++------
 mm/Kconfig                        |   6 +-
 mm/page_alloc.c                   |  67 +++++++-
 8 files changed, 373 insertions(+), 78 deletions(-)


base-commit: b9bbe6ed63b2b9f2c9ee5cbd0f2c946a2723f4ce
prerequisite-patch-id: 4ad522141e1119a325a9799dad2bd982fbac8b7c
prerequisite-patch-id: 169273327e56f5461101a71dfbd6b4cfd4570cf0
prerequisite-patch-id: 0f34692c8a9673d4c4f6a3545cf8ec3a2abf8620
-- 
2.26.2



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/7] padata: remove exit routine
  2020-05-20 18:26 [PATCH v2 0/7] padata: parallelize deferred page init Daniel Jordan
@ 2020-05-20 18:26 ` Daniel Jordan
  2020-05-20 18:26 ` [PATCH v2 2/7] padata: initialize earlier Daniel Jordan
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Daniel Jordan @ 2020-05-20 18:26 UTC (permalink / raw)
  To: Andrew Morton, Herbert Xu, Steffen Klassert
  Cc: Alex Williamson, Alexander Duyck, Dan Williams, Dave Hansen,
	David Hildenbrand, Jason Gunthorpe, Jonathan Corbet,
	Josh Triplett, Kirill Tkhai, Michal Hocko, Pavel Machek,
	Pavel Tatashin, Peter Zijlstra, Randy Dunlap, Robert Elliott,
	Shile Zhang, Steven Sistare, Tejun Heo, Zi Yan, linux-crypto,
	linux-mm, linux-kernel, linux-s390, linuxppc-dev, Daniel Jordan

padata_driver_exit() is unnecessary because padata isn't built as a
module and doesn't exit.

padata's init routine will soon allocate memory, so getting rid of the
exit function now avoids pointless code to free it.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 kernel/padata.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index a6afa12fb75ee..835919c745266 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -1072,10 +1072,4 @@ static __init int padata_driver_init(void)
 }
 module_init(padata_driver_init);
 
-static __exit void padata_driver_exit(void)
-{
-	cpuhp_remove_multi_state(CPUHP_PADATA_DEAD);
-	cpuhp_remove_multi_state(hp_online);
-}
-module_exit(padata_driver_exit);
 #endif
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/7] padata: initialize earlier
  2020-05-20 18:26 [PATCH v2 0/7] padata: parallelize deferred page init Daniel Jordan
  2020-05-20 18:26 ` [PATCH v2 1/7] padata: remove exit routine Daniel Jordan
@ 2020-05-20 18:26 ` Daniel Jordan
  2020-05-20 18:26 ` [PATCH v2 3/7] padata: allocate work structures for parallel jobs from a pool Daniel Jordan
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Daniel Jordan @ 2020-05-20 18:26 UTC (permalink / raw)
  To: Andrew Morton, Herbert Xu, Steffen Klassert
  Cc: Alex Williamson, Alexander Duyck, Dan Williams, Dave Hansen,
	David Hildenbrand, Jason Gunthorpe, Jonathan Corbet,
	Josh Triplett, Kirill Tkhai, Michal Hocko, Pavel Machek,
	Pavel Tatashin, Peter Zijlstra, Randy Dunlap, Robert Elliott,
	Shile Zhang, Steven Sistare, Tejun Heo, Zi Yan, linux-crypto,
	linux-mm, linux-kernel, linux-s390, linuxppc-dev, Daniel Jordan

padata will soon initialize the system's struct pages in parallel, so it
needs to be ready by page_alloc_init_late().

The error return from padata_driver_init() triggers an initcall warning,
so add a warning to padata_init() to avoid silent failure.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/padata.h |  6 ++++++
 init/main.c            |  2 ++
 kernel/padata.c        | 17 ++++++++---------
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index a0d8b41850b25..476ecfa41f363 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -164,6 +164,12 @@ struct padata_instance {
 #define	PADATA_INVALID	4
 };
 
+#ifdef CONFIG_PADATA
+extern void __init padata_init(void);
+#else
+static inline void __init padata_init(void) {}
+#endif
+
 extern struct padata_instance *padata_alloc_possible(const char *name);
 extern void padata_free(struct padata_instance *pinst);
 extern struct padata_shell *padata_alloc_shell(struct padata_instance *pinst);
diff --git a/init/main.c b/init/main.c
index 03371976d3872..8ab521f7af5d2 100644
--- a/init/main.c
+++ b/init/main.c
@@ -94,6 +94,7 @@
 #include <linux/rodata_test.h>
 #include <linux/jump_label.h>
 #include <linux/mem_encrypt.h>
+#include <linux/padata.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -1482,6 +1483,7 @@ static noinline void __init kernel_init_freeable(void)
 	smp_init();
 	sched_init_smp();
 
+	padata_init();
 	page_alloc_init_late();
 	/* Initialize page ext after all struct pages are initialized. */
 	page_ext_init();
diff --git a/kernel/padata.c b/kernel/padata.c
index 835919c745266..6f709bc0fc413 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -31,7 +31,6 @@
 #include <linux/slab.h>
 #include <linux/sysfs.h>
 #include <linux/rcupdate.h>
-#include <linux/module.h>
 
 #define MAX_OBJ_NUM 1000
 
@@ -1050,26 +1049,26 @@ void padata_free_shell(struct padata_shell *ps)
 }
 EXPORT_SYMBOL(padata_free_shell);
 
-#ifdef CONFIG_HOTPLUG_CPU
-
-static __init int padata_driver_init(void)
+void __init padata_init(void)
 {
+#ifdef CONFIG_HOTPLUG_CPU
 	int ret;
 
 	ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "padata:online",
 				      padata_cpu_online, NULL);
 	if (ret < 0)
-		return ret;
+		goto err;
 	hp_online = ret;
 
 	ret = cpuhp_setup_state_multi(CPUHP_PADATA_DEAD, "padata:dead",
 				      NULL, padata_cpu_dead);
 	if (ret < 0) {
 		cpuhp_remove_multi_state(hp_online);
-		return ret;
+		goto err;
 	}
-	return 0;
-}
-module_init(padata_driver_init);
 
+	return;
+err:
+	pr_warn("padata: initialization failed\n");
 #endif
+}
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/7] padata: allocate work structures for parallel jobs from a pool
  2020-05-20 18:26 [PATCH v2 0/7] padata: parallelize deferred page init Daniel Jordan
  2020-05-20 18:26 ` [PATCH v2 1/7] padata: remove exit routine Daniel Jordan
  2020-05-20 18:26 ` [PATCH v2 2/7] padata: initialize earlier Daniel Jordan
@ 2020-05-20 18:26 ` Daniel Jordan
  2020-05-20 18:26 ` [PATCH v2 4/7] padata: add basic support for multithreaded jobs Daniel Jordan
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Daniel Jordan @ 2020-05-20 18:26 UTC (permalink / raw)
  To: Andrew Morton, Herbert Xu, Steffen Klassert
  Cc: Alex Williamson, Alexander Duyck, Dan Williams, Dave Hansen,
	David Hildenbrand, Jason Gunthorpe, Jonathan Corbet,
	Josh Triplett, Kirill Tkhai, Michal Hocko, Pavel Machek,
	Pavel Tatashin, Peter Zijlstra, Randy Dunlap, Robert Elliott,
	Shile Zhang, Steven Sistare, Tejun Heo, Zi Yan, linux-crypto,
	linux-mm, linux-kernel, linux-s390, linuxppc-dev, Daniel Jordan

padata allocates per-CPU, per-instance work structs for parallel jobs.
A do_parallel call assigns a job to a sequence number and hashes the
number to a CPU, where the job will eventually run using the
corresponding work.

This approach fit with how padata used to bind a job to each CPU
round-robin, makes less sense after commit bfde23ce200e6 ("padata:
unbind parallel jobs from specific CPUs") because a work isn't bound to
a particular CPU anymore, and isn't needed at all for multithreaded jobs
because they don't have sequence numbers.

Replace the per-CPU works with a preallocated pool, which allows sharing
them between existing padata users and the upcoming multithreaded user.
The pool will also facilitate setting NUMA-aware concurrency limits with
later users.

The pool is sized according to the number of possible CPUs.  With this
limit, MAX_OBJ_NUM no longer makes sense, so remove it.

If the global pool is exhausted, a parallel job is run in the current
task instead to throttle a system trying to do too much in parallel.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/padata.h |   8 +--
 kernel/padata.c        | 118 +++++++++++++++++++++++++++--------------
 2 files changed, 78 insertions(+), 48 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 476ecfa41f363..3bfa503503ac5 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -24,7 +24,6 @@
  * @list: List entry, to attach to the padata lists.
  * @pd: Pointer to the internal control structure.
  * @cb_cpu: Callback cpu for serializatioon.
- * @cpu: Cpu for parallelization.
  * @seq_nr: Sequence number of the parallelized data object.
  * @info: Used to pass information from the parallel to the serial function.
  * @parallel: Parallel execution function.
@@ -34,7 +33,6 @@ struct padata_priv {
 	struct list_head	list;
 	struct parallel_data	*pd;
 	int			cb_cpu;
-	int			cpu;
 	unsigned int		seq_nr;
 	int			info;
 	void                    (*parallel)(struct padata_priv *padata);
@@ -68,15 +66,11 @@ struct padata_serial_queue {
 /**
  * struct padata_parallel_queue - The percpu padata parallel queue
  *
- * @parallel: List to wait for parallelization.
  * @reorder: List to wait for reordering after parallel processing.
- * @work: work struct for parallelization.
  * @num_obj: Number of objects that are processed by this cpu.
  */
 struct padata_parallel_queue {
-       struct padata_list    parallel;
        struct padata_list    reorder;
-       struct work_struct    work;
        atomic_t              num_obj;
 };
 
@@ -111,7 +105,7 @@ struct parallel_data {
 	struct padata_parallel_queue	__percpu *pqueue;
 	struct padata_serial_queue	__percpu *squeue;
 	atomic_t			refcnt;
-	atomic_t			seq_nr;
+	unsigned int			seq_nr;
 	unsigned int			processed;
 	int				cpu;
 	struct padata_cpumask		cpumask;
diff --git a/kernel/padata.c b/kernel/padata.c
index 6f709bc0fc413..78ff9aa529204 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -32,7 +32,15 @@
 #include <linux/sysfs.h>
 #include <linux/rcupdate.h>
 
-#define MAX_OBJ_NUM 1000
+struct padata_work {
+	struct work_struct	pw_work;
+	struct list_head	pw_list;  /* padata_free_works linkage */
+	void			*pw_data;
+};
+
+static DEFINE_SPINLOCK(padata_works_lock);
+static struct padata_work *padata_works;
+static LIST_HEAD(padata_free_works);
 
 static void padata_free_pd(struct parallel_data *pd);
 
@@ -58,30 +66,44 @@ static int padata_cpu_hash(struct parallel_data *pd, unsigned int seq_nr)
 	return padata_index_to_cpu(pd, cpu_index);
 }
 
-static void padata_parallel_worker(struct work_struct *parallel_work)
+static struct padata_work *padata_work_alloc(void)
 {
-	struct padata_parallel_queue *pqueue;
-	LIST_HEAD(local_list);
+	struct padata_work *pw;
 
-	local_bh_disable();
-	pqueue = container_of(parallel_work,
-			      struct padata_parallel_queue, work);
+	lockdep_assert_held(&padata_works_lock);
 
-	spin_lock(&pqueue->parallel.lock);
-	list_replace_init(&pqueue->parallel.list, &local_list);
-	spin_unlock(&pqueue->parallel.lock);
+	if (list_empty(&padata_free_works))
+		return NULL;	/* No more work items allowed to be queued. */
 
-	while (!list_empty(&local_list)) {
-		struct padata_priv *padata;
+	pw = list_first_entry(&padata_free_works, struct padata_work, pw_list);
+	list_del(&pw->pw_list);
+	return pw;
+}
 
-		padata = list_entry(local_list.next,
-				    struct padata_priv, list);
+static void padata_work_init(struct padata_work *pw, work_func_t work_fn,
+			     void *data)
+{
+	INIT_WORK(&pw->pw_work, work_fn);
+	pw->pw_data = data;
+}
 
-		list_del_init(&padata->list);
+static void padata_work_free(struct padata_work *pw)
+{
+	lockdep_assert_held(&padata_works_lock);
+	list_add(&pw->pw_list, &padata_free_works);
+}
 
-		padata->parallel(padata);
-	}
+static void padata_parallel_worker(struct work_struct *parallel_work)
+{
+	struct padata_work *pw = container_of(parallel_work, struct padata_work,
+					      pw_work);
+	struct padata_priv *padata = pw->pw_data;
 
+	local_bh_disable();
+	padata->parallel(padata);
+	spin_lock(&padata_works_lock);
+	padata_work_free(pw);
+	spin_unlock(&padata_works_lock);
 	local_bh_enable();
 }
 
@@ -105,9 +127,9 @@ int padata_do_parallel(struct padata_shell *ps,
 		       struct padata_priv *padata, int *cb_cpu)
 {
 	struct padata_instance *pinst = ps->pinst;
-	int i, cpu, cpu_index, target_cpu, err;
-	struct padata_parallel_queue *queue;
+	int i, cpu, cpu_index, err;
 	struct parallel_data *pd;
+	struct padata_work *pw;
 
 	rcu_read_lock_bh();
 
@@ -135,25 +157,25 @@ int padata_do_parallel(struct padata_shell *ps,
 	if ((pinst->flags & PADATA_RESET))
 		goto out;
 
-	if (atomic_read(&pd->refcnt) >= MAX_OBJ_NUM)
-		goto out;
-
-	err = 0;
 	atomic_inc(&pd->refcnt);
 	padata->pd = pd;
 	padata->cb_cpu = *cb_cpu;
 
-	padata->seq_nr = atomic_inc_return(&pd->seq_nr);
-	target_cpu = padata_cpu_hash(pd, padata->seq_nr);
-	padata->cpu = target_cpu;
-	queue = per_cpu_ptr(pd->pqueue, target_cpu);
-
-	spin_lock(&queue->parallel.lock);
-	list_add_tail(&padata->list, &queue->parallel.list);
-	spin_unlock(&queue->parallel.lock);
+	rcu_read_unlock_bh();
 
-	queue_work(pinst->parallel_wq, &queue->work);
+	spin_lock(&padata_works_lock);
+	padata->seq_nr = ++pd->seq_nr;
+	pw = padata_work_alloc();
+	spin_unlock(&padata_works_lock);
+	if (pw) {
+		padata_work_init(pw, padata_parallel_worker, padata);
+		queue_work(pinst->parallel_wq, &pw->pw_work);
+	} else {
+		/* Maximum works limit exceeded, run in the current task. */
+		padata->parallel(padata);
+	}
 
+	return 0;
 out:
 	rcu_read_unlock_bh();
 
@@ -324,8 +346,9 @@ static void padata_serial_worker(struct work_struct *serial_work)
 void padata_do_serial(struct padata_priv *padata)
 {
 	struct parallel_data *pd = padata->pd;
+	int hashed_cpu = padata_cpu_hash(pd, padata->seq_nr);
 	struct padata_parallel_queue *pqueue = per_cpu_ptr(pd->pqueue,
-							   padata->cpu);
+							   hashed_cpu);
 	struct padata_priv *cur;
 
 	spin_lock(&pqueue->reorder.lock);
@@ -416,8 +439,6 @@ static void padata_init_pqueues(struct parallel_data *pd)
 		pqueue = per_cpu_ptr(pd->pqueue, cpu);
 
 		__padata_list_init(&pqueue->reorder);
-		__padata_list_init(&pqueue->parallel);
-		INIT_WORK(&pqueue->work, padata_parallel_worker);
 		atomic_set(&pqueue->num_obj, 0);
 	}
 }
@@ -451,7 +472,7 @@ static struct parallel_data *padata_alloc_pd(struct padata_shell *ps)
 
 	padata_init_pqueues(pd);
 	padata_init_squeues(pd);
-	atomic_set(&pd->seq_nr, -1);
+	pd->seq_nr = -1;
 	atomic_set(&pd->refcnt, 1);
 	spin_lock_init(&pd->lock);
 	pd->cpu = cpumask_first(pd->cpumask.pcpu);
@@ -1051,6 +1072,7 @@ EXPORT_SYMBOL(padata_free_shell);
 
 void __init padata_init(void)
 {
+	unsigned int i, possible_cpus;
 #ifdef CONFIG_HOTPLUG_CPU
 	int ret;
 
@@ -1062,13 +1084,27 @@ void __init padata_init(void)
 
 	ret = cpuhp_setup_state_multi(CPUHP_PADATA_DEAD, "padata:dead",
 				      NULL, padata_cpu_dead);
-	if (ret < 0) {
-		cpuhp_remove_multi_state(hp_online);
-		goto err;
-	}
+	if (ret < 0)
+		goto remove_online_state;
+#endif
+
+	possible_cpus = num_possible_cpus();
+	padata_works = kmalloc_array(possible_cpus, sizeof(struct padata_work),
+				     GFP_KERNEL);
+	if (!padata_works)
+		goto remove_dead_state;
+
+	for (i = 0; i < possible_cpus; ++i)
+		list_add(&padata_works[i].pw_list, &padata_free_works);
 
 	return;
+
+remove_dead_state:
+#ifdef CONFIG_HOTPLUG_CPU
+	cpuhp_remove_multi_state(CPUHP_PADATA_DEAD);
+remove_online_state:
+	cpuhp_remove_multi_state(hp_online);
 err:
-	pr_warn("padata: initialization failed\n");
 #endif
+	pr_warn("padata: initialization failed\n");
 }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 4/7] padata: add basic support for multithreaded jobs
  2020-05-20 18:26 [PATCH v2 0/7] padata: parallelize deferred page init Daniel Jordan
                   ` (2 preceding siblings ...)
  2020-05-20 18:26 ` [PATCH v2 3/7] padata: allocate work structures for parallel jobs from a pool Daniel Jordan
@ 2020-05-20 18:26 ` Daniel Jordan
  2020-05-20 18:26 ` [PATCH v2 5/7] mm: parallelize deferred_init_memmap() Daniel Jordan
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Daniel Jordan @ 2020-05-20 18:26 UTC (permalink / raw)
  To: Andrew Morton, Herbert Xu, Steffen Klassert
  Cc: Alex Williamson, Alexander Duyck, Dan Williams, Dave Hansen,
	David Hildenbrand, Jason Gunthorpe, Jonathan Corbet,
	Josh Triplett, Kirill Tkhai, Michal Hocko, Pavel Machek,
	Pavel Tatashin, Peter Zijlstra, Randy Dunlap, Robert Elliott,
	Shile Zhang, Steven Sistare, Tejun Heo, Zi Yan, linux-crypto,
	linux-mm, linux-kernel, linux-s390, linuxppc-dev, Daniel Jordan

Sometimes the kernel doesn't take full advantage of system memory
bandwidth, leading to a single CPU spending excessive time in
initialization paths where the data scales with memory size.

Multithreading naturally addresses this problem.

Extend padata, a framework that handles many parallel yet singlethreaded
jobs, to also handle multithreaded jobs by adding support for splitting
up the work evenly, specifying a minimum amount of work that's
appropriate for one helper thread to do, load balancing between helpers,
and coordinating them.

This is inspired by work from Pavel Tatashin and Steve Sistare.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/padata.h |  29 ++++++++
 kernel/padata.c        | 152 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 178 insertions(+), 3 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 3bfa503503ac5..b0affa466a841 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -4,6 +4,9 @@
  *
  * Copyright (C) 2008, 2009 secunet Security Networks AG
  * Copyright (C) 2008, 2009 Steffen Klassert <steffen.klassert@secunet.com>
+ *
+ * Copyright (c) 2020 Oracle and/or its affiliates.
+ * Author: Daniel Jordan <daniel.m.jordan@oracle.com>
  */
 
 #ifndef PADATA_H
@@ -130,6 +133,31 @@ struct padata_shell {
 	struct list_head		list;
 };
 
+/**
+ * struct padata_mt_job - represents one multithreaded job
+ *
+ * @thread_fn: Called for each chunk of work that a padata thread does.
+ * @fn_arg: The thread function argument.
+ * @start: The start of the job (units are job-specific).
+ * @size: size of this node's work (units are job-specific).
+ * @align: Ranges passed to the thread function fall on this boundary, with the
+ *         possible exceptions of the beginning and end of the job.
+ * @min_chunk: The minimum chunk size in job-specific units.  This allows
+ *             the client to communicate the minimum amount of work that's
+ *             appropriate for one worker thread to do at once.
+ * @max_threads: Max threads to use for the job, actual number may be less
+ *               depending on task size and minimum chunk size.
+ */
+struct padata_mt_job {
+	void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
+	void			*fn_arg;
+	unsigned long		start;
+	unsigned long		size;
+	unsigned long		align;
+	unsigned long		min_chunk;
+	int			max_threads;
+};
+
 /**
  * struct padata_instance - The overall control structure.
  *
@@ -171,6 +199,7 @@ extern void padata_free_shell(struct padata_shell *ps);
 extern int padata_do_parallel(struct padata_shell *ps,
 			      struct padata_priv *padata, int *cb_cpu);
 extern void padata_do_serial(struct padata_priv *padata);
+extern void __init padata_do_multithreaded(struct padata_mt_job *job);
 extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
 			      cpumask_var_t cpumask);
 extern int padata_start(struct padata_instance *pinst);
diff --git a/kernel/padata.c b/kernel/padata.c
index 78ff9aa529204..e78f57d9aef90 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -7,6 +7,9 @@
  * Copyright (C) 2008, 2009 secunet Security Networks AG
  * Copyright (C) 2008, 2009 Steffen Klassert <steffen.klassert@secunet.com>
  *
+ * Copyright (c) 2020 Oracle and/or its affiliates.
+ * Author: Daniel Jordan <daniel.m.jordan@oracle.com>
+ *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
  * version 2, as published by the Free Software Foundation.
@@ -21,6 +24,7 @@
  * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
  */
 
+#include <linux/completion.h>
 #include <linux/export.h>
 #include <linux/cpumask.h>
 #include <linux/err.h>
@@ -32,6 +36,8 @@
 #include <linux/sysfs.h>
 #include <linux/rcupdate.h>
 
+#define	PADATA_WORK_ONSTACK	1	/* Work's memory is on stack */
+
 struct padata_work {
 	struct work_struct	pw_work;
 	struct list_head	pw_list;  /* padata_free_works linkage */
@@ -42,7 +48,17 @@ static DEFINE_SPINLOCK(padata_works_lock);
 static struct padata_work *padata_works;
 static LIST_HEAD(padata_free_works);
 
+struct padata_mt_job_state {
+	spinlock_t		lock;
+	struct completion	completion;
+	struct padata_mt_job	*job;
+	int			nworks;
+	int			nworks_fini;
+	unsigned long		chunk_size;
+};
+
 static void padata_free_pd(struct parallel_data *pd);
+static void __init padata_mt_helper(struct work_struct *work);
 
 static int padata_index_to_cpu(struct parallel_data *pd, int cpu_index)
 {
@@ -81,18 +97,56 @@ static struct padata_work *padata_work_alloc(void)
 }
 
 static void padata_work_init(struct padata_work *pw, work_func_t work_fn,
-			     void *data)
+			     void *data, int flags)
 {
-	INIT_WORK(&pw->pw_work, work_fn);
+	if (flags & PADATA_WORK_ONSTACK)
+		INIT_WORK_ONSTACK(&pw->pw_work, work_fn);
+	else
+		INIT_WORK(&pw->pw_work, work_fn);
 	pw->pw_data = data;
 }
 
+static int __init padata_work_alloc_mt(int nworks, void *data,
+				       struct list_head *head)
+{
+	int i;
+
+	spin_lock(&padata_works_lock);
+	/* Start at 1 because the current task participates in the job. */
+	for (i = 1; i < nworks; ++i) {
+		struct padata_work *pw = padata_work_alloc();
+
+		if (!pw)
+			break;
+		padata_work_init(pw, padata_mt_helper, data, 0);
+		list_add(&pw->pw_list, head);
+	}
+	spin_unlock(&padata_works_lock);
+
+	return i;
+}
+
 static void padata_work_free(struct padata_work *pw)
 {
 	lockdep_assert_held(&padata_works_lock);
 	list_add(&pw->pw_list, &padata_free_works);
 }
 
+static void __init padata_works_free(struct list_head *works)
+{
+	struct padata_work *cur, *next;
+
+	if (list_empty(works))
+		return;
+
+	spin_lock(&padata_works_lock);
+	list_for_each_entry_safe(cur, next, works, pw_list) {
+		list_del(&cur->pw_list);
+		padata_work_free(cur);
+	}
+	spin_unlock(&padata_works_lock);
+}
+
 static void padata_parallel_worker(struct work_struct *parallel_work)
 {
 	struct padata_work *pw = container_of(parallel_work, struct padata_work,
@@ -168,7 +222,7 @@ int padata_do_parallel(struct padata_shell *ps,
 	pw = padata_work_alloc();
 	spin_unlock(&padata_works_lock);
 	if (pw) {
-		padata_work_init(pw, padata_parallel_worker, padata);
+		padata_work_init(pw, padata_parallel_worker, padata, 0);
 		queue_work(pinst->parallel_wq, &pw->pw_work);
 	} else {
 		/* Maximum works limit exceeded, run in the current task. */
@@ -409,6 +463,98 @@ static int pd_setup_cpumasks(struct parallel_data *pd,
 	return err;
 }
 
+static void __init padata_mt_helper(struct work_struct *w)
+{
+	struct padata_work *pw = container_of(w, struct padata_work, pw_work);
+	struct padata_mt_job_state *ps = pw->pw_data;
+	struct padata_mt_job *job = ps->job;
+	bool done;
+
+	spin_lock(&ps->lock);
+
+	while (job->size > 0) {
+		unsigned long start, size, end;
+
+		start = job->start;
+		/* So end is chunk size aligned if enough work remains. */
+		size = roundup(start + 1, ps->chunk_size) - start;
+		size = min(size, job->size);
+		end = start + size;
+
+		job->start = end;
+		job->size -= size;
+
+		spin_unlock(&ps->lock);
+		job->thread_fn(start, end, job->fn_arg);
+		spin_lock(&ps->lock);
+	}
+
+	++ps->nworks_fini;
+	done = (ps->nworks_fini == ps->nworks);
+	spin_unlock(&ps->lock);
+
+	if (done)
+		complete(&ps->completion);
+}
+
+/**
+ * padata_do_multithreaded - run a multithreaded job
+ * @job: Description of the job.
+ *
+ * See the definition of struct padata_mt_job for more details.
+ */
+void __init padata_do_multithreaded(struct padata_mt_job *job)
+{
+	/* In case threads finish at different times. */
+	static const unsigned long load_balance_factor = 4;
+	struct padata_work my_work, *pw;
+	struct padata_mt_job_state ps;
+	LIST_HEAD(works);
+	int nworks;
+
+	if (job->size == 0)
+		return;
+
+	/* Ensure at least one thread when size < min_chunk. */
+	nworks = max(job->size / job->min_chunk, 1ul);
+	nworks = min(nworks, job->max_threads);
+
+	if (nworks == 1) {
+		/* Single thread, no coordination needed, cut to the chase. */
+		job->thread_fn(job->start, job->start + job->size, job->fn_arg);
+		return;
+	}
+
+	spin_lock_init(&ps.lock);
+	init_completion(&ps.completion);
+	ps.job	       = job;
+	ps.nworks      = padata_work_alloc_mt(nworks, &ps, &works);
+	ps.nworks_fini = 0;
+
+	/*
+	 * Chunk size is the amount of work a helper does per call to the
+	 * thread function.  Load balance large jobs between threads by
+	 * increasing the number of chunks, guarantee at least the minimum
+	 * chunk size from the caller, and honor the caller's alignment.
+	 */
+	ps.chunk_size = job->size / (ps.nworks * load_balance_factor);
+	ps.chunk_size = max(ps.chunk_size, job->min_chunk);
+	ps.chunk_size = roundup(ps.chunk_size, job->align);
+
+	list_for_each_entry(pw, &works, pw_list)
+		queue_work(system_unbound_wq, &pw->pw_work);
+
+	/* Use the current thread, which saves starting a workqueue worker. */
+	padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
+	padata_mt_helper(&my_work.pw_work);
+
+	/* Wait for all the helpers to finish. */
+	wait_for_completion(&ps.completion);
+
+	destroy_work_on_stack(&my_work.pw_work);
+	padata_works_free(&works);
+}
+
 static void __padata_list_init(struct padata_list *pd_list)
 {
 	INIT_LIST_HEAD(&pd_list->list);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 5/7] mm: parallelize deferred_init_memmap()
  2020-05-20 18:26 [PATCH v2 0/7] padata: parallelize deferred page init Daniel Jordan
                   ` (3 preceding siblings ...)
  2020-05-20 18:26 ` [PATCH v2 4/7] padata: add basic support for multithreaded jobs Daniel Jordan
@ 2020-05-20 18:26 ` Daniel Jordan
  2020-05-21  1:29   ` Alexander Duyck
  2020-05-20 18:26 ` [PATCH v2 6/7] mm: make deferred init's max threads arch-specific Daniel Jordan
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Daniel Jordan @ 2020-05-20 18:26 UTC (permalink / raw)
  To: Andrew Morton, Herbert Xu, Steffen Klassert
  Cc: Alex Williamson, Alexander Duyck, Dan Williams, Dave Hansen,
	David Hildenbrand, Jason Gunthorpe, Jonathan Corbet,
	Josh Triplett, Kirill Tkhai, Michal Hocko, Pavel Machek,
	Pavel Tatashin, Peter Zijlstra, Randy Dunlap, Robert Elliott,
	Shile Zhang, Steven Sistare, Tejun Heo, Zi Yan, linux-crypto,
	linux-mm, linux-kernel, linux-s390, linuxppc-dev, Daniel Jordan

Deferred struct page init is a significant bottleneck in kernel boot.
Optimizing it maximizes availability for large-memory systems and allows
spinning up short-lived VMs as needed without having to leave them
running.  It also benefits bare metal machines hosting VMs that are
sensitive to downtime.  In projects such as VMM Fast Restart[1], where
guest state is preserved across kexec reboot, it helps prevent
application and network timeouts in the guests.

Multithread to take full advantage of system memory bandwidth.

The maximum number of threads is capped at the number of CPUs on the
node because speedups always improve with additional threads on every
system tested, and at this phase of boot, the system is otherwise idle
and waiting on page init to finish.

Helper threads operate on section-aligned ranges to both avoid false
sharing when setting the pageblock's migrate type and to avoid accessing
uninitialized buddy pages, though max order alignment is enough for the
latter.

The minimum chunk size is also a section.  There was benefit to using
multiple threads even on relatively small memory (1G) systems, and this
is the smallest size that the alignment allows.

The time (milliseconds) is the slowest node to initialize since boot
blocks until all nodes finish.  intel_pstate is loaded in active mode
without hwp and with turbo enabled, and intel_idle is active as well.

    Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
      2 nodes * 26 cores * 2 threads = 104 CPUs
      384G/node = 768G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   4078.0 (  9.0)         --   1779.0 (  8.7)
       2% (  1)       1.4%   4021.3 (  2.9)       3.4%   1717.7 (  7.8)
      12% (  6)      35.1%   2644.7 ( 35.3)      80.8%    341.0 ( 35.5)
      25% ( 13)      38.7%   2498.0 ( 34.2)      89.1%    193.3 ( 32.3)
      37% ( 19)      39.1%   2482.0 ( 25.2)      90.1%    175.3 ( 31.7)
      50% ( 26)      38.8%   2495.0 (  8.7)      89.1%    193.7 (  3.5)
      75% ( 39)      39.2%   2478.0 ( 21.0)      90.3%    172.7 ( 26.7)
     100% ( 52)      40.0%   2448.0 (  2.0)      91.9%    143.3 (  1.5)

    Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, bare metal)
      1 node * 16 cores * 2 threads = 32 CPUs
      192G/node = 192G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   1996.0 ( 18.0)         --   1104.3 (  6.7)
       3% (  1)       1.4%   1968.0 (  3.0)       2.7%   1074.7 (  9.0)
      12% (  4)      40.1%   1196.0 ( 22.7)      72.4%    305.3 ( 16.8)
      25% (  8)      47.4%   1049.3 ( 17.2)      84.2%    174.0 ( 10.6)
      37% ( 12)      48.3%   1032.0 ( 14.9)      86.8%    145.3 (  2.5)
      50% ( 16)      48.9%   1020.3 (  2.5)      88.0%    133.0 (  1.7)
      75% ( 24)      49.1%   1016.3 (  8.1)      88.4%    128.0 (  1.7)
     100% ( 32)      49.4%   1009.0 (  8.5)      88.6%    126.3 (  0.6)

    Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
      2 nodes * 18 cores * 2 threads = 72 CPUs
      128G/node = 256G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   1682.7 (  6.7)         --    630.0 (  4.6)
       3% (  1)       0.4%   1676.0 (  2.0)       0.7%    625.3 (  3.2)
      12% (  4)      25.8%   1249.0 (  1.0)      68.2%    200.3 (  1.2)
      25% (  9)      30.0%   1178.0 (  5.2)      79.7%    128.0 (  3.5)
      37% ( 13)      30.6%   1167.7 (  3.1)      81.3%    117.7 (  1.2)
      50% ( 18)      30.6%   1167.3 (  2.3)      81.4%    117.0 (  1.0)
      75% ( 27)      31.0%   1161.3 (  4.6)      82.5%    110.0 (  6.9)
     100% ( 36)      32.1%   1142.0 (  3.6)      85.7%     90.0 (  1.0)

    AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
      1 node * 8 cores * 2 threads = 16 CPUs
      64G/node = 64G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   1003.7 ( 16.6)         --    243.3 (  8.1)
       6% (  1)       1.4%    990.0 (  4.6)       1.2%    240.3 (  1.5)
      12% (  2)      11.4%    889.3 ( 16.7)      44.5%    135.0 (  3.0)
      25% (  4)      16.8%    835.3 (  9.0)      65.8%     83.3 (  2.5)
      37% (  6)      18.6%    816.7 ( 17.6)      70.4%     72.0 (  1.0)
      50% (  8)      18.2%    821.0 (  5.0)      70.7%     71.3 (  1.2)
      75% ( 12)      19.0%    813.3 (  5.0)      71.8%     68.7 (  2.1)
     100% ( 16)      19.8%    805.3 ( 10.8)      76.4%     57.3 ( 15.9)

Server-oriented distros that enable deferred page init sometimes run in
small VMs, and they still benefit even though the fraction of boot time
saved is smaller:

    AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
      1 node * 2 cores * 2 threads = 4 CPUs
      16G/node = 16G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --    722.3 (  9.5)         --     50.7 (  0.6)
      25% (  1)      -3.3%    746.3 (  4.7)      -2.0%     51.7 (  1.2)
      50% (  2)       0.2%    721.0 ( 11.3)      29.6%     35.7 (  4.9)
      75% (  3)      -0.3%    724.3 ( 11.2)      48.7%     26.0 (  0.0)
     100% (  4)       3.0%    700.3 ( 13.6)      55.9%     22.3 (  0.6)

    Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
      1 node * 2 cores * 2 threads = 4 CPUs
      14G/node = 14G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --    673.0 (  6.9)         --     57.0 (  1.0)
      25% (  1)      -0.6%    677.3 ( 19.8)       1.8%     56.0 (  1.0)
      50% (  2)       3.4%    650.0 (  3.6)      36.8%     36.0 (  5.2)
      75% (  3)       4.2%    644.7 (  7.6)      56.1%     25.0 (  1.0)
     100% (  4)       5.3%    637.0 (  5.6)      63.2%     21.0 (  0.0)

On Josh's 96-CPU and 192G memory system:

    Without this patch series:
    [    0.487132] node 0 initialised, 23398907 pages in 292ms
    [    0.499132] node 1 initialised, 24189223 pages in 304ms
    ...
    [    0.629376] Run /sbin/init as init process

    With this patch series:
    [    0.227868] node 0 initialised, 23398907 pages in 28ms
    [    0.230019] node 1 initialised, 24189223 pages in 28ms
    ...
    [    0.361069] Run /sbin/init as init process

[1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 mm/Kconfig      |  6 ++---
 mm/page_alloc.c | 60 ++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 58 insertions(+), 8 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index c1acc34c1c358..04c1da3f9f44c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -750,13 +750,13 @@ config DEFERRED_STRUCT_PAGE_INIT
 	depends on SPARSEMEM
 	depends on !NEED_PER_CPU_KM
 	depends on 64BIT
+	select PADATA
 	help
 	  Ordinarily all struct pages are initialised during early boot in a
 	  single thread. On very large machines this can take a considerable
 	  amount of time. If this option is set, large machines will bring up
-	  a subset of memmap at boot and then initialise the rest in parallel
-	  by starting one-off "pgdatinitX" kernel thread for each node X. This
-	  has a potential performance impact on processes running early in the
+	  a subset of memmap at boot and then initialise the rest in parallel.
+	  This has a potential performance impact on tasks running early in the
 	  lifetime of the system until these kthreads finish the
 	  initialisation.
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0c0d9364aa6d..9cb780e8dec78 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -68,6 +68,7 @@
 #include <linux/lockdep.h>
 #include <linux/nmi.h>
 #include <linux/psi.h>
+#include <linux/padata.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -1814,16 +1815,44 @@ deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
 	return nr_pages;
 }
 
+struct definit_args {
+	struct zone *zone;
+	atomic_long_t nr_pages;
+};
+
+static void __init
+deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
+			   void *arg)
+{
+	unsigned long spfn, epfn, nr_pages = 0;
+	struct definit_args *args = arg;
+	struct zone *zone = args->zone;
+	u64 i;
+
+	deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, start_pfn);
+
+	/*
+	 * Initialize and free pages in MAX_ORDER sized increments so that we
+	 * can avoid introducing any issues with the buddy allocator.
+	 */
+	while (spfn < end_pfn) {
+		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+		cond_resched();
+	}
+
+	atomic_long_add(nr_pages, &args->nr_pages);
+}
+
 /* Initialise remaining memory on a node */
 static int __init deferred_init_memmap(void *data)
 {
 	pg_data_t *pgdat = data;
 	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
 	unsigned long spfn = 0, epfn = 0, nr_pages = 0;
-	unsigned long first_init_pfn, flags;
+	unsigned long first_init_pfn, flags, epfn_align;
 	unsigned long start = jiffies;
 	struct zone *zone;
-	int zid;
+	int zid, max_threads;
 	u64 i;
 
 	/* Bind memory initialisation thread to a local node if possible */
@@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void *data)
 		goto zone_empty;
 
 	/*
-	 * Initialize and free pages in MAX_ORDER sized increments so
-	 * that we can avoid introducing any issues with the buddy
-	 * allocator.
+	 * More CPUs always led to greater speedups on tested systems, up to
+	 * all the nodes' CPUs.  Use all since the system is otherwise idle now.
 	 */
+	max_threads = max(cpumask_weight(cpumask), 1u);
+
 	while (spfn < epfn) {
+		epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
+
+		if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
+		    epfn_align - spfn >= PAGES_PER_SECTION) {
+			struct definit_args arg = { zone, ATOMIC_LONG_INIT(0) };
+			struct padata_mt_job job = {
+				.thread_fn   = deferred_init_memmap_chunk,
+				.fn_arg      = &arg,
+				.start       = spfn,
+				.size        = epfn_align - spfn,
+				.align       = PAGES_PER_SECTION,
+				.min_chunk   = PAGES_PER_SECTION,
+				.max_threads = max_threads,
+			};
+
+			padata_do_multithreaded(&job);
+			nr_pages += atomic_long_read(&arg.nr_pages);
+			spfn = epfn_align;
+		}
+
 		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
 		cond_resched();
 	}
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 6/7] mm: make deferred init's max threads arch-specific
  2020-05-20 18:26 [PATCH v2 0/7] padata: parallelize deferred page init Daniel Jordan
                   ` (4 preceding siblings ...)
  2020-05-20 18:26 ` [PATCH v2 5/7] mm: parallelize deferred_init_memmap() Daniel Jordan
@ 2020-05-20 18:26 ` Daniel Jordan
  2020-05-20 18:26 ` [PATCH v2 7/7] padata: document multithreaded jobs Daniel Jordan
  2020-05-21 23:43 ` [PATCH v2 0/7] padata: parallelize deferred page init Josh Triplett
  7 siblings, 0 replies; 15+ messages in thread
From: Daniel Jordan @ 2020-05-20 18:26 UTC (permalink / raw)
  To: Andrew Morton, Herbert Xu, Steffen Klassert
  Cc: Alex Williamson, Alexander Duyck, Dan Williams, Dave Hansen,
	David Hildenbrand, Jason Gunthorpe, Jonathan Corbet,
	Josh Triplett, Kirill Tkhai, Michal Hocko, Pavel Machek,
	Pavel Tatashin, Peter Zijlstra, Randy Dunlap, Robert Elliott,
	Shile Zhang, Steven Sistare, Tejun Heo, Zi Yan, linux-crypto,
	linux-mm, linux-kernel, linux-s390, linuxppc-dev, Daniel Jordan

Using padata during deferred init has only been tested on x86, so for
now limit it to this architecture.

If another arch wants this, it can find the max thread limit that's best
for it and override deferred_page_init_max_threads().

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 arch/x86/mm/init_64.c    | 12 ++++++++++++
 include/linux/memblock.h |  3 +++
 mm/page_alloc.c          | 13 ++++++++-----
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8b5f73f5e207c..2d749ec12ea8a 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1260,6 +1260,18 @@ void __init mem_init(void)
 	mem_init_print_info(NULL);
 }
 
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask)
+{
+	/*
+	 * More CPUs always led to greater speedups on tested systems, up to
+	 * all the nodes' CPUs.  Use all since the system is otherwise idle
+	 * now.
+	 */
+	return max_t(int, cpumask_weight(node_cpumask), 1);
+}
+#endif
+
 int kernel_set_to_readonly;
 
 void mark_rodata_ro(void)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 6bc37a731d27b..2b289df44194f 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -275,6 +275,9 @@ void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
 #define for_each_free_mem_pfn_range_in_zone_from(i, zone, p_start, p_end) \
 	for (; i != U64_MAX;					  \
 	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
+
+int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask);
+
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
 /**
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9cb780e8dec78..0d7d805f98b2d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1843,6 +1843,13 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
 	atomic_long_add(nr_pages, &args->nr_pages);
 }
 
+/* An arch may override for more concurrency. */
+__weak int __init
+deferred_page_init_max_threads(const struct cpumask *node_cpumask)
+{
+	return 1;
+}
+
 /* Initialise remaining memory on a node */
 static int __init deferred_init_memmap(void *data)
 {
@@ -1891,11 +1898,7 @@ static int __init deferred_init_memmap(void *data)
 						 first_init_pfn))
 		goto zone_empty;
 
-	/*
-	 * More CPUs always led to greater speedups on tested systems, up to
-	 * all the nodes' CPUs.  Use all since the system is otherwise idle now.
-	 */
-	max_threads = max(cpumask_weight(cpumask), 1u);
+	max_threads = deferred_page_init_max_threads(cpumask);
 
 	while (spfn < epfn) {
 		epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 7/7] padata: document multithreaded jobs
  2020-05-20 18:26 [PATCH v2 0/7] padata: parallelize deferred page init Daniel Jordan
                   ` (5 preceding siblings ...)
  2020-05-20 18:26 ` [PATCH v2 6/7] mm: make deferred init's max threads arch-specific Daniel Jordan
@ 2020-05-20 18:26 ` Daniel Jordan
  2020-05-21 23:43 ` [PATCH v2 0/7] padata: parallelize deferred page init Josh Triplett
  7 siblings, 0 replies; 15+ messages in thread
From: Daniel Jordan @ 2020-05-20 18:26 UTC (permalink / raw)
  To: Andrew Morton, Herbert Xu, Steffen Klassert
  Cc: Alex Williamson, Alexander Duyck, Dan Williams, Dave Hansen,
	David Hildenbrand, Jason Gunthorpe, Jonathan Corbet,
	Josh Triplett, Kirill Tkhai, Michal Hocko, Pavel Machek,
	Pavel Tatashin, Peter Zijlstra, Randy Dunlap, Robert Elliott,
	Shile Zhang, Steven Sistare, Tejun Heo, Zi Yan, linux-crypto,
	linux-mm, linux-kernel, linux-s390, linuxppc-dev, Daniel Jordan

Add Documentation for multithreaded jobs.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 Documentation/core-api/padata.rst | 41 +++++++++++++++++++++++--------
 1 file changed, 31 insertions(+), 10 deletions(-)

diff --git a/Documentation/core-api/padata.rst b/Documentation/core-api/padata.rst
index 9a24c111781d9..b7e047af993e8 100644
--- a/Documentation/core-api/padata.rst
+++ b/Documentation/core-api/padata.rst
@@ -4,23 +4,26 @@
 The padata parallel execution mechanism
 =======================================
 
-:Date: December 2019
+:Date: April 2020
 
 Padata is a mechanism by which the kernel can farm jobs out to be done in
-parallel on multiple CPUs while retaining their ordering.  It was developed for
-use with the IPsec code, which needs to be able to perform encryption and
-decryption on large numbers of packets without reordering those packets.  The
-crypto developers made a point of writing padata in a sufficiently general
-fashion that it could be put to other uses as well.
+parallel on multiple CPUs while optionally retaining their ordering.
 
-Usage
-=====
+It was originally developed for IPsec, which needs to perform encryption and
+decryption on large numbers of packets without reordering those packets.  This
+is currently the sole consumer of padata's serialized job support.
+
+Padata also supports multithreaded jobs, splitting up the job evenly while load
+balancing and coordinating between threads.
+
+Running Serialized Jobs
+=======================
 
 Initializing
 ------------
 
-The first step in using padata is to set up a padata_instance structure for
-overall control of how jobs are to be run::
+The first step in using padata to run parallel jobs is to set up a
+padata_instance structure for overall control of how jobs are to be run::
 
     #include <linux/padata.h>
 
@@ -162,6 +165,24 @@ functions that correspond to the allocation in reverse::
 It is the user's responsibility to ensure all outstanding jobs are complete
 before any of the above are called.
 
+Running Multithreaded Jobs
+==========================
+
+A multithreaded job has a main thread and zero or more helper threads, with the
+main thread participating in the job and then waiting until all helpers have
+finished.  padata splits the job into units called chunks, where a chunk is a
+piece of the job that one thread completes in one call to the thread function.
+
+A user has to do three things to run a multithreaded job.  First, describe the
+job by defining a padata_mt_job structure, which is explained in the Interface
+section.  This includes a pointer to the thread function, which padata will
+call each time it assigns a job chunk to a thread.  Then, define the thread
+function, which accepts three arguments, ``start``, ``end``, and ``arg``, where
+the first two delimit the range that the thread operates on and the last is a
+pointer to the job's shared state, if any.  Prepare the shared state, which is
+typically a stack-allocated structure that wraps the required data.  Last, call
+padata_do_multithreaded(), which will return once the job is finished.
+
 Interface
 =========
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()
  2020-05-20 18:26 ` [PATCH v2 5/7] mm: parallelize deferred_init_memmap() Daniel Jordan
@ 2020-05-21  1:29   ` Alexander Duyck
  2020-05-21 15:00     ` Alexander Duyck
  2020-05-21 15:37     ` Daniel Jordan
  0 siblings, 2 replies; 15+ messages in thread
From: Alexander Duyck @ 2020-05-21  1:29 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Andrew Morton, Herbert Xu, Steffen Klassert, Alex Williamson,
	Alexander Duyck, Dan Williams, Dave Hansen, David Hildenbrand,
	Jason Gunthorpe, Jonathan Corbet, Josh Triplett, Kirill Tkhai,
	Michal Hocko, Pavel Machek, Pavel Tatashin, Peter Zijlstra,
	Randy Dunlap, Robert Elliott, Shile Zhang, Steven Sistare,
	Tejun Heo, Zi Yan, linux-crypto, linux-mm, LKML, linux-s390,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Wed, May 20, 2020 at 11:27 AM Daniel Jordan
<daniel.m.jordan@oracle.com> wrote:
>
> Deferred struct page init is a significant bottleneck in kernel boot.
> Optimizing it maximizes availability for large-memory systems and allows
> spinning up short-lived VMs as needed without having to leave them
> running.  It also benefits bare metal machines hosting VMs that are
> sensitive to downtime.  In projects such as VMM Fast Restart[1], where
> guest state is preserved across kexec reboot, it helps prevent
> application and network timeouts in the guests.
>
> Multithread to take full advantage of system memory bandwidth.
>
> The maximum number of threads is capped at the number of CPUs on the
> node because speedups always improve with additional threads on every
> system tested, and at this phase of boot, the system is otherwise idle
> and waiting on page init to finish.
>
> Helper threads operate on section-aligned ranges to both avoid false
> sharing when setting the pageblock's migrate type and to avoid accessing
> uninitialized buddy pages, though max order alignment is enough for the
> latter.
>
> The minimum chunk size is also a section.  There was benefit to using
> multiple threads even on relatively small memory (1G) systems, and this
> is the smallest size that the alignment allows.
>
> The time (milliseconds) is the slowest node to initialize since boot
> blocks until all nodes finish.  intel_pstate is loaded in active mode
> without hwp and with turbo enabled, and intel_idle is active as well.
>
>     Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
>       2 nodes * 26 cores * 2 threads = 104 CPUs
>       384G/node = 768G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --   4078.0 (  9.0)         --   1779.0 (  8.7)
>        2% (  1)       1.4%   4021.3 (  2.9)       3.4%   1717.7 (  7.8)
>       12% (  6)      35.1%   2644.7 ( 35.3)      80.8%    341.0 ( 35.5)
>       25% ( 13)      38.7%   2498.0 ( 34.2)      89.1%    193.3 ( 32.3)
>       37% ( 19)      39.1%   2482.0 ( 25.2)      90.1%    175.3 ( 31.7)
>       50% ( 26)      38.8%   2495.0 (  8.7)      89.1%    193.7 (  3.5)
>       75% ( 39)      39.2%   2478.0 ( 21.0)      90.3%    172.7 ( 26.7)
>      100% ( 52)      40.0%   2448.0 (  2.0)      91.9%    143.3 (  1.5)
>
>     Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, bare metal)
>       1 node * 16 cores * 2 threads = 32 CPUs
>       192G/node = 192G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --   1996.0 ( 18.0)         --   1104.3 (  6.7)
>        3% (  1)       1.4%   1968.0 (  3.0)       2.7%   1074.7 (  9.0)
>       12% (  4)      40.1%   1196.0 ( 22.7)      72.4%    305.3 ( 16.8)
>       25% (  8)      47.4%   1049.3 ( 17.2)      84.2%    174.0 ( 10.6)
>       37% ( 12)      48.3%   1032.0 ( 14.9)      86.8%    145.3 (  2.5)
>       50% ( 16)      48.9%   1020.3 (  2.5)      88.0%    133.0 (  1.7)
>       75% ( 24)      49.1%   1016.3 (  8.1)      88.4%    128.0 (  1.7)
>      100% ( 32)      49.4%   1009.0 (  8.5)      88.6%    126.3 (  0.6)
>
>     Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
>       2 nodes * 18 cores * 2 threads = 72 CPUs
>       128G/node = 256G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --   1682.7 (  6.7)         --    630.0 (  4.6)
>        3% (  1)       0.4%   1676.0 (  2.0)       0.7%    625.3 (  3.2)
>       12% (  4)      25.8%   1249.0 (  1.0)      68.2%    200.3 (  1.2)
>       25% (  9)      30.0%   1178.0 (  5.2)      79.7%    128.0 (  3.5)
>       37% ( 13)      30.6%   1167.7 (  3.1)      81.3%    117.7 (  1.2)
>       50% ( 18)      30.6%   1167.3 (  2.3)      81.4%    117.0 (  1.0)
>       75% ( 27)      31.0%   1161.3 (  4.6)      82.5%    110.0 (  6.9)
>      100% ( 36)      32.1%   1142.0 (  3.6)      85.7%     90.0 (  1.0)
>
>     AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
>       1 node * 8 cores * 2 threads = 16 CPUs
>       64G/node = 64G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --   1003.7 ( 16.6)         --    243.3 (  8.1)
>        6% (  1)       1.4%    990.0 (  4.6)       1.2%    240.3 (  1.5)
>       12% (  2)      11.4%    889.3 ( 16.7)      44.5%    135.0 (  3.0)
>       25% (  4)      16.8%    835.3 (  9.0)      65.8%     83.3 (  2.5)
>       37% (  6)      18.6%    816.7 ( 17.6)      70.4%     72.0 (  1.0)
>       50% (  8)      18.2%    821.0 (  5.0)      70.7%     71.3 (  1.2)
>       75% ( 12)      19.0%    813.3 (  5.0)      71.8%     68.7 (  2.1)
>      100% ( 16)      19.8%    805.3 ( 10.8)      76.4%     57.3 ( 15.9)
>
> Server-oriented distros that enable deferred page init sometimes run in
> small VMs, and they still benefit even though the fraction of boot time
> saved is smaller:
>
>     AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
>       1 node * 2 cores * 2 threads = 4 CPUs
>       16G/node = 16G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --    722.3 (  9.5)         --     50.7 (  0.6)
>       25% (  1)      -3.3%    746.3 (  4.7)      -2.0%     51.7 (  1.2)
>       50% (  2)       0.2%    721.0 ( 11.3)      29.6%     35.7 (  4.9)
>       75% (  3)      -0.3%    724.3 ( 11.2)      48.7%     26.0 (  0.0)
>      100% (  4)       3.0%    700.3 ( 13.6)      55.9%     22.3 (  0.6)
>
>     Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
>       1 node * 2 cores * 2 threads = 4 CPUs
>       14G/node = 14G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --    673.0 (  6.9)         --     57.0 (  1.0)
>       25% (  1)      -0.6%    677.3 ( 19.8)       1.8%     56.0 (  1.0)
>       50% (  2)       3.4%    650.0 (  3.6)      36.8%     36.0 (  5.2)
>       75% (  3)       4.2%    644.7 (  7.6)      56.1%     25.0 (  1.0)
>      100% (  4)       5.3%    637.0 (  5.6)      63.2%     21.0 (  0.0)
>
> On Josh's 96-CPU and 192G memory system:
>
>     Without this patch series:
>     [    0.487132] node 0 initialised, 23398907 pages in 292ms
>     [    0.499132] node 1 initialised, 24189223 pages in 304ms
>     ...
>     [    0.629376] Run /sbin/init as init process
>
>     With this patch series:
>     [    0.227868] node 0 initialised, 23398907 pages in 28ms
>     [    0.230019] node 1 initialised, 24189223 pages in 28ms
>     ...
>     [    0.361069] Run /sbin/init as init process
>
> [1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf
>
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> ---
>  mm/Kconfig      |  6 ++---
>  mm/page_alloc.c | 60 ++++++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 58 insertions(+), 8 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index c1acc34c1c358..04c1da3f9f44c 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -750,13 +750,13 @@ config DEFERRED_STRUCT_PAGE_INIT
>         depends on SPARSEMEM
>         depends on !NEED_PER_CPU_KM
>         depends on 64BIT
> +       select PADATA
>         help
>           Ordinarily all struct pages are initialised during early boot in a
>           single thread. On very large machines this can take a considerable
>           amount of time. If this option is set, large machines will bring up
> -         a subset of memmap at boot and then initialise the rest in parallel
> -         by starting one-off "pgdatinitX" kernel thread for each node X. This
> -         has a potential performance impact on processes running early in the
> +         a subset of memmap at boot and then initialise the rest in parallel.
> +         This has a potential performance impact on tasks running early in the
>           lifetime of the system until these kthreads finish the
>           initialisation.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d0c0d9364aa6d..9cb780e8dec78 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -68,6 +68,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/nmi.h>
>  #include <linux/psi.h>
> +#include <linux/padata.h>
>
>  #include <asm/sections.h>
>  #include <asm/tlbflush.h>
> @@ -1814,16 +1815,44 @@ deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
>         return nr_pages;
>  }
>
> +struct definit_args {
> +       struct zone *zone;
> +       atomic_long_t nr_pages;
> +};
> +
> +static void __init
> +deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> +                          void *arg)
> +{
> +       unsigned long spfn, epfn, nr_pages = 0;
> +       struct definit_args *args = arg;
> +       struct zone *zone = args->zone;
> +       u64 i;
> +
> +       deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, start_pfn);
> +
> +       /*
> +        * Initialize and free pages in MAX_ORDER sized increments so that we
> +        * can avoid introducing any issues with the buddy allocator.
> +        */
> +       while (spfn < end_pfn) {
> +               nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +               cond_resched();
> +       }
> +
> +       atomic_long_add(nr_pages, &args->nr_pages);
> +}
> +

Personally I would get rid of nr_pages entirely. It isn't worth the
cache thrash to have this atomic variable bouncing around. You could
probably just have this function return void since all nr_pages is
used for is a pr_info  statement at the end of the initialization
which will be completely useless now anyway since we really have the
threads running in parallel anyway.

We only really need the nr_pages logic in deferred_grow_zone in order
to track if we have freed enough pages to allow us to go back to what
we were doing.

>  /* Initialise remaining memory on a node */
>  static int __init deferred_init_memmap(void *data)
>  {
>         pg_data_t *pgdat = data;
>         const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
>         unsigned long spfn = 0, epfn = 0, nr_pages = 0;
> -       unsigned long first_init_pfn, flags;
> +       unsigned long first_init_pfn, flags, epfn_align;
>         unsigned long start = jiffies;
>         struct zone *zone;
> -       int zid;
> +       int zid, max_threads;
>         u64 i;
>
>         /* Bind memory initialisation thread to a local node if possible */
> @@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void *data)
>                 goto zone_empty;
>
>         /*
> -        * Initialize and free pages in MAX_ORDER sized increments so
> -        * that we can avoid introducing any issues with the buddy
> -        * allocator.
> +        * More CPUs always led to greater speedups on tested systems, up to
> +        * all the nodes' CPUs.  Use all since the system is otherwise idle now.
>          */
> +       max_threads = max(cpumask_weight(cpumask), 1u);
> +
>         while (spfn < epfn) {
> +               epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
> +
> +               if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
> +                   epfn_align - spfn >= PAGES_PER_SECTION) {
> +                       struct definit_args arg = { zone, ATOMIC_LONG_INIT(0) };
> +                       struct padata_mt_job job = {
> +                               .thread_fn   = deferred_init_memmap_chunk,
> +                               .fn_arg      = &arg,
> +                               .start       = spfn,
> +                               .size        = epfn_align - spfn,
> +                               .align       = PAGES_PER_SECTION,
> +                               .min_chunk   = PAGES_PER_SECTION,
> +                               .max_threads = max_threads,
> +                       };
> +
> +                       padata_do_multithreaded(&job);
> +                       nr_pages += atomic_long_read(&arg.nr_pages);
> +                       spfn = epfn_align;
> +               }
> +
>                 nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
>                 cond_resched();
>         }

This doesn't look right. You are basically adding threads in addition
to calls to deferred_init_maxorder. In addition you are spawning one
job per section instead of per range. Really you should be going for
something more along the lines of:

        while (spfn < epfn) {
                unsigned long epfn_align = ALIGN(epfn,
PAGE_PER_SECTION);
                struct definit_args arg = { zone, ATOMIC_LONG_INIT(0)
};
                struct padata_mt_job job = {
                        .thread_fn   = deferred_init_memmap_chunk,
                        .fn_arg      = &arg,
                        .start       = spfn,
                        .size        = epfn_align - spfn,
                        .align       = PAGES_PER_SECTION,
                        .min_chunk   = PAGES_PER_SECTION,
                        .max_threads = max_threads,
                };

                padata_do_multithreaded(&job);

                for_each_free_mem_pfn_range_in_zone_from(i, zone,
spfn, epfn) {
                        if (epfn_align <= spfn)
                                break;
                }
        }

This should accomplish the same thing, but much more efficiently. The
only thing you really lose is the tracking of nr_pages which really
doesn't add anything anyway since the value could shift around
depending on how many times deferred_grow_zone got called anyway.

Also the spfn should already be sectioned aligned, or at least be in a
new section unrelated to the one we just scheduled, so there is no
need for the extra checks you had.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()
  2020-05-21  1:29   ` Alexander Duyck
@ 2020-05-21 15:00     ` Alexander Duyck
  2020-05-21 15:39       ` Daniel Jordan
  2020-05-21 15:37     ` Daniel Jordan
  1 sibling, 1 reply; 15+ messages in thread
From: Alexander Duyck @ 2020-05-21 15:00 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Andrew Morton, Herbert Xu, Steffen Klassert, Alex Williamson,
	Alexander Duyck, Dan Williams, Dave Hansen, David Hildenbrand,
	Jason Gunthorpe, Jonathan Corbet, Josh Triplett, Kirill Tkhai,
	Michal Hocko, Pavel Machek, Pavel Tatashin, Peter Zijlstra,
	Randy Dunlap, Robert Elliott, Shile Zhang, Steven Sistare,
	Tejun Heo, Zi Yan, linux-crypto, linux-mm, LKML, linux-s390,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Wed, May 20, 2020 at 6:29 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Wed, May 20, 2020 at 11:27 AM Daniel Jordan
> <daniel.m.jordan@oracle.com> wrote:
> >
> > Deferred struct page init is a significant bottleneck in kernel boot.
> > Optimizing it maximizes availability for large-memory systems and allows
> > spinning up short-lived VMs as needed without having to leave them
> > running.  It also benefits bare metal machines hosting VMs that are
> > sensitive to downtime.  In projects such as VMM Fast Restart[1], where
> > guest state is preserved across kexec reboot, it helps prevent
> > application and network timeouts in the guests.
> >
> > Multithread to take full advantage of system memory bandwidth.
> >
> > The maximum number of threads is capped at the number of CPUs on the
> > node because speedups always improve with additional threads on every
> > system tested, and at this phase of boot, the system is otherwise idle
> > and waiting on page init to finish.
> >
> > Helper threads operate on section-aligned ranges to both avoid false
> > sharing when setting the pageblock's migrate type and to avoid accessing
> > uninitialized buddy pages, though max order alignment is enough for the
> > latter.
> >
> > The minimum chunk size is also a section.  There was benefit to using
> > multiple threads even on relatively small memory (1G) systems, and this
> > is the smallest size that the alignment allows.
> >
> > The time (milliseconds) is the slowest node to initialize since boot
> > blocks until all nodes finish.  intel_pstate is loaded in active mode
> > without hwp and with turbo enabled, and intel_idle is active as well.
> >
> >     Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
> >       2 nodes * 26 cores * 2 threads = 104 CPUs
> >       384G/node = 768G memory
> >
> >                    kernel boot                 deferred init
> >                    ------------------------    ------------------------
> >     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
> >           (  0)         --   4078.0 (  9.0)         --   1779.0 (  8.7)
> >        2% (  1)       1.4%   4021.3 (  2.9)       3.4%   1717.7 (  7.8)
> >       12% (  6)      35.1%   2644.7 ( 35.3)      80.8%    341.0 ( 35.5)
> >       25% ( 13)      38.7%   2498.0 ( 34.2)      89.1%    193.3 ( 32.3)
> >       37% ( 19)      39.1%   2482.0 ( 25.2)      90.1%    175.3 ( 31.7)
> >       50% ( 26)      38.8%   2495.0 (  8.7)      89.1%    193.7 (  3.5)
> >       75% ( 39)      39.2%   2478.0 ( 21.0)      90.3%    172.7 ( 26.7)
> >      100% ( 52)      40.0%   2448.0 (  2.0)      91.9%    143.3 (  1.5)
> >
> >     Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, bare metal)
> >       1 node * 16 cores * 2 threads = 32 CPUs
> >       192G/node = 192G memory
> >
> >                    kernel boot                 deferred init
> >                    ------------------------    ------------------------
> >     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
> >           (  0)         --   1996.0 ( 18.0)         --   1104.3 (  6.7)
> >        3% (  1)       1.4%   1968.0 (  3.0)       2.7%   1074.7 (  9.0)
> >       12% (  4)      40.1%   1196.0 ( 22.7)      72.4%    305.3 ( 16.8)
> >       25% (  8)      47.4%   1049.3 ( 17.2)      84.2%    174.0 ( 10.6)
> >       37% ( 12)      48.3%   1032.0 ( 14.9)      86.8%    145.3 (  2.5)
> >       50% ( 16)      48.9%   1020.3 (  2.5)      88.0%    133.0 (  1.7)
> >       75% ( 24)      49.1%   1016.3 (  8.1)      88.4%    128.0 (  1.7)
> >      100% ( 32)      49.4%   1009.0 (  8.5)      88.6%    126.3 (  0.6)
> >
> >     Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
> >       2 nodes * 18 cores * 2 threads = 72 CPUs
> >       128G/node = 256G memory
> >
> >                    kernel boot                 deferred init
> >                    ------------------------    ------------------------
> >     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
> >           (  0)         --   1682.7 (  6.7)         --    630.0 (  4.6)
> >        3% (  1)       0.4%   1676.0 (  2.0)       0.7%    625.3 (  3.2)
> >       12% (  4)      25.8%   1249.0 (  1.0)      68.2%    200.3 (  1.2)
> >       25% (  9)      30.0%   1178.0 (  5.2)      79.7%    128.0 (  3.5)
> >       37% ( 13)      30.6%   1167.7 (  3.1)      81.3%    117.7 (  1.2)
> >       50% ( 18)      30.6%   1167.3 (  2.3)      81.4%    117.0 (  1.0)
> >       75% ( 27)      31.0%   1161.3 (  4.6)      82.5%    110.0 (  6.9)
> >      100% ( 36)      32.1%   1142.0 (  3.6)      85.7%     90.0 (  1.0)
> >
> >     AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
> >       1 node * 8 cores * 2 threads = 16 CPUs
> >       64G/node = 64G memory
> >
> >                    kernel boot                 deferred init
> >                    ------------------------    ------------------------
> >     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
> >           (  0)         --   1003.7 ( 16.6)         --    243.3 (  8.1)
> >        6% (  1)       1.4%    990.0 (  4.6)       1.2%    240.3 (  1.5)
> >       12% (  2)      11.4%    889.3 ( 16.7)      44.5%    135.0 (  3.0)
> >       25% (  4)      16.8%    835.3 (  9.0)      65.8%     83.3 (  2.5)
> >       37% (  6)      18.6%    816.7 ( 17.6)      70.4%     72.0 (  1.0)
> >       50% (  8)      18.2%    821.0 (  5.0)      70.7%     71.3 (  1.2)
> >       75% ( 12)      19.0%    813.3 (  5.0)      71.8%     68.7 (  2.1)
> >      100% ( 16)      19.8%    805.3 ( 10.8)      76.4%     57.3 ( 15.9)
> >
> > Server-oriented distros that enable deferred page init sometimes run in
> > small VMs, and they still benefit even though the fraction of boot time
> > saved is smaller:
> >
> >     AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
> >       1 node * 2 cores * 2 threads = 4 CPUs
> >       16G/node = 16G memory
> >
> >                    kernel boot                 deferred init
> >                    ------------------------    ------------------------
> >     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
> >           (  0)         --    722.3 (  9.5)         --     50.7 (  0.6)
> >       25% (  1)      -3.3%    746.3 (  4.7)      -2.0%     51.7 (  1.2)
> >       50% (  2)       0.2%    721.0 ( 11.3)      29.6%     35.7 (  4.9)
> >       75% (  3)      -0.3%    724.3 ( 11.2)      48.7%     26.0 (  0.0)
> >      100% (  4)       3.0%    700.3 ( 13.6)      55.9%     22.3 (  0.6)
> >
> >     Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
> >       1 node * 2 cores * 2 threads = 4 CPUs
> >       14G/node = 14G memory
> >
> >                    kernel boot                 deferred init
> >                    ------------------------    ------------------------
> >     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
> >           (  0)         --    673.0 (  6.9)         --     57.0 (  1.0)
> >       25% (  1)      -0.6%    677.3 ( 19.8)       1.8%     56.0 (  1.0)
> >       50% (  2)       3.4%    650.0 (  3.6)      36.8%     36.0 (  5.2)
> >       75% (  3)       4.2%    644.7 (  7.6)      56.1%     25.0 (  1.0)
> >      100% (  4)       5.3%    637.0 (  5.6)      63.2%     21.0 (  0.0)
> >
> > On Josh's 96-CPU and 192G memory system:
> >
> >     Without this patch series:
> >     [    0.487132] node 0 initialised, 23398907 pages in 292ms
> >     [    0.499132] node 1 initialised, 24189223 pages in 304ms
> >     ...
> >     [    0.629376] Run /sbin/init as init process
> >
> >     With this patch series:
> >     [    0.227868] node 0 initialised, 23398907 pages in 28ms
> >     [    0.230019] node 1 initialised, 24189223 pages in 28ms
> >     ...
> >     [    0.361069] Run /sbin/init as init process
> >
> > [1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf
> >
> > Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> > ---
> >  mm/Kconfig      |  6 ++---
> >  mm/page_alloc.c | 60 ++++++++++++++++++++++++++++++++++++++++++++-----
> >  2 files changed, 58 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index c1acc34c1c358..04c1da3f9f44c 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -750,13 +750,13 @@ config DEFERRED_STRUCT_PAGE_INIT
> >         depends on SPARSEMEM
> >         depends on !NEED_PER_CPU_KM
> >         depends on 64BIT
> > +       select PADATA
> >         help
> >           Ordinarily all struct pages are initialised during early boot in a
> >           single thread. On very large machines this can take a considerable
> >           amount of time. If this option is set, large machines will bring up
> > -         a subset of memmap at boot and then initialise the rest in parallel
> > -         by starting one-off "pgdatinitX" kernel thread for each node X. This
> > -         has a potential performance impact on processes running early in the
> > +         a subset of memmap at boot and then initialise the rest in parallel.
> > +         This has a potential performance impact on tasks running early in the
> >           lifetime of the system until these kthreads finish the
> >           initialisation.
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index d0c0d9364aa6d..9cb780e8dec78 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -68,6 +68,7 @@
> >  #include <linux/lockdep.h>
> >  #include <linux/nmi.h>
> >  #include <linux/psi.h>
> > +#include <linux/padata.h>
> >
> >  #include <asm/sections.h>
> >  #include <asm/tlbflush.h>
> > @@ -1814,16 +1815,44 @@ deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
> >         return nr_pages;
> >  }
> >
> > +struct definit_args {
> > +       struct zone *zone;
> > +       atomic_long_t nr_pages;
> > +};
> > +
> > +static void __init
> > +deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> > +                          void *arg)
> > +{
> > +       unsigned long spfn, epfn, nr_pages = 0;
> > +       struct definit_args *args = arg;
> > +       struct zone *zone = args->zone;
> > +       u64 i;
> > +
> > +       deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, start_pfn);
> > +
> > +       /*
> > +        * Initialize and free pages in MAX_ORDER sized increments so that we
> > +        * can avoid introducing any issues with the buddy allocator.
> > +        */
> > +       while (spfn < end_pfn) {
> > +               nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > +               cond_resched();
> > +       }
> > +
> > +       atomic_long_add(nr_pages, &args->nr_pages);
> > +}
> > +
>
> Personally I would get rid of nr_pages entirely. It isn't worth the
> cache thrash to have this atomic variable bouncing around. You could
> probably just have this function return void since all nr_pages is
> used for is a pr_info  statement at the end of the initialization
> which will be completely useless now anyway since we really have the
> threads running in parallel anyway.
>
> We only really need the nr_pages logic in deferred_grow_zone in order
> to track if we have freed enough pages to allow us to go back to what
> we were doing.
>
> >  /* Initialise remaining memory on a node */
> >  static int __init deferred_init_memmap(void *data)
> >  {
> >         pg_data_t *pgdat = data;
> >         const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> >         unsigned long spfn = 0, epfn = 0, nr_pages = 0;
> > -       unsigned long first_init_pfn, flags;
> > +       unsigned long first_init_pfn, flags, epfn_align;
> >         unsigned long start = jiffies;
> >         struct zone *zone;
> > -       int zid;
> > +       int zid, max_threads;
> >         u64 i;
> >
> >         /* Bind memory initialisation thread to a local node if possible */
> > @@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void *data)
> >                 goto zone_empty;
> >
> >         /*
> > -        * Initialize and free pages in MAX_ORDER sized increments so
> > -        * that we can avoid introducing any issues with the buddy
> > -        * allocator.
> > +        * More CPUs always led to greater speedups on tested systems, up to
> > +        * all the nodes' CPUs.  Use all since the system is otherwise idle now.
> >          */
> > +       max_threads = max(cpumask_weight(cpumask), 1u);
> > +
> >         while (spfn < epfn) {
> > +               epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
> > +
> > +               if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
> > +                   epfn_align - spfn >= PAGES_PER_SECTION) {
> > +                       struct definit_args arg = { zone, ATOMIC_LONG_INIT(0) };
> > +                       struct padata_mt_job job = {
> > +                               .thread_fn   = deferred_init_memmap_chunk,
> > +                               .fn_arg      = &arg,
> > +                               .start       = spfn,
> > +                               .size        = epfn_align - spfn,
> > +                               .align       = PAGES_PER_SECTION,
> > +                               .min_chunk   = PAGES_PER_SECTION,
> > +                               .max_threads = max_threads,
> > +                       };
> > +
> > +                       padata_do_multithreaded(&job);
> > +                       nr_pages += atomic_long_read(&arg.nr_pages);
> > +                       spfn = epfn_align;
> > +               }
> > +
> >                 nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> >                 cond_resched();
> >         }
>
> This doesn't look right. You are basically adding threads in addition
> to calls to deferred_init_maxorder. In addition you are spawning one
> job per section instead of per range. Really you should be going for
> something more along the lines of:
>
>         while (spfn < epfn) {
>                 unsigned long epfn_align = ALIGN(epfn,
> PAGE_PER_SECTION);
>                 struct definit_args arg = { zone, ATOMIC_LONG_INIT(0)
> };
>                 struct padata_mt_job job = {
>                         .thread_fn   = deferred_init_memmap_chunk,
>                         .fn_arg      = &arg,
>                         .start       = spfn,
>                         .size        = epfn_align - spfn,
>                         .align       = PAGES_PER_SECTION,
>                         .min_chunk   = PAGES_PER_SECTION,
>                         .max_threads = max_threads,
>                 };
>
>                 padata_do_multithreaded(&job);
>
>                 for_each_free_mem_pfn_range_in_zone_from(i, zone,
> spfn, epfn) {
>                         if (epfn_align <= spfn)
>                                 break;
>                 }
>         }
>

So I was thinking about my suggestion further and the loop at the end
isn't quite correct as I believe it could lead to gaps. The loop on
the end should probably be:
                for_each_free_mem_pfn_range_in_zone_from(i, zone, spfn, epfn) {
                        if (epfn <= epfn_align)
                                continue;
                        if (spfn < epfn_align)
                                spfn = epfn_align;
                        break;
                }

That would generate a new range where epfn_align has actually ended
and there is a range of new PFNs to process.

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()
  2020-05-21  1:29   ` Alexander Duyck
  2020-05-21 15:00     ` Alexander Duyck
@ 2020-05-21 15:37     ` Daniel Jordan
  2020-05-21 16:46       ` Alexander Duyck
  1 sibling, 1 reply; 15+ messages in thread
From: Daniel Jordan @ 2020-05-21 15:37 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Daniel Jordan, Andrew Morton, Herbert Xu, Steffen Klassert,
	Alex Williamson, Alexander Duyck, Dan Williams, Dave Hansen,
	David Hildenbrand, Jason Gunthorpe, Jonathan Corbet,
	Josh Triplett, Kirill Tkhai, Michal Hocko, Pavel Machek,
	Pavel Tatashin, Peter Zijlstra, Randy Dunlap, Robert Elliott,
	Shile Zhang, Steven Sistare, Tejun Heo, Zi Yan, linux-crypto,
	linux-mm, LKML, linux-s390,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Wed, May 20, 2020 at 06:29:32PM -0700, Alexander Duyck wrote:
> On Wed, May 20, 2020 at 11:27 AM Daniel Jordan
> > @@ -1814,16 +1815,44 @@ deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
> >         return nr_pages;
> >  }
> >
> > +struct definit_args {
> > +       struct zone *zone;
> > +       atomic_long_t nr_pages;
> > +};
> > +
> > +static void __init
> > +deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> > +                          void *arg)
> > +{
> > +       unsigned long spfn, epfn, nr_pages = 0;
> > +       struct definit_args *args = arg;
> > +       struct zone *zone = args->zone;
> > +       u64 i;
> > +
> > +       deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, start_pfn);
> > +
> > +       /*
> > +        * Initialize and free pages in MAX_ORDER sized increments so that we
> > +        * can avoid introducing any issues with the buddy allocator.
> > +        */
> > +       while (spfn < end_pfn) {
> > +               nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > +               cond_resched();
> > +       }
> > +
> > +       atomic_long_add(nr_pages, &args->nr_pages);
> > +}
> > +
> 
> Personally I would get rid of nr_pages entirely. It isn't worth the
> cache thrash to have this atomic variable bouncing around.

One of the things I tried to optimize was the managed_pages atomic adds in
__free_pages_core, but performance stayed the same on the biggest machine I
tested when it was done once at the end of page init instead of in every thread
for every pageblock.

I'm not sure this atomic would matter either, given it's less frequent.

> You could
> probably just have this function return void since all nr_pages is
> used for is a pr_info  statement at the end of the initialization
> which will be completely useless now anyway since we really have the
> threads running in parallel anyway.

The timestamp is still useful for observability, page init is a significant
part of kernel boot on big machines, over 10% sometimes with these patches.

It's mostly the time that matters though, I agree the number of pages is less
important and is probably worth removing just to simplify the code.  I'll do it
if no one sees a reason to keep it.

> We only really need the nr_pages logic in deferred_grow_zone in order
> to track if we have freed enough pages to allow us to go back to what
> we were doing.
>
> > @@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void *data)
> >                 goto zone_empty;
> >
> >         /*
> > -        * Initialize and free pages in MAX_ORDER sized increments so
> > -        * that we can avoid introducing any issues with the buddy
> > -        * allocator.
> > +        * More CPUs always led to greater speedups on tested systems, up to
> > +        * all the nodes' CPUs.  Use all since the system is otherwise idle now.
> >          */
> > +       max_threads = max(cpumask_weight(cpumask), 1u);
> > +
> >         while (spfn < epfn) {
> > +               epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
> > +
> > +               if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
> > +                   epfn_align - spfn >= PAGES_PER_SECTION) {
> > +                       struct definit_args arg = { zone, ATOMIC_LONG_INIT(0) };
> > +                       struct padata_mt_job job = {
> > +                               .thread_fn   = deferred_init_memmap_chunk,
> > +                               .fn_arg      = &arg,
> > +                               .start       = spfn,
> > +                               .size        = epfn_align - spfn,
> > +                               .align       = PAGES_PER_SECTION,
> > +                               .min_chunk   = PAGES_PER_SECTION,
> > +                               .max_threads = max_threads,
> > +                       };
> > +
> > +                       padata_do_multithreaded(&job);
> > +                       nr_pages += atomic_long_read(&arg.nr_pages);
> > +                       spfn = epfn_align;
> > +               }
> > +
> >                 nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> >                 cond_resched();
> >         }
> 
> This doesn't look right. You are basically adding threads in addition
> to calls to deferred_init_maxorder.

The deferred_init_maxorder call is there to do the remaining, non-section
aligned part of a range.  It doesn't have to be done this way.

> In addition you are spawning one
> job per section instead of per range.

That's not what's happening, all the above is doing is aligning the end of the
range down to a section.  Each thread is working on way more than a section at
a time.

> Really you should be going for
> something more along the lines of:
> 
>         while (spfn < epfn) {
>                 unsigned long epfn_align = ALIGN(epfn,
> PAGE_PER_SECTION);
>                 struct definit_args arg = { zone, ATOMIC_LONG_INIT(0)
> };
>                 struct padata_mt_job job = {
>                         .thread_fn   = deferred_init_memmap_chunk,
>                         .fn_arg      = &arg,
>                         .start       = spfn,
>                         .size        = epfn_align - spfn,
>                         .align       = PAGES_PER_SECTION,
>                         .min_chunk   = PAGES_PER_SECTION,
>                         .max_threads = max_threads,
>                 };
> 
>                 padata_do_multithreaded(&job);
> 
>                 for_each_free_mem_pfn_range_in_zone_from(i, zone,
> spfn, epfn) {
>                         if (epfn_align <= spfn)
>                                 break;
>                 }
>         }

I can see what you're getting at even though I think this can leave ranges
uninitialized.  Starting with range [a,b), b is aligned up to d and the inner
loop skips [c,e).

a    b  c d        e
|         |         |   section boundaries
[    )  [          )

We could use deferred_init_mem_pfn_range_in_zone() instead of the for_each
loop.

What I was trying to avoid by aligning down is creating a discontiguous pfn
range that get passed to padata.  We already discussed how those are handled
by the zone iterator in the thread function, but job->size can be exaggerated
to include parts of the range that are never touched.  Thinking more about it
though, it's a small fraction of the total work and shouldn't matter.

> This should accomplish the same thing, but much more efficiently.

Well, more cleanly.  I'll give it a try.

> The
> only thing you really lose is the tracking of nr_pages which really
> doesn't add anything anyway since the value could shift around
> depending on how many times deferred_grow_zone got called anyway.
> 
> Also the spfn should already be sectioned aligned, or at least be in a
> new section unrelated to the one we just scheduled, so there is no
> need for the extra checks you had.

I was doing it to be robust to future changes.  Otherwise epfn_align - spfn
could be huge when aligning down, but with aligning up it won't matter and can
be removed.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()
  2020-05-21 15:00     ` Alexander Duyck
@ 2020-05-21 15:39       ` Daniel Jordan
  0 siblings, 0 replies; 15+ messages in thread
From: Daniel Jordan @ 2020-05-21 15:39 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Daniel Jordan, Andrew Morton, Herbert Xu, Steffen Klassert,
	Alex Williamson, Alexander Duyck, Dan Williams, Dave Hansen,
	David Hildenbrand, Jason Gunthorpe, Jonathan Corbet,
	Josh Triplett, Kirill Tkhai, Michal Hocko, Pavel Machek,
	Pavel Tatashin, Peter Zijlstra, Randy Dunlap, Robert Elliott,
	Shile Zhang, Steven Sistare, Tejun Heo, Zi Yan, linux-crypto,
	linux-mm, LKML, linux-s390,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Thu, May 21, 2020 at 08:00:31AM -0700, Alexander Duyck wrote:
> So I was thinking about my suggestion further and the loop at the end
> isn't quite correct as I believe it could lead to gaps. The loop on
> the end should probably be:
>                 for_each_free_mem_pfn_range_in_zone_from(i, zone, spfn, epfn) {
>                         if (epfn <= epfn_align)
>                                 continue;
>                         if (spfn < epfn_align)
>                                 spfn = epfn_align;
>                         break;
>                 }
> 
> That would generate a new range where epfn_align has actually ended
> and there is a range of new PFNs to process.

Whoops, my email crossed with yours.  Agreed, but see the other message.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()
  2020-05-21 15:37     ` Daniel Jordan
@ 2020-05-21 16:46       ` Alexander Duyck
  2020-05-21 21:15         ` Daniel Jordan
  0 siblings, 1 reply; 15+ messages in thread
From: Alexander Duyck @ 2020-05-21 16:46 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Andrew Morton, Herbert Xu, Steffen Klassert, Alex Williamson,
	Alexander Duyck, Dan Williams, Dave Hansen, David Hildenbrand,
	Jason Gunthorpe, Jonathan Corbet, Josh Triplett, Kirill Tkhai,
	Michal Hocko, Pavel Machek, Pavel Tatashin, Peter Zijlstra,
	Randy Dunlap, Robert Elliott, Shile Zhang, Steven Sistare,
	Tejun Heo, Zi Yan, linux-crypto, linux-mm, LKML, linux-s390,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Thu, May 21, 2020 at 8:37 AM Daniel Jordan
<daniel.m.jordan@oracle.com> wrote:
>
> On Wed, May 20, 2020 at 06:29:32PM -0700, Alexander Duyck wrote:
> > On Wed, May 20, 2020 at 11:27 AM Daniel Jordan
> > > @@ -1814,16 +1815,44 @@ deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
> > >         return nr_pages;
> > >  }
> > >
> > > +struct definit_args {
> > > +       struct zone *zone;
> > > +       atomic_long_t nr_pages;
> > > +};
> > > +
> > > +static void __init
> > > +deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> > > +                          void *arg)
> > > +{
> > > +       unsigned long spfn, epfn, nr_pages = 0;
> > > +       struct definit_args *args = arg;
> > > +       struct zone *zone = args->zone;
> > > +       u64 i;
> > > +
> > > +       deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, start_pfn);
> > > +
> > > +       /*
> > > +        * Initialize and free pages in MAX_ORDER sized increments so that we
> > > +        * can avoid introducing any issues with the buddy allocator.
> > > +        */
> > > +       while (spfn < end_pfn) {
> > > +               nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > > +               cond_resched();
> > > +       }
> > > +
> > > +       atomic_long_add(nr_pages, &args->nr_pages);
> > > +}
> > > +
> >
> > Personally I would get rid of nr_pages entirely. It isn't worth the
> > cache thrash to have this atomic variable bouncing around.
>
> One of the things I tried to optimize was the managed_pages atomic adds in
> __free_pages_core, but performance stayed the same on the biggest machine I
> tested when it was done once at the end of page init instead of in every thread
> for every pageblock.
>
> I'm not sure this atomic would matter either, given it's less frequent.

It is more about not bothering with the extra tracking. We don't
really need it and having it doesn't really add much in the way of
value.

> > You could
> > probably just have this function return void since all nr_pages is
> > used for is a pr_info  statement at the end of the initialization
> > which will be completely useless now anyway since we really have the
> > threads running in parallel anyway.
>
> The timestamp is still useful for observability, page init is a significant
> part of kernel boot on big machines, over 10% sometimes with these patches.

Agreed.

> It's mostly the time that matters though, I agree the number of pages is less
> important and is probably worth removing just to simplify the code.  I'll do it
> if no one sees a reason to keep it.

Sounds good.

> > We only really need the nr_pages logic in deferred_grow_zone in order
> > to track if we have freed enough pages to allow us to go back to what
> > we were doing.
> >
> > > @@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void *data)
> > >                 goto zone_empty;
> > >
> > >         /*
> > > -        * Initialize and free pages in MAX_ORDER sized increments so
> > > -        * that we can avoid introducing any issues with the buddy
> > > -        * allocator.
> > > +        * More CPUs always led to greater speedups on tested systems, up to
> > > +        * all the nodes' CPUs.  Use all since the system is otherwise idle now.
> > >          */
> > > +       max_threads = max(cpumask_weight(cpumask), 1u);
> > > +
> > >         while (spfn < epfn) {
> > > +               epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
> > > +
> > > +               if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
> > > +                   epfn_align - spfn >= PAGES_PER_SECTION) {
> > > +                       struct definit_args arg = { zone, ATOMIC_LONG_INIT(0) };
> > > +                       struct padata_mt_job job = {
> > > +                               .thread_fn   = deferred_init_memmap_chunk,
> > > +                               .fn_arg      = &arg,
> > > +                               .start       = spfn,
> > > +                               .size        = epfn_align - spfn,
> > > +                               .align       = PAGES_PER_SECTION,
> > > +                               .min_chunk   = PAGES_PER_SECTION,
> > > +                               .max_threads = max_threads,
> > > +                       };
> > > +
> > > +                       padata_do_multithreaded(&job);
> > > +                       nr_pages += atomic_long_read(&arg.nr_pages);
> > > +                       spfn = epfn_align;
> > > +               }
> > > +
> > >                 nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > >                 cond_resched();
> > >         }
> >
> > This doesn't look right. You are basically adding threads in addition
> > to calls to deferred_init_maxorder.
>
> The deferred_init_maxorder call is there to do the remaining, non-section
> aligned part of a range.  It doesn't have to be done this way.

It is also doing the advancing though isn't it?

> > In addition you are spawning one
> > job per section instead of per range.
>
> That's not what's happening, all the above is doing is aligning the end of the
> range down to a section.  Each thread is working on way more than a section at
> a time.

Yeah, now that I reread it I see that. For some reason I was thinking
you were aligning spfn, not epfn.

> > Really you should be going for
> > something more along the lines of:
> >
> >         while (spfn < epfn) {
> >                 unsigned long epfn_align = ALIGN(epfn,
> > PAGE_PER_SECTION);
> >                 struct definit_args arg = { zone, ATOMIC_LONG_INIT(0)
> > };
> >                 struct padata_mt_job job = {
> >                         .thread_fn   = deferred_init_memmap_chunk,
> >                         .fn_arg      = &arg,
> >                         .start       = spfn,
> >                         .size        = epfn_align - spfn,
> >                         .align       = PAGES_PER_SECTION,
> >                         .min_chunk   = PAGES_PER_SECTION,
> >                         .max_threads = max_threads,
> >                 };
> >
> >                 padata_do_multithreaded(&job);
> >
> >                 for_each_free_mem_pfn_range_in_zone_from(i, zone,
> > spfn, epfn) {
> >                         if (epfn_align <= spfn)
> >                                 break;
> >                 }
> >         }
>
> I can see what you're getting at even though I think this can leave ranges
> uninitialized.  Starting with range [a,b), b is aligned up to d and the inner
> loop skips [c,e).
>
> a    b  c d        e
> |         |         |   section boundaries
> [    )  [          )

I think I resolved this with the fix for it I described in the other
email. We just need to swap out spfn for epfn and make sure we align
spfn with epfn_align. Then I think that takes care of possible skips.

> We could use deferred_init_mem_pfn_range_in_zone() instead of the for_each
> loop.
>
> What I was trying to avoid by aligning down is creating a discontiguous pfn
> range that get passed to padata.  We already discussed how those are handled
> by the zone iterator in the thread function, but job->size can be exaggerated
> to include parts of the range that are never touched.  Thinking more about it
> though, it's a small fraction of the total work and shouldn't matter.

So the problem with aligning down is that you are going to be slowed
up as you have to go single threaded to initialize whatever remains.
So worst case scenario is that you have a section aligned block and
you will process all but 1 section in parallel, and then have to
process the remaining section one max order block at a time.

> > This should accomplish the same thing, but much more efficiently.
>
> Well, more cleanly.  I'll give it a try.

I agree I am not sure if it will make a big difference on x86, however
the more ranges you have to process the faster this approach should be
as it stays parallel the entire time rather than having to drop out
and process the last section one max order block at a time.

> > The
> > only thing you really lose is the tracking of nr_pages which really
> > doesn't add anything anyway since the value could shift around
> > depending on how many times deferred_grow_zone got called anyway.
> >
> > Also the spfn should already be sectioned aligned, or at least be in a
> > new section unrelated to the one we just scheduled, so there is no
> > need for the extra checks you had.
>
> I was doing it to be robust to future changes.  Otherwise epfn_align - spfn
> could be huge when aligning down, but with aligning up it won't matter and can
> be removed.

Right. So that was to catch the case where you could potentially align
down below spfn.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()
  2020-05-21 16:46       ` Alexander Duyck
@ 2020-05-21 21:15         ` Daniel Jordan
  0 siblings, 0 replies; 15+ messages in thread
From: Daniel Jordan @ 2020-05-21 21:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Daniel Jordan, Andrew Morton, Herbert Xu, Steffen Klassert,
	Alex Williamson, Alexander Duyck, Dan Williams, Dave Hansen,
	David Hildenbrand, Jason Gunthorpe, Jonathan Corbet,
	Josh Triplett, Kirill Tkhai, Michal Hocko, Pavel Machek,
	Pavel Tatashin, Peter Zijlstra, Randy Dunlap, Robert Elliott,
	Shile Zhang, Steven Sistare, Tejun Heo, Zi Yan, linux-crypto,
	linux-mm, LKML, linux-s390,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Thu, May 21, 2020 at 09:46:35AM -0700, Alexander Duyck wrote:
> It is more about not bothering with the extra tracking. We don't
> really need it and having it doesn't really add much in the way of
> value.

Yeah, it can probably go.

> > > > @@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void *data)
> > > >                 goto zone_empty;
> > > >
> > > >         /*
> > > > -        * Initialize and free pages in MAX_ORDER sized increments so
> > > > -        * that we can avoid introducing any issues with the buddy
> > > > -        * allocator.
> > > > +        * More CPUs always led to greater speedups on tested systems, up to
> > > > +        * all the nodes' CPUs.  Use all since the system is otherwise idle now.
> > > >          */
> > > > +       max_threads = max(cpumask_weight(cpumask), 1u);
> > > > +
> > > >         while (spfn < epfn) {
> > > > +               epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
> > > > +
> > > > +               if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
> > > > +                   epfn_align - spfn >= PAGES_PER_SECTION) {
> > > > +                       struct definit_args arg = { zone, ATOMIC_LONG_INIT(0) };
> > > > +                       struct padata_mt_job job = {
> > > > +                               .thread_fn   = deferred_init_memmap_chunk,
> > > > +                               .fn_arg      = &arg,
> > > > +                               .start       = spfn,
> > > > +                               .size        = epfn_align - spfn,
> > > > +                               .align       = PAGES_PER_SECTION,
> > > > +                               .min_chunk   = PAGES_PER_SECTION,
> > > > +                               .max_threads = max_threads,
> > > > +                       };
> > > > +
> > > > +                       padata_do_multithreaded(&job);
> > > > +                       nr_pages += atomic_long_read(&arg.nr_pages);
> > > > +                       spfn = epfn_align;
> > > > +               }
> > > > +
> > > >                 nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > > >                 cond_resched();
> > > >         }
> > >
> > > This doesn't look right. You are basically adding threads in addition
> > > to calls to deferred_init_maxorder.
> >
> > The deferred_init_maxorder call is there to do the remaining, non-section
> > aligned part of a range.  It doesn't have to be done this way.
> 
> It is also doing the advancing though isn't it?

Yes.  Not sure what you're getting at.  There's the 'spfn = epfn_align' before
so nothing is skipped.  It's true that the nonaligned part is done outside of
padata when it could be done by a thread that'd otherwise be waiting or idle,
which should be addressed in the next version.

> I think I resolved this with the fix for it I described in the other
> email. We just need to swap out spfn for epfn and make sure we align
> spfn with epfn_align. Then I think that takes care of possible skips.

Right, though your fix looks a lot like deferred_init_mem_pfn_range_in_zone().
Seems better to just use that and not repeat ourselves.  Lame that it's
starting at the beginning of the ranges every time, maybe it could be
generalized somehow, but I think it should be fast enough.

> > We could use deferred_init_mem_pfn_range_in_zone() instead of the for_each
> > loop.
> >
> > What I was trying to avoid by aligning down is creating a discontiguous pfn
> > range that get passed to padata.  We already discussed how those are handled
> > by the zone iterator in the thread function, but job->size can be exaggerated
> > to include parts of the range that are never touched.  Thinking more about it
> > though, it's a small fraction of the total work and shouldn't matter.
> 
> So the problem with aligning down is that you are going to be slowed
> up as you have to go single threaded to initialize whatever remains.
> So worst case scenario is that you have a section aligned block and
> you will process all but 1 section in parallel, and then have to
> process the remaining section one max order block at a time.

Yes, aligning up is better.

> > > This should accomplish the same thing, but much more efficiently.
> >
> > Well, more cleanly.  I'll give it a try.
> 
> I agree I am not sure if it will make a big difference on x86, however
> the more ranges you have to process the faster this approach should be
> as it stays parallel the entire time rather than having to drop out
> and process the last section one max order block at a time.

Right.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/7] padata: parallelize deferred page init
  2020-05-20 18:26 [PATCH v2 0/7] padata: parallelize deferred page init Daniel Jordan
                   ` (6 preceding siblings ...)
  2020-05-20 18:26 ` [PATCH v2 7/7] padata: document multithreaded jobs Daniel Jordan
@ 2020-05-21 23:43 ` Josh Triplett
  7 siblings, 0 replies; 15+ messages in thread
From: Josh Triplett @ 2020-05-21 23:43 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Andrew Morton, Herbert Xu, Steffen Klassert, Alex Williamson,
	Alexander Duyck, Dan Williams, Dave Hansen, David Hildenbrand,
	Jason Gunthorpe, Jonathan Corbet, Kirill Tkhai, Michal Hocko,
	Pavel Machek, Pavel Tatashin, Peter Zijlstra, Randy Dunlap,
	Robert Elliott, Shile Zhang, Steven Sistare, Tejun Heo, Zi Yan,
	linux-crypto, linux-mm, linux-kernel, linux-s390, linuxppc-dev

On Wed, May 20, 2020 at 02:26:38PM -0400, Daniel Jordan wrote:
> Please review and test, and thanks to Alex, Andrew, Josh, and Pavel for
> their feedback in the last version.

I re-tested v2:

Tested-by: Josh Triplett <josh@joshtriplett.org>

[    0.231435] node 1 initialised, 24189223 pages in 32ms
[    0.236718] node 0 initialised, 23398907 pages in 36ms


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-05-21 23:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-20 18:26 [PATCH v2 0/7] padata: parallelize deferred page init Daniel Jordan
2020-05-20 18:26 ` [PATCH v2 1/7] padata: remove exit routine Daniel Jordan
2020-05-20 18:26 ` [PATCH v2 2/7] padata: initialize earlier Daniel Jordan
2020-05-20 18:26 ` [PATCH v2 3/7] padata: allocate work structures for parallel jobs from a pool Daniel Jordan
2020-05-20 18:26 ` [PATCH v2 4/7] padata: add basic support for multithreaded jobs Daniel Jordan
2020-05-20 18:26 ` [PATCH v2 5/7] mm: parallelize deferred_init_memmap() Daniel Jordan
2020-05-21  1:29   ` Alexander Duyck
2020-05-21 15:00     ` Alexander Duyck
2020-05-21 15:39       ` Daniel Jordan
2020-05-21 15:37     ` Daniel Jordan
2020-05-21 16:46       ` Alexander Duyck
2020-05-21 21:15         ` Daniel Jordan
2020-05-20 18:26 ` [PATCH v2 6/7] mm: make deferred init's max threads arch-specific Daniel Jordan
2020-05-20 18:26 ` [PATCH v2 7/7] padata: document multithreaded jobs Daniel Jordan
2020-05-21 23:43 ` [PATCH v2 0/7] padata: parallelize deferred page init Josh Triplett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).