[RFC PATCH v2 0/7] ktask: multithread cpu-intensive kernel work

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/7] ktask: multithread cpu-intensive kernel work
@ 2017-08-24 20:49 ` Daniel Jordan
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-24 20:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

ktask is a generic framework for parallelizing cpu-intensive work in the
kernel.  The intended use is for big machines that can use their cpu power
to speed up large tasks that can't otherwise be multithreaded in userland.
The API is generic enough to add concurrency to many different kinds of
tasks--for example, zeroing a range of pages or evicting a list of
inodes--and aims to save its clients the trouble of splitting up the work,
choosing the number of threads to use, starting these threads, and load
balancing the work between them.

Why do we need ktask when the kernel has other APIs for managing
concurrency?  After all, kthread_workers and workqueues already provide ways
to start threads, and the kernel can handle large tasks with a single thread
by periodically yielding the cpu with cond_resched or doing the work in
fixed size batches.

Of the existing concurrency facilities, kthread_worker isn't suited for
providing parallelism because each comes with only a single thread.
Workqueues are a better fit for this, and in fact ktask is built on an
unbound workqueue, but workqueues aren't designed for splitting up a large
task.  ktask instead uses unbound workqueue threads to run "chunks" of a
task.

More background is available in the documentation commit (first commit of the
series).

This patchset is based on 4.13-rc6 and contains three ktask users so far, with
more to come:
 - clearing gigantic pages
 - fallocate for HugeTLB pages
 - deferred struct page initialization at boot time

The core ktask code is based on work by Pavel Tatashin, Steve Sistare, and
Jonathan Adams.

v1 -> v2:
 - Added deferred struct page initialization use case.
 - Explained the source of the performance improvement from parallelizing
   clear_gigantic_page (comment from Dave Hansen).
 - Fixed Documentation and build warnings from CONFIG_KTASK=n kernels.

link to v1: https://lkml.org/lkml/2017/7/14/666

Daniel Jordan (7):
  ktask: add documentation
  ktask: multithread cpu-intensive kernel work
  ktask: add /proc/sys/debug/ktask_max_threads
  mm: enlarge type of offset argument in mem_map_offset and mem_map_next
  mm: parallelize clear_gigantic_page
  hugetlbfs: parallelize hugetlbfs_fallocate with ktask
  mm: parallelize deferred struct page initialization within each node

 Documentation/core-api/index.rst |   1 +
 Documentation/core-api/ktask.rst | 104 ++++++++++
 fs/hugetlbfs/inode.c             | 117 +++++++++---
 include/linux/ktask.h            | 235 +++++++++++++++++++++++
 include/linux/ktask_internal.h   |  19 ++
 include/linux/mm.h               |   6 +
 init/Kconfig                     |   7 +
 init/main.c                      |   2 +
 kernel/Makefile                  |   2 +-
 kernel/ktask.c                   | 396 +++++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c                  |  10 +
 mm/internal.h                    |   7 +-
 mm/memory.c                      |  35 +++-
 mm/page_alloc.c                  | 174 ++++++++++-------
 14 files changed, 1014 insertions(+), 101 deletions(-)
 create mode 100644 Documentation/core-api/ktask.rst
 create mode 100644 include/linux/ktask.h
 create mode 100644 include/linux/ktask_internal.h
 create mode 100644 kernel/ktask.c

-- 
2.12.2

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 0/7] ktask: multithread cpu-intensive kernel work
@ 2017-08-24 20:49 ` Daniel Jordan
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-24 20:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

ktask is a generic framework for parallelizing cpu-intensive work in the
kernel.  The intended use is for big machines that can use their cpu power
to speed up large tasks that can't otherwise be multithreaded in userland.
The API is generic enough to add concurrency to many different kinds of
tasks--for example, zeroing a range of pages or evicting a list of
inodes--and aims to save its clients the trouble of splitting up the work,
choosing the number of threads to use, starting these threads, and load
balancing the work between them.

Why do we need ktask when the kernel has other APIs for managing
concurrency?  After all, kthread_workers and workqueues already provide ways
to start threads, and the kernel can handle large tasks with a single thread
by periodically yielding the cpu with cond_resched or doing the work in
fixed size batches.

Of the existing concurrency facilities, kthread_worker isn't suited for
providing parallelism because each comes with only a single thread.
Workqueues are a better fit for this, and in fact ktask is built on an
unbound workqueue, but workqueues aren't designed for splitting up a large
task.  ktask instead uses unbound workqueue threads to run "chunks" of a
task.

More background is available in the documentation commit (first commit of the
series).

This patchset is based on 4.13-rc6 and contains three ktask users so far, with
more to come:
 - clearing gigantic pages
 - fallocate for HugeTLB pages
 - deferred struct page initialization at boot time

The core ktask code is based on work by Pavel Tatashin, Steve Sistare, and
Jonathan Adams.

v1 -> v2:
 - Added deferred struct page initialization use case.
 - Explained the source of the performance improvement from parallelizing
   clear_gigantic_page (comment from Dave Hansen).
 - Fixed Documentation and build warnings from CONFIG_KTASK=n kernels.

link to v1: https://lkml.org/lkml/2017/7/14/666

Daniel Jordan (7):
  ktask: add documentation
  ktask: multithread cpu-intensive kernel work
  ktask: add /proc/sys/debug/ktask_max_threads
  mm: enlarge type of offset argument in mem_map_offset and mem_map_next
  mm: parallelize clear_gigantic_page
  hugetlbfs: parallelize hugetlbfs_fallocate with ktask
  mm: parallelize deferred struct page initialization within each node

 Documentation/core-api/index.rst |   1 +
 Documentation/core-api/ktask.rst | 104 ++++++++++
 fs/hugetlbfs/inode.c             | 117 +++++++++---
 include/linux/ktask.h            | 235 +++++++++++++++++++++++
 include/linux/ktask_internal.h   |  19 ++
 include/linux/mm.h               |   6 +
 init/Kconfig                     |   7 +
 init/main.c                      |   2 +
 kernel/Makefile                  |   2 +-
 kernel/ktask.c                   | 396 +++++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c                  |  10 +
 mm/internal.h                    |   7 +-
 mm/memory.c                      |  35 +++-
 mm/page_alloc.c                  | 174 ++++++++++-------
 14 files changed, 1014 insertions(+), 101 deletions(-)
 create mode 100644 Documentation/core-api/ktask.rst
 create mode 100644 include/linux/ktask.h
 create mode 100644 include/linux/ktask_internal.h
 create mode 100644 kernel/ktask.c

-- 
2.12.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 1/7] ktask: add documentation
  2017-08-24 20:49 ` Daniel Jordan
@ 2017-08-24 20:49   ` Daniel Jordan
  -1 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-24 20:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

Motivates and explains the ktask API for kernel clients.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 Documentation/core-api/index.rst |   1 +
 Documentation/core-api/ktask.rst | 104 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 105 insertions(+)
 create mode 100644 Documentation/core-api/ktask.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index 0606be3a3111..9096f4b77b42 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -15,6 +15,7 @@ Core utilities
    assoc_array
    atomic_ops
    cpu_hotplug
+   ktask
    local_ops
    workqueue
    genericirq
diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst
new file mode 100644
index 000000000000..cb4b0d87c8c6
--- /dev/null
+++ b/Documentation/core-api/ktask.rst
@@ -0,0 +1,104 @@
+============================================
+ktask: parallelize cpu-intensive kernel work
+============================================
+
+:Date: July, 2017
+:Author: Daniel Jordan <daniel.m.jordan@oracle.com>
+
+
+Introduction
+============
+
+ktask is a generic framework for parallelizing cpu-intensive work in the
+kernel.  The intended use is for big machines that can use their cpu power to
+speed up large tasks that can't otherwise be multithreaded in userland.  The
+API is generic enough to add concurrency to many different kinds of tasks--for
+example, zeroing a range of pages or evicting a list of inodes--and aims to
+save its clients the trouble of splitting up the work, choosing the number of
+threads to use, starting these threads, and load balancing the work between
+them.
+
+
+Motivation
+==========
+
+Why do we need ktask when the kernel has other APIs for managing concurrency?
+After all, kthread_workers and workqueues already provide ways to start
+threads, and the kernel can handle large tasks with a single thread by
+periodically yielding the cpu with cond_resched (e.g. hugetlbfs_fallocate,
+clear_gigantic_page) or performing the work in fixed size batches (e.g. struct
+pagevec, struct mmu_gather).
+
+Of the existing concurrency facilities, kthread_worker isn't suited for
+providing parallelism because each comes with only a single thread.  Workqueues
+are a better fit for this, and in fact ktask is built on an unbound workqueue,
+but workqueues aren't designed for splitting up a large task.  ktask instead
+uses unbound workqueue threads to run "chunks" of a task.
+
+On top of workqueues, ktask takes care of dividing up the task into chunks,
+determining how many threads to use to complete those chunks, starting the
+threads, and load balancing across them.  This makes use of otherwise idle
+cpus, but if the system is under load, the scheduler still decides when the
+ktask threads run: existing cond_resched calls are retained in big loops that
+have been parallelized.
+
+This added concurrency boosts the performance of the system in a number of
+ways: system startup and shutdown are faster, page fault latency of a gigantic
+page goes down (zero the page in parallel), initializing many pages goes
+quicker (e.g. populating a range of pages via prefaulting, mlocking, or
+fallocating), and pages are freed back to the system in less time (e.g. on a
+large munmap(2) or on exit(2) of a large process).
+
+
+Configuration
+=============
+
+To use ktask, configure the kernel with CONFIG_KTASK=y.
+
+If CONFIG_KTASK=n, calls to the ktask API are simply #define'd to run the
+thread function that the client provides so that the task is completed without
+concurrency in the current thread.
+
+
+Concept
+=======
+
+A little terminology up front:  A 'task' is the total work there is to do and a
+'chunk' is a unit of work given to a thread.
+
+To complete a task using the ktask framework, a client provides a thread
+function that is responsible for completing one chunk.  The thread function is
+defined in a standard way, with start and end arguments that delimit the chunk
+as well as an argument that the client uses to pass data specific to the task.
+
+In addition, the client supplies an object representing the start of the task
+and an iterator function that knows how to advance some number of units in the
+task to yield another object representing the new task position.  The framework
+uses the start object and iterator internally to divide the task into chunks.
+
+Finally, the client passes the total task size and a minimum chunk size to
+indicate the minimum amount of work that's appropriate to do in one chunk.  The
+sizes are given in task-specific units (e.g. pages, inodes, bytes).  The
+framework uses these sizes, along with the number of online cpus and an
+internal maximum number of threads, to decide how many threads to start and how
+many chunks to divide the task into.
+
+For example, consider the task of clearing a gigantic page.  This used to be
+done in a single thread with a for loop that calls a page clearing function for
+each constituent base page.  To parallelize with ktask, the client first moves
+the for loop to the thread function, adapting it to operate on the range passed
+to the function.  In this simple case, the thread function's start and end
+arguments are just addresses delimiting the portion of the gigantic page to
+clear.  Then, where the for loop used to be, the client calls into ktask with
+the start address of the gigantic page, the total size of the gigantic page,
+and the thread function.  Internally, ktask will divide the address range into
+an appropriate number of chunks and start an appropriate number of threads to
+complete these chunks.
+
+
+Interface
+=========
+
+.. Include ktask.h inline here.  This file is heavily commented and documents
+.. the ktask interface.
+.. kernel-doc:: include/linux/ktask.h
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 1/7] ktask: add documentation
@ 2017-08-24 20:49   ` Daniel Jordan
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-24 20:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

Motivates and explains the ktask API for kernel clients.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 Documentation/core-api/index.rst |   1 +
 Documentation/core-api/ktask.rst | 104 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 105 insertions(+)
 create mode 100644 Documentation/core-api/ktask.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index 0606be3a3111..9096f4b77b42 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -15,6 +15,7 @@ Core utilities
    assoc_array
    atomic_ops
    cpu_hotplug
+   ktask
    local_ops
    workqueue
    genericirq
diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst
new file mode 100644
index 000000000000..cb4b0d87c8c6
--- /dev/null
+++ b/Documentation/core-api/ktask.rst
@@ -0,0 +1,104 @@
+============================================
+ktask: parallelize cpu-intensive kernel work
+============================================
+
+:Date: July, 2017
+:Author: Daniel Jordan <daniel.m.jordan@oracle.com>
+
+
+Introduction
+============
+
+ktask is a generic framework for parallelizing cpu-intensive work in the
+kernel.  The intended use is for big machines that can use their cpu power to
+speed up large tasks that can't otherwise be multithreaded in userland.  The
+API is generic enough to add concurrency to many different kinds of tasks--for
+example, zeroing a range of pages or evicting a list of inodes--and aims to
+save its clients the trouble of splitting up the work, choosing the number of
+threads to use, starting these threads, and load balancing the work between
+them.
+
+
+Motivation
+==========
+
+Why do we need ktask when the kernel has other APIs for managing concurrency?
+After all, kthread_workers and workqueues already provide ways to start
+threads, and the kernel can handle large tasks with a single thread by
+periodically yielding the cpu with cond_resched (e.g. hugetlbfs_fallocate,
+clear_gigantic_page) or performing the work in fixed size batches (e.g. struct
+pagevec, struct mmu_gather).
+
+Of the existing concurrency facilities, kthread_worker isn't suited for
+providing parallelism because each comes with only a single thread.  Workqueues
+are a better fit for this, and in fact ktask is built on an unbound workqueue,
+but workqueues aren't designed for splitting up a large task.  ktask instead
+uses unbound workqueue threads to run "chunks" of a task.
+
+On top of workqueues, ktask takes care of dividing up the task into chunks,
+determining how many threads to use to complete those chunks, starting the
+threads, and load balancing across them.  This makes use of otherwise idle
+cpus, but if the system is under load, the scheduler still decides when the
+ktask threads run: existing cond_resched calls are retained in big loops that
+have been parallelized.
+
+This added concurrency boosts the performance of the system in a number of
+ways: system startup and shutdown are faster, page fault latency of a gigantic
+page goes down (zero the page in parallel), initializing many pages goes
+quicker (e.g. populating a range of pages via prefaulting, mlocking, or
+fallocating), and pages are freed back to the system in less time (e.g. on a
+large munmap(2) or on exit(2) of a large process).
+
+
+Configuration
+=============
+
+To use ktask, configure the kernel with CONFIG_KTASK=y.
+
+If CONFIG_KTASK=n, calls to the ktask API are simply #define'd to run the
+thread function that the client provides so that the task is completed without
+concurrency in the current thread.
+
+
+Concept
+=======
+
+A little terminology up front:  A 'task' is the total work there is to do and a
+'chunk' is a unit of work given to a thread.
+
+To complete a task using the ktask framework, a client provides a thread
+function that is responsible for completing one chunk.  The thread function is
+defined in a standard way, with start and end arguments that delimit the chunk
+as well as an argument that the client uses to pass data specific to the task.
+
+In addition, the client supplies an object representing the start of the task
+and an iterator function that knows how to advance some number of units in the
+task to yield another object representing the new task position.  The framework
+uses the start object and iterator internally to divide the task into chunks.
+
+Finally, the client passes the total task size and a minimum chunk size to
+indicate the minimum amount of work that's appropriate to do in one chunk.  The
+sizes are given in task-specific units (e.g. pages, inodes, bytes).  The
+framework uses these sizes, along with the number of online cpus and an
+internal maximum number of threads, to decide how many threads to start and how
+many chunks to divide the task into.
+
+For example, consider the task of clearing a gigantic page.  This used to be
+done in a single thread with a for loop that calls a page clearing function for
+each constituent base page.  To parallelize with ktask, the client first moves
+the for loop to the thread function, adapting it to operate on the range passed
+to the function.  In this simple case, the thread function's start and end
+arguments are just addresses delimiting the portion of the gigantic page to
+clear.  Then, where the for loop used to be, the client calls into ktask with
+the start address of the gigantic page, the total size of the gigantic page,
+and the thread function.  Internally, ktask will divide the address range into
+an appropriate number of chunks and start an appropriate number of threads to
+complete these chunks.
+
+
+Interface
+=========
+
+.. Include ktask.h inline here.  This file is heavily commented and documents
+.. the ktask interface.
+.. kernel-doc:: include/linux/ktask.h
-- 
2.12.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 2/7] ktask: multithread cpu-intensive kernel work
  2017-08-24 20:49 ` Daniel Jordan
@ 2017-08-24 20:49   ` Daniel Jordan
  -1 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-24 20:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

ktask is a generic framework for parallelizing cpu-intensive work in the
kernel.  The intended use is for big machines that can use their cpu power to
speed up large tasks that can't otherwise be multithreaded in userland.  The
API is generic enough to add concurrency to many different kinds of tasks--for
example, zeroing a range of pages or evicting a list of inodes--and aims to
save its clients the trouble of splitting up the work, choosing the number of
threads to use, starting these threads, and load balancing the work between
them.

The Documentation patch earlier in this series has more background.

Introduces the ktask API; consumers appear in subsequent patches.

Based on work by Pavel Tatashin, Steve Sistare, and Jonathan Adams.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Suggested-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Suggested-by: Steve Sistare <steven.sistare@oracle.com>
Suggested-by: Jonathan Adams <jonathan.adams@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 include/linux/ktask.h          | 235 ++++++++++++++++++++++++
 include/linux/ktask_internal.h |  19 ++
 include/linux/mm.h             |   6 +
 init/Kconfig                   |   7 +
 init/main.c                    |   2 +
 kernel/Makefile                |   2 +-
 kernel/ktask.c                 | 396 +++++++++++++++++++++++++++++++++++++++++
 7 files changed, 666 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/ktask.h
 create mode 100644 include/linux/ktask_internal.h
 create mode 100644 kernel/ktask.c

diff --git a/include/linux/ktask.h b/include/linux/ktask.h
new file mode 100644
index 000000000000..269c795039e7
--- /dev/null
+++ b/include/linux/ktask.h
@@ -0,0 +1,235 @@
+/*
+ * ktask.h
+ *
+ * Framework to parallelize cpu-intensive kernel work such as zeroing
+ * huge pages or freeing many pages at once.  For more information, see
+ * Documentation/core-api/ktask.rst.
+ *
+ * This is the interface to ktask; everything in this file is
+ * accessible to ktask clients.
+ *
+ * If CONFIG_KTASK=n, calls to the ktask API are simply #define'd to run the
+ * thread function that the client provides so that the task is completed
+ * without concurrency in the current thread.
+ */
+
+#ifndef _LINUX_KTASK_H
+#define _LINUX_KTASK_H
+
+#include <linux/types.h>
+
+struct ktask_ctl;
+struct ktask_node;
+
+#define	KTASK_RETURN_SUCCESS	0
+#define	KTASK_RETURN_ERROR	(-1)
+
+#ifdef CONFIG_KTASK
+
+/**
+ * ktask_run - Runs one task.  It doesn't account for NUMA locality.
+ *
+ * @start: An object that describes the start of the task.  The client thread
+ *         function interprets the object however it sees fit (e.g. an array
+ *         index, a simple pointer, or a pointer to a more complicated
+ *         representation of job position.
+ * @task_size:  The size of the task (units are task-specific).
+ * @ctl:  A control structure containing information about the task, including
+ *        the client thread function (see the definition of struct ktask_ctl).
+ *
+ * RETURNS:
+ * KTASK_RETURN_SUCCESS or KTASK_RETURN_ERROR.
+ */
+int ktask_run(void *start, size_t task_size, struct ktask_ctl *ctl);
+
+/**
+ * ktask_run_numa - Runs one task while accounting for NUMA locality.
+ *
+ * The ktask framework ensures worker threads are scheduled on a CPU local to
+ * each chunk of a task.  The client is responsible for organizing the work
+ * along NUMA boundaries in the 'nodes' array.
+ *
+ * @nodes: An array of struct ktask_node's, each of which describes the task on
+ *         a NUMA node (see struct ktask_node).
+ * @nr_nodes:  The length of the 'nodes' array.
+ * @ctl:  A control structure containing information about the task (see
+ *        the definition of struct ktask_ctl).
+ *
+ * RETURNS:
+ * KTASK_RETURN_SUCCESS or KTASK_RETURN_ERROR.
+ */
+int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,
+		   struct ktask_ctl *ctl);
+
+/*
+ * Two possible future enhancements related to error handling, should the
+ * need arise, are:
+ *
+ * - Add client specific error reporting.  It's possible for tasks to fail for
+ *   different reasons, so let the client pass a pointer for its own error
+ *   information.
+ *
+ * - Allow clients to pass an "undo" callback to ktask that is responsible for
+ *   undoing those parts of the task that fail if an error occurs.
+ */
+
+void ktask_init(void);
+
+#else  /* CONFIG_KTASK */
+
+#define ktask_run(start, task_size, ctl)				      \
+	((ctl)->kc_thread_func((start),				              \
+			       (ctl)->kc_iter_advance((start), (task_size)),  \
+			       (ctl)->kc_thread_func_arg))
+
+#define ktask_run_numa(nodes, nr_nodes, ctl)				      \
+({									      \
+	size_t __i;							      \
+	int __ret = KTASK_RETURN_SUCCESS;				      \
+									      \
+	for (__i = 0; __i < (nr_nodes); ++__i) {			      \
+		__ret = (ctl)->kc_thread_func(				      \
+			    (nodes)->kn_start,				      \
+			    (ctl)->kc_iter_advance((nodes)->kn_start,	      \
+						   (nodes)->kn_task_size),    \
+			    (ctl)->kc_thread_func_arg);			      \
+									      \
+		if (__ret == KTASK_RETURN_ERROR)			      \
+			break;						      \
+	}								      \
+									      \
+	__ret;								      \
+})
+
+static inline void ktask_init(void) { }
+
+#endif /* CONFIG_KTASK */
+
+/**
+ * struct ktask_node - Holds per-NUMA-node information about a task.
+ *
+ * @kn_start: An object that describes the start of the task on this NUMA node.
+ * @kn_task_size: The size of the task on this NUMA node (units are
+ *                task-specific).
+ * @kn_nid: The NUMA node id (or NUMA_NO_NODE, in which case the work is done on
+ *          the current node).
+ */
+struct ktask_node {
+	void		*kn_start;
+	size_t		kn_task_size;
+	int		kn_nid;
+};
+
+/**
+ * typedef ktask_thread_func
+ *
+ * Called on each chunk of work that a ktask thread does, where the chunk is
+ * delimited by [start, end).  A thread may call this multiple times during one
+ * task.
+ *
+ * @start: An object that describes the start of the chunk.
+ * @end: An object that describes the end of the chunk.
+ * @arg: The thread function argument (provided with struct ktask_ctl).
+ *
+ * RETURNS:
+ * KTASK_RETURN_SUCCESS or KTASK_RETURN_ERROR.
+ */
+typedef int (*ktask_thread_func)(void *start, void *end, void *arg);
+
+/**
+ * typedef ktask_iter_func
+ *
+ * An iterator function that advances the position by a given number of steps.
+ *
+ * @position: An object that describes the current position in the task.
+ * @nsteps: The number of steps to advance in the task (in task-specific
+ *          units).
+ *
+ * RETURNS:
+ * An object representing the new position.
+ */
+typedef void *(*ktask_iter_func)(void *position, size_t nsteps);
+
+/**
+ * ktask_iter_range
+ *
+ * An iterator function for a contiguous range such as an array or address
+ * range.  Clients should use this to avoid reinventing the wheel for this
+ * common case.
+ *
+ * This is an instance of 'ktask_iter_func' where 'position' is interpreted as
+ * an unsigned long.
+ *
+ * @position: An object that describes the current position in the task.
+ * @nsteps: The number of steps to advance in the task (in task-specific
+ *          units).
+ *
+ * RETURNS:
+ * (position + nsteps)
+ */
+void *ktask_iter_range(void *position, size_t nsteps);
+
+/**
+ * struct ktask_ctl - Client-provided per-task control information.
+ *
+ * @kc_thread_func: A thread function that completes one chunk of the task per
+ *                  call.
+ * @kc_thread_func_arg: An argument to be passed to the thread function.
+ * @kc_iter_advance: An iterator function to advance the iterator by some number
+ *                   of task-specific units.
+ * @kc_min_chunk_size: The minimum chunk size in task-specific units.  This
+ *                     allows the client to communicate the minimum amount of
+ *                     work that's appropriate for one worker thread to do at
+ *                     once.
+ * @kc_max_threads: The maximum number of threads to use for the task.
+ *                  The actual number used may be less than this if the
+ *                  framework determines that fewer threads would be better,
+ *                  taking into account such things as total cpu count and
+ *                  task size.  Pass 0 to use ktask's default maximum.
+ * @kc_gfp_flags: gfp flags for allocating ktask metadata during the task.
+ */
+struct ktask_ctl {
+	ktask_thread_func	kc_thread_func;
+	void			*kc_thread_func_arg;
+	ktask_iter_func		kc_iter_advance;
+	size_t			kc_min_chunk_size;
+	size_t			kc_max_threads;
+	gfp_t			kc_gfp_flags;
+};
+
+#define KTASK_CTL_INITIALIZER(thread_func, thread_func_arg, iter_advance, \
+			      min_chunk_size, max_threads, gfp_flags)	  \
+	{								  \
+		.kc_thread_func = (ktask_thread_func)(thread_func),	  \
+		.kc_thread_func_arg = (thread_func_arg),		  \
+		.kc_iter_advance = (iter_advance),			  \
+		.kc_min_chunk_size = (min_chunk_size),			  \
+		.kc_max_threads = (max_threads),			  \
+		.kc_gfp_flags = (gfp_flags),				  \
+	}
+
+/*
+ * Note that KTASK_CTL_INITIALIZER casts 'thread_func' to be of type
+ * ktask_thread_func.  This is to help clients write cleaner thread functions
+ * by relieving them of the need to cast the three void * arguments.  Clients
+ * can just use the actual argument types instead.
+ */
+#define DEFINE_KTASK_CTL(ctl_name, thread_func, thread_func_arg,	  \
+			 iter_advance, min_chunk_size, max_threads,	  \
+			 gfp_flags)					  \
+	struct ktask_ctl ctl_name =					  \
+		KTASK_CTL_INITIALIZER(thread_func, thread_func_arg,	  \
+				      iter_advance, min_chunk_size,	  \
+				      kc_max_threads, gfp_flags)
+/*
+ * Similar to DEFINE_KTASK_CTL, but omits the iterator argument in favor of
+ * using ktask_iter_range.
+ */
+#define DEFINE_KTASK_CTL_RANGE(ctl_name, thread_func, thread_func_arg,	  \
+			 min_chunk_size, kc_max_threads, gfp_flags)	  \
+	struct ktask_ctl ctl_name =					  \
+		KTASK_CTL_INITIALIZER(thread_func, thread_func_arg,	  \
+				      ktask_iter_range, min_chunk_size,	  \
+				      kc_max_threads, gfp_flags)
+
+#endif /* _LINUX_KTASK_H */
diff --git a/include/linux/ktask_internal.h b/include/linux/ktask_internal.h
new file mode 100644
index 000000000000..7b576f4b0ada
--- /dev/null
+++ b/include/linux/ktask_internal.h
@@ -0,0 +1,19 @@
+/*
+ * ktask_internal.h
+ *
+ * Framework to parallelize cpu-intensive kernel work such as zeroing
+ * huge pages or freeing many pages at once.  For more information, see
+ * Documentation/core-api/ktask.rst.
+ *
+ * This file contains implementation details of ktask for core kernel code that
+ * needs to be aware of them.  ktask clients should not include this file.
+ */
+#ifndef _LINUX_KTASK_INTERNAL_H
+#define _LINUX_KTASK_INTERNAL_H
+
+#ifdef CONFIG_KTASK
+/* Caps the number of threads that are allowed to be used in one task. */
+extern int ktask_max_threads;
+#endif
+
+#endif /* _LINUX_KTASK_INTERNAL_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5e8569..ccf9e8957098 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2557,5 +2557,11 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+/*
+ * The minimum chunk size for a task that uses base page units.  For now, say
+ * 1G's worth of pages.
+ */
+#define	KTASK_BPGS_MINCHUNK		((1ul << 30) / PAGE_SIZE)
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/init/Kconfig b/init/Kconfig
index 8514b25db21c..80ebb5296a38 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -332,6 +332,13 @@ config AUDIT_TREE
 	depends on AUDITSYSCALL
 	select FSNOTIFY
 
+config KTASK
+	bool "Multithread cpu-intensive kernel tasks"
+	depends on SMP
+	default n
+	help
+          Parallelize expensive kernel tasks such as zeroing huge pages.
+
 source "kernel/irq/Kconfig"
 source "kernel/time/Kconfig"
 
diff --git a/init/main.c b/init/main.c
index 052481fbe363..1b7ddb6db6ad 100644
--- a/init/main.c
+++ b/init/main.c
@@ -88,6 +88,7 @@
 #include <linux/io.h>
 #include <linux/cache.h>
 #include <linux/rodata_test.h>
+#include <linux/ktask.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -1051,6 +1052,7 @@ static noinline void __init kernel_init_freeable(void)
 
 	smp_init();
 	sched_init_smp();
+	ktask_init();
 
 	page_alloc_init_late();
 
diff --git a/kernel/Makefile b/kernel/Makefile
index 4cb8e8b23c6e..297fe901be14 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -9,7 +9,7 @@ obj-y     = fork.o exec_domain.o panic.o \
 	    extable.o params.o \
 	    kthread.o sys_ni.o nsproxy.o \
 	    notifier.o ksysfs.o cred.o reboot.o \
-	    async.o range.o smpboot.o ucount.o
+	    async.o range.o smpboot.o ucount.o ktask.o
 
 obj-$(CONFIG_MULTIUSER) += groups.o
 
diff --git a/kernel/ktask.c b/kernel/ktask.c
new file mode 100644
index 000000000000..3a414b65d59f
--- /dev/null
+++ b/kernel/ktask.c
@@ -0,0 +1,396 @@
+/*
+ * ktask.c
+ *
+ * Framework to parallelize cpu-intensive kernel work such as zeroing
+ * huge pages or freeing many pages at once.  For more information, see
+ * Documentation/core-api/ktask.rst.
+ *
+ * This is the ktask implementation; everything in this file is private to
+ * ktask.
+ */
+
+#include <linux/ktask.h>
+
+#ifdef CONFIG_KTASK
+
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
+#include <linux/completion.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/mutex.h>
+#include <linux/printk.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+
+/*
+ * Shrink the size of each job by this shift amount to load balance between the
+ * worker threads.
+ */
+#define	KTASK_LOAD_BAL_SHIFT		2
+
+#define	KTASK_DEFAULT_MAX_THREADS	4
+
+/* Maximum number of threads for a single task. */
+int ktask_max_threads = KTASK_DEFAULT_MAX_THREADS;
+
+static struct workqueue_struct *ktask_wq;
+
+/* Used to pass ktask state to the workqueue API. */
+struct ktask_work {
+	struct work_struct kw_work;
+	void               *kw_state;
+};
+
+/* Internal per-task state hidden from clients. */
+struct ktask_state {
+	struct ktask_ctl	ks_ctl;
+	size_t			ks_total_size;
+	size_t			ks_chunk_size;
+	/* mutex protects nodes, nr_nodes_left, nthreads_fini, error */
+	struct mutex		ks_mutex;
+	struct ktask_node	*ks_nodes;
+	size_t			ks_nr_nodes;
+	size_t			ks_nr_nodes_left;
+	size_t			ks_nthreads;
+	size_t			ks_nthreads_fini;
+	int			ks_error; /* tracks error(s) from thread_func */
+	struct completion	ks_ktask_done;
+};
+
+static inline size_t ktask_get_start_node(struct ktask_node *nodes,
+					  size_t nr_nodes)
+{
+	int cur_nid = numa_node_id();
+	size_t fallback_i = 0;
+	size_t i;
+
+	for (i = 0; i < nr_nodes; ++i) {
+		if (nodes[i].kn_nid == cur_nid)
+			break;
+		else if (nodes[i].kn_nid == NUMA_NO_NODE)
+			fallback_i = i;
+	}
+
+	if (i >= nr_nodes)
+		i = fallback_i;
+
+	return i;
+}
+
+static void ktask_node_migrate(cpumask_var_t *saved_cpumask,
+			       struct ktask_node *old_kn,
+			       struct ktask_node *kn,
+			       gfp_t gfp_flags, bool *migratedp)
+{
+	struct task_struct *p = current;
+	const struct cpumask *node_cpumask;
+	int ret;
+
+	/*
+	 * Don't migrate a user thread, migrating to NUMA_NO_NODE is
+	 * nonsense, and no need to migrate if it's the same node.
+	 */
+	if (!(p->flags & PF_KTHREAD) || kn->kn_nid == NUMA_NO_NODE ||
+	    kn->kn_nid == old_kn->kn_nid)
+		return;
+
+	node_cpumask = cpumask_of_node(kn->kn_nid);
+	/* No cpu to migrate to. */
+	if (cpumask_empty(node_cpumask))
+		return;
+
+	if (!*migratedp) {
+		/*
+		 * Save the workqueue thread's original mask so we can restore
+		 * it after the task is done.
+		 */
+		if (!alloc_cpumask_var(saved_cpumask, gfp_flags))
+			return;
+
+		cpumask_copy(*saved_cpumask, &p->cpus_allowed);
+	}
+
+	ret = set_cpus_allowed_ptr(current, node_cpumask);
+	if (ret == 0)
+		*migratedp = true;
+	else if (!*migratedp)
+		free_cpumask_var(*saved_cpumask);
+}
+
+static void ktask_task(struct work_struct *work)
+{
+	struct ktask_work  *kw;
+	struct ktask_state *ks;
+	struct ktask_ctl   *kc;
+	struct ktask_node  *kn;
+	size_t             nidx;
+	bool               done;
+	bool               migrated = false;
+	cpumask_var_t      saved_cpumask;
+
+	kw = container_of(work, struct ktask_work, kw_work);
+	ks = kw->kw_state;
+	kc = &ks->ks_ctl;
+
+	if (ks->ks_nr_nodes > 1)
+		nidx = ktask_get_start_node(ks->ks_nodes, ks->ks_nr_nodes);
+	else
+		nidx = 0;
+
+	WARN_ON(nidx >= ks->ks_nr_nodes);
+	kn = &ks->ks_nodes[nidx];
+
+	mutex_lock(&ks->ks_mutex);
+
+	while (ks->ks_total_size > 0 && ks->ks_error == KTASK_RETURN_SUCCESS) {
+		void *start, *end;
+		size_t nsteps;
+		int ret;
+
+		if (kn->kn_task_size == 0) {
+			/* The current node is out of work; pick a new one. */
+			size_t remaining_nodes_seen = 0;
+			size_t new_idx = prandom_u32_max(ks->ks_nr_nodes_left);
+			struct ktask_node *old_kn;
+
+			WARN_ON(ks->ks_nr_nodes_left == 0);
+			WARN_ON(new_idx >= ks->ks_nr_nodes_left);
+			for (nidx = 0; nidx < ks->ks_nr_nodes; ++nidx) {
+				if (ks->ks_nodes[nidx].kn_task_size == 0)
+					continue;
+
+				if (remaining_nodes_seen >= new_idx)
+					break;
+
+				++remaining_nodes_seen;
+			}
+			/* We should have found work on another node. */
+			WARN_ON(nidx >= ks->ks_nr_nodes);
+
+			old_kn = kn;
+			kn = &ks->ks_nodes[nidx];
+
+			/* Temporarily migrate to the node we just chose. */
+			ktask_node_migrate(&saved_cpumask, old_kn, kn,
+					   kc->kc_gfp_flags, &migrated);
+		}
+
+		start = kn->kn_start;
+		nsteps = min(ks->ks_chunk_size, kn->kn_task_size);
+		end = kc->kc_iter_advance(start, nsteps);
+		kn->kn_start = end;
+		WARN_ON(kn->kn_task_size < nsteps);
+		kn->kn_task_size -= nsteps;
+		WARN_ON(ks->ks_total_size < nsteps);
+		ks->ks_total_size -= nsteps;
+		if (kn->kn_task_size == 0) {
+			WARN_ON(ks->ks_nr_nodes_left == 0);
+			ks->ks_nr_nodes_left--;
+		}
+
+		mutex_unlock(&ks->ks_mutex);
+
+		ret = kc->kc_thread_func(start, end, kc->kc_thread_func_arg);
+
+		mutex_lock(&ks->ks_mutex);
+
+		if (ret == KTASK_RETURN_ERROR)
+			ks->ks_error = KTASK_RETURN_ERROR;
+	}
+
+	WARN_ON(ks->ks_nr_nodes_left > 0 &&
+		ks->ks_error == KTASK_RETURN_SUCCESS);
+
+	++ks->ks_nthreads_fini;
+	WARN_ON(ks->ks_nthreads_fini > ks->ks_nthreads);
+	done = (ks->ks_nthreads_fini == ks->ks_nthreads);
+	mutex_unlock(&ks->ks_mutex);
+
+	if (migrated) {
+		set_cpus_allowed_ptr(current, saved_cpumask);
+		free_cpumask_var(saved_cpumask);
+	}
+
+	if (done)
+		complete(&ks->ks_ktask_done);
+}
+
+/* Returns the number of threads to use for this task. */
+static inline size_t ktask_nthreads(size_t task_size, size_t min_chunk_size,
+				    size_t max_threads)
+{
+	size_t nthreads;
+
+	if (max_threads == 0)
+		max_threads = ktask_max_threads;
+
+	/* Ensure at least one thread when task_size < min_chunk_size. */
+	nthreads = DIV_ROUND_UP(task_size, min_chunk_size);
+
+	nthreads = min_t(size_t, nthreads, num_online_cpus());
+
+	nthreads = min_t(size_t, nthreads, max_threads);
+
+	return nthreads;
+}
+
+/*
+ * Returns the number of chunks to break this task into.
+ *
+ * The number of chunks will be at least the number of threads, but in the
+ * common case of a large task, the number of chunks will be greater to load
+ * balance the work between threads in case some threads finish their work more
+ * quickly than others.
+ */
+static inline size_t ktask_chunk_size(size_t task_size, size_t min_chunk_size,
+				    size_t nthreads)
+{
+	size_t chunk_size;
+
+	if (nthreads == 1)
+		return task_size;
+
+	chunk_size = (task_size / nthreads) >> KTASK_LOAD_BAL_SHIFT;
+
+	/*
+	 * chunk_size should be a multiple of min_chunk_size for tasks that
+	 * need to operate in fixed-size batches.
+	 */
+	if (chunk_size > min_chunk_size)
+		chunk_size = rounddown(chunk_size, min_chunk_size);
+
+	return max(chunk_size, min_chunk_size);
+}
+
+int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,
+		   struct ktask_ctl *ctl)
+{
+	size_t i;
+	struct ktask_work *kw;
+	struct ktask_state ks = {
+		.ks_ctl             = *ctl,
+		.ks_total_size        = 0,
+		.ks_nodes           = nodes,
+		.ks_nr_nodes        = nr_nodes,
+		.ks_nr_nodes_left   = nr_nodes,
+		.ks_nthreads_fini   = 0,
+		.ks_error           = KTASK_RETURN_SUCCESS,
+	};
+
+	for (i = 0; i < nr_nodes; ++i) {
+		ks.ks_total_size += nodes[i].kn_task_size;
+		if (nodes[i].kn_task_size == 0)
+			ks.ks_nr_nodes_left--;
+
+		WARN_ON(nodes[i].kn_nid >= MAX_NUMNODES);
+	}
+
+	if (ks.ks_total_size == 0)
+		return KTASK_RETURN_SUCCESS;
+
+	mutex_init(&ks.ks_mutex);
+
+	ks.ks_nthreads = ktask_nthreads(ks.ks_total_size,
+					ctl->kc_min_chunk_size,
+					ctl->kc_max_threads);
+
+	ks.ks_chunk_size = ktask_chunk_size(ks.ks_total_size,
+					ctl->kc_min_chunk_size, ks.ks_nthreads);
+
+	init_completion(&ks.ks_ktask_done);
+
+	kw = kmalloc_array(ks.ks_nthreads, sizeof(struct ktask_work),
+			    ctl->kc_gfp_flags);
+	if (unlikely(!kw || !ktask_wq)) {
+		/* Low on memory; fall back to a single thread. */
+		struct ktask_work kw = {
+			.kw_work = __WORK_INITIALIZER(kw.kw_work, ktask_task),
+			.kw_state = &ks
+		};
+
+		ks.ks_nthreads = 1;
+
+		ktask_task(&kw.kw_work);
+		mutex_destroy(&ks.ks_mutex);
+
+		return ks.ks_error;
+	}
+
+	for (i = 1; i < ks.ks_nthreads; ++i) {
+		int cpu;
+		struct ktask_node *kn;
+
+		INIT_WORK(&kw[i].kw_work, ktask_task);
+		kw[i].kw_state = &ks;
+
+		/*
+		 * Spread workers evenly across nodes with work to do,
+		 * starting each worker on a cpu local to the nid of their
+		 * part of the task.
+		 */
+		kn = &nodes[i % nr_nodes];
+
+		if (kn->kn_nid == NUMA_NO_NODE) {
+			cpu = smp_processor_id();
+		} else {
+			/*
+			 * WQ_UNBOUND workqueues execute work on a cpu from
+			 * the node of the cpu we pass to queue_work_on, so
+			 * just pick any cpu to stand for the node.
+			 */
+			cpu = cpumask_any(cpumask_of_node(kn->kn_nid));
+		}
+
+		queue_work_on(cpu, ktask_wq, &kw[i].kw_work);
+	}
+
+	/*
+	 * Make ourselves one of the threads, which saves launching a workqueue
+	 * worker.
+	 */
+	INIT_WORK(&kw[0].kw_work, ktask_task);
+	kw[0].kw_state = &ks;
+	ktask_task(&kw[0].kw_work);
+
+	/* Wait for all the jobs to finish. */
+	wait_for_completion(&ks.ks_ktask_done);
+
+	kfree(kw);
+	mutex_destroy(&ks.ks_mutex);
+
+	return ks.ks_error;
+}
+EXPORT_SYMBOL_GPL(ktask_run_numa);
+
+int ktask_run(void *start, size_t task_size, struct ktask_ctl *ctl)
+{
+	struct ktask_node node;
+
+	node.kn_start = start;
+	node.kn_task_size = task_size;
+	node.kn_nid = NUMA_NO_NODE;
+
+	return ktask_run_numa(&node, 1, ctl);
+}
+EXPORT_SYMBOL_GPL(ktask_run);
+
+void __init ktask_init(void)
+{
+	ktask_wq = alloc_workqueue("ktask_wq", WQ_UNBOUND, 0);
+	if (!ktask_wq)
+		pr_warn("%s: alloc_workqueue failed", __func__);
+}
+
+#endif /* CONFIG_KTASK */
+
+/*
+ * This function is defined outside CONFIG_KTASK so it can be called in the
+ * ktask_run and ktask_run_numa macros defined in ktask.h for CONFIG_KTASK=n
+ * kernels.
+ */
+void *ktask_iter_range(void *position, size_t nsteps)
+{
+	return (char *)position + nsteps;
+}
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 2/7] ktask: multithread cpu-intensive kernel work
@ 2017-08-24 20:49   ` Daniel Jordan
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-24 20:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

ktask is a generic framework for parallelizing cpu-intensive work in the
kernel.  The intended use is for big machines that can use their cpu power to
speed up large tasks that can't otherwise be multithreaded in userland.  The
API is generic enough to add concurrency to many different kinds of tasks--for
example, zeroing a range of pages or evicting a list of inodes--and aims to
save its clients the trouble of splitting up the work, choosing the number of
threads to use, starting these threads, and load balancing the work between
them.

The Documentation patch earlier in this series has more background.

Introduces the ktask API; consumers appear in subsequent patches.

Based on work by Pavel Tatashin, Steve Sistare, and Jonathan Adams.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Suggested-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Suggested-by: Steve Sistare <steven.sistare@oracle.com>
Suggested-by: Jonathan Adams <jonathan.adams@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 include/linux/ktask.h          | 235 ++++++++++++++++++++++++
 include/linux/ktask_internal.h |  19 ++
 include/linux/mm.h             |   6 +
 init/Kconfig                   |   7 +
 init/main.c                    |   2 +
 kernel/Makefile                |   2 +-
 kernel/ktask.c                 | 396 +++++++++++++++++++++++++++++++++++++++++
 7 files changed, 666 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/ktask.h
 create mode 100644 include/linux/ktask_internal.h
 create mode 100644 kernel/ktask.c

diff --git a/include/linux/ktask.h b/include/linux/ktask.h
new file mode 100644
index 000000000000..269c795039e7
--- /dev/null
+++ b/include/linux/ktask.h
@@ -0,0 +1,235 @@
+/*
+ * ktask.h
+ *
+ * Framework to parallelize cpu-intensive kernel work such as zeroing
+ * huge pages or freeing many pages at once.  For more information, see
+ * Documentation/core-api/ktask.rst.
+ *
+ * This is the interface to ktask; everything in this file is
+ * accessible to ktask clients.
+ *
+ * If CONFIG_KTASK=n, calls to the ktask API are simply #define'd to run the
+ * thread function that the client provides so that the task is completed
+ * without concurrency in the current thread.
+ */
+
+#ifndef _LINUX_KTASK_H
+#define _LINUX_KTASK_H
+
+#include <linux/types.h>
+
+struct ktask_ctl;
+struct ktask_node;
+
+#define	KTASK_RETURN_SUCCESS	0
+#define	KTASK_RETURN_ERROR	(-1)
+
+#ifdef CONFIG_KTASK
+
+/**
+ * ktask_run - Runs one task.  It doesn't account for NUMA locality.
+ *
+ * @start: An object that describes the start of the task.  The client thread
+ *         function interprets the object however it sees fit (e.g. an array
+ *         index, a simple pointer, or a pointer to a more complicated
+ *         representation of job position.
+ * @task_size:  The size of the task (units are task-specific).
+ * @ctl:  A control structure containing information about the task, including
+ *        the client thread function (see the definition of struct ktask_ctl).
+ *
+ * RETURNS:
+ * KTASK_RETURN_SUCCESS or KTASK_RETURN_ERROR.
+ */
+int ktask_run(void *start, size_t task_size, struct ktask_ctl *ctl);
+
+/**
+ * ktask_run_numa - Runs one task while accounting for NUMA locality.
+ *
+ * The ktask framework ensures worker threads are scheduled on a CPU local to
+ * each chunk of a task.  The client is responsible for organizing the work
+ * along NUMA boundaries in the 'nodes' array.
+ *
+ * @nodes: An array of struct ktask_node's, each of which describes the task on
+ *         a NUMA node (see struct ktask_node).
+ * @nr_nodes:  The length of the 'nodes' array.
+ * @ctl:  A control structure containing information about the task (see
+ *        the definition of struct ktask_ctl).
+ *
+ * RETURNS:
+ * KTASK_RETURN_SUCCESS or KTASK_RETURN_ERROR.
+ */
+int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,
+		   struct ktask_ctl *ctl);
+
+/*
+ * Two possible future enhancements related to error handling, should the
+ * need arise, are:
+ *
+ * - Add client specific error reporting.  It's possible for tasks to fail for
+ *   different reasons, so let the client pass a pointer for its own error
+ *   information.
+ *
+ * - Allow clients to pass an "undo" callback to ktask that is responsible for
+ *   undoing those parts of the task that fail if an error occurs.
+ */
+
+void ktask_init(void);
+
+#else  /* CONFIG_KTASK */
+
+#define ktask_run(start, task_size, ctl)				      \
+	((ctl)->kc_thread_func((start),				              \
+			       (ctl)->kc_iter_advance((start), (task_size)),  \
+			       (ctl)->kc_thread_func_arg))
+
+#define ktask_run_numa(nodes, nr_nodes, ctl)				      \
+({									      \
+	size_t __i;							      \
+	int __ret = KTASK_RETURN_SUCCESS;				      \
+									      \
+	for (__i = 0; __i < (nr_nodes); ++__i) {			      \
+		__ret = (ctl)->kc_thread_func(				      \
+			    (nodes)->kn_start,				      \
+			    (ctl)->kc_iter_advance((nodes)->kn_start,	      \
+						   (nodes)->kn_task_size),    \
+			    (ctl)->kc_thread_func_arg);			      \
+									      \
+		if (__ret == KTASK_RETURN_ERROR)			      \
+			break;						      \
+	}								      \
+									      \
+	__ret;								      \
+})
+
+static inline void ktask_init(void) { }
+
+#endif /* CONFIG_KTASK */
+
+/**
+ * struct ktask_node - Holds per-NUMA-node information about a task.
+ *
+ * @kn_start: An object that describes the start of the task on this NUMA node.
+ * @kn_task_size: The size of the task on this NUMA node (units are
+ *                task-specific).
+ * @kn_nid: The NUMA node id (or NUMA_NO_NODE, in which case the work is done on
+ *          the current node).
+ */
+struct ktask_node {
+	void		*kn_start;
+	size_t		kn_task_size;
+	int		kn_nid;
+};
+
+/**
+ * typedef ktask_thread_func
+ *
+ * Called on each chunk of work that a ktask thread does, where the chunk is
+ * delimited by [start, end).  A thread may call this multiple times during one
+ * task.
+ *
+ * @start: An object that describes the start of the chunk.
+ * @end: An object that describes the end of the chunk.
+ * @arg: The thread function argument (provided with struct ktask_ctl).
+ *
+ * RETURNS:
+ * KTASK_RETURN_SUCCESS or KTASK_RETURN_ERROR.
+ */
+typedef int (*ktask_thread_func)(void *start, void *end, void *arg);
+
+/**
+ * typedef ktask_iter_func
+ *
+ * An iterator function that advances the position by a given number of steps.
+ *
+ * @position: An object that describes the current position in the task.
+ * @nsteps: The number of steps to advance in the task (in task-specific
+ *          units).
+ *
+ * RETURNS:
+ * An object representing the new position.
+ */
+typedef void *(*ktask_iter_func)(void *position, size_t nsteps);
+
+/**
+ * ktask_iter_range
+ *
+ * An iterator function for a contiguous range such as an array or address
+ * range.  Clients should use this to avoid reinventing the wheel for this
+ * common case.
+ *
+ * This is an instance of 'ktask_iter_func' where 'position' is interpreted as
+ * an unsigned long.
+ *
+ * @position: An object that describes the current position in the task.
+ * @nsteps: The number of steps to advance in the task (in task-specific
+ *          units).
+ *
+ * RETURNS:
+ * (position + nsteps)
+ */
+void *ktask_iter_range(void *position, size_t nsteps);
+
+/**
+ * struct ktask_ctl - Client-provided per-task control information.
+ *
+ * @kc_thread_func: A thread function that completes one chunk of the task per
+ *                  call.
+ * @kc_thread_func_arg: An argument to be passed to the thread function.
+ * @kc_iter_advance: An iterator function to advance the iterator by some number
+ *                   of task-specific units.
+ * @kc_min_chunk_size: The minimum chunk size in task-specific units.  This
+ *                     allows the client to communicate the minimum amount of
+ *                     work that's appropriate for one worker thread to do at
+ *                     once.
+ * @kc_max_threads: The maximum number of threads to use for the task.
+ *                  The actual number used may be less than this if the
+ *                  framework determines that fewer threads would be better,
+ *                  taking into account such things as total cpu count and
+ *                  task size.  Pass 0 to use ktask's default maximum.
+ * @kc_gfp_flags: gfp flags for allocating ktask metadata during the task.
+ */
+struct ktask_ctl {
+	ktask_thread_func	kc_thread_func;
+	void			*kc_thread_func_arg;
+	ktask_iter_func		kc_iter_advance;
+	size_t			kc_min_chunk_size;
+	size_t			kc_max_threads;
+	gfp_t			kc_gfp_flags;
+};
+
+#define KTASK_CTL_INITIALIZER(thread_func, thread_func_arg, iter_advance, \
+			      min_chunk_size, max_threads, gfp_flags)	  \
+	{								  \
+		.kc_thread_func = (ktask_thread_func)(thread_func),	  \
+		.kc_thread_func_arg = (thread_func_arg),		  \
+		.kc_iter_advance = (iter_advance),			  \
+		.kc_min_chunk_size = (min_chunk_size),			  \
+		.kc_max_threads = (max_threads),			  \
+		.kc_gfp_flags = (gfp_flags),				  \
+	}
+
+/*
+ * Note that KTASK_CTL_INITIALIZER casts 'thread_func' to be of type
+ * ktask_thread_func.  This is to help clients write cleaner thread functions
+ * by relieving them of the need to cast the three void * arguments.  Clients
+ * can just use the actual argument types instead.
+ */
+#define DEFINE_KTASK_CTL(ctl_name, thread_func, thread_func_arg,	  \
+			 iter_advance, min_chunk_size, max_threads,	  \
+			 gfp_flags)					  \
+	struct ktask_ctl ctl_name =					  \
+		KTASK_CTL_INITIALIZER(thread_func, thread_func_arg,	  \
+				      iter_advance, min_chunk_size,	  \
+				      kc_max_threads, gfp_flags)
+/*
+ * Similar to DEFINE_KTASK_CTL, but omits the iterator argument in favor of
+ * using ktask_iter_range.
+ */
+#define DEFINE_KTASK_CTL_RANGE(ctl_name, thread_func, thread_func_arg,	  \
+			 min_chunk_size, kc_max_threads, gfp_flags)	  \
+	struct ktask_ctl ctl_name =					  \
+		KTASK_CTL_INITIALIZER(thread_func, thread_func_arg,	  \
+				      ktask_iter_range, min_chunk_size,	  \
+				      kc_max_threads, gfp_flags)
+
+#endif /* _LINUX_KTASK_H */
diff --git a/include/linux/ktask_internal.h b/include/linux/ktask_internal.h
new file mode 100644
index 000000000000..7b576f4b0ada
--- /dev/null
+++ b/include/linux/ktask_internal.h
@@ -0,0 +1,19 @@
+/*
+ * ktask_internal.h
+ *
+ * Framework to parallelize cpu-intensive kernel work such as zeroing
+ * huge pages or freeing many pages at once.  For more information, see
+ * Documentation/core-api/ktask.rst.
+ *
+ * This file contains implementation details of ktask for core kernel code that
+ * needs to be aware of them.  ktask clients should not include this file.
+ */
+#ifndef _LINUX_KTASK_INTERNAL_H
+#define _LINUX_KTASK_INTERNAL_H
+
+#ifdef CONFIG_KTASK
+/* Caps the number of threads that are allowed to be used in one task. */
+extern int ktask_max_threads;
+#endif
+
+#endif /* _LINUX_KTASK_INTERNAL_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5e8569..ccf9e8957098 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2557,5 +2557,11 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+/*
+ * The minimum chunk size for a task that uses base page units.  For now, say
+ * 1G's worth of pages.
+ */
+#define	KTASK_BPGS_MINCHUNK		((1ul << 30) / PAGE_SIZE)
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/init/Kconfig b/init/Kconfig
index 8514b25db21c..80ebb5296a38 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -332,6 +332,13 @@ config AUDIT_TREE
 	depends on AUDITSYSCALL
 	select FSNOTIFY
 
+config KTASK
+	bool "Multithread cpu-intensive kernel tasks"
+	depends on SMP
+	default n
+	help
+          Parallelize expensive kernel tasks such as zeroing huge pages.
+
 source "kernel/irq/Kconfig"
 source "kernel/time/Kconfig"
 
diff --git a/init/main.c b/init/main.c
index 052481fbe363..1b7ddb6db6ad 100644
--- a/init/main.c
+++ b/init/main.c
@@ -88,6 +88,7 @@
 #include <linux/io.h>
 #include <linux/cache.h>
 #include <linux/rodata_test.h>
+#include <linux/ktask.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -1051,6 +1052,7 @@ static noinline void __init kernel_init_freeable(void)
 
 	smp_init();
 	sched_init_smp();
+	ktask_init();
 
 	page_alloc_init_late();
 
diff --git a/kernel/Makefile b/kernel/Makefile
index 4cb8e8b23c6e..297fe901be14 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -9,7 +9,7 @@ obj-y     = fork.o exec_domain.o panic.o \
 	    extable.o params.o \
 	    kthread.o sys_ni.o nsproxy.o \
 	    notifier.o ksysfs.o cred.o reboot.o \
-	    async.o range.o smpboot.o ucount.o
+	    async.o range.o smpboot.o ucount.o ktask.o
 
 obj-$(CONFIG_MULTIUSER) += groups.o
 
diff --git a/kernel/ktask.c b/kernel/ktask.c
new file mode 100644
index 000000000000..3a414b65d59f
--- /dev/null
+++ b/kernel/ktask.c
@@ -0,0 +1,396 @@
+/*
+ * ktask.c
+ *
+ * Framework to parallelize cpu-intensive kernel work such as zeroing
+ * huge pages or freeing many pages at once.  For more information, see
+ * Documentation/core-api/ktask.rst.
+ *
+ * This is the ktask implementation; everything in this file is private to
+ * ktask.
+ */
+
+#include <linux/ktask.h>
+
+#ifdef CONFIG_KTASK
+
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
+#include <linux/completion.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/mutex.h>
+#include <linux/printk.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+
+/*
+ * Shrink the size of each job by this shift amount to load balance between the
+ * worker threads.
+ */
+#define	KTASK_LOAD_BAL_SHIFT		2
+
+#define	KTASK_DEFAULT_MAX_THREADS	4
+
+/* Maximum number of threads for a single task. */
+int ktask_max_threads = KTASK_DEFAULT_MAX_THREADS;
+
+static struct workqueue_struct *ktask_wq;
+
+/* Used to pass ktask state to the workqueue API. */
+struct ktask_work {
+	struct work_struct kw_work;
+	void               *kw_state;
+};
+
+/* Internal per-task state hidden from clients. */
+struct ktask_state {
+	struct ktask_ctl	ks_ctl;
+	size_t			ks_total_size;
+	size_t			ks_chunk_size;
+	/* mutex protects nodes, nr_nodes_left, nthreads_fini, error */
+	struct mutex		ks_mutex;
+	struct ktask_node	*ks_nodes;
+	size_t			ks_nr_nodes;
+	size_t			ks_nr_nodes_left;
+	size_t			ks_nthreads;
+	size_t			ks_nthreads_fini;
+	int			ks_error; /* tracks error(s) from thread_func */
+	struct completion	ks_ktask_done;
+};
+
+static inline size_t ktask_get_start_node(struct ktask_node *nodes,
+					  size_t nr_nodes)
+{
+	int cur_nid = numa_node_id();
+	size_t fallback_i = 0;
+	size_t i;
+
+	for (i = 0; i < nr_nodes; ++i) {
+		if (nodes[i].kn_nid == cur_nid)
+			break;
+		else if (nodes[i].kn_nid == NUMA_NO_NODE)
+			fallback_i = i;
+	}
+
+	if (i >= nr_nodes)
+		i = fallback_i;
+
+	return i;
+}
+
+static void ktask_node_migrate(cpumask_var_t *saved_cpumask,
+			       struct ktask_node *old_kn,
+			       struct ktask_node *kn,
+			       gfp_t gfp_flags, bool *migratedp)
+{
+	struct task_struct *p = current;
+	const struct cpumask *node_cpumask;
+	int ret;
+
+	/*
+	 * Don't migrate a user thread, migrating to NUMA_NO_NODE is
+	 * nonsense, and no need to migrate if it's the same node.
+	 */
+	if (!(p->flags & PF_KTHREAD) || kn->kn_nid == NUMA_NO_NODE ||
+	    kn->kn_nid == old_kn->kn_nid)
+		return;
+
+	node_cpumask = cpumask_of_node(kn->kn_nid);
+	/* No cpu to migrate to. */
+	if (cpumask_empty(node_cpumask))
+		return;
+
+	if (!*migratedp) {
+		/*
+		 * Save the workqueue thread's original mask so we can restore
+		 * it after the task is done.
+		 */
+		if (!alloc_cpumask_var(saved_cpumask, gfp_flags))
+			return;
+
+		cpumask_copy(*saved_cpumask, &p->cpus_allowed);
+	}
+
+	ret = set_cpus_allowed_ptr(current, node_cpumask);
+	if (ret == 0)
+		*migratedp = true;
+	else if (!*migratedp)
+		free_cpumask_var(*saved_cpumask);
+}
+
+static void ktask_task(struct work_struct *work)
+{
+	struct ktask_work  *kw;
+	struct ktask_state *ks;
+	struct ktask_ctl   *kc;
+	struct ktask_node  *kn;
+	size_t             nidx;
+	bool               done;
+	bool               migrated = false;
+	cpumask_var_t      saved_cpumask;
+
+	kw = container_of(work, struct ktask_work, kw_work);
+	ks = kw->kw_state;
+	kc = &ks->ks_ctl;
+
+	if (ks->ks_nr_nodes > 1)
+		nidx = ktask_get_start_node(ks->ks_nodes, ks->ks_nr_nodes);
+	else
+		nidx = 0;
+
+	WARN_ON(nidx >= ks->ks_nr_nodes);
+	kn = &ks->ks_nodes[nidx];
+
+	mutex_lock(&ks->ks_mutex);
+
+	while (ks->ks_total_size > 0 && ks->ks_error == KTASK_RETURN_SUCCESS) {
+		void *start, *end;
+		size_t nsteps;
+		int ret;
+
+		if (kn->kn_task_size == 0) {
+			/* The current node is out of work; pick a new one. */
+			size_t remaining_nodes_seen = 0;
+			size_t new_idx = prandom_u32_max(ks->ks_nr_nodes_left);
+			struct ktask_node *old_kn;
+
+			WARN_ON(ks->ks_nr_nodes_left == 0);
+			WARN_ON(new_idx >= ks->ks_nr_nodes_left);
+			for (nidx = 0; nidx < ks->ks_nr_nodes; ++nidx) {
+				if (ks->ks_nodes[nidx].kn_task_size == 0)
+					continue;
+
+				if (remaining_nodes_seen >= new_idx)
+					break;
+
+				++remaining_nodes_seen;
+			}
+			/* We should have found work on another node. */
+			WARN_ON(nidx >= ks->ks_nr_nodes);
+
+			old_kn = kn;
+			kn = &ks->ks_nodes[nidx];
+
+			/* Temporarily migrate to the node we just chose. */
+			ktask_node_migrate(&saved_cpumask, old_kn, kn,
+					   kc->kc_gfp_flags, &migrated);
+		}
+
+		start = kn->kn_start;
+		nsteps = min(ks->ks_chunk_size, kn->kn_task_size);
+		end = kc->kc_iter_advance(start, nsteps);
+		kn->kn_start = end;
+		WARN_ON(kn->kn_task_size < nsteps);
+		kn->kn_task_size -= nsteps;
+		WARN_ON(ks->ks_total_size < nsteps);
+		ks->ks_total_size -= nsteps;
+		if (kn->kn_task_size == 0) {
+			WARN_ON(ks->ks_nr_nodes_left == 0);
+			ks->ks_nr_nodes_left--;
+		}
+
+		mutex_unlock(&ks->ks_mutex);
+
+		ret = kc->kc_thread_func(start, end, kc->kc_thread_func_arg);
+
+		mutex_lock(&ks->ks_mutex);
+
+		if (ret == KTASK_RETURN_ERROR)
+			ks->ks_error = KTASK_RETURN_ERROR;
+	}
+
+	WARN_ON(ks->ks_nr_nodes_left > 0 &&
+		ks->ks_error == KTASK_RETURN_SUCCESS);
+
+	++ks->ks_nthreads_fini;
+	WARN_ON(ks->ks_nthreads_fini > ks->ks_nthreads);
+	done = (ks->ks_nthreads_fini == ks->ks_nthreads);
+	mutex_unlock(&ks->ks_mutex);
+
+	if (migrated) {
+		set_cpus_allowed_ptr(current, saved_cpumask);
+		free_cpumask_var(saved_cpumask);
+	}
+
+	if (done)
+		complete(&ks->ks_ktask_done);
+}
+
+/* Returns the number of threads to use for this task. */
+static inline size_t ktask_nthreads(size_t task_size, size_t min_chunk_size,
+				    size_t max_threads)
+{
+	size_t nthreads;
+
+	if (max_threads == 0)
+		max_threads = ktask_max_threads;
+
+	/* Ensure at least one thread when task_size < min_chunk_size. */
+	nthreads = DIV_ROUND_UP(task_size, min_chunk_size);
+
+	nthreads = min_t(size_t, nthreads, num_online_cpus());
+
+	nthreads = min_t(size_t, nthreads, max_threads);
+
+	return nthreads;
+}
+
+/*
+ * Returns the number of chunks to break this task into.
+ *
+ * The number of chunks will be at least the number of threads, but in the
+ * common case of a large task, the number of chunks will be greater to load
+ * balance the work between threads in case some threads finish their work more
+ * quickly than others.
+ */
+static inline size_t ktask_chunk_size(size_t task_size, size_t min_chunk_size,
+				    size_t nthreads)
+{
+	size_t chunk_size;
+
+	if (nthreads == 1)
+		return task_size;
+
+	chunk_size = (task_size / nthreads) >> KTASK_LOAD_BAL_SHIFT;
+
+	/*
+	 * chunk_size should be a multiple of min_chunk_size for tasks that
+	 * need to operate in fixed-size batches.
+	 */
+	if (chunk_size > min_chunk_size)
+		chunk_size = rounddown(chunk_size, min_chunk_size);
+
+	return max(chunk_size, min_chunk_size);
+}
+
+int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,
+		   struct ktask_ctl *ctl)
+{
+	size_t i;
+	struct ktask_work *kw;
+	struct ktask_state ks = {
+		.ks_ctl             = *ctl,
+		.ks_total_size        = 0,
+		.ks_nodes           = nodes,
+		.ks_nr_nodes        = nr_nodes,
+		.ks_nr_nodes_left   = nr_nodes,
+		.ks_nthreads_fini   = 0,
+		.ks_error           = KTASK_RETURN_SUCCESS,
+	};
+
+	for (i = 0; i < nr_nodes; ++i) {
+		ks.ks_total_size += nodes[i].kn_task_size;
+		if (nodes[i].kn_task_size == 0)
+			ks.ks_nr_nodes_left--;
+
+		WARN_ON(nodes[i].kn_nid >= MAX_NUMNODES);
+	}
+
+	if (ks.ks_total_size == 0)
+		return KTASK_RETURN_SUCCESS;
+
+	mutex_init(&ks.ks_mutex);
+
+	ks.ks_nthreads = ktask_nthreads(ks.ks_total_size,
+					ctl->kc_min_chunk_size,
+					ctl->kc_max_threads);
+
+	ks.ks_chunk_size = ktask_chunk_size(ks.ks_total_size,
+					ctl->kc_min_chunk_size, ks.ks_nthreads);
+
+	init_completion(&ks.ks_ktask_done);
+
+	kw = kmalloc_array(ks.ks_nthreads, sizeof(struct ktask_work),
+			    ctl->kc_gfp_flags);
+	if (unlikely(!kw || !ktask_wq)) {
+		/* Low on memory; fall back to a single thread. */
+		struct ktask_work kw = {
+			.kw_work = __WORK_INITIALIZER(kw.kw_work, ktask_task),
+			.kw_state = &ks
+		};
+
+		ks.ks_nthreads = 1;
+
+		ktask_task(&kw.kw_work);
+		mutex_destroy(&ks.ks_mutex);
+
+		return ks.ks_error;
+	}
+
+	for (i = 1; i < ks.ks_nthreads; ++i) {
+		int cpu;
+		struct ktask_node *kn;
+
+		INIT_WORK(&kw[i].kw_work, ktask_task);
+		kw[i].kw_state = &ks;
+
+		/*
+		 * Spread workers evenly across nodes with work to do,
+		 * starting each worker on a cpu local to the nid of their
+		 * part of the task.
+		 */
+		kn = &nodes[i % nr_nodes];
+
+		if (kn->kn_nid == NUMA_NO_NODE) {
+			cpu = smp_processor_id();
+		} else {
+			/*
+			 * WQ_UNBOUND workqueues execute work on a cpu from
+			 * the node of the cpu we pass to queue_work_on, so
+			 * just pick any cpu to stand for the node.
+			 */
+			cpu = cpumask_any(cpumask_of_node(kn->kn_nid));
+		}
+
+		queue_work_on(cpu, ktask_wq, &kw[i].kw_work);
+	}
+
+	/*
+	 * Make ourselves one of the threads, which saves launching a workqueue
+	 * worker.
+	 */
+	INIT_WORK(&kw[0].kw_work, ktask_task);
+	kw[0].kw_state = &ks;
+	ktask_task(&kw[0].kw_work);
+
+	/* Wait for all the jobs to finish. */
+	wait_for_completion(&ks.ks_ktask_done);
+
+	kfree(kw);
+	mutex_destroy(&ks.ks_mutex);
+
+	return ks.ks_error;
+}
+EXPORT_SYMBOL_GPL(ktask_run_numa);
+
+int ktask_run(void *start, size_t task_size, struct ktask_ctl *ctl)
+{
+	struct ktask_node node;
+
+	node.kn_start = start;
+	node.kn_task_size = task_size;
+	node.kn_nid = NUMA_NO_NODE;
+
+	return ktask_run_numa(&node, 1, ctl);
+}
+EXPORT_SYMBOL_GPL(ktask_run);
+
+void __init ktask_init(void)
+{
+	ktask_wq = alloc_workqueue("ktask_wq", WQ_UNBOUND, 0);
+	if (!ktask_wq)
+		pr_warn("%s: alloc_workqueue failed", __func__);
+}
+
+#endif /* CONFIG_KTASK */
+
+/*
+ * This function is defined outside CONFIG_KTASK so it can be called in the
+ * ktask_run and ktask_run_numa macros defined in ktask.h for CONFIG_KTASK=n
+ * kernels.
+ */
+void *ktask_iter_range(void *position, size_t nsteps)
+{
+	return (char *)position + nsteps;
+}
-- 
2.12.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 3/7] ktask: add /proc/sys/debug/ktask_max_threads
  2017-08-24 20:49 ` Daniel Jordan
@ 2017-08-24 20:50   ` Daniel Jordan
  -1 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-24 20:50 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

Adds a proc file to control the maximum number of ktask threads in use
for any one job.  Its primary use is to aid in debugging.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 kernel/sysctl.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbbb8157..bc22c61b5d12 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -67,6 +67,7 @@
 #include <linux/kexec.h>
 #include <linux/bpf.h>
 #include <linux/mount.h>
+#include <linux/ktask_internal.h>
 
 #include <linux/uaccess.h>
 #include <asm/processor.h>
@@ -1876,6 +1877,15 @@ static struct ctl_table debug_table[] = {
 		.extra2		= &one,
 	},
 #endif
+#if defined(CONFIG_KTASK)
+	{
+		.procname	= "ktask_max_threads",
+		.data		= &ktask_max_threads,
+		.maxlen		= sizeof(ktask_max_threads),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif
 	{ }
 };
 
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 3/7] ktask: add /proc/sys/debug/ktask_max_threads
@ 2017-08-24 20:50   ` Daniel Jordan
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-24 20:50 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

Adds a proc file to control the maximum number of ktask threads in use
for any one job.  Its primary use is to aid in debugging.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 kernel/sysctl.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbbb8157..bc22c61b5d12 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -67,6 +67,7 @@
 #include <linux/kexec.h>
 #include <linux/bpf.h>
 #include <linux/mount.h>
+#include <linux/ktask_internal.h>
 
 #include <linux/uaccess.h>
 #include <asm/processor.h>
@@ -1876,6 +1877,15 @@ static struct ctl_table debug_table[] = {
 		.extra2		= &one,
 	},
 #endif
+#if defined(CONFIG_KTASK)
+	{
+		.procname	= "ktask_max_threads",
+		.data		= &ktask_max_threads,
+		.maxlen		= sizeof(ktask_max_threads),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif
 	{ }
 };
 
-- 
2.12.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 4/7] mm: enlarge type of offset argument in mem_map_offset and mem_map_next
  2017-08-24 20:49 ` Daniel Jordan
@ 2017-08-24 20:50   ` Daniel Jordan
  -1 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-24 20:50 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

Changes the type of 'offset' from int to unsigned long in both
mem_map_offset and mem_map_next.

This facilitates ktask's use of mem_map_next with its unsigned long
types to avoid silent truncation when these unsigned longs are passed as
ints.

It also fixes the preexisting truncation of 'offset' from unsigned long
to int by the sole caller of mem_map_offset, follow_hugetlb_page.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 mm/internal.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 4ef49fc55e58..a033b47c44d3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -365,7 +365,8 @@ static inline void mlock_migrate_page(struct page *new, struct page *old) { }
  * the maximally aligned gigantic page 'base'.  Handle any discontiguity
  * in the mem_map at MAX_ORDER_NR_PAGES boundaries.
  */
-static inline struct page *mem_map_offset(struct page *base, int offset)
+static inline struct page *mem_map_offset(struct page *base,
+					  unsigned long offset)
 {
 	if (unlikely(offset >= MAX_ORDER_NR_PAGES))
 		return nth_page(base, offset);
@@ -376,8 +377,8 @@ static inline struct page *mem_map_offset(struct page *base, int offset)
  * Iterator over all subpages within the maximally aligned gigantic
  * page 'base'.  Handle any discontiguity in the mem_map.
  */
-static inline struct page *mem_map_next(struct page *iter,
-						struct page *base, int offset)
+static inline struct page *mem_map_next(struct page *iter, struct page *base,
+					unsigned long offset)
 {
 	if (unlikely((offset & (MAX_ORDER_NR_PAGES - 1)) == 0)) {
 		unsigned long pfn = page_to_pfn(base) + offset;
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 4/7] mm: enlarge type of offset argument in mem_map_offset and mem_map_next
@ 2017-08-24 20:50   ` Daniel Jordan
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-24 20:50 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

Changes the type of 'offset' from int to unsigned long in both
mem_map_offset and mem_map_next.

This facilitates ktask's use of mem_map_next with its unsigned long
types to avoid silent truncation when these unsigned longs are passed as
ints.

It also fixes the preexisting truncation of 'offset' from unsigned long
to int by the sole caller of mem_map_offset, follow_hugetlb_page.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 mm/internal.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 4ef49fc55e58..a033b47c44d3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -365,7 +365,8 @@ static inline void mlock_migrate_page(struct page *new, struct page *old) { }
  * the maximally aligned gigantic page 'base'.  Handle any discontiguity
  * in the mem_map at MAX_ORDER_NR_PAGES boundaries.
  */
-static inline struct page *mem_map_offset(struct page *base, int offset)
+static inline struct page *mem_map_offset(struct page *base,
+					  unsigned long offset)
 {
 	if (unlikely(offset >= MAX_ORDER_NR_PAGES))
 		return nth_page(base, offset);
@@ -376,8 +377,8 @@ static inline struct page *mem_map_offset(struct page *base, int offset)
  * Iterator over all subpages within the maximally aligned gigantic
  * page 'base'.  Handle any discontiguity in the mem_map.
  */
-static inline struct page *mem_map_next(struct page *iter,
-						struct page *base, int offset)
+static inline struct page *mem_map_next(struct page *iter, struct page *base,
+					unsigned long offset)
 {
 	if (unlikely((offset & (MAX_ORDER_NR_PAGES - 1)) == 0)) {
 		unsigned long pfn = page_to_pfn(base) + offset;
-- 
2.12.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 5/7] mm: parallelize clear_gigantic_page
  2017-08-24 20:49 ` Daniel Jordan
@ 2017-08-24 20:50   ` Daniel Jordan
  -1 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-24 20:50 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

Parallelize clear_gigantic_page, which zeroes any page size larger than
8M (e.g. 1G on x86 or 2G on SPARC).

Performance results (the default number of threads is 4; higher thread
counts shown for context only):

Machine: SPARC T7-4, 1024 cpus, 504G memory
Test:    Clear a range of gigantic pages

nthread   speedup   size (GiB)   min time (s)   stdev
      1                     50           7.77    0.02
      2     1.97x           50           3.95    0.04
      4     3.85x           50           2.02    0.05
      8     6.27x           50           1.24    0.10
     16     9.84x           50           0.79    0.06

      1                    100          15.50    0.07
      2     1.91x          100           8.10    0.05
      4     3.48x          100           4.45    0.07
      8     5.18x          100           2.99    0.05
     16     7.79x          100           1.99    0.12

      1                    200          31.03    0.15
      2     1.88x          200          16.47    0.02
      4     3.37x          200           9.20    0.14
      8     5.16x          200           6.01    0.19
     16     7.04x          200           4.41    0.06

Machine:  Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 cpus, 1T memory
Test:     Clear a range of gigantic pages

nthread   speedup   size (GiB)   min time (s)   stdev
      1                    100          41.13    0.03
      2     2.03x          100          20.26    0.14
      4     4.28x          100           9.62    0.09
      8     8.39x          100           4.90    0.05
     16    10.44x          100           3.94    0.03

      1                    200          89.68    0.35
      2     2.21x          200          40.64    0.18
      4     4.64x          200          19.33    0.32
      8     8.99x          200           9.98    0.04
     16    11.27x          200           7.96    0.04

      1                    400         188.20    1.57
      2     2.30x          400          81.84    0.09
      4     4.63x          400          40.62    0.26
      8     8.92x          400          21.09    0.50
     16    11.78x          400          15.97    0.25

      1                    800         434.91    1.81
      2     2.54x          800         170.97    1.46
      4     4.98x          800          87.38    1.91
      8    10.15x          800          42.86    2.59
     16    12.99x          800          33.48    0.83

The speedups are mostly due to the fact that more threads can use more
memory bandwidth.  The loop we're stressing on the x86 chip in this test
is clear_page_erms, which tops out at a bandwidth of 2550 MiB/s with one
thread.  We get the same bandwidth per thread for 2, 4, or 8 threads,
but at 16 threads the per-thread bandwidth drops to 1420 MiB/s.

However, the performance also improves over a single thread because of
the ktask threads' NUMA awareness (ktask migrates worker threads to the
node local to the work being done).  This becomes a bigger factor as the
amount of pages to zero grows to include memory from multiple nodes, so
that speedups increase as the size increases.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 mm/memory.c | 35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fe2fba27ded2..d1f603a24186 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,6 +69,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/dax.h>
 #include <linux/oom.h>
+#include <linux/ktask.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -4325,27 +4326,45 @@ EXPORT_SYMBOL(__might_fault);
 #endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
-static void clear_gigantic_page(struct page *page,
-				unsigned long addr,
-				unsigned int pages_per_huge_page)
+
+struct cgp_args {
+	struct page	*base_page;
+	unsigned long	addr;
+};
+
+static int clear_gigantic_page_chunk(unsigned long start, unsigned long end,
+				     struct cgp_args *args)
 {
-	int i;
-	struct page *p = page;
+	struct page *base_page = args->base_page;
+	struct page *p = base_page;
+	unsigned long addr = args->addr;
+	unsigned long i;
 
 	might_sleep();
-	for (i = 0; i < pages_per_huge_page;
-	     i++, p = mem_map_next(p, page, i)) {
+	for (i = start; i < end; ++i) {
 		cond_resched();
 		clear_user_highpage(p, addr + i * PAGE_SIZE);
+
+		p = mem_map_next(p, base_page, i);
 	}
+
+	return KTASK_RETURN_SUCCESS;
 }
+
 void clear_huge_page(struct page *page,
 		     unsigned long addr, unsigned int pages_per_huge_page)
 {
 	int i;
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-		clear_gigantic_page(page, addr, pages_per_huge_page);
+		struct cgp_args args = {page, addr};
+		struct ktask_node node = {0, pages_per_huge_page,
+					  page_to_nid(page)};
+		DEFINE_KTASK_CTL_RANGE(ctl, clear_gigantic_page_chunk, &args,
+				       KTASK_BPGS_MINCHUNK, 0, GFP_KERNEL);
+
+		ktask_run_numa(&node, 1, &ctl);
+
 		return;
 	}
 
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 5/7] mm: parallelize clear_gigantic_page
@ 2017-08-24 20:50   ` Daniel Jordan
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-24 20:50 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

Parallelize clear_gigantic_page, which zeroes any page size larger than
8M (e.g. 1G on x86 or 2G on SPARC).

Performance results (the default number of threads is 4; higher thread
counts shown for context only):

Machine: SPARC T7-4, 1024 cpus, 504G memory
Test:    Clear a range of gigantic pages

nthread   speedup   size (GiB)   min time (s)   stdev
      1                     50           7.77    0.02
      2     1.97x           50           3.95    0.04
      4     3.85x           50           2.02    0.05
      8     6.27x           50           1.24    0.10
     16     9.84x           50           0.79    0.06

      1                    100          15.50    0.07
      2     1.91x          100           8.10    0.05
      4     3.48x          100           4.45    0.07
      8     5.18x          100           2.99    0.05
     16     7.79x          100           1.99    0.12

      1                    200          31.03    0.15
      2     1.88x          200          16.47    0.02
      4     3.37x          200           9.20    0.14
      8     5.16x          200           6.01    0.19
     16     7.04x          200           4.41    0.06

Machine:  Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 cpus, 1T memory
Test:     Clear a range of gigantic pages

nthread   speedup   size (GiB)   min time (s)   stdev
      1                    100          41.13    0.03
      2     2.03x          100          20.26    0.14
      4     4.28x          100           9.62    0.09
      8     8.39x          100           4.90    0.05
     16    10.44x          100           3.94    0.03

      1                    200          89.68    0.35
      2     2.21x          200          40.64    0.18
      4     4.64x          200          19.33    0.32
      8     8.99x          200           9.98    0.04
     16    11.27x          200           7.96    0.04

      1                    400         188.20    1.57
      2     2.30x          400          81.84    0.09
      4     4.63x          400          40.62    0.26
      8     8.92x          400          21.09    0.50
     16    11.78x          400          15.97    0.25

      1                    800         434.91    1.81
      2     2.54x          800         170.97    1.46
      4     4.98x          800          87.38    1.91
      8    10.15x          800          42.86    2.59
     16    12.99x          800          33.48    0.83

The speedups are mostly due to the fact that more threads can use more
memory bandwidth.  The loop we're stressing on the x86 chip in this test
is clear_page_erms, which tops out at a bandwidth of 2550 MiB/s with one
thread.  We get the same bandwidth per thread for 2, 4, or 8 threads,
but at 16 threads the per-thread bandwidth drops to 1420 MiB/s.

However, the performance also improves over a single thread because of
the ktask threads' NUMA awareness (ktask migrates worker threads to the
node local to the work being done).  This becomes a bigger factor as the
amount of pages to zero grows to include memory from multiple nodes, so
that speedups increase as the size increases.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 mm/memory.c | 35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fe2fba27ded2..d1f603a24186 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,6 +69,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/dax.h>
 #include <linux/oom.h>
+#include <linux/ktask.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -4325,27 +4326,45 @@ EXPORT_SYMBOL(__might_fault);
 #endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
-static void clear_gigantic_page(struct page *page,
-				unsigned long addr,
-				unsigned int pages_per_huge_page)
+
+struct cgp_args {
+	struct page	*base_page;
+	unsigned long	addr;
+};
+
+static int clear_gigantic_page_chunk(unsigned long start, unsigned long end,
+				     struct cgp_args *args)
 {
-	int i;
-	struct page *p = page;
+	struct page *base_page = args->base_page;
+	struct page *p = base_page;
+	unsigned long addr = args->addr;
+	unsigned long i;
 
 	might_sleep();
-	for (i = 0; i < pages_per_huge_page;
-	     i++, p = mem_map_next(p, page, i)) {
+	for (i = start; i < end; ++i) {
 		cond_resched();
 		clear_user_highpage(p, addr + i * PAGE_SIZE);
+
+		p = mem_map_next(p, base_page, i);
 	}
+
+	return KTASK_RETURN_SUCCESS;
 }
+
 void clear_huge_page(struct page *page,
 		     unsigned long addr, unsigned int pages_per_huge_page)
 {
 	int i;
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-		clear_gigantic_page(page, addr, pages_per_huge_page);
+		struct cgp_args args = {page, addr};
+		struct ktask_node node = {0, pages_per_huge_page,
+					  page_to_nid(page)};
+		DEFINE_KTASK_CTL_RANGE(ctl, clear_gigantic_page_chunk, &args,
+				       KTASK_BPGS_MINCHUNK, 0, GFP_KERNEL);
+
+		ktask_run_numa(&node, 1, &ctl);
+
 		return;
 	}
 
-- 
2.12.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 6/7] hugetlbfs: parallelize hugetlbfs_fallocate with ktask
  2017-08-24 20:49 ` Daniel Jordan
@ 2017-08-24 20:50   ` Daniel Jordan
  -1 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-24 20:50 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

hugetlbfs_fallocate preallocates huge pages to back a file in a
hugetlbfs filesystem.  The time to call this function grows linearly
with size.

ktask performs well with its default thread count of 4; higher thread
counts are given for context only.

Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 cpus, 1T memory
Test:    fallocate(1) a file on a hugetlbfs filesystem

nthread   speedup   size (GiB)   min time (s)   stdev
      1                    200         127.53    2.19
      2     3.09x          200          41.30    2.11
      4     5.72x          200          22.29    0.51
      8     9.45x          200          13.50    2.58
     16     9.74x          200          13.09    1.64

      1                    400         193.09    2.47
      2     2.14x          400          90.31    3.39
      4     3.84x          400          50.32    0.44
      8     5.11x          400          37.75    1.23
     16     6.12x          400          31.54    3.13

Machine: SPARC T7-4, 1024 cpus, 504G memory
Test:    fallocate(1) a file on a hugetlbfs filesystem

nthread   speedup   size (GiB)   min time (s)   stdev

      1                    100          15.55    0.05
      2     1.92x          100           8.08    0.01
      4     3.55x          100           4.38    0.02
      8     5.87x          100           2.65    0.06
     16     6.45x          100           2.41    0.09

      1                    200          31.26    0.02
      2     1.92x          200          16.26    0.02
      4     3.58x          200           8.73    0.04
      8     5.54x          200           5.64    0.16
     16     6.96x          200           4.49    0.35

      1                    400          62.18    0.09
      2     1.98x          400          31.36    0.04
      4     3.55x          400          17.52    0.03
      8     5.53x          400          11.25    0.04
     16     6.61x          400           9.40    0.17

The primary bottleneck for better scaling at higher thread counts is
hugetlb_fault_mutex_table[hash].  perf showed L1-dcache-loads increase
with 8 threads and again sharply with 16 threads, and a cpu counter
profile showed that 31% of the L1d misses were on
hugetlb_fault_mutex_table[hash] in the 16-thread case.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 fs/hugetlbfs/inode.c | 117 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 95 insertions(+), 22 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 28d2753be094..7eb8c9f988aa 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -36,6 +36,7 @@
 #include <linux/magic.h>
 #include <linux/migrate.h>
 #include <linux/uio.h>
+#include <linux/ktask.h>
 
 #include <linux/uaccess.h>
 
@@ -86,11 +87,16 @@ static const match_table_t tokens = {
 };
 
 #ifdef CONFIG_NUMA
+static inline struct shared_policy *hugetlb_get_shared_policy(
+							struct inode *inode)
+{
+	return &HUGETLBFS_I(inode)->policy;
+}
+
 static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma,
-					struct inode *inode, pgoff_t index)
+				struct shared_policy *policy, pgoff_t index)
 {
-	vma->vm_policy = mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
-							index);
+	vma->vm_policy = mpol_shared_policy_lookup(policy, index);
 }
 
 static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma)
@@ -98,8 +104,14 @@ static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma)
 	mpol_cond_put(vma->vm_policy);
 }
 #else
+static inline struct shared_policy *hugetlb_get_shared_policy(
+							struct inode *inode)
+{
+	return NULL;
+}
+
 static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma,
-					struct inode *inode, pgoff_t index)
+				struct shared_policy *policy, pgoff_t index)
 {
 }
 
@@ -551,19 +563,29 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	return 0;
 }
 
+struct hf_args {
+	struct file		*file;
+	struct task_struct	*parent_task;
+	struct mm_struct	*mm;
+	struct shared_policy	*shared_policy;
+	struct hstate		*hstate;
+	struct address_space	*mapping;
+	int			error;
+};
+
+static int hugetlbfs_fallocate_chunk(pgoff_t start, pgoff_t end,
+				     struct hf_args *args);
+
 static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 				loff_t len)
 {
 	struct inode *inode = file_inode(file);
-	struct address_space *mapping = inode->i_mapping;
 	struct hstate *h = hstate_inode(inode);
-	struct vm_area_struct pseudo_vma;
-	struct mm_struct *mm = current->mm;
 	loff_t hpage_size = huge_page_size(h);
 	unsigned long hpage_shift = huge_page_shift(h);
-	pgoff_t start, index, end;
+	pgoff_t start, end;
+	struct hf_args hf_args;
 	int error;
-	u32 hash;
 
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
 		return -EOPNOTSUPP;
@@ -586,16 +608,67 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 	if (error)
 		goto out;
 
+	hf_args.file = file;
+	hf_args.parent_task = current;
+	hf_args.mm = current->mm;
+	hf_args.shared_policy = hugetlb_get_shared_policy(inode);
+	hf_args.hstate = h;
+	hf_args.mapping = inode->i_mapping;
+	hf_args.error = 0;
+
+	if (unlikely(hstate_is_gigantic(h))) {
+		/*
+		 * Use multiple threads in clear_gigantic_page instead of here,
+		 * so just do a 1-threaded hugetlbfs_fallocate_chunk.
+		 */
+		error = hugetlbfs_fallocate_chunk(start, end, &hf_args);
+	} else {
+		DEFINE_KTASK_CTL_RANGE(ctl, hugetlbfs_fallocate_chunk,
+				       &hf_args, KTASK_BPGS_MINCHUNK,
+				       0, GFP_KERNEL);
+
+		error = ktask_run((void *)start, end - start, &ctl);
+	}
+
+	if (error == KTASK_RETURN_ERROR && hf_args.error != -EINTR)
+		goto out;
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
+		i_size_write(inode, offset + len);
+	inode->i_ctime = current_time(inode);
+out:
+	inode_unlock(inode);
+	return error;
+}
+
+static int hugetlbfs_fallocate_chunk(pgoff_t start, pgoff_t end,
+				     struct hf_args *args)
+{
+	struct file		*file		= args->file;
+	struct task_struct	*parent_task	= args->parent_task;
+	struct mm_struct	*mm		= args->mm;
+	struct shared_policy	*shared_policy	= args->shared_policy;
+	struct hstate		*h		= args->hstate;
+	struct address_space	*mapping	= args->mapping;
+	int			error		= 0;
+	pgoff_t			index;
+	struct vm_area_struct	pseudo_vma;
+	loff_t			hpage_size;
+	u32			hash;
+
+	hpage_size = huge_page_size(h);
+
 	/*
 	 * Initialize a pseudo vma as this is required by the huge page
 	 * allocation routines.  If NUMA is configured, use page index
-	 * as input to create an allocation policy.
+	 * as input to create an allocation policy.  Each thread gets its
+	 * own pseudo vma because mempolicies can differ by page.
 	 */
 	memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
 	pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
 	pseudo_vma.vm_file = file;
 
-	for (index = start; index < end; index++) {
+	for (index = start; index < end; ++index) {
 		/*
 		 * This is supposed to be the vaddr where the page is being
 		 * faulted in, but we have no vaddr here.
@@ -610,13 +683,13 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		 * fallocate(2) manpage permits EINTR; we may have been
 		 * interrupted because we are using up too much memory.
 		 */
-		if (signal_pending(current)) {
+		if (signal_pending(parent_task) || signal_pending(current)) {
 			error = -EINTR;
-			break;
+			goto err;
 		}
 
 		/* Set numa allocation policy based on index */
-		hugetlb_set_vma_policy(&pseudo_vma, inode, index);
+		hugetlb_set_vma_policy(&pseudo_vma, shared_policy, index);
 
 		/* addr is the offset within the file (zero based) */
 		addr = index * hpage_size;
@@ -641,7 +714,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		if (IS_ERR(page)) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			error = PTR_ERR(page);
-			goto out;
+			goto err;
 		}
 		clear_huge_page(page, addr, pages_per_huge_page(h));
 		__SetPageUptodate(page);
@@ -649,7 +722,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		if (unlikely(error)) {
 			put_page(page);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			goto out;
+			goto err;
 		}
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
@@ -662,12 +735,12 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		unlock_page(page);
 	}
 
-	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
-		i_size_write(inode, offset + len);
-	inode->i_ctime = current_time(inode);
-out:
-	inode_unlock(inode);
-	return error;
+	return KTASK_RETURN_SUCCESS;
+
+err:
+	args->error = error;
+
+	return KTASK_RETURN_ERROR;
 }
 
 static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 6/7] hugetlbfs: parallelize hugetlbfs_fallocate with ktask
@ 2017-08-24 20:50   ` Daniel Jordan
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-24 20:50 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

hugetlbfs_fallocate preallocates huge pages to back a file in a
hugetlbfs filesystem.  The time to call this function grows linearly
with size.

ktask performs well with its default thread count of 4; higher thread
counts are given for context only.

Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 cpus, 1T memory
Test:    fallocate(1) a file on a hugetlbfs filesystem

nthread   speedup   size (GiB)   min time (s)   stdev
      1                    200         127.53    2.19
      2     3.09x          200          41.30    2.11
      4     5.72x          200          22.29    0.51
      8     9.45x          200          13.50    2.58
     16     9.74x          200          13.09    1.64

      1                    400         193.09    2.47
      2     2.14x          400          90.31    3.39
      4     3.84x          400          50.32    0.44
      8     5.11x          400          37.75    1.23
     16     6.12x          400          31.54    3.13

Machine: SPARC T7-4, 1024 cpus, 504G memory
Test:    fallocate(1) a file on a hugetlbfs filesystem

nthread   speedup   size (GiB)   min time (s)   stdev

      1                    100          15.55    0.05
      2     1.92x          100           8.08    0.01
      4     3.55x          100           4.38    0.02
      8     5.87x          100           2.65    0.06
     16     6.45x          100           2.41    0.09

      1                    200          31.26    0.02
      2     1.92x          200          16.26    0.02
      4     3.58x          200           8.73    0.04
      8     5.54x          200           5.64    0.16
     16     6.96x          200           4.49    0.35

      1                    400          62.18    0.09
      2     1.98x          400          31.36    0.04
      4     3.55x          400          17.52    0.03
      8     5.53x          400          11.25    0.04
     16     6.61x          400           9.40    0.17

The primary bottleneck for better scaling at higher thread counts is
hugetlb_fault_mutex_table[hash].  perf showed L1-dcache-loads increase
with 8 threads and again sharply with 16 threads, and a cpu counter
profile showed that 31% of the L1d misses were on
hugetlb_fault_mutex_table[hash] in the 16-thread case.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 fs/hugetlbfs/inode.c | 117 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 95 insertions(+), 22 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 28d2753be094..7eb8c9f988aa 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -36,6 +36,7 @@
 #include <linux/magic.h>
 #include <linux/migrate.h>
 #include <linux/uio.h>
+#include <linux/ktask.h>
 
 #include <linux/uaccess.h>
 
@@ -86,11 +87,16 @@ static const match_table_t tokens = {
 };
 
 #ifdef CONFIG_NUMA
+static inline struct shared_policy *hugetlb_get_shared_policy(
+							struct inode *inode)
+{
+	return &HUGETLBFS_I(inode)->policy;
+}
+
 static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma,
-					struct inode *inode, pgoff_t index)
+				struct shared_policy *policy, pgoff_t index)
 {
-	vma->vm_policy = mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
-							index);
+	vma->vm_policy = mpol_shared_policy_lookup(policy, index);
 }
 
 static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma)
@@ -98,8 +104,14 @@ static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma)
 	mpol_cond_put(vma->vm_policy);
 }
 #else
+static inline struct shared_policy *hugetlb_get_shared_policy(
+							struct inode *inode)
+{
+	return NULL;
+}
+
 static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma,
-					struct inode *inode, pgoff_t index)
+				struct shared_policy *policy, pgoff_t index)
 {
 }
 
@@ -551,19 +563,29 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	return 0;
 }
 
+struct hf_args {
+	struct file		*file;
+	struct task_struct	*parent_task;
+	struct mm_struct	*mm;
+	struct shared_policy	*shared_policy;
+	struct hstate		*hstate;
+	struct address_space	*mapping;
+	int			error;
+};
+
+static int hugetlbfs_fallocate_chunk(pgoff_t start, pgoff_t end,
+				     struct hf_args *args);
+
 static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 				loff_t len)
 {
 	struct inode *inode = file_inode(file);
-	struct address_space *mapping = inode->i_mapping;
 	struct hstate *h = hstate_inode(inode);
-	struct vm_area_struct pseudo_vma;
-	struct mm_struct *mm = current->mm;
 	loff_t hpage_size = huge_page_size(h);
 	unsigned long hpage_shift = huge_page_shift(h);
-	pgoff_t start, index, end;
+	pgoff_t start, end;
+	struct hf_args hf_args;
 	int error;
-	u32 hash;
 
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
 		return -EOPNOTSUPP;
@@ -586,16 +608,67 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 	if (error)
 		goto out;
 
+	hf_args.file = file;
+	hf_args.parent_task = current;
+	hf_args.mm = current->mm;
+	hf_args.shared_policy = hugetlb_get_shared_policy(inode);
+	hf_args.hstate = h;
+	hf_args.mapping = inode->i_mapping;
+	hf_args.error = 0;
+
+	if (unlikely(hstate_is_gigantic(h))) {
+		/*
+		 * Use multiple threads in clear_gigantic_page instead of here,
+		 * so just do a 1-threaded hugetlbfs_fallocate_chunk.
+		 */
+		error = hugetlbfs_fallocate_chunk(start, end, &hf_args);
+	} else {
+		DEFINE_KTASK_CTL_RANGE(ctl, hugetlbfs_fallocate_chunk,
+				       &hf_args, KTASK_BPGS_MINCHUNK,
+				       0, GFP_KERNEL);
+
+		error = ktask_run((void *)start, end - start, &ctl);
+	}
+
+	if (error == KTASK_RETURN_ERROR && hf_args.error != -EINTR)
+		goto out;
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
+		i_size_write(inode, offset + len);
+	inode->i_ctime = current_time(inode);
+out:
+	inode_unlock(inode);
+	return error;
+}
+
+static int hugetlbfs_fallocate_chunk(pgoff_t start, pgoff_t end,
+				     struct hf_args *args)
+{
+	struct file		*file		= args->file;
+	struct task_struct	*parent_task	= args->parent_task;
+	struct mm_struct	*mm		= args->mm;
+	struct shared_policy	*shared_policy	= args->shared_policy;
+	struct hstate		*h		= args->hstate;
+	struct address_space	*mapping	= args->mapping;
+	int			error		= 0;
+	pgoff_t			index;
+	struct vm_area_struct	pseudo_vma;
+	loff_t			hpage_size;
+	u32			hash;
+
+	hpage_size = huge_page_size(h);
+
 	/*
 	 * Initialize a pseudo vma as this is required by the huge page
 	 * allocation routines.  If NUMA is configured, use page index
-	 * as input to create an allocation policy.
+	 * as input to create an allocation policy.  Each thread gets its
+	 * own pseudo vma because mempolicies can differ by page.
 	 */
 	memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
 	pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
 	pseudo_vma.vm_file = file;
 
-	for (index = start; index < end; index++) {
+	for (index = start; index < end; ++index) {
 		/*
 		 * This is supposed to be the vaddr where the page is being
 		 * faulted in, but we have no vaddr here.
@@ -610,13 +683,13 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		 * fallocate(2) manpage permits EINTR; we may have been
 		 * interrupted because we are using up too much memory.
 		 */
-		if (signal_pending(current)) {
+		if (signal_pending(parent_task) || signal_pending(current)) {
 			error = -EINTR;
-			break;
+			goto err;
 		}
 
 		/* Set numa allocation policy based on index */
-		hugetlb_set_vma_policy(&pseudo_vma, inode, index);
+		hugetlb_set_vma_policy(&pseudo_vma, shared_policy, index);
 
 		/* addr is the offset within the file (zero based) */
 		addr = index * hpage_size;
@@ -641,7 +714,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		if (IS_ERR(page)) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			error = PTR_ERR(page);
-			goto out;
+			goto err;
 		}
 		clear_huge_page(page, addr, pages_per_huge_page(h));
 		__SetPageUptodate(page);
@@ -649,7 +722,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		if (unlikely(error)) {
 			put_page(page);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			goto out;
+			goto err;
 		}
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
@@ -662,12 +735,12 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		unlock_page(page);
 	}
 
-	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
-		i_size_write(inode, offset + len);
-	inode->i_ctime = current_time(inode);
-out:
-	inode_unlock(inode);
-	return error;
+	return KTASK_RETURN_SUCCESS;
+
+err:
+	args->error = error;
+
+	return KTASK_RETURN_ERROR;
 }
 
 static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
-- 
2.12.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 7/7] mm: parallelize deferred struct page initialization within each node
  2017-08-24 20:49 ` Daniel Jordan
@ 2017-08-24 20:50   ` Daniel Jordan
  -1 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-24 20:50 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

Deferred struct page initialization currently uses one thread per node
(pgdatinit threads), but this can still be a bottleneck during boot on
big machines.  To reduce boot time, use ktask within each pgdatinit
thread to parallelize the struct page initialization on each node,
allowing the system to use more memory bandwidth.

The number of cpus used depends on a few factors, including the size of
the memory on that node (see the Documentation commit earlier in the
series for more information), but in this special case, since cpus are
not being used for much else at this phase of boot, we raise ktask's cap
on the maximum number of cpus to the number of cpus on the node.  Up to
this many cpus participate in initializing struct pages per node.

Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 cpus, 1T memory
Test:    Boot the machine with deferred struct page initialization

kernel                   speedup   min time per   stdev
                                   node (ms)

baseline (4.13-rc5)                         483     0.5
ktask (4.13-rc5 based)     3.66x            132     1.5

Machine: SPARC M6 30-node LDom, 256 cpus, 30T memory
Test:    Boot the machine with deferred struct page initialization

kernel                   speedup   min time per   stdev
                                   node (ms)

baseline (4.13-rc5)                        9566     1.4
ktask (4.13-rc5 based)     1.55x           6172    19.5

[There is a patch series under review upstream to defer the zeroing of
struct pages to pgdatinit threads:
    complete deferred page initialization
    http://www.spinics.net/lists/linux-mm/msg132805.html
We get bigger speedups and save more boot time when incorporating this
pending series because there is more work to parallelize.]

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Suggested-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Steve Sistare <steven.sistare@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 mm/page_alloc.c | 174 ++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 107 insertions(+), 67 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1bad301820c7..6850f58fa720 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -66,6 +66,7 @@
 #include <linux/kthread.h>
 #include <linux/memcontrol.h>
 #include <linux/ftrace.h>
+#include <linux/ktask.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -1268,8 +1269,6 @@ static void __init __free_pages_boot_core(struct page *page, unsigned int order)
 	}
 	__ClearPageReserved(p);
 	set_page_count(p, 0);
-
-	page_zone(page)->managed_pages += nr_pages;
 	set_page_refcounted(page);
 	__free_pages(page, order);
 }
@@ -1333,7 +1332,8 @@ void __init __free_pages_bootmem(struct page *page, unsigned long pfn,
 {
 	if (early_page_uninitialised(pfn))
 		return;
-	return __free_pages_boot_core(page, order);
+	__free_pages_boot_core(page, order);
+	page_zone(page)->managed_pages += (1ul << order);
 }
 
 /*
@@ -1441,12 +1441,99 @@ static inline void __init pgdat_init_report_one_done(void)
 		complete(&pgdat_init_all_done_comp);
 }
 
+struct deferred_init_args {
+	int nid;
+	int zid;
+	struct zone *zone;
+	atomic64_t nr_pages;
+};
+
+int __init deferred_init_memmap_chunk(unsigned long start_pfn,
+				      unsigned long end_pfn,
+				      struct deferred_init_args *args)
+{
+	unsigned long pfn;
+	int nid = args->nid;
+	int zid = args->zid;
+	struct zone *zone = args->zone;
+	struct page *page = NULL;
+	struct page *free_base_page = NULL;
+	unsigned long free_base_pfn = 0;
+	unsigned long nr_pages = 0;
+	int nr_to_free = 0;
+	struct mminit_pfnnid_cache nid_init_state = { };
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+		if (!pfn_valid_within(pfn))
+			goto free_range;
+
+		/*
+		 * Ensure pfn_valid is checked every
+		 * pageblock_nr_pages for memory holes
+		 */
+		if ((pfn & (pageblock_nr_pages - 1)) == 0) {
+			if (!pfn_valid(pfn)) {
+				page = NULL;
+				goto free_range;
+			}
+		}
+
+		if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
+			page = NULL;
+			goto free_range;
+		}
+
+		/* Minimise pfn page lookups and scheduler checks */
+		if (page && (pfn & (pageblock_nr_pages - 1)) != 0) {
+			page++;
+		} else {
+			nr_pages += nr_to_free;
+			deferred_free_range(free_base_page,
+					free_base_pfn, nr_to_free);
+			free_base_page = NULL;
+			free_base_pfn = nr_to_free = 0;
+
+			page = pfn_to_page(pfn);
+			cond_resched();
+		}
+
+		if (page->flags) {
+			VM_BUG_ON(page_zone(page) != zone);
+			goto free_range;
+		}
+
+		__init_single_page(page, pfn, zid, nid);
+		if (!free_base_page) {
+			free_base_page = page;
+			free_base_pfn = pfn;
+			nr_to_free = 0;
+		}
+		nr_to_free++;
+
+		/* Where possible, batch up pages for a single free */
+		continue;
+free_range:
+		/* Free the current block of pages to allocator */
+		nr_pages += nr_to_free;
+		deferred_free_range(free_base_page, free_base_pfn,
+							nr_to_free);
+		free_base_page = NULL;
+		free_base_pfn = nr_to_free = 0;
+	}
+	/* Free the last block of pages to allocator */
+	nr_pages += nr_to_free;
+	deferred_free_range(free_base_page, free_base_pfn, nr_to_free);
+
+	atomic64_add(nr_pages, &args->nr_pages);
+
+	return KTASK_RETURN_SUCCESS;
+}
+
 /* Initialise remaining memory on a node */
 static int __init deferred_init_memmap(void *data)
 {
 	pg_data_t *pgdat = data;
 	int nid = pgdat->node_id;
-	struct mminit_pfnnid_cache nid_init_state = { };
 	unsigned long start = jiffies;
 	unsigned long nr_pages = 0;
 	unsigned long walk_start, walk_end;
@@ -1454,6 +1541,7 @@ static int __init deferred_init_memmap(void *data)
 	struct zone *zone;
 	unsigned long first_init_pfn = pgdat->first_deferred_pfn;
 	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+	unsigned long nr_node_cpus = cpumask_weight(cpumask);
 
 	if (first_init_pfn == ULONG_MAX) {
 		pgdat_init_report_one_done();
@@ -1478,10 +1566,12 @@ static int __init deferred_init_memmap(void *data)
 
 	for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) {
 		unsigned long pfn, end_pfn;
-		struct page *page = NULL;
-		struct page *free_base_page = NULL;
-		unsigned long free_base_pfn = 0;
-		int nr_to_free = 0;
+		struct ktask_node kn;
+		struct deferred_init_args args = { nid, zid, zone,
+						   ATOMIC64_INIT(0) };
+		DEFINE_KTASK_CTL_RANGE(ctl, deferred_init_memmap_chunk, &args,
+				       KTASK_BPGS_MINCHUNK, nr_node_cpus,
+				       GFP_KERNEL);
 
 		end_pfn = min(walk_end, zone_end_pfn(zone));
 		pfn = first_init_pfn;
@@ -1490,73 +1580,23 @@ static int __init deferred_init_memmap(void *data)
 		if (pfn < zone->zone_start_pfn)
 			pfn = zone->zone_start_pfn;
 
-		for (; pfn < end_pfn; pfn++) {
-			if (!pfn_valid_within(pfn))
-				goto free_range;
-
-			/*
-			 * Ensure pfn_valid is checked every
-			 * pageblock_nr_pages for memory holes
-			 */
-			if ((pfn & (pageblock_nr_pages - 1)) == 0) {
-				if (!pfn_valid(pfn)) {
-					page = NULL;
-					goto free_range;
-				}
-			}
-
-			if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
-				page = NULL;
-				goto free_range;
-			}
-
-			/* Minimise pfn page lookups and scheduler checks */
-			if (page && (pfn & (pageblock_nr_pages - 1)) != 0) {
-				page++;
-			} else {
-				nr_pages += nr_to_free;
-				deferred_free_range(free_base_page,
-						free_base_pfn, nr_to_free);
-				free_base_page = NULL;
-				free_base_pfn = nr_to_free = 0;
-
-				page = pfn_to_page(pfn);
-				cond_resched();
-			}
-
-			if (page->flags) {
-				VM_BUG_ON(page_zone(page) != zone);
-				goto free_range;
-			}
-
-			__init_single_page(page, pfn, zid, nid);
-			if (!free_base_page) {
-				free_base_page = page;
-				free_base_pfn = pfn;
-				nr_to_free = 0;
-			}
-			nr_to_free++;
-
-			/* Where possible, batch up pages for a single free */
+		if (pfn >= end_pfn)
 			continue;
-free_range:
-			/* Free the current block of pages to allocator */
-			nr_pages += nr_to_free;
-			deferred_free_range(free_base_page, free_base_pfn,
-								nr_to_free);
-			free_base_page = NULL;
-			free_base_pfn = nr_to_free = 0;
-		}
-		/* Free the last block of pages to allocator */
-		nr_pages += nr_to_free;
-		deferred_free_range(free_base_page, free_base_pfn, nr_to_free);
+
+		kn.kn_start	= (void *)pfn;
+		kn.kn_task_size	= end_pfn - pfn;
+		kn.kn_nid	= nid;
+		(void) ktask_run_numa(&kn, 1, &ctl);
 
 		first_init_pfn = max(end_pfn, first_init_pfn);
+		nr_pages += atomic64_read(&args.nr_pages);
 	}
 
 	/* Sanity check that the next zone really is unpopulated */
 	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
 
+	zone->managed_pages += nr_pages;
+
 	pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_pages,
 					jiffies_to_msecs(jiffies - start));
 
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 7/7] mm: parallelize deferred struct page initialization within each node
@ 2017-08-24 20:50   ` Daniel Jordan
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-24 20:50 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

Deferred struct page initialization currently uses one thread per node
(pgdatinit threads), but this can still be a bottleneck during boot on
big machines.  To reduce boot time, use ktask within each pgdatinit
thread to parallelize the struct page initialization on each node,
allowing the system to use more memory bandwidth.

The number of cpus used depends on a few factors, including the size of
the memory on that node (see the Documentation commit earlier in the
series for more information), but in this special case, since cpus are
not being used for much else at this phase of boot, we raise ktask's cap
on the maximum number of cpus to the number of cpus on the node.  Up to
this many cpus participate in initializing struct pages per node.

Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 cpus, 1T memory
Test:    Boot the machine with deferred struct page initialization

kernel                   speedup   min time per   stdev
                                   node (ms)

baseline (4.13-rc5)                         483     0.5
ktask (4.13-rc5 based)     3.66x            132     1.5

Machine: SPARC M6 30-node LDom, 256 cpus, 30T memory
Test:    Boot the machine with deferred struct page initialization

kernel                   speedup   min time per   stdev
                                   node (ms)

baseline (4.13-rc5)                        9566     1.4
ktask (4.13-rc5 based)     1.55x           6172    19.5

[There is a patch series under review upstream to defer the zeroing of
struct pages to pgdatinit threads:
    complete deferred page initialization
    http://www.spinics.net/lists/linux-mm/msg132805.html
We get bigger speedups and save more boot time when incorporating this
pending series because there is more work to parallelize.]

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Suggested-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Steve Sistare <steven.sistare@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 mm/page_alloc.c | 174 ++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 107 insertions(+), 67 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1bad301820c7..6850f58fa720 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -66,6 +66,7 @@
 #include <linux/kthread.h>
 #include <linux/memcontrol.h>
 #include <linux/ftrace.h>
+#include <linux/ktask.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -1268,8 +1269,6 @@ static void __init __free_pages_boot_core(struct page *page, unsigned int order)
 	}
 	__ClearPageReserved(p);
 	set_page_count(p, 0);
-
-	page_zone(page)->managed_pages += nr_pages;
 	set_page_refcounted(page);
 	__free_pages(page, order);
 }
@@ -1333,7 +1332,8 @@ void __init __free_pages_bootmem(struct page *page, unsigned long pfn,
 {
 	if (early_page_uninitialised(pfn))
 		return;
-	return __free_pages_boot_core(page, order);
+	__free_pages_boot_core(page, order);
+	page_zone(page)->managed_pages += (1ul << order);
 }
 
 /*
@@ -1441,12 +1441,99 @@ static inline void __init pgdat_init_report_one_done(void)
 		complete(&pgdat_init_all_done_comp);
 }
 
+struct deferred_init_args {
+	int nid;
+	int zid;
+	struct zone *zone;
+	atomic64_t nr_pages;
+};
+
+int __init deferred_init_memmap_chunk(unsigned long start_pfn,
+				      unsigned long end_pfn,
+				      struct deferred_init_args *args)
+{
+	unsigned long pfn;
+	int nid = args->nid;
+	int zid = args->zid;
+	struct zone *zone = args->zone;
+	struct page *page = NULL;
+	struct page *free_base_page = NULL;
+	unsigned long free_base_pfn = 0;
+	unsigned long nr_pages = 0;
+	int nr_to_free = 0;
+	struct mminit_pfnnid_cache nid_init_state = { };
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+		if (!pfn_valid_within(pfn))
+			goto free_range;
+
+		/*
+		 * Ensure pfn_valid is checked every
+		 * pageblock_nr_pages for memory holes
+		 */
+		if ((pfn & (pageblock_nr_pages - 1)) == 0) {
+			if (!pfn_valid(pfn)) {
+				page = NULL;
+				goto free_range;
+			}
+		}
+
+		if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
+			page = NULL;
+			goto free_range;
+		}
+
+		/* Minimise pfn page lookups and scheduler checks */
+		if (page && (pfn & (pageblock_nr_pages - 1)) != 0) {
+			page++;
+		} else {
+			nr_pages += nr_to_free;
+			deferred_free_range(free_base_page,
+					free_base_pfn, nr_to_free);
+			free_base_page = NULL;
+			free_base_pfn = nr_to_free = 0;
+
+			page = pfn_to_page(pfn);
+			cond_resched();
+		}
+
+		if (page->flags) {
+			VM_BUG_ON(page_zone(page) != zone);
+			goto free_range;
+		}
+
+		__init_single_page(page, pfn, zid, nid);
+		if (!free_base_page) {
+			free_base_page = page;
+			free_base_pfn = pfn;
+			nr_to_free = 0;
+		}
+		nr_to_free++;
+
+		/* Where possible, batch up pages for a single free */
+		continue;
+free_range:
+		/* Free the current block of pages to allocator */
+		nr_pages += nr_to_free;
+		deferred_free_range(free_base_page, free_base_pfn,
+							nr_to_free);
+		free_base_page = NULL;
+		free_base_pfn = nr_to_free = 0;
+	}
+	/* Free the last block of pages to allocator */
+	nr_pages += nr_to_free;
+	deferred_free_range(free_base_page, free_base_pfn, nr_to_free);
+
+	atomic64_add(nr_pages, &args->nr_pages);
+
+	return KTASK_RETURN_SUCCESS;
+}
+
 /* Initialise remaining memory on a node */
 static int __init deferred_init_memmap(void *data)
 {
 	pg_data_t *pgdat = data;
 	int nid = pgdat->node_id;
-	struct mminit_pfnnid_cache nid_init_state = { };
 	unsigned long start = jiffies;
 	unsigned long nr_pages = 0;
 	unsigned long walk_start, walk_end;
@@ -1454,6 +1541,7 @@ static int __init deferred_init_memmap(void *data)
 	struct zone *zone;
 	unsigned long first_init_pfn = pgdat->first_deferred_pfn;
 	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+	unsigned long nr_node_cpus = cpumask_weight(cpumask);
 
 	if (first_init_pfn == ULONG_MAX) {
 		pgdat_init_report_one_done();
@@ -1478,10 +1566,12 @@ static int __init deferred_init_memmap(void *data)
 
 	for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) {
 		unsigned long pfn, end_pfn;
-		struct page *page = NULL;
-		struct page *free_base_page = NULL;
-		unsigned long free_base_pfn = 0;
-		int nr_to_free = 0;
+		struct ktask_node kn;
+		struct deferred_init_args args = { nid, zid, zone,
+						   ATOMIC64_INIT(0) };
+		DEFINE_KTASK_CTL_RANGE(ctl, deferred_init_memmap_chunk, &args,
+				       KTASK_BPGS_MINCHUNK, nr_node_cpus,
+				       GFP_KERNEL);
 
 		end_pfn = min(walk_end, zone_end_pfn(zone));
 		pfn = first_init_pfn;
@@ -1490,73 +1580,23 @@ static int __init deferred_init_memmap(void *data)
 		if (pfn < zone->zone_start_pfn)
 			pfn = zone->zone_start_pfn;
 
-		for (; pfn < end_pfn; pfn++) {
-			if (!pfn_valid_within(pfn))
-				goto free_range;
-
-			/*
-			 * Ensure pfn_valid is checked every
-			 * pageblock_nr_pages for memory holes
-			 */
-			if ((pfn & (pageblock_nr_pages - 1)) == 0) {
-				if (!pfn_valid(pfn)) {
-					page = NULL;
-					goto free_range;
-				}
-			}
-
-			if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
-				page = NULL;
-				goto free_range;
-			}
-
-			/* Minimise pfn page lookups and scheduler checks */
-			if (page && (pfn & (pageblock_nr_pages - 1)) != 0) {
-				page++;
-			} else {
-				nr_pages += nr_to_free;
-				deferred_free_range(free_base_page,
-						free_base_pfn, nr_to_free);
-				free_base_page = NULL;
-				free_base_pfn = nr_to_free = 0;
-
-				page = pfn_to_page(pfn);
-				cond_resched();
-			}
-
-			if (page->flags) {
-				VM_BUG_ON(page_zone(page) != zone);
-				goto free_range;
-			}
-
-			__init_single_page(page, pfn, zid, nid);
-			if (!free_base_page) {
-				free_base_page = page;
-				free_base_pfn = pfn;
-				nr_to_free = 0;
-			}
-			nr_to_free++;
-
-			/* Where possible, batch up pages for a single free */
+		if (pfn >= end_pfn)
 			continue;
-free_range:
-			/* Free the current block of pages to allocator */
-			nr_pages += nr_to_free;
-			deferred_free_range(free_base_page, free_base_pfn,
-								nr_to_free);
-			free_base_page = NULL;
-			free_base_pfn = nr_to_free = 0;
-		}
-		/* Free the last block of pages to allocator */
-		nr_pages += nr_to_free;
-		deferred_free_range(free_base_page, free_base_pfn, nr_to_free);
+
+		kn.kn_start	= (void *)pfn;
+		kn.kn_task_size	= end_pfn - pfn;
+		kn.kn_nid	= nid;
+		(void) ktask_run_numa(&kn, 1, &ctl);
 
 		first_init_pfn = max(end_pfn, first_init_pfn);
+		nr_pages += atomic64_read(&args.nr_pages);
 	}
 
 	/* Sanity check that the next zone really is unpopulated */
 	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
 
+	zone->managed_pages += nr_pages;
+
 	pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_pages,
 					jiffies_to_msecs(jiffies - start));
 
-- 
2.12.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 1/7] ktask: add documentation
  2017-08-24 20:49   ` Daniel Jordan
@ 2017-08-24 23:07     ` Randy Dunlap
  -1 siblings, 0 replies; 20+ messages in thread
From: Randy Dunlap @ 2017-08-24 23:07 UTC (permalink / raw)
  To: Daniel Jordan, linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

On 08/24/2017 01:49 PM, Daniel Jordan wrote:
> Motivates and explains the ktask API for kernel clients.
> 
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> Cc: Aaron Lu <aaron.lu@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> Cc: Tim Chen <tim.c.chen@intel.com>
> ---
>  Documentation/core-api/index.rst |   1 +
>  Documentation/core-api/ktask.rst | 104 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 105 insertions(+)
>  create mode 100644 Documentation/core-api/ktask.rst
> 
> diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst
> new file mode 100644
> index 000000000000..cb4b0d87c8c6
> --- /dev/null
> +++ b/Documentation/core-api/ktask.rst
> @@ -0,0 +1,104 @@
> +============================================
> +ktask: parallelize cpu-intensive kernel work
> +============================================

Hi,

I would prefer to use CPU instead of cpu.
Otherwise, Reviewed-by: Randy Dunlap <rdunlap@infradead.org>


-- 
~Randy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 1/7] ktask: add documentation
@ 2017-08-24 23:07     ` Randy Dunlap
  0 siblings, 0 replies; 20+ messages in thread
From: Randy Dunlap @ 2017-08-24 23:07 UTC (permalink / raw)
  To: Daniel Jordan, linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

On 08/24/2017 01:49 PM, Daniel Jordan wrote:
> Motivates and explains the ktask API for kernel clients.
> 
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> Cc: Aaron Lu <aaron.lu@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> Cc: Tim Chen <tim.c.chen@intel.com>
> ---
>  Documentation/core-api/index.rst |   1 +
>  Documentation/core-api/ktask.rst | 104 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 105 insertions(+)
>  create mode 100644 Documentation/core-api/ktask.rst
> 
> diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst
> new file mode 100644
> index 000000000000..cb4b0d87c8c6
> --- /dev/null
> +++ b/Documentation/core-api/ktask.rst
> @@ -0,0 +1,104 @@
> +============================================
> +ktask: parallelize cpu-intensive kernel work
> +============================================

Hi,

I would prefer to use CPU instead of cpu.
Otherwise, Reviewed-by: Randy Dunlap <rdunlap@infradead.org>


-- 
~Randy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 1/7] ktask: add documentation
  2017-08-24 23:07     ` Randy Dunlap
@ 2017-08-25 14:12       ` Daniel Jordan
  -1 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-25 14:12 UTC (permalink / raw)
  To: Randy Dunlap, linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

On 08/24/2017 07:07 PM, Randy Dunlap wrote:
> On 08/24/2017 01:49 PM, Daniel Jordan wrote:
>> diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst
>> new file mode 100644
>> index 000000000000..cb4b0d87c8c6
>> --- /dev/null
>> +++ b/Documentation/core-api/ktask.rst
>> @@ -0,0 +1,104 @@
>> +============================================
>> +ktask: parallelize cpu-intensive kernel work
>> +============================================
> Hi,
>
> I would prefer to use CPU instead of cpu.

Ok, a quick grep through Documentation shows that CPU is used more often 
than cpu, so for consistency I'll change it.

> Otherwise, Reviewed-by: Randy Dunlap <rdunlap@infradead.org>

Thanks for the review, Randy.

Daniel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 1/7] ktask: add documentation
@ 2017-08-25 14:12       ` Daniel Jordan
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Jordan @ 2017-08-25 14:12 UTC (permalink / raw)
  To: Randy Dunlap, linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

On 08/24/2017 07:07 PM, Randy Dunlap wrote:
> On 08/24/2017 01:49 PM, Daniel Jordan wrote:
>> diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst
>> new file mode 100644
>> index 000000000000..cb4b0d87c8c6
>> --- /dev/null
>> +++ b/Documentation/core-api/ktask.rst
>> @@ -0,0 +1,104 @@
>> +============================================
>> +ktask: parallelize cpu-intensive kernel work
>> +============================================
> Hi,
>
> I would prefer to use CPU instead of cpu.

Ok, a quick grep through Documentation shows that CPU is used more often 
than cpu, so for consistency I'll change it.

> Otherwise, Reviewed-by: Randy Dunlap <rdunlap@infradead.org>

Thanks for the review, Randy.

Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-08-25 14:12 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-24 20:49 [RFC PATCH v2 0/7] ktask: multithread cpu-intensive kernel work Daniel Jordan
2017-08-24 20:49 ` Daniel Jordan
2017-08-24 20:49 ` [RFC PATCH v2 1/7] ktask: add documentation Daniel Jordan
2017-08-24 20:49   ` Daniel Jordan
2017-08-24 23:07   ` Randy Dunlap
2017-08-24 23:07     ` Randy Dunlap
2017-08-25 14:12     ` Daniel Jordan
2017-08-25 14:12       ` Daniel Jordan
2017-08-24 20:49 ` [RFC PATCH v2 2/7] ktask: multithread cpu-intensive kernel work Daniel Jordan
2017-08-24 20:49   ` Daniel Jordan
2017-08-24 20:50 ` [RFC PATCH v2 3/7] ktask: add /proc/sys/debug/ktask_max_threads Daniel Jordan
2017-08-24 20:50   ` Daniel Jordan
2017-08-24 20:50 ` [RFC PATCH v2 4/7] mm: enlarge type of offset argument in mem_map_offset and mem_map_next Daniel Jordan
2017-08-24 20:50   ` Daniel Jordan
2017-08-24 20:50 ` [RFC PATCH v2 5/7] mm: parallelize clear_gigantic_page Daniel Jordan
2017-08-24 20:50   ` Daniel Jordan
2017-08-24 20:50 ` [RFC PATCH v2 6/7] hugetlbfs: parallelize hugetlbfs_fallocate with ktask Daniel Jordan
2017-08-24 20:50   ` Daniel Jordan
2017-08-24 20:50 ` [RFC PATCH v2 7/7] mm: parallelize deferred struct page initialization within each node Daniel Jordan
2017-08-24 20:50   ` Daniel Jordan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.