Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work
@ 2017-12-05 19:52 Daniel Jordan
  2017-12-05 19:52 ` [RFC PATCH v3 1/7] ktask: add documentation Daniel Jordan
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: Daniel Jordan @ 2017-12-05 19:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

What do people think of the overall design and direction?

There's documentation describing the design in the first patch of the
series and the second patch has the API in ktask.h.

         Thanks,
            Daniel


Changelog:

v2 -> v3:
 - Changed cpu to CPU in the ktask Documentation, as suggested by Randy Dunlap
 - Saved more boot time now that Pavel Tatashin's deferred struct page init
   patches are in mainline (https://lkml.org/lkml/2017/10/13/692).  New
   performance results in patch 7.
 - Added resource limits, per-node and system-wide, to maintain efficient
   concurrency levels (addresses a concern from my Plumbers talk)
 - ktask no longer allocates memory internally during a task so it can be used
   in sensitive contexts
 - Added the option to run work anywhere on the system rather than always
   confining it to a specific node
 - Updated Documentation patch with these changes and reworked motivation
   section

v1 -> v2:
 - Added deferred struct page initialization use case.
 - Explained the source of the performance improvement from parallelizing
   clear_gigantic_page (comment from Dave Hansen).
 - Fixed Documentation and build warnings from CONFIG_KTASK=n kernels.

My Linux Plumbers Unconference Talk:
  https://www.linuxplumbersconf.org/2017/ocw/proposals/4837
  (please ignore OpenID's misapprehension that James Bottomley was speaker)

ktask is a generic framework for parallelizing CPU-intensive work in the
kernel.  The intended use is for big machines that can use their CPU power
to speed up large tasks that can't otherwise be multithreaded in userland.
The API is generic enough to add concurrency to many different kinds of
tasks--for example, zeroing a range of pages or evicting a list of
inodes--and aims to save its clients the trouble of splitting up the work,
choosing the number of threads to use, starting these threads, and load
balancing the work between them.

This patchset is based on 4.15-rc2 plus one mmots fix[*] and contains three
ktask users:
 - deferred struct page initialization at boot time
 - clearing gigantic pages
 - fallocate for HugeTLB pages

Work in progress:
 - Parallelizing page freeing in the exit/munmap paths
 - CPU hotplug support

The core ktask code is based on work by Pavel Tatashin, Steve Sistare, and
Jonathan Adams.

ktask v1 RFC: https://lkml.org/lkml/2017/7/14/666
ktask v2 RFC: https://lkml.org/lkml/2017/8/24/801

[*] http://ozlabs.org/~akpm/mmots/broken-out/mm-split-deferred_init_range-into-initializing-and-freeing-parts.patch


Daniel Jordan (7):
  ktask: add documentation
  ktask: multithread CPU-intensive kernel work
  ktask: add /proc/sys/debug/ktask_max_threads
  mm: enlarge type of offset argument in mem_map_offset and mem_map_next
  mm: parallelize clear_gigantic_page
  hugetlbfs: parallelize hugetlbfs_fallocate with ktask
  mm: parallelize deferred struct page initialization within each node

 Documentation/core-api/index.rst |   1 +
 Documentation/core-api/ktask.rst | 173 ++++++++++++
 fs/hugetlbfs/inode.c             | 116 ++++++--
 include/linux/ktask.h            | 255 ++++++++++++++++++
 include/linux/ktask_internal.h   |  22 ++
 include/linux/mm.h               |   6 +
 init/Kconfig                     |  12 +
 init/main.c                      |   2 +
 kernel/Makefile                  |   2 +-
 kernel/ktask.c                   | 556 +++++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c                  |  10 +
 mm/internal.h                    |   7 +-
 mm/memory.c                      |  35 ++-
 mm/page_alloc.c                  |  78 ++++--
 14 files changed, 1226 insertions(+), 49 deletions(-)
 create mode 100644 Documentation/core-api/ktask.rst
 create mode 100644 include/linux/ktask.h
 create mode 100644 include/linux/ktask_internal.h
 create mode 100644 kernel/ktask.c

-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v3 1/7] ktask: add documentation
  2017-12-05 19:52 [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work Daniel Jordan
@ 2017-12-05 19:52 ` Daniel Jordan
  2017-12-05 20:59   ` Daniel Jordan
  2017-12-06 14:35   ` Michal Hocko
  2017-12-05 19:52 ` [RFC PATCH v3 2/7] ktask: multithread CPU-intensive kernel work Daniel Jordan
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 17+ messages in thread
From: Daniel Jordan @ 2017-12-05 19:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

Motivates and explains the ktask API for kernel clients.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 Documentation/core-api/index.rst |   1 +
 Documentation/core-api/ktask.rst | 173 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 174 insertions(+)
 create mode 100644 Documentation/core-api/ktask.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index d5bbe035316d..255724095814 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -15,6 +15,7 @@ Core utilities
    assoc_array
    atomic_ops
    cpu_hotplug
+   ktask
    local_ops
    workqueue
    genericirq
diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst
new file mode 100644
index 000000000000..703f200c7d36
--- /dev/null
+++ b/Documentation/core-api/ktask.rst
@@ -0,0 +1,173 @@
+============================================
+ktask: parallelize CPU-intensive kernel work
+============================================
+
+:Date: December, 2017
+:Author: Daniel Jordan <daniel.m.jordan@oracle.com>
+
+
+Introduction
+============
+
+ktask is a generic framework for parallelizing CPU-intensive work in the
+kernel.  The intended use is for big machines that can use their CPU power to
+speed up large tasks that can't otherwise be multithreaded in userland.  The
+API is generic enough to add concurrency to many different kinds of tasks--for
+example, zeroing a range of pages or evicting a list of inodes--and aims to
+save its clients the trouble of splitting up the work, choosing the number of
+threads to use, maintaining an efficient concurrency level, starting these
+threads, and load balancing the work between them.
+
+
+Motivation
+==========
+
+To ensure that applications and the kernel itself continue to perform well as
+core counts and memory sizes increase, the kernel needs to scale.  For example,
+when a system call requests a certain fraction of system resources, the kernel
+should respond in kind by devoting a similar fraction of system resources to
+service the request.
+
+Before ktask, for example, when booting a NUMA machine with many CPUs, only one
+thread per node was used to initialize struct pages.  Using additional CPUs
+that would otherwise be idle until the machine is fully up avoids a needless
+bottleneck during system boot and allows the kernel to take advantage of unused
+memory bandwidth.
+
+Why a new framework when there are existing kernel APIs for managing
+concurrency and other ways to improve performance?  Of the existing facilities,
+workqueues aren't designed to divide work up (although ktask is built on
+unbound workqueues), and kthread_worker supports only one thread.  Existing
+scalability techniques in the kernel such as doing work or holding locks in
+batches are helpful and should be applied first for performance problems, but
+eventually a single thread hits a wall.
+
+
+Concept
+=======
+
+A little terminology up front:  A 'task' is the total work there is to do and a
+'chunk' is a unit of work given to a thread.
+
+To complete a task using the ktask framework, a client provides a thread
+function that is responsible for completing one chunk.  The thread function is
+defined in a standard way, with start and end arguments that delimit the chunk
+as well as an argument that the client uses to pass data specific to the task.
+
+In addition, the client supplies an object representing the start of the task
+and an iterator function that knows how to advance some number of units in the
+task to yield another object representing the new task position.  The framework
+uses the start object and iterator internally to divide the task into chunks.
+
+Finally, the client passes the total task size and a minimum chunk size to
+indicate the minimum amount of work that's appropriate to do in one chunk.  The
+sizes are given in task-specific units (e.g. pages, inodes, bytes).  The
+framework uses these sizes, along with the number of online CPUs and an
+internal maximum number of threads, to decide how many threads to start and how
+many chunks to divide the task into.
+
+For example, consider the task of clearing a gigantic page.  This used to be
+done in a single thread with a for loop that calls a page clearing function for
+each constituent base page.  To parallelize with ktask, the client first moves
+the for loop to the thread function, adapting it to operate on the range passed
+to the function.  In this simple case, the thread function's start and end
+arguments are just addresses delimiting the portion of the gigantic page to
+clear.  Then, where the for loop used to be, the client calls into ktask with
+the start address of the gigantic page, the total size of the gigantic page,
+and the thread function.  Internally, ktask will divide the address range into
+an appropriate number of chunks and start an appropriate number of threads to
+complete these chunks.
+
+
+Configuration
+=============
+
+To use ktask, configure the kernel with CONFIG_KTASK=y.
+
+If CONFIG_KTASK=n, calls to the ktask API are simply #define'd to run the
+thread function that the client provides so that the task is completed without
+concurrency in the current thread.
+
+
+Interface
+=========
+
+.. Include ktask.h inline here.  This file is heavily commented and documents
+.. the ktask interface.
+.. kernel-doc:: include/linux/ktask.h
+
+
+Resource Limits and Auto-Tuning
+===============================
+
+ktask has resource limits on the number of workqueue items it queues.  In
+ktask, a workqueue item is a thread that runs chunks of the task until the task
+is finished.
+
+These limits support the different ways ktask uses workqueues:
+ - ktask_run to run threads on the calling thread's node.
+ - ktask_run_numa to run threads on the node(s) specified.
+ - ktask_run_numa with nid=NUMA_NO_NODE to run threads on any node in the
+   system.
+
+To support these different ways of queueing work while maintaining an efficient
+concurrency level, we need both system-wide and per-node limits on the number
+of threads.  Without per-node limits, a node might become oversubscribed
+despite ktask staying within the system-wide limit, and without a system-wide
+limit, we can't properly account for work that can run on any node.
+
+The system-wide limit is based on the total number of CPUs, and the per-node
+limit on the CPU count for each node.  A per-node work item counts against the
+system-wide limit.  Workqueue's max_active can't accommodate both types of
+limit, no matter how many workqueues are used, so ktask implements its own.
+
+If a per-node limit is reached, the work item is allowed to run anywhere on the
+machine to avoid overwhelming the node.  If the global limit is also reached,
+ktask won't queue additional work items until we fall below the limit again.
+
+These limits apply only to workqueue items--that is, additional threads beyond
+the one starting the task.  That way, one thread per task is always allowed to
+run.
+
+Within the resource limits, ktask uses a default maximum number of threads per
+task to avoid disturbing other processes on the system.  Callers can change the
+limit with ktask_ctl_set_max_threads.  For example, this might be used to raise
+the maximum number of threads for a boot-time initialization task when more
+CPUs than usual are idle.
+
+
+Backward Compatibility
+======================
+
+ktask is written so that existing calls to the API will be backwards compatible
+should the API gain new features in the future.  This is accomplished by
+restricting API changes to members of struct ktask_ctl and having clients make
+an opaque initialization call (DEFINE_KTASK_CTL).  This initialization can then
+be modified to include any new arguments so that existing call sites stay the
+same.
+
+
+Error Handling
+==============
+
+Calls to ktask fail only if the provided thread function fails.  In particular,
+ktask avoids allocating memory internally during a task, so it's safe to use in
+sensitive contexts.
+
+To avoid adding features before they're used, ktask currently has only basic
+error handling.  Each call to ktask_run and ktask_run_numa returns a simple
+error code, KTASK_RETURN_SUCCESS or KTASK_RETURN_ERROR.  As usage of the
+framework expands, however, error handling will likely need to be enhanced in
+two ways.
+
+First, ktask may need client-specific error reporting.  It's possible for tasks
+to fail for different reasons, so the framework should have a way to
+communicate client-specific error information.  For this purpose, allow the
+client to pass a pointer for its own error information in struct ktask_ctl.
+
+Second, tasks can fail midway through their work.  To recover, the finished
+chunks of work need to be undone in a task-specific way, so ktask should allow
+clients to pass an "undo" callback that is responsible for undoing one chunk of
+work.  To avoid multiple levels of error handling, this "undo" callback should
+not be allowed to fail.  The iterator used for the original task can simply be
+reused for the undo operation.
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v3 2/7] ktask: multithread CPU-intensive kernel work
  2017-12-05 19:52 [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work Daniel Jordan
  2017-12-05 19:52 ` [RFC PATCH v3 1/7] ktask: add documentation Daniel Jordan
@ 2017-12-05 19:52 ` Daniel Jordan
  2017-12-05 22:21   ` Andrew Morton
  2017-12-05 19:52 ` [RFC PATCH v3 3/7] ktask: add /proc/sys/debug/ktask_max_threads Daniel Jordan
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Daniel Jordan @ 2017-12-05 19:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

ktask is a generic framework for parallelizing CPU-intensive work in the
kernel.  The intended use is for big machines that can use their CPU power to
speed up large tasks that can't otherwise be multithreaded in userland.  The
API is generic enough to add concurrency to many different kinds of tasks--for
example, zeroing a range of pages or evicting a list of inodes--and aims to
save its clients the trouble of splitting up the work, choosing the number of
threads to use, maintaining an efficient concurrency level, starting these
threads, and load balancing the work between them.

The Documentation patch earlier in this series has more background.

Introduces the ktask API; consumers appear in subsequent patches.

Based on work by Pavel Tatashin, Steve Sistare, and Jonathan Adams.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Suggested-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Suggested-by: Steve Sistare <steven.sistare@oracle.com>
Suggested-by: Jonathan Adams <jonathan.adams@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 include/linux/ktask.h          | 255 +++++++++++++++++++
 include/linux/ktask_internal.h |  22 ++
 include/linux/mm.h             |   6 +
 init/Kconfig                   |  12 +
 init/main.c                    |   2 +
 kernel/Makefile                |   2 +-
 kernel/ktask.c                 | 556 +++++++++++++++++++++++++++++++++++++++++
 7 files changed, 854 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/ktask.h
 create mode 100644 include/linux/ktask_internal.h
 create mode 100644 kernel/ktask.c

diff --git a/include/linux/ktask.h b/include/linux/ktask.h
new file mode 100644
index 000000000000..16232bdcaaef
--- /dev/null
+++ b/include/linux/ktask.h
@@ -0,0 +1,255 @@
+/*
+ * ktask.h
+ *
+ * Framework to parallelize CPU-intensive kernel work such as zeroing
+ * huge pages or freeing many pages at once.  For more information, see
+ * Documentation/core-api/ktask.rst.
+ *
+ * This is the interface to ktask; everything in this file is
+ * accessible to ktask clients.
+ *
+ * If CONFIG_KTASK=n, calls to the ktask API are simply #define'd to run the
+ * thread function that the client provides so that the task is completed
+ * without concurrency in the current thread.
+ */
+
+#ifndef _LINUX_KTASK_H
+#define _LINUX_KTASK_H
+
+#include <linux/types.h>
+
+#define	KTASK_RETURN_SUCCESS	0
+#define	KTASK_RETURN_ERROR	(-1)
+
+/**
+ * struct ktask_node - Holds per-NUMA-node information about a task.
+ *
+ * @kn_start: An object that describes the start of the task on this NUMA node.
+ * @kn_task_size: The size of the task on this NUMA node (units are
+ *                task-specific).
+ * @kn_nid: The NUMA node id (or NUMA_NO_NODE, in which case the work is done on
+ *          the current node).
+ */
+struct ktask_node {
+	void		*kn_start;
+	size_t		kn_task_size;
+	int		kn_nid;
+};
+
+/**
+ * typedef ktask_thread_func
+ *
+ * Called on each chunk of work that a ktask thread does, where the chunk is
+ * delimited by [start, end).  A thread may call this multiple times during one
+ * task.
+ *
+ * @start: An object that describes the start of the chunk.
+ * @end: An object that describes the end of the chunk.
+ * @arg: The thread function argument (provided with struct ktask_ctl).
+ *
+ * RETURNS:
+ * KTASK_RETURN_SUCCESS or KTASK_RETURN_ERROR.
+ */
+typedef int (*ktask_thread_func)(void *start, void *end, void *arg);
+
+/**
+ * typedef ktask_iter_func
+ *
+ * An iterator function that advances the position by a given number of steps.
+ *
+ * @position: An object that describes the current position in the task.
+ * @nsteps: The number of steps to advance in the task (in task-specific
+ *          units).
+ *
+ * RETURNS:
+ * An object representing the new position.
+ */
+typedef void *(*ktask_iter_func)(void *position, size_t nsteps);
+
+/**
+ * ktask_iter_range
+ *
+ * An iterator function for a contiguous range such as an array or address
+ * range.  This is the default iterator; clients may override with
+ * ktask_ctl_set_iter_func.
+ *
+ * @position: An object that describes the current position in the task.
+ *            Interpreted as an unsigned long.
+ * @nsteps: The number of steps to advance in the task (in task-specific
+ *          units).
+ *
+ * RETURNS:
+ * (position + nsteps)
+ */
+void *ktask_iter_range(void *position, size_t nsteps);
+
+/**
+ * struct ktask_ctl - Client-provided per-task control information.
+ *
+ * @kc_thread_func: A thread function that completes one chunk of the task per
+ *                  call.
+ * @kc_thread_func_arg: An argument to be passed to the thread function.
+ * @kc_iter_func: An iterator function to advance the iterator by some number
+ *                   of task-specific units.
+ * @kc_min_chunk_size: The minimum chunk size in task-specific units.  This
+ *                     allows the client to communicate the minimum amount of
+ *                     work that's appropriate for one worker thread to do at
+ *                     once.
+ * @kc_max_threads: The maximum number of threads to use for the task.
+ *                  The actual number used may be less than this if the
+ *                  framework determines that fewer threads would be better,
+ *                  taking into account such things as total CPU count and
+ *                  task size.  Pass 0 to use ktask's default maximum.
+ */
+struct ktask_ctl {
+	/* Required arguments set with DEFINE_KTASK_CTL. */
+	ktask_thread_func	kc_thread_func;
+	void			*kc_thread_func_arg;
+	size_t			kc_min_chunk_size;
+
+	/*
+	 * Optional arguments set with ktask_ctl_set_* functions.  Defaults
+	 * listed to the side.
+	 */
+	ktask_iter_func		kc_iter_func;    /* ktask_iter_range */
+	size_t			kc_max_threads;  /* 0 (uses internal limit) */
+};
+
+#define KTASK_CTL_INITIALIZER(thread_func, thread_func_arg, min_chunk_size)  \
+	{								     \
+		.kc_thread_func = (ktask_thread_func)(thread_func),	     \
+		.kc_thread_func_arg = (thread_func_arg),		     \
+		.kc_min_chunk_size = (min_chunk_size),			     \
+		.kc_iter_func = (ktask_iter_range),			     \
+		.kc_max_threads = (0),					     \
+	}
+
+/*
+ * Note that KTASK_CTL_INITIALIZER casts 'thread_func' to be of type
+ * ktask_thread_func.  This is to help clients write cleaner thread functions
+ * by relieving them of the need to cast the three void * arguments.  Clients
+ * can just use the actual argument types instead.
+ */
+#define DEFINE_KTASK_CTL(ctl_name, thread_func, thread_func_arg,	  \
+			 min_chunk_size)				  \
+	struct ktask_ctl ctl_name =					  \
+		KTASK_CTL_INITIALIZER(thread_func, thread_func_arg,	  \
+				      min_chunk_size)
+
+/**
+ * ktask_ctl_set_iter_func - Set a task-specific iterator
+ *
+ * This overrides the default iterator, ktask_iter_range.
+ *
+ * @ctl:  A control structure containing information about the task.
+ * @iter_func:  Client-provided iterator function that conforms to the
+ *              declaration of ktask_iter_func.
+ */
+static inline void ktask_ctl_set_iter_func(struct ktask_ctl *ctl,
+					   ktask_iter_func iter_func)
+{
+	ctl->kc_iter_func = iter_func;
+}
+
+/**
+ * ktask_ctl_set_max_threads - Set a task-specific maximum number of threads
+ *
+ * This overrides the default maximum, which is KTASK_DEFAULT_MAX_THREADS
+ * initially and may be changed via /proc/sys/debug/ktask_max_threads.
+ *
+ * @ctl:  A control structure containing information about the task.
+ * @max_threads:  The maximum number of threads to be started for this task.
+ *                The actual number of threads may be less than this.
+ */
+static inline void ktask_ctl_set_max_threads(struct ktask_ctl *ctl,
+					     size_t max_threads)
+{
+	ctl->kc_max_threads = max_threads;
+}
+
+#ifdef CONFIG_KTASK
+
+/**
+ * ktask_run - Runs one task.
+ *
+ * Starts threads to complete one task with the given thread function.  Waits
+ * for the task to finish before returning.
+ *
+ * On a NUMA system, threads run on the current node.  This is designed to
+ * mirror other parts of the kernel that favor locality, such as the default
+ * memory policy of allocating pages from the same node as the calling thread.
+ * ktask_run_numa may be used to get more control over where threads run.
+ *
+ * @start: An object that describes the start of the task.  The client thread
+ *         function interprets the object however it sees fit (e.g. an array
+ *         index, a simple pointer, or a pointer to a more complicated
+ *         representation of job position).
+ * @task_size:  The size of the task (units are task-specific).
+ * @ctl:  A control structure containing information about the task, including
+ *        the client thread function.
+ *
+ * RETURNS:
+ * KTASK_RETURN_SUCCESS or KTASK_RETURN_ERROR.
+ */
+int ktask_run(void *start, size_t task_size, struct ktask_ctl *ctl);
+
+/**
+ * ktask_run_numa - Runs one task while accounting for NUMA locality.
+ *
+ * Starts threads on the requested nodes to complete one task with the given
+ * thread function.  The client is responsible for organizing the work along
+ * NUMA boundaries in the 'nodes' array.  Waits for the task to finish before
+ * returning.
+ *
+ * In the special case of NUMA_NO_NODE, threads are allowed to run on any node.
+ * This is distinct from ktask_run, which runs threads on the current node.
+ *
+ * @nodes: An array of struct ktask_node's, each of which describes the task on
+ *         a NUMA node (see struct ktask_node).
+ * @nr_nodes:  The length of the 'nodes' array.
+ * @ctl:  A control structure containing information about the task (see
+ *        the definition of struct ktask_ctl).
+ *
+ * RETURNS:
+ * KTASK_RETURN_SUCCESS or KTASK_RETURN_ERROR.
+ */
+int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,
+		   struct ktask_ctl *ctl);
+
+void ktask_init(void);
+
+#else  /* CONFIG_KTASK */
+
+static inline int ktask_run(void *start, size_t task_size,
+			    struct ktask_ctl *ctl)
+{
+	return ctl->kc_thread_func(start,
+				   ctl->kc_iter_func(start, task_size),
+				   ctl->kc_thread_func_arg);
+}
+
+static inline int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,
+				 struct ktask_ctl *ctl)
+{
+	size_t i;
+	int err = KTASK_RETURN_SUCCESS;
+
+	for (i = 0; i < nr_nodes; ++i) {
+		err = ctl->kc_thread_func(
+			    nodes[i].kn_start,
+			    ctl->kc_iter_func(nodes[i].kn_start,
+						 nodes[i].kn_task_size),
+			    ctl->kc_thread_func_arg);
+
+		if (err == KTASK_RETURN_ERROR)
+			break;
+	}
+
+	return err;
+}
+
+static inline void ktask_init(void) { }
+
+#endif /* CONFIG_KTASK */
+
+#endif /* _LINUX_KTASK_H */
diff --git a/include/linux/ktask_internal.h b/include/linux/ktask_internal.h
new file mode 100644
index 000000000000..50d339d6eed1
--- /dev/null
+++ b/include/linux/ktask_internal.h
@@ -0,0 +1,22 @@
+/*
+ * ktask_internal.h
+ *
+ * Framework to parallelize CPU-intensive kernel work such as zeroing
+ * huge pages or freeing many pages at once.  For more information, see
+ * Documentation/core-api/ktask.rst.
+ *
+ * This file contains implementation details of ktask for core kernel code that
+ * needs to be aware of them.  ktask clients should not include this file.
+ */
+#ifndef _LINUX_KTASK_INTERNAL_H
+#define _LINUX_KTASK_INTERNAL_H
+
+#include <linux/ktask.h>
+
+#ifdef CONFIG_KTASK
+/* Caps the number of threads that are allowed to be used in one task. */
+extern int ktask_max_threads;
+
+#endif /* CONFIG_KTASK */
+
+#endif /* _LINUX_KTASK_INTERNAL_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ea818ff739cd..50fa9b3d9d2c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2669,5 +2669,11 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+/*
+ * The minimum chunk size for a task that uses base page units.  For now, say
+ * 1G's worth of pages.
+ */
+#define	KTASK_BPGS_MINCHUNK		((1ul << 30) / PAGE_SIZE)
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/init/Kconfig b/init/Kconfig
index 2934249fba46..2a7b120de4d4 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -319,6 +319,18 @@ config AUDIT_TREE
 	depends on AUDITSYSCALL
 	select FSNOTIFY
 
+config KTASK
+	bool "Multithread cpu-intensive kernel tasks"
+	depends on SMP
+	depends on NR_CPUS > 16
+	default n
+	help
+	  Parallelize expensive kernel tasks such as zeroing huge pages.  This
+          feature is designed for big machines that can take advantage of their
+          cpu count to speed up large kernel tasks.
+
+          If unsure, say 'N'.
+
 source "kernel/irq/Kconfig"
 source "kernel/time/Kconfig"
 
diff --git a/init/main.c b/init/main.c
index dfec3809e740..e771199f0c60 100644
--- a/init/main.c
+++ b/init/main.c
@@ -88,6 +88,7 @@
 #include <linux/io.h>
 #include <linux/cache.h>
 #include <linux/rodata_test.h>
+#include <linux/ktask.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -1060,6 +1061,7 @@ static noinline void __init kernel_init_freeable(void)
 
 	smp_init();
 	sched_init_smp();
+	ktask_init();
 
 	page_alloc_init_late();
 
diff --git a/kernel/Makefile b/kernel/Makefile
index 172d151d429c..f8d1ed267ebd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -10,7 +10,7 @@ obj-y     = fork.o exec_domain.o panic.o \
 	    extable.o params.o \
 	    kthread.o sys_ni.o nsproxy.o \
 	    notifier.o ksysfs.o cred.o reboot.o \
-	    async.o range.o smpboot.o ucount.o
+	    async.o range.o smpboot.o ucount.o ktask.o
 
 obj-$(CONFIG_MODULES) += kmod.o
 obj-$(CONFIG_MULTIUSER) += groups.o
diff --git a/kernel/ktask.c b/kernel/ktask.c
new file mode 100644
index 000000000000..7b075075b56b
--- /dev/null
+++ b/kernel/ktask.c
@@ -0,0 +1,556 @@
+/*
+ * ktask.c
+ *
+ * Framework to parallelize CPU-intensive kernel work such as zeroing
+ * huge pages or freeing many pages at once.  For more information, see
+ * Documentation/core-api/ktask.rst.
+ *
+ * This is the ktask implementation; everything in this file is private to
+ * ktask.
+ */
+
+#define pr_fmt(fmt)	"ktask: " fmt
+
+#include <linux/ktask.h>
+
+#ifdef CONFIG_KTASK
+
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
+#include <linux/completion.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/ktask_internal.h>
+#include <linux/mutex.h>
+#include <linux/printk.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/workqueue.h>
+
+/* Resource limits on the amount of workqueue items queued through ktask. */
+spinlock_t ktask_rlim_lock;
+/* Work items queued on all nodes (includes NUMA_NO_NODE) */
+size_t ktask_rlim_cur;
+size_t ktask_rlim_max;
+/* Work items queued per node */
+size_t *ktask_rlim_node_cur;
+size_t *ktask_rlim_node_max;
+
+/* Allow only 80% of the cpus to be running additional ktask threads. */
+#define	KTASK_CPUFRAC_NUMER	4
+#define	KTASK_CPUFRAC_DENOM	5
+
+/* Used to pass ktask data to the workqueue API. */
+struct ktask_work {
+	struct work_struct	kw_work;
+	struct ktask_task	*kw_task;
+	int			kw_ktask_node_i;
+	int			kw_queue_nid;
+	struct list_head	kw_list;	/* ktask_free_works linkage */
+};
+
+static LIST_HEAD(ktask_free_works);
+struct ktask_work *ktask_works;
+
+/* Represents one task.  This is for internal use only. */
+struct ktask_task {
+	struct ktask_ctl	kt_ctl;
+	size_t			kt_total_size;
+	size_t			kt_chunk_size;
+	/* mutex protects nodes, nr_nodes_left, nthreads_fini, error */
+	struct mutex		kt_mutex;
+	struct ktask_node	*kt_nodes;
+	size_t			kt_nr_nodes;
+	size_t			kt_nr_nodes_left;
+	size_t			kt_nthreads;
+	size_t			kt_nthreads_fini;
+	int			kt_error; /* tracks error(s) from thread_func */
+	struct completion	kt_ktask_done;
+};
+
+/*
+ * Shrink the size of each job by this shift amount to load balance between the
+ * worker threads.
+ */
+#define	KTASK_LOAD_BAL_SHIFT		2
+
+#define	KTASK_DEFAULT_MAX_THREADS	4
+
+/* Maximum number of threads for a single task. */
+int ktask_max_threads = KTASK_DEFAULT_MAX_THREADS;
+
+static struct workqueue_struct *ktask_wq;
+static struct workqueue_struct *ktask_nonuma_wq;
+
+static void ktask_thread(struct work_struct *work);
+
+static inline void ktask_init_work(struct ktask_work *kw, struct ktask_task *kt,
+			    size_t ktask_node_i, size_t queue_nid)
+{
+	INIT_WORK(&kw->kw_work, ktask_thread);
+	kw->kw_task = kt;
+	kw->kw_ktask_node_i = ktask_node_i;
+	kw->kw_queue_nid = queue_nid;
+}
+
+static void ktask_queue_work(struct ktask_work *kw)
+{
+	struct workqueue_struct *wq;
+	int cpu;
+
+	if (kw->kw_queue_nid == NUMA_NO_NODE) {
+		/*
+		 * If no node is specified, use ktask_nonuma_wq to
+		 * allow the thread to run on any node, but fall back
+		 * to ktask_wq if we couldn't allocate ktask_nonuma_wq.
+		 */
+		cpu = WORK_CPU_UNBOUND;
+		wq = (ktask_nonuma_wq) ?: ktask_wq;
+	} else {
+		/*
+		 * WQ_UNBOUND workqueues, such as the one ktask uses,
+		 * execute work on some CPU from the node of the CPU we
+		 * pass to queue_work_on, so just pick any CPU to stand
+		 * for the node on NUMA systems.
+		 *
+		 * On non-NUMA systems, cpumask_of_node becomes
+		 * cpu_online_mask.
+		 */
+		cpu = cpumask_any(cpumask_of_node(kw->kw_queue_nid));
+		wq = ktask_wq;
+	}
+
+	WARN_ON(!queue_work_on(cpu, wq, &kw->kw_work));
+}
+
+#ifdef CONFIG_NUMA
+
+/* Returns true if we're migrating this part of the task to another node. */
+static bool ktask_node_migrate(struct ktask_node *old_kn, struct ktask_node *kn,
+			       size_t ktask_node_i, struct ktask_work *kw,
+			       struct ktask_task *kt)
+{
+	int new_queue_nid;
+
+	/*
+	 * Don't migrate a user thread, otherwise migrate only if we're going
+	 * to a different node.
+	 */
+	if (!(current->flags & PF_KTHREAD) || kn->kn_nid == old_kn->kn_nid ||
+	    num_online_nodes() == 1)
+		return false;
+
+	/* Adjust resource limits. */
+	spin_lock(&ktask_rlim_lock);
+	if (kw->kw_queue_nid != NUMA_NO_NODE)
+		--ktask_rlim_node_cur[kw->kw_queue_nid];
+
+	if (kn->kn_nid != NUMA_NO_NODE &&
+	    ktask_rlim_node_cur[kw->kw_queue_nid] <
+	    ktask_rlim_node_max[kw->kw_queue_nid]) {
+		new_queue_nid = kn->kn_nid;
+		++ktask_rlim_node_cur[new_queue_nid];
+	} else {
+		new_queue_nid = NUMA_NO_NODE;
+	}
+	spin_unlock(&ktask_rlim_lock);
+
+	ktask_init_work(kw, kt, ktask_node_i, new_queue_nid);
+	ktask_queue_work(kw);
+
+	return true;
+}
+
+#else /* CONFIG_NUMA */
+
+static bool ktask_node_migrate(struct ktask_node *old_kn, struct ktask_node *kn,
+			       size_t ktask_node_i, struct ktask_work *kw,
+			       struct ktask_task *kt)
+{
+	return false;
+}
+
+#endif /* CONFIG_NUMA */
+
+static void ktask_thread(struct work_struct *work)
+{
+	struct ktask_work  *kw;
+	struct ktask_task  *kt;
+	struct ktask_ctl   *kc;
+	struct ktask_node  *kn;
+	bool               done;
+
+	kw = container_of(work, struct ktask_work, kw_work);
+	kt = kw->kw_task;
+	kc = &kt->kt_ctl;
+	kn = &kt->kt_nodes[kw->kw_ktask_node_i];
+
+	mutex_lock(&kt->kt_mutex);
+
+	while (kt->kt_total_size > 0 && kt->kt_error == KTASK_RETURN_SUCCESS) {
+		void *start, *end;
+		size_t nsteps;
+		int ret;
+
+		if (kn->kn_task_size == 0) {
+			/* The current node is out of work; pick a new one. */
+			size_t remaining_nodes_seen = 0;
+			size_t new_idx = prandom_u32_max(kt->kt_nr_nodes_left);
+			struct ktask_node *old_kn;
+			size_t i;
+
+			WARN_ON(kt->kt_nr_nodes_left == 0);
+			WARN_ON(new_idx >= kt->kt_nr_nodes_left);
+			for (i = 0; i < kt->kt_nr_nodes; ++i) {
+				if (kt->kt_nodes[i].kn_task_size == 0)
+					continue;
+
+				if (remaining_nodes_seen >= new_idx)
+					break;
+
+				++remaining_nodes_seen;
+			}
+			/* We should have found work on another node. */
+			WARN_ON(i >= kt->kt_nr_nodes);
+
+			old_kn = kn;
+			kn = &kt->kt_nodes[i];
+
+			/* Start another worker on the node we've chosen. */
+			if (ktask_node_migrate(old_kn, kn, i, kw, kt)) {
+				mutex_unlock(&kt->kt_mutex);
+				return;
+			}
+		}
+
+		start = kn->kn_start;
+		nsteps = min(kt->kt_chunk_size, kn->kn_task_size);
+		end = kc->kc_iter_func(start, nsteps);
+		kn->kn_start = end;
+		WARN_ON(kn->kn_task_size < nsteps);
+		kn->kn_task_size -= nsteps;
+		WARN_ON(kt->kt_total_size < nsteps);
+		kt->kt_total_size -= nsteps;
+		if (kn->kn_task_size == 0) {
+			WARN_ON(kt->kt_nr_nodes_left == 0);
+			kt->kt_nr_nodes_left--;
+		}
+
+		mutex_unlock(&kt->kt_mutex);
+
+		ret = kc->kc_thread_func(start, end, kc->kc_thread_func_arg);
+
+		mutex_lock(&kt->kt_mutex);
+
+		if (ret == KTASK_RETURN_ERROR)
+			kt->kt_error = KTASK_RETURN_ERROR;
+	}
+
+	WARN_ON(kt->kt_nr_nodes_left > 0 &&
+		kt->kt_error == KTASK_RETURN_SUCCESS);
+
+	++kt->kt_nthreads_fini;
+	WARN_ON(kt->kt_nthreads_fini > kt->kt_nthreads);
+	done = (kt->kt_nthreads_fini == kt->kt_nthreads);
+	mutex_unlock(&kt->kt_mutex);
+
+	if (done)
+		complete(&kt->kt_ktask_done);
+}
+
+/*
+ * Returns the number of chunks to break this task into.
+ *
+ * The number of chunks will be at least the number of threads, but in the
+ * common case of a large task, the number of chunks will be greater to load
+ * balance the work between threads in case some threads finish their work more
+ * quickly than others.
+ */
+static inline size_t ktask_chunk_size(size_t task_size, size_t min_chunk_size,
+				    size_t nthreads)
+{
+	size_t chunk_size;
+
+	if (nthreads == 1)
+		return task_size;
+
+	chunk_size = (task_size / nthreads) >> KTASK_LOAD_BAL_SHIFT;
+
+	/*
+	 * chunk_size should be a multiple of min_chunk_size for tasks that
+	 * need to operate in fixed-size batches.
+	 */
+	if (chunk_size > min_chunk_size)
+		chunk_size = rounddown(chunk_size, min_chunk_size);
+
+	return max(chunk_size, min_chunk_size);
+}
+
+/*
+ * Prepares to run the task by computing the number of threads, checking
+ * the ktask resource limits, finding the chunk size, and initializing the
+ * work items.
+ */
+static size_t ktask_prepare_threads(struct ktask_node *nodes, size_t nr_nodes,
+				    struct ktask_task *kt,
+				    struct list_head *to_queue)
+{
+	size_t i, nthreads, nthreads_check;
+	size_t min_chunk_size = kt->kt_ctl.kc_min_chunk_size;
+	size_t max_threads    = kt->kt_ctl.kc_max_threads;
+
+	if (!ktask_wq)
+		return 1;
+
+	if (max_threads == 0)
+		max_threads = ktask_max_threads;
+
+	/* Ensure at least one thread when task_size < min_chunk_size. */
+	nthreads_check = DIV_ROUND_UP(kt->kt_total_size, min_chunk_size);
+	nthreads_check = min_t(size_t, nthreads_check, num_online_cpus());
+	nthreads_check = min_t(size_t, nthreads_check, max_threads);
+
+	/*
+	 * Use at least the current thread for this task; check whether
+	 * ktask_rlim allows additional work items to be queued.
+	 */
+	nthreads = 1;
+	spin_lock(&ktask_rlim_lock);
+	for (i = nthreads; i < nthreads_check; ++i) {
+		/* Spread threads across nodes evenly. */
+		size_t ktask_node_i = i % nr_nodes;
+		struct ktask_node *kn = &nodes[ktask_node_i];
+		struct ktask_work *kw;
+		int nid = kn->kn_nid;
+		int queue_nid;
+
+		WARN_ON(ktask_rlim_cur > ktask_rlim_max);
+		if (ktask_rlim_cur == ktask_rlim_max)
+			break;	/* No more work items allowed to be queued. */
+
+		/* Allowed to queue on requested node? */
+		if (nid != NUMA_NO_NODE &&
+		    ktask_rlim_node_cur[nid] < ktask_rlim_node_max[nid]) {
+			WARN_ON(ktask_rlim_node_cur[nid] > ktask_rlim_cur);
+			++ktask_rlim_node_cur[nid];
+			queue_nid = nid;
+		} else {
+			queue_nid = NUMA_NO_NODE;
+		}
+
+		BUG_ON(list_empty(&ktask_free_works));
+		kw = list_first_entry(&ktask_free_works, struct ktask_work,
+				      kw_list);
+		list_move_tail(&kw->kw_list, to_queue);
+		ktask_init_work(kw, kt, ktask_node_i, queue_nid);
+
+		++ktask_rlim_cur;
+		++nthreads;
+	}
+	spin_unlock(&ktask_rlim_lock);
+
+	return nthreads;
+}
+
+int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,
+		   struct ktask_ctl *ctl)
+{
+	size_t i;
+	struct ktask_work kw;
+	struct ktask_work *kw_cur, *kw_next;
+	LIST_HEAD(to_queue);
+	struct ktask_task kt = {
+		.kt_ctl             = *ctl,
+		.kt_total_size      = 0,
+		.kt_nodes           = nodes,
+		.kt_nr_nodes        = nr_nodes,
+		.kt_nr_nodes_left   = nr_nodes,
+		.kt_nthreads_fini   = 0,
+		.kt_error           = KTASK_RETURN_SUCCESS,
+	};
+
+	for (i = 0; i < nr_nodes; ++i) {
+		kt.kt_total_size += nodes[i].kn_task_size;
+		if (nodes[i].kn_task_size == 0)
+			kt.kt_nr_nodes_left--;
+
+		WARN_ON(nodes[i].kn_nid >= MAX_NUMNODES);
+	}
+
+	if (kt.kt_total_size == 0)
+		return KTASK_RETURN_SUCCESS;
+
+	mutex_init(&kt.kt_mutex);
+
+	kt.kt_nthreads = ktask_nthreads(kt.kt_total_size,
+					ctl->kc_min_chunk_size,
+					ctl->kc_max_threads);
+
+	kt.kt_chunk_size = ktask_chunk_size(kt.kt_total_size,
+					ctl->kc_min_chunk_size, kt.kt_nthreads);
+
+	init_completion(&kt.kt_ktask_done);
+
+	kt.kt_nthreads = ktask_prepare_threads(nodes, nr_nodes, &kt, &to_queue);
+	kt.kt_chunk_size = ktask_chunk_size(kt.kt_total_size,
+					    ctl->kc_min_chunk_size,
+					    kt.kt_nthreads);
+
+	list_for_each_entry_safe(kw_cur, kw_next, &to_queue, kw_list)
+		ktask_queue_work(kw_cur);
+
+	/*
+	 * Make ourselves one of the threads, which saves launching a workqueue
+	 * worker.
+	 */
+	INIT_WORK(&kw.kw_work, ktask_thread);
+	kw.kw_task = &kt;
+	kw.kw_ktask_node_i = 0;
+	ktask_thread(&kw.kw_work);
+
+	/* Wait for all the jobs to finish. */
+	wait_for_completion(&kt.kt_ktask_done);
+
+	spin_lock(&ktask_rlim_lock);
+
+	/* Put the works back on the free list, adjusting rlimits. */
+	list_for_each_entry_safe(kw_cur, kw_next, &to_queue, kw_list) {
+		if (kw_cur->kw_queue_nid != NUMA_NO_NODE) {
+			WARN_ON(ktask_rlim_node_cur[kw_cur->kw_queue_nid] == 0);
+			--ktask_rlim_node_cur[kw_cur->kw_queue_nid];
+		}
+		WARN_ON(ktask_rlim_cur == 0);
+		--ktask_rlim_cur;
+	}
+	list_splice(&to_queue, &ktask_free_works);
+	spin_unlock(&ktask_rlim_lock);
+
+	mutex_destroy(&kt.kt_mutex);
+
+	return kt.kt_error;
+}
+EXPORT_SYMBOL_GPL(ktask_run_numa);
+
+int ktask_run(void *start, size_t task_size, struct ktask_ctl *ctl)
+{
+	struct ktask_node node;
+
+	node.kn_start = start;
+	node.kn_task_size = task_size;
+	node.kn_nid = numa_node_id();
+
+	return ktask_run_numa(&node, 1, ctl);
+}
+EXPORT_SYMBOL_GPL(ktask_run);
+
+/*
+ * Initialize internal limits on work items queued.  Work items submitted to
+ * cmwq capped at 80% of online cpus both system-wide and per-node to maintain
+ * an efficient level of parallelization at these respective levels.
+ */
+bool ktask_rlim_init(void)
+{
+	int node;
+	unsigned nr_node_cpus;
+
+	spin_lock_init(&ktask_rlim_lock);
+
+	ktask_rlim_node_cur = kcalloc(num_possible_nodes(),
+					       sizeof(size_t),
+					       GFP_KERNEL);
+	if (!ktask_rlim_node_cur) {
+		pr_warn("can't alloc rlim counts (ktask disabled)");
+		return false;
+	}
+
+	ktask_rlim_node_max = kmalloc_array(num_possible_nodes(),
+						     sizeof(size_t),
+						     GFP_KERNEL);
+	if (!ktask_rlim_node_max) {
+		kfree(ktask_rlim_node_cur);
+		pr_warn("can't alloc rlim maximums (ktask disabled)");
+		return false;
+	}
+
+	ktask_rlim_max = mult_frac(num_online_cpus(), KTASK_CPUFRAC_NUMER,
+						      KTASK_CPUFRAC_DENOM);
+	for_each_node(node) {
+		nr_node_cpus = cpumask_weight(cpumask_of_node(node));
+		ktask_rlim_node_max[node] = mult_frac(nr_node_cpus,
+						      KTASK_CPUFRAC_NUMER,
+						      KTASK_CPUFRAC_DENOM);
+	}
+
+	return true;
+}
+
+void __init ktask_init(void)
+{
+	struct workqueue_attrs *attrs;
+	int i, ret;
+
+	if (!ktask_rlim_init())
+		goto out;
+
+	ktask_works = kmalloc_array(ktask_rlim_max, sizeof(struct ktask_work),
+				    GFP_KERNEL);
+	if (!ktask_works) {
+		pr_warn("failed to alloc ktask_works (ktask disabled)");
+		goto out;
+	}
+	for (i = 0; i < ktask_rlim_max; ++i)
+		list_add_tail(&ktask_works[i].kw_list, &ktask_free_works);
+
+	ktask_wq = alloc_workqueue("ktask_wq", WQ_UNBOUND, 0);
+	if (!ktask_wq) {
+		pr_warn("failed to alloc ktask_wq (ktask disabled)");
+		goto out;
+	}
+
+	/*
+	 * Threads executing work from this workqueue can run on any node on
+	 * the system.  If we get any failures below, use ktask_wq in its
+	 * place.  It's better than nothing.
+	 */
+	ktask_nonuma_wq = alloc_workqueue("ktask_nonuma_wq", WQ_UNBOUND, 0);
+	if (!ktask_nonuma_wq) {
+		pr_warn("failed to alloc ktask_nonuma_wq");
+		goto out;
+	}
+
+	attrs = alloc_workqueue_attrs(GFP_KERNEL);
+	if (!attrs) {
+		pr_warn("alloc_workqueue_attrs failed");
+		goto alloc_fail;
+	}
+
+	attrs->no_numa = true;
+
+	ret = apply_workqueue_attrs(ktask_nonuma_wq, attrs);
+	if (ret != 0) {
+		pr_warn("apply_workqueue_attrs failed");
+		goto apply_fail;
+	}
+
+	free_workqueue_attrs(attrs);
+out:
+	return;
+
+apply_fail:
+	free_workqueue_attrs(attrs);
+alloc_fail:
+	destroy_workqueue(ktask_nonuma_wq);
+	ktask_nonuma_wq = NULL;
+}
+
+#endif /* CONFIG_KTASK */
+
+/*
+ * This function is defined outside CONFIG_KTASK so it can be called in the
+ * !CONFIG_KTASK versions of ktask_run and ktask_run_numa.
+ */
+void *ktask_iter_range(void *position, size_t nsteps)
+{
+	return (char *)position + nsteps;
+}
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v3 3/7] ktask: add /proc/sys/debug/ktask_max_threads
  2017-12-05 19:52 [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work Daniel Jordan
  2017-12-05 19:52 ` [RFC PATCH v3 1/7] ktask: add documentation Daniel Jordan
  2017-12-05 19:52 ` [RFC PATCH v3 2/7] ktask: multithread CPU-intensive kernel work Daniel Jordan
@ 2017-12-05 19:52 ` Daniel Jordan
  2017-12-05 19:52 ` [RFC PATCH v3 4/7] mm: enlarge type of offset argument in mem_map_offset and mem_map_next Daniel Jordan
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Daniel Jordan @ 2017-12-05 19:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

Adds a proc file to control the maximum number of ktask threads in use
for any one job.  Its primary use is to aid in debugging.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 kernel/sysctl.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 557d46728577..e296906e609e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -67,6 +67,7 @@
 #include <linux/bpf.h>
 #include <linux/mount.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/ktask_internal.h>
 
 #include <linux/uaccess.h>
 #include <asm/processor.h>
@@ -1867,6 +1868,15 @@ static struct ctl_table debug_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
+#endif
+#if defined(CONFIG_KTASK)
+	{
+		.procname	= "ktask_max_threads",
+		.data		= &ktask_max_threads,
+		.maxlen		= sizeof(ktask_max_threads),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 #endif
 	{ }
 };
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v3 4/7] mm: enlarge type of offset argument in mem_map_offset and mem_map_next
  2017-12-05 19:52 [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work Daniel Jordan
                   ` (2 preceding siblings ...)
  2017-12-05 19:52 ` [RFC PATCH v3 3/7] ktask: add /proc/sys/debug/ktask_max_threads Daniel Jordan
@ 2017-12-05 19:52 ` Daniel Jordan
  2017-12-05 19:52 ` [RFC PATCH v3 5/7] mm: parallelize clear_gigantic_page Daniel Jordan
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Daniel Jordan @ 2017-12-05 19:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

Changes the type of 'offset' from int to unsigned long in both
mem_map_offset and mem_map_next.

This facilitates ktask's use of mem_map_next with its unsigned long
types to avoid silent truncation when these unsigned longs are passed as
ints.

It also fixes the preexisting truncation of 'offset' from unsigned long
to int by the sole caller of mem_map_offset, follow_hugetlb_page.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 mm/internal.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index e6bd35182dae..cee1325fa682 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -366,7 +366,8 @@ static inline void mlock_migrate_page(struct page *new, struct page *old) { }
  * the maximally aligned gigantic page 'base'.  Handle any discontiguity
  * in the mem_map at MAX_ORDER_NR_PAGES boundaries.
  */
-static inline struct page *mem_map_offset(struct page *base, int offset)
+static inline struct page *mem_map_offset(struct page *base,
+					  unsigned long offset)
 {
 	if (unlikely(offset >= MAX_ORDER_NR_PAGES))
 		return nth_page(base, offset);
@@ -377,8 +378,8 @@ static inline struct page *mem_map_offset(struct page *base, int offset)
  * Iterator over all subpages within the maximally aligned gigantic
  * page 'base'.  Handle any discontiguity in the mem_map.
  */
-static inline struct page *mem_map_next(struct page *iter,
-						struct page *base, int offset)
+static inline struct page *mem_map_next(struct page *iter, struct page *base,
+					unsigned long offset)
 {
 	if (unlikely((offset & (MAX_ORDER_NR_PAGES - 1)) == 0)) {
 		unsigned long pfn = page_to_pfn(base) + offset;
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v3 5/7] mm: parallelize clear_gigantic_page
  2017-12-05 19:52 [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work Daniel Jordan
                   ` (3 preceding siblings ...)
  2017-12-05 19:52 ` [RFC PATCH v3 4/7] mm: enlarge type of offset argument in mem_map_offset and mem_map_next Daniel Jordan
@ 2017-12-05 19:52 ` Daniel Jordan
  2017-12-05 19:52 ` [RFC PATCH v3 6/7] hugetlbfs: parallelize hugetlbfs_fallocate with ktask Daniel Jordan
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Daniel Jordan @ 2017-12-05 19:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

Parallelize clear_gigantic_page, which zeroes any page size larger than
8M (e.g. 1G on x86 or 2G on SPARC).

Performance results (the default number of threads is 4; higher thread
counts shown for context only):

Machine: SPARC T7-4, 1024 CPUs, 504G memory
Test:    Clear a range of gigantic pages

nthread   speedup   size (GiB)   min time (s)   stdev
      1                     50           7.77    0.02
      2     1.97x           50           3.95    0.04
      4     3.85x           50           2.02    0.05
      8     6.27x           50           1.24    0.10
     16     9.84x           50           0.79    0.06

      1                    100          15.50    0.07
      2     1.91x          100           8.10    0.05
      4     3.48x          100           4.45    0.07
      8     5.18x          100           2.99    0.05
     16     7.79x          100           1.99    0.12

      1                    200          31.03    0.15
      2     1.88x          200          16.47    0.02
      4     3.37x          200           9.20    0.14
      8     5.16x          200           6.01    0.19
     16     7.04x          200           4.41    0.06

Machine:  Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
Test:     Clear a range of gigantic pages

nthread   speedup   size (GiB)   min time (s)   stdev
      1                    100          41.13    0.03
      2     2.03x          100          20.26    0.14
      4     4.28x          100           9.62    0.09
      8     8.39x          100           4.90    0.05
     16    10.44x          100           3.94    0.03

      1                    200          89.68    0.35
      2     2.21x          200          40.64    0.18
      4     4.64x          200          19.33    0.32
      8     8.99x          200           9.98    0.04
     16    11.27x          200           7.96    0.04

      1                    400         188.20    1.57
      2     2.30x          400          81.84    0.09
      4     4.63x          400          40.62    0.26
      8     8.92x          400          21.09    0.50
     16    11.78x          400          15.97    0.25

      1                    800         434.91    1.81
      2     2.54x          800         170.97    1.46
      4     4.98x          800          87.38    1.91
      8    10.15x          800          42.86    2.59
     16    12.99x          800          33.48    0.83

The speedups are mostly due to the fact that more threads can use more
memory bandwidth.  The loop we're stressing on the x86 chip in this test
is clear_page_erms, which tops out at a bandwidth of 2550 MiB/s with one
thread.  We get the same bandwidth per thread for 2, 4, or 8 threads,
but at 16 threads the per-thread bandwidth drops to 1420 MiB/s.

However, the performance also improves over a single thread because of
the ktask threads' NUMA awareness (ktask migrates worker threads to the
node local to the work being done).  This becomes a bigger factor as the
amount of pages to zero grows to include memory from multiple nodes, so
that speedups increase as the size increases.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 mm/memory.c | 35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5eb3d2524bdc..ca0a9a05ac7a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -70,6 +70,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/dax.h>
 #include <linux/oom.h>
+#include <linux/ktask.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -4532,20 +4533,31 @@ EXPORT_SYMBOL(__might_fault);
 #endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
-static void clear_gigantic_page(struct page *page,
-				unsigned long addr,
-				unsigned int pages_per_huge_page)
+
+struct cgp_args {
+	struct page	*base_page;
+	unsigned long	addr;
+};
+
+static int clear_gigantic_page_chunk(unsigned long start, unsigned long end,
+				     struct cgp_args *args)
 {
-	int i;
-	struct page *p = page;
+	struct page *base_page = args->base_page;
+	struct page *p = base_page;
+	unsigned long addr = args->addr;
+	unsigned long i;
 
 	might_sleep();
-	for (i = 0; i < pages_per_huge_page;
-	     i++, p = mem_map_next(p, page, i)) {
+	for (i = start; i < end; ++i) {
 		cond_resched();
 		clear_user_highpage(p, addr + i * PAGE_SIZE);
+
+		p = mem_map_next(p, base_page, i);
 	}
+
+	return KTASK_RETURN_SUCCESS;
 }
+
 void clear_huge_page(struct page *page,
 		     unsigned long addr_hint, unsigned int pages_per_huge_page)
 {
@@ -4554,7 +4566,14 @@ void clear_huge_page(struct page *page,
 		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-		clear_gigantic_page(page, addr, pages_per_huge_page);
+		struct cgp_args args = {page, addr};
+		struct ktask_node node = {0, pages_per_huge_page,
+					  page_to_nid(page)};
+		DEFINE_KTASK_CTL(ctl, clear_gigantic_page_chunk, &args,
+				 KTASK_BPGS_MINCHUNK);
+
+		ktask_run_numa(&node, 1, &ctl);
+
 		return;
 	}
 
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v3 6/7] hugetlbfs: parallelize hugetlbfs_fallocate with ktask
  2017-12-05 19:52 [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work Daniel Jordan
                   ` (4 preceding siblings ...)
  2017-12-05 19:52 ` [RFC PATCH v3 5/7] mm: parallelize clear_gigantic_page Daniel Jordan
@ 2017-12-05 19:52 ` Daniel Jordan
  2017-12-05 19:52 ` [RFC PATCH v3 7/7] mm: parallelize deferred struct page initialization within each node Daniel Jordan
  2017-12-05 22:23 ` [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work Andrew Morton
  7 siblings, 0 replies; 17+ messages in thread
From: Daniel Jordan @ 2017-12-05 19:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

hugetlbfs_fallocate preallocates huge pages to back a file in a
hugetlbfs filesystem.  The time to call this function grows linearly
with size.

ktask performs well with its default thread count of 4; higher thread
counts are given for context only.

Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
Test:    fallocate(1) a file on a hugetlbfs filesystem

nthread   speedup   size (GiB)   min time (s)   stdev
      1                    200         127.53    2.19
      2     3.09x          200          41.30    2.11
      4     5.72x          200          22.29    0.51
      8     9.45x          200          13.50    2.58
     16     9.74x          200          13.09    1.64

      1                    400         193.09    2.47
      2     2.14x          400          90.31    3.39
      4     3.84x          400          50.32    0.44
      8     5.11x          400          37.75    1.23
     16     6.12x          400          31.54    3.13

Machine: SPARC T7-4, 1024 CPUs, 504G memory
Test:    fallocate(1) a file on a hugetlbfs filesystem

nthread   speedup   size (GiB)   min time (s)   stdev

      1                    100          15.55    0.05
      2     1.92x          100           8.08    0.01
      4     3.55x          100           4.38    0.02
      8     5.87x          100           2.65    0.06
     16     6.45x          100           2.41    0.09

      1                    200          31.26    0.02
      2     1.92x          200          16.26    0.02
      4     3.58x          200           8.73    0.04
      8     5.54x          200           5.64    0.16
     16     6.96x          200           4.49    0.35

      1                    400          62.18    0.09
      2     1.98x          400          31.36    0.04
      4     3.55x          400          17.52    0.03
      8     5.53x          400          11.25    0.04
     16     6.61x          400           9.40    0.17

The primary bottleneck for better scaling at higher thread counts is
hugetlb_fault_mutex_table[hash].  perf showed L1-dcache-loads increase
with 8 threads and again sharply with 16 threads, and a CPU counter
profile showed that 31% of the L1d misses were on
hugetlb_fault_mutex_table[hash] in the 16-thread case.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 fs/hugetlbfs/inode.c | 116 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 94 insertions(+), 22 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 8a85f3f53446..b027ba917239 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -36,6 +36,7 @@
 #include <linux/magic.h>
 #include <linux/migrate.h>
 #include <linux/uio.h>
+#include <linux/ktask.h>
 
 #include <linux/uaccess.h>
 
@@ -86,11 +87,16 @@ static const match_table_t tokens = {
 };
 
 #ifdef CONFIG_NUMA
+static inline struct shared_policy *hugetlb_get_shared_policy(
+							struct inode *inode)
+{
+	return &HUGETLBFS_I(inode)->policy;
+}
+
 static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma,
-					struct inode *inode, pgoff_t index)
+				struct shared_policy *policy, pgoff_t index)
 {
-	vma->vm_policy = mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
-							index);
+	vma->vm_policy = mpol_shared_policy_lookup(policy, index);
 }
 
 static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma)
@@ -98,8 +104,14 @@ static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma)
 	mpol_cond_put(vma->vm_policy);
 }
 #else
+static inline struct shared_policy *hugetlb_get_shared_policy(
+							struct inode *inode)
+{
+	return NULL;
+}
+
 static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma,
-					struct inode *inode, pgoff_t index)
+				struct shared_policy *policy, pgoff_t index)
 {
 }
 
@@ -535,19 +547,29 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	return 0;
 }
 
+struct hf_args {
+	struct file		*file;
+	struct task_struct	*parent_task;
+	struct mm_struct	*mm;
+	struct shared_policy	*shared_policy;
+	struct hstate		*hstate;
+	struct address_space	*mapping;
+	int			error;
+};
+
+static int hugetlbfs_fallocate_chunk(pgoff_t start, pgoff_t end,
+				     struct hf_args *args);
+
 static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 				loff_t len)
 {
 	struct inode *inode = file_inode(file);
-	struct address_space *mapping = inode->i_mapping;
 	struct hstate *h = hstate_inode(inode);
-	struct vm_area_struct pseudo_vma;
-	struct mm_struct *mm = current->mm;
 	loff_t hpage_size = huge_page_size(h);
 	unsigned long hpage_shift = huge_page_shift(h);
-	pgoff_t start, index, end;
+	pgoff_t start, end;
+	struct hf_args hf_args;
 	int error;
-	u32 hash;
 
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
 		return -EOPNOTSUPP;
@@ -570,16 +592,66 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 	if (error)
 		goto out;
 
+	hf_args.file = file;
+	hf_args.parent_task = current;
+	hf_args.mm = current->mm;
+	hf_args.shared_policy = hugetlb_get_shared_policy(inode);
+	hf_args.hstate = h;
+	hf_args.mapping = inode->i_mapping;
+	hf_args.error = 0;
+
+	if (unlikely(hstate_is_gigantic(h))) {
+		/*
+		 * Use multiple threads in clear_gigantic_page instead of here,
+		 * so just do a 1-threaded hugetlbfs_fallocate_chunk.
+		 */
+		error = hugetlbfs_fallocate_chunk(start, end, &hf_args);
+	} else {
+		DEFINE_KTASK_CTL(ctl, hugetlbfs_fallocate_chunk,
+				 &hf_args, KTASK_BPGS_MINCHUNK);
+
+		error = ktask_run((void *)start, end - start, &ctl);
+	}
+
+	if (error == KTASK_RETURN_ERROR && hf_args.error != -EINTR)
+		goto out;
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
+		i_size_write(inode, offset + len);
+	inode->i_ctime = current_time(inode);
+out:
+	inode_unlock(inode);
+	return error;
+}
+
+static int hugetlbfs_fallocate_chunk(pgoff_t start, pgoff_t end,
+				     struct hf_args *args)
+{
+	struct file		*file		= args->file;
+	struct task_struct	*parent_task	= args->parent_task;
+	struct mm_struct	*mm		= args->mm;
+	struct shared_policy	*shared_policy	= args->shared_policy;
+	struct hstate		*h		= args->hstate;
+	struct address_space	*mapping	= args->mapping;
+	int			error		= 0;
+	pgoff_t			index;
+	struct vm_area_struct	pseudo_vma;
+	loff_t			hpage_size;
+	u32			hash;
+
+	hpage_size = huge_page_size(h);
+
 	/*
 	 * Initialize a pseudo vma as this is required by the huge page
 	 * allocation routines.  If NUMA is configured, use page index
-	 * as input to create an allocation policy.
+	 * as input to create an allocation policy.  Each thread gets its
+	 * own pseudo vma because mempolicies can differ by page.
 	 */
 	memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
 	pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
 	pseudo_vma.vm_file = file;
 
-	for (index = start; index < end; index++) {
+	for (index = start; index < end; ++index) {
 		/*
 		 * This is supposed to be the vaddr where the page is being
 		 * faulted in, but we have no vaddr here.
@@ -594,13 +666,13 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		 * fallocate(2) manpage permits EINTR; we may have been
 		 * interrupted because we are using up too much memory.
 		 */
-		if (signal_pending(current)) {
+		if (signal_pending(parent_task) || signal_pending(current)) {
 			error = -EINTR;
-			break;
+			goto err;
 		}
 
 		/* Set numa allocation policy based on index */
-		hugetlb_set_vma_policy(&pseudo_vma, inode, index);
+		hugetlb_set_vma_policy(&pseudo_vma, shared_policy, index);
 
 		/* addr is the offset within the file (zero based) */
 		addr = index * hpage_size;
@@ -625,7 +697,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		if (IS_ERR(page)) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			error = PTR_ERR(page);
-			goto out;
+			goto err;
 		}
 		clear_huge_page(page, addr, pages_per_huge_page(h));
 		__SetPageUptodate(page);
@@ -633,7 +705,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		if (unlikely(error)) {
 			put_page(page);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			goto out;
+			goto err;
 		}
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
@@ -646,12 +718,12 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		put_page(page);
 	}
 
-	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
-		i_size_write(inode, offset + len);
-	inode->i_ctime = current_time(inode);
-out:
-	inode_unlock(inode);
-	return error;
+	return KTASK_RETURN_SUCCESS;
+
+err:
+	args->error = error;
+
+	return KTASK_RETURN_ERROR;
 }
 
 static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v3 7/7] mm: parallelize deferred struct page initialization within each node
  2017-12-05 19:52 [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work Daniel Jordan
                   ` (5 preceding siblings ...)
  2017-12-05 19:52 ` [RFC PATCH v3 6/7] hugetlbfs: parallelize hugetlbfs_fallocate with ktask Daniel Jordan
@ 2017-12-05 19:52 ` Daniel Jordan
  2017-12-05 22:23 ` [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work Andrew Morton
  7 siblings, 0 replies; 17+ messages in thread
From: Daniel Jordan @ 2017-12-05 19:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen

Deferred struct page initialization currently uses one thread per node
(pgdatinit threads), but this is a bottleneck during boot on big
machines, so use ktask within each pgdatinit thread to parallelize the
struct page initialization on each node, allowing the system to take
better advantage of its memory bandwidth.

Because the system is not fully up yet and most CPUs are idle, use more
than the default maximum number of ktask threads.

Machine: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 88 CPUs, 503G memory,
         2 sockets
Test:    Boot the machine with deferred struct page init three times

kernel                   speedup   max time per   stdev
                                   node (ms)

baseline (4.15-rc2)                        5860     8.6
ktask                      9.56x            613    12.4

---

Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
         8 sockets
Test:    Boot the machine with deferred struct page init three times

kernel                   speedup   max time per   stdev
                                   node (ms)
baseline (4.15-rc2)                        1261     1.9
ktask                      3.88x            325     5.0

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Suggested-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Steve Sistare <steven.sistare@oracle.com>
Cc: Tim Chen <tim.c.chen@intel.com>
---
 mm/page_alloc.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 63 insertions(+), 15 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1f4af28df5b5..68d1261ce99d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -67,6 +67,7 @@
 #include <linux/ftrace.h>
 #include <linux/lockdep.h>
 #include <linux/nmi.h>
+#include <linux/ktask.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -1280,8 +1281,6 @@ static void __init __free_pages_boot_core(struct page *page, unsigned int order)
 	}
 	__ClearPageReserved(p);
 	set_page_count(p, 0);
-
-	page_zone(page)->managed_pages += nr_pages;
 	set_page_refcounted(page);
 	__free_pages(page, order);
 }
@@ -1345,7 +1344,8 @@ void __init __free_pages_bootmem(struct page *page, unsigned long pfn,
 {
 	if (early_page_uninitialised(pfn))
 		return;
-	return __free_pages_boot_core(page, order);
+	__free_pages_boot_core(page, order);
+	page_zone(page)->managed_pages += (1ul << order);
 }
 
 /*
@@ -1483,23 +1483,32 @@ deferred_pfn_valid(int nid, unsigned long pfn,
 	return true;
 }
 
+struct deferred_args {
+	int nid;
+	int zid;
+	atomic64_t nr_pages;
+};
+
 /*
  * Free pages to buddy allocator. Try to free aligned pages in
  * pageblock_nr_pages sizes.
  */
-static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
-				       unsigned long end_pfn)
+static int __init deferred_free_chunk(unsigned long pfn, unsigned long end_pfn,
+				      struct deferred_args *args)
 {
 	struct mminit_pfnnid_cache nid_init_state = { };
 	unsigned long nr_pgmask = pageblock_nr_pages - 1;
-	unsigned long nr_free = 0;
+	unsigned long nr_free = 0, nr_pages = 0;
+	int nid = args->nid;
 
 	for (; pfn < end_pfn; pfn++) {
 		if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
 			deferred_free_range(pfn - nr_free, nr_free);
+			nr_pages += nr_free;
 			nr_free = 0;
 		} else if (!(pfn & nr_pgmask)) {
 			deferred_free_range(pfn - nr_free, nr_free);
+			nr_pages += nr_free;
 			nr_free = 1;
 			cond_resched();
 		} else {
@@ -1508,21 +1517,26 @@ static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
 	}
 	/* Free the last block of pages to allocator */
 	deferred_free_range(pfn - nr_free, nr_free);
+	nr_pages += nr_free;
+
+	atomic64_add(nr_pages, &args->nr_pages);
+	return KTASK_RETURN_SUCCESS;
 }
 
 /*
  * Initialize struct pages.  We minimize pfn page lookups and scheduler checks
  * by performing it only once every pageblock_nr_pages.
- * Return number of pages initialized.
+ * Return number of pages initialized in deferred_args.
  */
-static unsigned long  __init deferred_init_pages(int nid, int zid,
-						 unsigned long pfn,
-						 unsigned long end_pfn)
+static int __init deferred_init_chunk(unsigned long pfn, unsigned long end_pfn,
+				      struct deferred_args *args)
 {
 	struct mminit_pfnnid_cache nid_init_state = { };
 	unsigned long nr_pgmask = pageblock_nr_pages - 1;
 	unsigned long nr_pages = 0;
 	struct page *page = NULL;
+	int nid = args->nid;
+	int zid = args->zid;
 
 	for (; pfn < end_pfn; pfn++) {
 		if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
@@ -1537,7 +1551,8 @@ static unsigned long  __init deferred_init_pages(int nid, int zid,
 		__init_single_page(page, pfn, zid, nid);
 		nr_pages++;
 	}
-	return (nr_pages);
+	atomic64_add(nr_pages, &args->nr_pages);
+	return KTASK_RETURN_SUCCESS;
 }
 
 /* Initialise remaining memory on a node */
@@ -1546,7 +1561,7 @@ static int __init deferred_init_memmap(void *data)
 	pg_data_t *pgdat = data;
 	int nid = pgdat->node_id;
 	unsigned long start = jiffies;
-	unsigned long nr_pages = 0;
+	unsigned long nr_init = 0, nr_free = 0;
 	unsigned long spfn, epfn;
 	phys_addr_t spa, epa;
 	int zid;
@@ -1554,6 +1569,8 @@ static int __init deferred_init_memmap(void *data)
 	unsigned long first_init_pfn = pgdat->first_deferred_pfn;
 	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
 	u64 i;
+	unsigned long nr_node_cpus = cpumask_weight(cpumask) * 4;
+	struct ktask_node kn;
 
 	if (first_init_pfn == ULONG_MAX) {
 		pgdat_init_report_one_done();
@@ -1564,6 +1581,12 @@ static int __init deferred_init_memmap(void *data)
 	if (!cpumask_empty(cpumask))
 		set_cpus_allowed_ptr(current, cpumask);
 
+	/*
+	 * We'd like to know the memory bandwidth of the chip to calculate the
+	 * right number of CPUs, but we can't so make a guess.
+	 */
+	nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
+
 	/* Sanity check boundaries */
 	BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn);
 	BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
@@ -1584,20 +1607,45 @@ static int __init deferred_init_memmap(void *data)
 	 * page in __free_one_page()).
 	 */
 	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
+		struct deferred_args args = { nid, zid, ATOMIC64_INIT(0) };
+		DEFINE_KTASK_CTL(ctl, deferred_init_chunk, &args,
+				 KTASK_BPGS_MINCHUNK);
+		ktask_ctl_set_max_threads(&ctl, nr_node_cpus);
+
 		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
 		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
-		nr_pages += deferred_init_pages(nid, zid, spfn, epfn);
+
+		kn.kn_start	= (void *)spfn;
+		kn.kn_task_size	= (spfn < epfn) ? epfn - spfn : 0;
+		kn.kn_nid	= nid;
+		(void) ktask_run_numa(&kn, 1, &ctl);
+
+		nr_init += atomic64_read(&args.nr_pages);
 	}
 	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
+		struct deferred_args args = { nid, zid, ATOMIC64_INIT(0) };
+		DEFINE_KTASK_CTL(ctl, deferred_free_chunk, &args,
+				 KTASK_BPGS_MINCHUNK);
+		ktask_ctl_set_max_threads(&ctl, nr_node_cpus);
+
 		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
 		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
-		deferred_free_pages(nid, zid, spfn, epfn);
+
+		kn.kn_start	= (void *)spfn;
+		kn.kn_task_size	= (spfn < epfn) ? epfn - spfn : 0;
+		kn.kn_nid	= nid;
+		(void) ktask_run_numa(&kn, 1, &ctl);
+
+		nr_free += atomic64_read(&args.nr_pages);
 	}
 
 	/* Sanity check that the next zone really is unpopulated */
 	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
+	VM_BUG_ON(nr_init != nr_free);
+
+	zone->managed_pages += nr_free;
 
-	pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_pages,
+	pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_free,
 					jiffies_to_msecs(jiffies - start));
 
 	pgdat_init_report_one_done();
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 1/7] ktask: add documentation
  2017-12-05 19:52 ` [RFC PATCH v3 1/7] ktask: add documentation Daniel Jordan
@ 2017-12-05 20:59   ` Daniel Jordan
  2017-12-06 14:35   ` Michal Hocko
  1 sibling, 0 replies; 17+ messages in thread
From: Daniel Jordan @ 2017-12-05 20:59 UTC (permalink / raw)
  To: Daniel Jordan, linux-mm, linux-kernel
  Cc: aaron.lu, akpm, dave.hansen, mgorman, mhocko, mike.kravetz,
	pasha.tatashin, steven.sistare, tim.c.chen, rdunlap

Forgot to cc Randy Dunlap and add his Reviewed-by from v2.

On 12/05/2017 02:52 PM, Daniel Jordan wrote:
> Motivates and explains the ktask API for kernel clients.
> 
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>

Daniel

> Cc: Aaron Lu <aaron.lu@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> Cc: Tim Chen <tim.c.chen@intel.com>
> ---
>   Documentation/core-api/index.rst |   1 +
>   Documentation/core-api/ktask.rst | 173 +++++++++++++++++++++++++++++++++++++++
>   2 files changed, 174 insertions(+)
>   create mode 100644 Documentation/core-api/ktask.rst
> 
> diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
> index d5bbe035316d..255724095814 100644
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -15,6 +15,7 @@ Core utilities
>      assoc_array
>      atomic_ops
>      cpu_hotplug
> +   ktask
>      local_ops
>      workqueue
>      genericirq
> diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst
> new file mode 100644
> index 000000000000..703f200c7d36
> --- /dev/null
> +++ b/Documentation/core-api/ktask.rst
> @@ -0,0 +1,173 @@
> +============================================
> +ktask: parallelize CPU-intensive kernel work
> +============================================
> +
> +:Date: December, 2017
> +:Author: Daniel Jordan <daniel.m.jordan@oracle.com>
> +
> +
> +Introduction
> +============
> +
> +ktask is a generic framework for parallelizing CPU-intensive work in the
> +kernel.  The intended use is for big machines that can use their CPU power to
> +speed up large tasks that can't otherwise be multithreaded in userland.  The
> +API is generic enough to add concurrency to many different kinds of tasks--for
> +example, zeroing a range of pages or evicting a list of inodes--and aims to
> +save its clients the trouble of splitting up the work, choosing the number of
> +threads to use, maintaining an efficient concurrency level, starting these
> +threads, and load balancing the work between them.
> +
> +
> +Motivation
> +==========
> +
> +To ensure that applications and the kernel itself continue to perform well as
> +core counts and memory sizes increase, the kernel needs to scale.  For example,
> +when a system call requests a certain fraction of system resources, the kernel
> +should respond in kind by devoting a similar fraction of system resources to
> +service the request.
> +
> +Before ktask, for example, when booting a NUMA machine with many CPUs, only one
> +thread per node was used to initialize struct pages.  Using additional CPUs
> +that would otherwise be idle until the machine is fully up avoids a needless
> +bottleneck during system boot and allows the kernel to take advantage of unused
> +memory bandwidth.
> +
> +Why a new framework when there are existing kernel APIs for managing
> +concurrency and other ways to improve performance?  Of the existing facilities,
> +workqueues aren't designed to divide work up (although ktask is built on
> +unbound workqueues), and kthread_worker supports only one thread.  Existing
> +scalability techniques in the kernel such as doing work or holding locks in
> +batches are helpful and should be applied first for performance problems, but
> +eventually a single thread hits a wall.
> +
> +
> +Concept
> +=======
> +
> +A little terminology up front:  A 'task' is the total work there is to do and a
> +'chunk' is a unit of work given to a thread.
> +
> +To complete a task using the ktask framework, a client provides a thread
> +function that is responsible for completing one chunk.  The thread function is
> +defined in a standard way, with start and end arguments that delimit the chunk
> +as well as an argument that the client uses to pass data specific to the task.
> +
> +In addition, the client supplies an object representing the start of the task
> +and an iterator function that knows how to advance some number of units in the
> +task to yield another object representing the new task position.  The framework
> +uses the start object and iterator internally to divide the task into chunks.
> +
> +Finally, the client passes the total task size and a minimum chunk size to
> +indicate the minimum amount of work that's appropriate to do in one chunk.  The
> +sizes are given in task-specific units (e.g. pages, inodes, bytes).  The
> +framework uses these sizes, along with the number of online CPUs and an
> +internal maximum number of threads, to decide how many threads to start and how
> +many chunks to divide the task into.
> +
> +For example, consider the task of clearing a gigantic page.  This used to be
> +done in a single thread with a for loop that calls a page clearing function for
> +each constituent base page.  To parallelize with ktask, the client first moves
> +the for loop to the thread function, adapting it to operate on the range passed
> +to the function.  In this simple case, the thread function's start and end
> +arguments are just addresses delimiting the portion of the gigantic page to
> +clear.  Then, where the for loop used to be, the client calls into ktask with
> +the start address of the gigantic page, the total size of the gigantic page,
> +and the thread function.  Internally, ktask will divide the address range into
> +an appropriate number of chunks and start an appropriate number of threads to
> +complete these chunks.
> +
> +
> +Configuration
> +=============
> +
> +To use ktask, configure the kernel with CONFIG_KTASK=y.
> +
> +If CONFIG_KTASK=n, calls to the ktask API are simply #define'd to run the
> +thread function that the client provides so that the task is completed without
> +concurrency in the current thread.
> +
> +
> +Interface
> +=========
> +
> +.. Include ktask.h inline here.  This file is heavily commented and documents
> +.. the ktask interface.
> +.. kernel-doc:: include/linux/ktask.h
> +
> +
> +Resource Limits and Auto-Tuning
> +===============================
> +
> +ktask has resource limits on the number of workqueue items it queues.  In
> +ktask, a workqueue item is a thread that runs chunks of the task until the task
> +is finished.
> +
> +These limits support the different ways ktask uses workqueues:
> + - ktask_run to run threads on the calling thread's node.
> + - ktask_run_numa to run threads on the node(s) specified.
> + - ktask_run_numa with nid=NUMA_NO_NODE to run threads on any node in the
> +   system.
> +
> +To support these different ways of queueing work while maintaining an efficient
> +concurrency level, we need both system-wide and per-node limits on the number
> +of threads.  Without per-node limits, a node might become oversubscribed
> +despite ktask staying within the system-wide limit, and without a system-wide
> +limit, we can't properly account for work that can run on any node.
> +
> +The system-wide limit is based on the total number of CPUs, and the per-node
> +limit on the CPU count for each node.  A per-node work item counts against the
> +system-wide limit.  Workqueue's max_active can't accommodate both types of
> +limit, no matter how many workqueues are used, so ktask implements its own.
> +
> +If a per-node limit is reached, the work item is allowed to run anywhere on the
> +machine to avoid overwhelming the node.  If the global limit is also reached,
> +ktask won't queue additional work items until we fall below the limit again.
> +
> +These limits apply only to workqueue items--that is, additional threads beyond
> +the one starting the task.  That way, one thread per task is always allowed to
> +run.
> +
> +Within the resource limits, ktask uses a default maximum number of threads per
> +task to avoid disturbing other processes on the system.  Callers can change the
> +limit with ktask_ctl_set_max_threads.  For example, this might be used to raise
> +the maximum number of threads for a boot-time initialization task when more
> +CPUs than usual are idle.
> +
> +
> +Backward Compatibility
> +======================
> +
> +ktask is written so that existing calls to the API will be backwards compatible
> +should the API gain new features in the future.  This is accomplished by
> +restricting API changes to members of struct ktask_ctl and having clients make
> +an opaque initialization call (DEFINE_KTASK_CTL).  This initialization can then
> +be modified to include any new arguments so that existing call sites stay the
> +same.
> +
> +
> +Error Handling
> +==============
> +
> +Calls to ktask fail only if the provided thread function fails.  In particular,
> +ktask avoids allocating memory internally during a task, so it's safe to use in
> +sensitive contexts.
> +
> +To avoid adding features before they're used, ktask currently has only basic
> +error handling.  Each call to ktask_run and ktask_run_numa returns a simple
> +error code, KTASK_RETURN_SUCCESS or KTASK_RETURN_ERROR.  As usage of the
> +framework expands, however, error handling will likely need to be enhanced in
> +two ways.
> +
> +First, ktask may need client-specific error reporting.  It's possible for tasks
> +to fail for different reasons, so the framework should have a way to
> +communicate client-specific error information.  For this purpose, allow the
> +client to pass a pointer for its own error information in struct ktask_ctl.
> +
> +Second, tasks can fail midway through their work.  To recover, the finished
> +chunks of work need to be undone in a task-specific way, so ktask should allow
> +clients to pass an "undo" callback that is responsible for undoing one chunk of
> +work.  To avoid multiple levels of error handling, this "undo" callback should
> +not be allowed to fail.  The iterator used for the original task can simply be
> +reused for the undo operation.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 2/7] ktask: multithread CPU-intensive kernel work
  2017-12-05 19:52 ` [RFC PATCH v3 2/7] ktask: multithread CPU-intensive kernel work Daniel Jordan
@ 2017-12-05 22:21   ` Andrew Morton
  2017-12-06 14:21     ` Daniel Jordan
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2017-12-05 22:21 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: linux-mm, linux-kernel, aaron.lu, dave.hansen, mgorman, mhocko,
	mike.kravetz, pasha.tatashin, steven.sistare, tim.c.chen

On Tue,  5 Dec 2017 14:52:15 -0500 Daniel Jordan <daniel.m.jordan@oracle.com> wrote:

> ktask is a generic framework for parallelizing CPU-intensive work in the
> kernel.  The intended use is for big machines that can use their CPU power to
> speed up large tasks that can't otherwise be multithreaded in userland.  The
> API is generic enough to add concurrency to many different kinds of tasks--for
> example, zeroing a range of pages or evicting a list of inodes--and aims to
> save its clients the trouble of splitting up the work, choosing the number of
> threads to use, maintaining an efficient concurrency level, starting these
> threads, and load balancing the work between them.
> 
> The Documentation patch earlier in this series has more background.
> 
> Introduces the ktask API; consumers appear in subsequent patches.
> 
> Based on work by Pavel Tatashin, Steve Sistare, and Jonathan Adams.
>
> ...
>
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -319,6 +319,18 @@ config AUDIT_TREE
>  	depends on AUDITSYSCALL
>  	select FSNOTIFY
>  
> +config KTASK
> +	bool "Multithread cpu-intensive kernel tasks"
> +	depends on SMP
> +	depends on NR_CPUS > 16

Why this?

It would make sense to relax (or eliminate) this at least for the
development/test period, so more people actually run and test the new
code.

> +	default n
> +	help
> +	  Parallelize expensive kernel tasks such as zeroing huge pages.  This
> +          feature is designed for big machines that can take advantage of their
> +          cpu count to speed up large kernel tasks.
> +
> +          If unsure, say 'N'.
> +
>  source "kernel/irq/Kconfig"
>  source "kernel/time/Kconfig"
>  
>
> ...
>
> +/*
> + * Initialize internal limits on work items queued.  Work items submitted to
> + * cmwq capped at 80% of online cpus both system-wide and per-node to maintain
> + * an efficient level of parallelization at these respective levels.
> + */
> +bool ktask_rlim_init(void)

Why not static __init?

> +{
> +	int node;
> +	unsigned nr_node_cpus;
> +
> +	spin_lock_init(&ktask_rlim_lock);

This can be done at compile time.  Unless there's a real reason for
ktask_rlim_init to be non-static, non-__init, in which case I'm
worried: reinitializing a static spinlock is weird.

> +	ktask_rlim_node_cur = kcalloc(num_possible_nodes(),
> +					       sizeof(size_t),
> +					       GFP_KERNEL);
> +	if (!ktask_rlim_node_cur) {
> +		pr_warn("can't alloc rlim counts (ktask disabled)");
> +		return false;
> +	}
> +
> +	ktask_rlim_node_max = kmalloc_array(num_possible_nodes(),
> +						     sizeof(size_t),
> +						     GFP_KERNEL);
> +	if (!ktask_rlim_node_max) {
> +		kfree(ktask_rlim_node_cur);
> +		pr_warn("can't alloc rlim maximums (ktask disabled)");
> +		return false;
> +	}
> +
> +	ktask_rlim_max = mult_frac(num_online_cpus(), KTASK_CPUFRAC_NUMER,
> +						      KTASK_CPUFRAC_DENOM);
> +	for_each_node(node) {
> +		nr_node_cpus = cpumask_weight(cpumask_of_node(node));
> +		ktask_rlim_node_max[node] = mult_frac(nr_node_cpus,
> +						      KTASK_CPUFRAC_NUMER,
> +						      KTASK_CPUFRAC_DENOM);
> +	}
> +
> +	return true;
> +}
>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work
  2017-12-05 19:52 [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work Daniel Jordan
                   ` (6 preceding siblings ...)
  2017-12-05 19:52 ` [RFC PATCH v3 7/7] mm: parallelize deferred struct page initialization within each node Daniel Jordan
@ 2017-12-05 22:23 ` Andrew Morton
  2017-12-06 14:21   ` Daniel Jordan
  7 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2017-12-05 22:23 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: linux-mm, linux-kernel, aaron.lu, dave.hansen, mgorman, mhocko,
	mike.kravetz, pasha.tatashin, steven.sistare, tim.c.chen

On Tue,  5 Dec 2017 14:52:13 -0500 Daniel Jordan <daniel.m.jordan@oracle.com> wrote:

> This patchset is based on 4.15-rc2 plus one mmots fix[*] and contains three
> ktask users:
>  - deferred struct page initialization at boot time
>  - clearing gigantic pages
>  - fallocate for HugeTLB pages

Performance improvements are nice.  How much overall impact is there in
real-world worklaods?

> Work in progress:
>  - Parallelizing page freeing in the exit/munmap paths

Also sounds interesting.  Have you identified any other parallelizable
operations?  vfs object teardown at umount time may be one...

>  - CPU hotplug support

Of what?  The ktask infrastructure itself?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 2/7] ktask: multithread CPU-intensive kernel work
  2017-12-05 22:21   ` Andrew Morton
@ 2017-12-06 14:21     ` Daniel Jordan
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Jordan @ 2017-12-06 14:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, aaron.lu, dave.hansen, mgorman, mhocko,
	mike.kravetz, pasha.tatashin, steven.sistare, tim.c.chen

Thanks for looking at this, Andrew.  Responses below.


On 12/05/2017 05:21 PM, Andrew Morton wrote:
> On Tue,  5 Dec 2017 14:52:15 -0500 Daniel Jordan <daniel.m.jordan@oracle.com> wrote:
> 
>> ktask is a generic framework for parallelizing CPU-intensive work in the
>> kernel.  The intended use is for big machines that can use their CPU power to
>> speed up large tasks that can't otherwise be multithreaded in userland.  The
>> API is generic enough to add concurrency to many different kinds of tasks--for
>> example, zeroing a range of pages or evicting a list of inodes--and aims to
>> save its clients the trouble of splitting up the work, choosing the number of
>> threads to use, maintaining an efficient concurrency level, starting these
>> threads, and load balancing the work between them.
>>
>> The Documentation patch earlier in this series has more background.
>>
>> Introduces the ktask API; consumers appear in subsequent patches.
>>
>> Based on work by Pavel Tatashin, Steve Sistare, and Jonathan Adams.
>>
>> ...
>>
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -319,6 +319,18 @@ config AUDIT_TREE
>>   	depends on AUDITSYSCALL
>>   	select FSNOTIFY
>>   
>> +config KTASK
>> +	bool "Multithread cpu-intensive kernel tasks"
>> +	depends on SMP
>> +	depends on NR_CPUS > 16
> 
> Why this?

Good question.  I picked 16 to represent a big machine, but as with most 
cutoffs it's somewhat arbitrary.

> It would make sense to relax (or eliminate) this at least for the
> development/test period, so more people actually run and test the new
> code.

Ok, that makes sense.  I'll remove it for now.

Since many (most?) distributions ship with a high NR_CPUS, maybe 
deciding whether to enable the framework at runtime based on online CPUs 
and memory is a better option.  A static branch might do it.

> 
>> +	default n
>> +	help
>> +	  Parallelize expensive kernel tasks such as zeroing huge pages.  This
>> +          feature is designed for big machines that can take advantage of their
>> +          cpu count to speed up large kernel tasks.
>> +
>> +          If unsure, say 'N'.
>> +
>>   source "kernel/irq/Kconfig"
>>   source "kernel/time/Kconfig"
>>   
>>
>> ...
>>
>> +/*
>> + * Initialize internal limits on work items queued.  Work items submitted to
>> + * cmwq capped at 80% of online cpus both system-wide and per-node to maintain
>> + * an efficient level of parallelization at these respective levels.
>> + */
>> +bool ktask_rlim_init(void)
> 
> Why not static __init?

I forgot both.  I added them, thanks.

> 
>> +{
>> +	int node;
>> +	unsigned nr_node_cpus;
>> +
>> +	spin_lock_init(&ktask_rlim_lock);
> 
> This can be done at compile time.  Unless there's a real reason for
> ktask_rlim_init to be non-static, non-__init, in which case I'm
> worried: reinitializing a static spinlock is weird.

You're right, I should have used DEFINE_SPINLOCK.  This is fixed.


The patch at the bottom covers these changes and gets rid of a mismerge 
in this patch.

Daniel


diff --git a/init/Kconfig b/init/Kconfig
index 2a7b120de4d4..28c234791819 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -322,15 +322,12 @@ config AUDIT_TREE
  config KTASK
         bool "Multithread cpu-intensive kernel tasks"
         depends on SMP
-       depends on NR_CPUS > 16
-       default n
+       default y
         help
           Parallelize expensive kernel tasks such as zeroing huge 
pages.  This
            feature is designed for big machines that can take advantage 
of their
            cpu count to speed up large kernel tasks.

-          If unsure, say 'N'.
-
  source "kernel/irq/Kconfig"
  source "kernel/time/Kconfig"

diff --git a/kernel/ktask.c b/kernel/ktask.c
index 7b075075b56b..4db38fe59bdb 100644
--- a/kernel/ktask.c
+++ b/kernel/ktask.c
@@ -29,7 +29,7 @@
  #include <linux/workqueue.h>

  /* Resource limits on the amount of workqueue items queued through 
ktask. */
-spinlock_t ktask_rlim_lock;
+static DEFINE_SPINLOCK(ktask_rlim_lock);
  /* Work items queued on all nodes (includes NUMA_NO_NODE) */
  size_t ktask_rlim_cur;
  size_t ktask_rlim_max;
@@ -382,14 +382,6 @@ int ktask_run_numa(struct ktask_node *nodes, size_t 
nr_nodes,
                 return KTASK_RETURN_SUCCESS;

         mutex_init(&kt.kt_mutex);
-
-       kt.kt_nthreads = ktask_nthreads(kt.kt_total_size,
-                                       ctl->kc_min_chunk_size,
-                                       ctl->kc_max_threads);
-
-       kt.kt_chunk_size = ktask_chunk_size(kt.kt_total_size,
-                                       ctl->kc_min_chunk_size, 
kt.kt_nthreads);
-
         init_completion(&kt.kt_ktask_done);

         kt.kt_nthreads = ktask_prepare_threads(nodes, nr_nodes, &kt, 
&to_queue);
@@ -449,13 +441,11 @@ EXPORT_SYMBOL_GPL(ktask_run);
   * cmwq capped at 80% of online cpus both system-wide and per-node to 
maintain
   * an efficient level of parallelization at these respective levels.
   */
-bool ktask_rlim_init(void)
+static bool __init ktask_rlim_init(void)
  {
         int node;
         unsigned nr_node_cpus;

-       spin_lock_init(&ktask_rlim_lock);
-
         ktask_rlim_node_cur = kcalloc(num_possible_nodes(),
                                                sizeof(size_t),
                                                GFP_KERNEL);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work
  2017-12-05 22:23 ` [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work Andrew Morton
@ 2017-12-06 14:21   ` Daniel Jordan
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Jordan @ 2017-12-06 14:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, aaron.lu, dave.hansen, mgorman, mhocko,
	mike.kravetz, pasha.tatashin, steven.sistare, tim.c.chen

On 12/05/2017 05:23 PM, Andrew Morton wrote:
> On Tue,  5 Dec 2017 14:52:13 -0500 Daniel Jordan <daniel.m.jordan@oracle.com> wrote:
> 
>> This patchset is based on 4.15-rc2 plus one mmots fix[*] and contains three
>> ktask users:
>>   - deferred struct page initialization at boot time
>>   - clearing gigantic pages
>>   - fallocate for HugeTLB pages
> 
> Performance improvements are nice.  How much overall impact is there in
> real-world worklaods?

All of the users so far are mainly for initialization/startup, so the 
impact depends on how often users are rebooting (deferred struct page 
init) and starting applications such as RDBMS'es (hugetlbfs_fallocate).

ktask saves 5 seconds of boot time on the two-socket machine I tested on 
with deferred init, which is half the time it takes for the kernel to 
get to systemd, so for big machines that are frequently updated, the 
savings would add up.

> 
>> Work in progress:
>>   - Parallelizing page freeing in the exit/munmap paths
> 
> Also sounds interesting.

Parallelizing this efficiently depends on scaling lru_lock and 
zone->lock, which I've been working on separately.

Have you identified any other parallelizable
> operations?  vfs object teardown at umount time may be one...

By vfs object teardown, are you referring to evict_inodes/dispose_list?

If so, I actually have tried parallelizing that and there were good 
speedups during unmount with many cached pages.  It's just a matter of 
parallelizing well across inodes with different amounts of pages in cache.

I've also gotten good results with __get_user_pages.  If we want to keep 
the return value of __get_user_pages consistent on error (and I'm 
assuming that's a given), there needs to be logic that undoes the work 
past the first non-pinned page in the range so we continue to return the 
number of pages pinned from the start.  That seems ok since it's a slow 
path.

The shmem page free path (shmem_undo_range), struct page initialization 
on memory hotplug, and huge page copying are others I've considered but 
haven't implemented yet.

>>   - CPU hotplug support
> 
> Of what?  The ktask infrastructure itself?

Yes, ktask itself.  When CPUs come up or down, ktask's resource limits 
and preallocated data (the struct ktask_work's passed to the workqueue 
code) need to be adjusted for the new CPU count, at least as it's 
written now.

Thanks for the comments,
Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 1/7] ktask: add documentation
  2017-12-05 19:52 ` [RFC PATCH v3 1/7] ktask: add documentation Daniel Jordan
  2017-12-05 20:59   ` Daniel Jordan
@ 2017-12-06 14:35   ` Michal Hocko
  2017-12-06 20:32     ` Daniel Jordan
  1 sibling, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2017-12-06 14:35 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: linux-mm, linux-kernel, aaron.lu, akpm, dave.hansen, mgorman,
	mike.kravetz, pasha.tatashin, steven.sistare, tim.c.chen

Please note that I haven't checked any code in this patch series. I've
just started here to see how the thing is supposed to work and what is
the overall design

On Tue 05-12-17 14:52:14, Daniel Jordan wrote:
[...]
> +Resource Limits and Auto-Tuning
> +===============================
> +
> +ktask has resource limits on the number of workqueue items it queues.  In
> +ktask, a workqueue item is a thread that runs chunks of the task until the task
> +is finished.
> +
> +These limits support the different ways ktask uses workqueues:
> + - ktask_run to run threads on the calling thread's node.
> + - ktask_run_numa to run threads on the node(s) specified.
> + - ktask_run_numa with nid=NUMA_NO_NODE to run threads on any node in the
> +   system.
> +
> +To support these different ways of queueing work while maintaining an efficient
> +concurrency level, we need both system-wide and per-node limits on the number
> +of threads.  Without per-node limits, a node might become oversubscribed
> +despite ktask staying within the system-wide limit, and without a system-wide
> +limit, we can't properly account for work that can run on any node.
> +
> +The system-wide limit is based on the total number of CPUs, and the per-node
> +limit on the CPU count for each node.  A per-node work item counts against the
> +system-wide limit.  Workqueue's max_active can't accommodate both types of
> +limit, no matter how many workqueues are used, so ktask implements its own.
> +
> +If a per-node limit is reached, the work item is allowed to run anywhere on the
> +machine to avoid overwhelming the node.  If the global limit is also reached,
> +ktask won't queue additional work items until we fall below the limit again.
> +
> +These limits apply only to workqueue items--that is, additional threads beyond
> +the one starting the task.  That way, one thread per task is always allowed to
> +run.
> +
> +Within the resource limits, ktask uses a default maximum number of threads per
> +task to avoid disturbing other processes on the system.  Callers can change the
> +limit with ktask_ctl_set_max_threads.  For example, this might be used to raise
> +the maximum number of threads for a boot-time initialization task when more
> +CPUs than usual are idle.

The last time something like this (maybe even this specific approach -
I do not remember) the main objection was the auto-tuning. Unless I've
missed anything here all the tuning is based on counters rather than
the _current_ system utilization. There is also no mention about other
characteristics (e.g. power management), resource isolation etc. So
let me ask again. How do you control that the parallelized operation
doesn't run outside of the limit imposed to the calling context? How
do you control whether a larger number of workers should be fired when
the system is idle but we want to keep many cpus idle due to power
constrains. How do you control how many workers are fired based on
cpu utilization? Do you talk to the scheduler to see overall/per node
utilization.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 1/7] ktask: add documentation
  2017-12-06 14:35   ` Michal Hocko
@ 2017-12-06 20:32     ` Daniel Jordan
  2017-12-08 12:43       ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Jordan @ 2017-12-06 20:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, aaron.lu, akpm, dave.hansen, mgorman,
	mike.kravetz, pasha.tatashin, steven.sistare, tim.c.chen

On 12/06/2017 09:35 AM, Michal Hocko wrote:
> Please note that I haven't checked any code in this patch series. I've
> just started here to see how the thing is supposed to work and what is
> the overall design

Thanks for taking a look, Michal.

> 
> On Tue 05-12-17 14:52:14, Daniel Jordan wrote:
> [...]
>> +Resource Limits and Auto-Tuning
>> +===============================
>> +
>> +ktask has resource limits on the number of workqueue items it queues.  In
>> +ktask, a workqueue item is a thread that runs chunks of the task until the task
>> +is finished.
>> +
>> +These limits support the different ways ktask uses workqueues:
>> + - ktask_run to run threads on the calling thread's node.
>> + - ktask_run_numa to run threads on the node(s) specified.
>> + - ktask_run_numa with nid=NUMA_NO_NODE to run threads on any node in the
>> +   system.
>> +
>> +To support these different ways of queueing work while maintaining an efficient
>> +concurrency level, we need both system-wide and per-node limits on the number
>> +of threads.  Without per-node limits, a node might become oversubscribed
>> +despite ktask staying within the system-wide limit, and without a system-wide
>> +limit, we can't properly account for work that can run on any node.
>> +
>> +The system-wide limit is based on the total number of CPUs, and the per-node
>> +limit on the CPU count for each node.  A per-node work item counts against the
>> +system-wide limit.  Workqueue's max_active can't accommodate both types of
>> +limit, no matter how many workqueues are used, so ktask implements its own.
>> +
>> +If a per-node limit is reached, the work item is allowed to run anywhere on the
>> +machine to avoid overwhelming the node.  If the global limit is also reached,
>> +ktask won't queue additional work items until we fall below the limit again.
>> +
>> +These limits apply only to workqueue items--that is, additional threads beyond
>> +the one starting the task.  That way, one thread per task is always allowed to
>> +run.
>> +
>> +Within the resource limits, ktask uses a default maximum number of threads per
>> +task to avoid disturbing other processes on the system.  Callers can change the
>> +limit with ktask_ctl_set_max_threads.  For example, this might be used to raise
>> +the maximum number of threads for a boot-time initialization task when more
>> +CPUs than usual are idle.
> 
> The last time something like this (maybe even this specific approach -
> I do not remember) the main objection was the auto-tuning. Unless I've
> missed anything here all the tuning is based on counters rather than
> the _current_ system utilization.

That's right, as it's written now, it's just counters.

> There is also no mention about other
> characteristics (e.g. power management), resource isloataion etc. So > let me ask again. How do you control that the parallelized operation
> doesn't run outside of the limit imposed to the calling context?

The current code doesn't do this, and the answer is the same for the 
rest of your questions.

For resource isolation, I'll experiment with moving ktask threads into 
and out of the cgroup of the calling thread.

Do any resources not covered by cgroup come to mind?  I'm trying to 
think if I've left anything out.

> How
> do you control whether a larger number of workers should be fired when
> the system is idle but we want to keep many cpus idle due to power
> constrains. 

For power management, I'm going to look into how ktask can use the 
current cpufreq settings and the scheduler hooks called by cpufreq.

We could make decisions about starting additional threads (if any) based 
on the CPU frequency range or policy.

> How do you control how many workers are fired based on
> cpu utilization? Do you talk to the scheduler to see overall/per node
> utilization.

We'd have to go off of past and present scheduler data to predict the 
future.  Even the best heuristic might get it wrong, but heuristics 
could be better than nothing.  I'll look into what data the scheduler 
exports.


Anyway, I think scalability bottlenecks should be weighed with the rest 
of this.  It seems wrong that the kernel should always assume that one 
thread is enough to free all of a process's memory or evict all the 
pages of a file system no matter how much work there is to do.

Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 1/7] ktask: add documentation
  2017-12-06 20:32     ` Daniel Jordan
@ 2017-12-08 12:43       ` Michal Hocko
  2017-12-08 13:46         ` Daniel Jordan
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2017-12-08 12:43 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: linux-mm, linux-kernel, aaron.lu, akpm, dave.hansen, mgorman,
	mike.kravetz, pasha.tatashin, steven.sistare, tim.c.chen

On Wed 06-12-17 15:32:48, Daniel Jordan wrote:
> On 12/06/2017 09:35 AM, Michal Hocko wrote:
[...]
> > There is also no mention about other
> > characteristics (e.g. power management), resource isloataion etc. So > let me ask again. How do you control that the parallelized operation
> > doesn't run outside of the limit imposed to the calling context?
> 
> The current code doesn't do this, and the answer is the same for the rest of
> your questions.

I really believe this should be addressed before this can be considered
for merging. While what you have might be sufficient for early boot
initialization stuff I am not sure the amount of code is really
justified by that usecase alone. Any runtime enabled parallelized work
really have to care about the rest of the system. The last thing you
really want to see is to make a highly utilized system overloaded just
because of some optimization. And I do not see how can you achive that
with a limit on the number of paralelization threads.

> For resource isolation, I'll experiment with moving ktask threads into and
> out of the cgroup of the calling thread.
> 
> Do any resources not covered by cgroup come to mind?  I'm trying to think if
> I've left anything out.

This is mostly about cpu so dealing with the cpu cgroup controller
should do the work.

[...]

> Anyway, I think scalability bottlenecks should be weighed with the rest of
> this.  It seems wrong that the kernel should always assume that one thread
> is enough to free all of a process's memory or evict all the pages of a file
> system no matter how much work there is to do.

Well, this will be always a double edge sword. Sure if you have spare
cycles (whatever that means) than using them is really nice. But the
last thing you really want is to turn an optimization into an
utilization nightmare where few processes dominant the whole machine
even though they could be easily contained normally inside a single
execution context.

Your work targets larger machines and I understand that you are mainly
focused on a single large workload running on that machine but there are
many others running with many smaller workloads which would like to be
independent. Not everything is a large DB running on a large HW.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v3 1/7] ktask: add documentation
  2017-12-08 12:43       ` Michal Hocko
@ 2017-12-08 13:46         ` Daniel Jordan
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Jordan @ 2017-12-08 13:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, aaron.lu, akpm, dave.hansen, mgorman,
	mike.kravetz, pasha.tatashin, steven.sistare, tim.c.chen

On 12/08/2017 07:43 AM, Michal Hocko wrote:
> On Wed 06-12-17 15:32:48, Daniel Jordan wrote:
>> On 12/06/2017 09:35 AM, Michal Hocko wrote:
> [...]
>>> There is also no mention about other
>>> characteristics (e.g. power management), resource isloataion etc. So > let me ask again. How do you control that the parallelized operation
>>> doesn't run outside of the limit imposed to the calling context?
>>
>> The current code doesn't do this, and the answer is the same for the rest of
>> your questions.
> 
> I really believe this should be addressed before this can be considered
> for merging. While what you have might be sufficient for early boot
> initialization stuff I am not sure the amount of code is really
> justified by that usecase alone. Any runtime enabled parallelized work
> really have to care about the rest of the system. The last thing you
> really want to see is to make a highly utilized system overloaded just
> because of some optimization. And I do not see how can you achive that
> with a limit on the number of paralelization threads.

That's fair, I'll see what I can do in the next version.

> 
>> For resource isolation, I'll experiment with moving ktask threads into and
>> out of the cgroup of the calling thread.
>>
>> Do any resources not covered by cgroup come to mind?  I'm trying to think if
>> I've left anything out.
> 
> This is mostly about cpu so dealing with the cpu cgroup controller
> should do the work.

Ok, thanks.  Luckily cgroup v2's cpu controller was recently merged.

> 
> [...]
> 
>> Anyway, I think scalability bottlenecks should be weighed with the rest of
>> this.  It seems wrong that the kernel should always assume that one thread
>> is enough to free all of a process's memory or evict all the pages of a file
>> system no matter how much work there is to do.
> 
> Well, this will be always a double edge sword. Sure if you have spare
> cycles (whatever that means) than using them is really nice. But the
> last thing you really want is to turn an optimization into an
> utilization nightmare where few processes dominant the whole machine
> even though they could be easily contained normally inside a single
> execution context. >
> Your work targets larger machines and I understand that you are mainly
> focused on a single large workload running on that machine but there are
> many others running with many smaller workloads which would like to be
> independent. Not everything is a large DB running on a large HW.

Well of course, yes, but the struct page initialization stuff benefits 
any large-memory machine (9x faster on a 2-socket machine!) and the 
(forthcoming) page freeing parallelization will similarly benefit a 
variety of workloads.

Anyway, I'll put more controls in and see where I get.  Thanks for the 
feedback.

Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, back to index

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-05 19:52 [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work Daniel Jordan
2017-12-05 19:52 ` [RFC PATCH v3 1/7] ktask: add documentation Daniel Jordan
2017-12-05 20:59   ` Daniel Jordan
2017-12-06 14:35   ` Michal Hocko
2017-12-06 20:32     ` Daniel Jordan
2017-12-08 12:43       ` Michal Hocko
2017-12-08 13:46         ` Daniel Jordan
2017-12-05 19:52 ` [RFC PATCH v3 2/7] ktask: multithread CPU-intensive kernel work Daniel Jordan
2017-12-05 22:21   ` Andrew Morton
2017-12-06 14:21     ` Daniel Jordan
2017-12-05 19:52 ` [RFC PATCH v3 3/7] ktask: add /proc/sys/debug/ktask_max_threads Daniel Jordan
2017-12-05 19:52 ` [RFC PATCH v3 4/7] mm: enlarge type of offset argument in mem_map_offset and mem_map_next Daniel Jordan
2017-12-05 19:52 ` [RFC PATCH v3 5/7] mm: parallelize clear_gigantic_page Daniel Jordan
2017-12-05 19:52 ` [RFC PATCH v3 6/7] hugetlbfs: parallelize hugetlbfs_fallocate with ktask Daniel Jordan
2017-12-05 19:52 ` [RFC PATCH v3 7/7] mm: parallelize deferred struct page initialization within each node Daniel Jordan
2017-12-05 22:23 ` [RFC PATCH v3 0/7] ktask: multithread CPU-intensive kernel work Andrew Morton
2017-12-06 14:21   ` Daniel Jordan

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git