linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v1 0/6] ktask: multithread cpu-intensive kernel work
@ 2017-07-14 22:16 daniel.m.jordan
  2017-07-14 22:16 ` [RFC PATCH v1 1/6] ktask: add documentation daniel.m.jordan
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: daniel.m.jordan @ 2017-07-14 22:16 UTC (permalink / raw)
  To: linux-kernel, linux-mm

ktask is a generic framework for parallelizing cpu-intensive work in the
kernel.  The intended use is for big machines that can use their cpu power to
speed up large tasks that can't otherwise be multithreaded in userland.  The
API is generic enough to add concurrency to many different kinds of tasks--for
example, zeroing a range of pages or evicting a list of inodes--and aims to
save its clients the trouble of splitting up the work, choosing the number of
threads to use, starting these threads, and load balancing the work between
them.

Why do we need ktask when the kernel has other APIs for managing concurrency?
After all, kthread_workers and workqueues already provide ways to start
threads, and the kernel can handle large tasks with a single thread by
periodically yielding the cpu with cond_resched or doing the work in fixed size
batches.

Of the existing concurrency facilities, kthread_worker isn't suited for
providing parallelism because each comes with only a single thread.  Workqueues
are a better fit for this, and in fact ktask is built on an unbound workqueue,
but workqueues aren't designed for splitting up a large task.  ktask instead
uses unbound workqueue threads to run "chunks" of a task.

More background is available in the documentation commit (first commit of the
series).

There are two ktask consumers included, with more to come later.  Other
consumers that are in the works include:

  - Page table walking and/or mmu_gather struct page freeing to optimize exit(2)
    and munmap(2) for large processes.  This is inspired by Aaron Lu's work:
        http://marc.info/?l=linux-mm&m=148793643210514&w=2

  - struct page initialization in early boot (use more threads than the current
    pgdatinit threads to reduce boot time).

The core ktask code is based on work by Pavel Tatashin, Steve Sistare, and
Jonathan Adams.

This series is based on 4.12.

Daniel Jordan (6):
  ktask: add documentation
  ktask: multithread cpu-intensive kernel work
  ktask: add /proc/sys/debug/ktask_max_threads
  mm: enlarge type of offset argument in mem_map_offset and
    mem_map_next
  mm: parallelize clear_gigantic_page
  hugetlbfs: parallelize hugetlbfs_fallocate with ktask

 Documentation/core-api/index.rst |    1 +
 Documentation/core-api/ktask.rst |  104 ++++++++++
 fs/hugetlbfs/inode.c             |  122 ++++++++++---
 include/linux/ktask.h            |  228 ++++++++++++++++++++++
 include/linux/ktask_internal.h   |   19 ++
 include/linux/mm.h               |    5 +
 init/Kconfig                     |    7 +
 kernel/Makefile                  |    2 +-
 kernel/ktask.c                   |  389 ++++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c                  |   10 +
 mm/internal.h                    |    7 +-
 mm/memory.c                      |   35 +++-
 12 files changed, 895 insertions(+), 34 deletions(-)
 create mode 100644 Documentation/core-api/ktask.rst
 create mode 100644 include/linux/ktask.h
 create mode 100644 include/linux/ktask_internal.h
 create mode 100644 kernel/ktask.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH v1 1/6] ktask: add documentation
  2017-07-14 22:16 [RFC PATCH v1 0/6] ktask: multithread cpu-intensive kernel work daniel.m.jordan
@ 2017-07-14 22:16 ` daniel.m.jordan
  2017-07-14 22:16 ` [RFC PATCH v1 2/6] ktask: multithread cpu-intensive kernel work daniel.m.jordan
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: daniel.m.jordan @ 2017-07-14 22:16 UTC (permalink / raw)
  To: linux-kernel, linux-mm

Motivates and explains the ktask API for kernel clients.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 Documentation/core-api/index.rst |    1 +
 Documentation/core-api/ktask.rst |  104 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 105 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/core-api/ktask.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index 62abd36..2be3ca4 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -15,6 +15,7 @@ Core utilities
    assoc_array
    atomic_ops
    cpu_hotplug
+   ktask
    local_ops
    workqueue
    genericirq
diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst
new file mode 100644
index 0000000..cb4b0d8
--- /dev/null
+++ b/Documentation/core-api/ktask.rst
@@ -0,0 +1,104 @@
+============================================
+ktask: parallelize cpu-intensive kernel work
+============================================
+
+:Date: July, 2017
+:Author: Daniel Jordan <daniel.m.jordan@oracle.com>
+
+
+Introduction
+============
+
+ktask is a generic framework for parallelizing cpu-intensive work in the
+kernel.  The intended use is for big machines that can use their cpu power to
+speed up large tasks that can't otherwise be multithreaded in userland.  The
+API is generic enough to add concurrency to many different kinds of tasks--for
+example, zeroing a range of pages or evicting a list of inodes--and aims to
+save its clients the trouble of splitting up the work, choosing the number of
+threads to use, starting these threads, and load balancing the work between
+them.
+
+
+Motivation
+==========
+
+Why do we need ktask when the kernel has other APIs for managing concurrency?
+After all, kthread_workers and workqueues already provide ways to start
+threads, and the kernel can handle large tasks with a single thread by
+periodically yielding the cpu with cond_resched (e.g. hugetlbfs_fallocate,
+clear_gigantic_page) or performing the work in fixed size batches (e.g. struct
+pagevec, struct mmu_gather).
+
+Of the existing concurrency facilities, kthread_worker isn't suited for
+providing parallelism because each comes with only a single thread.  Workqueues
+are a better fit for this, and in fact ktask is built on an unbound workqueue,
+but workqueues aren't designed for splitting up a large task.  ktask instead
+uses unbound workqueue threads to run "chunks" of a task.
+
+On top of workqueues, ktask takes care of dividing up the task into chunks,
+determining how many threads to use to complete those chunks, starting the
+threads, and load balancing across them.  This makes use of otherwise idle
+cpus, but if the system is under load, the scheduler still decides when the
+ktask threads run: existing cond_resched calls are retained in big loops that
+have been parallelized.
+
+This added concurrency boosts the performance of the system in a number of
+ways: system startup and shutdown are faster, page fault latency of a gigantic
+page goes down (zero the page in parallel), initializing many pages goes
+quicker (e.g. populating a range of pages via prefaulting, mlocking, or
+fallocating), and pages are freed back to the system in less time (e.g. on a
+large munmap(2) or on exit(2) of a large process).
+
+
+Configuration
+=============
+
+To use ktask, configure the kernel with CONFIG_KTASK=y.
+
+If CONFIG_KTASK=n, calls to the ktask API are simply #define'd to run the
+thread function that the client provides so that the task is completed without
+concurrency in the current thread.
+
+
+Concept
+=======
+
+A little terminology up front:  A 'task' is the total work there is to do and a
+'chunk' is a unit of work given to a thread.
+
+To complete a task using the ktask framework, a client provides a thread
+function that is responsible for completing one chunk.  The thread function is
+defined in a standard way, with start and end arguments that delimit the chunk
+as well as an argument that the client uses to pass data specific to the task.
+
+In addition, the client supplies an object representing the start of the task
+and an iterator function that knows how to advance some number of units in the
+task to yield another object representing the new task position.  The framework
+uses the start object and iterator internally to divide the task into chunks.
+
+Finally, the client passes the total task size and a minimum chunk size to
+indicate the minimum amount of work that's appropriate to do in one chunk.  The
+sizes are given in task-specific units (e.g. pages, inodes, bytes).  The
+framework uses these sizes, along with the number of online cpus and an
+internal maximum number of threads, to decide how many threads to start and how
+many chunks to divide the task into.
+
+For example, consider the task of clearing a gigantic page.  This used to be
+done in a single thread with a for loop that calls a page clearing function for
+each constituent base page.  To parallelize with ktask, the client first moves
+the for loop to the thread function, adapting it to operate on the range passed
+to the function.  In this simple case, the thread function's start and end
+arguments are just addresses delimiting the portion of the gigantic page to
+clear.  Then, where the for loop used to be, the client calls into ktask with
+the start address of the gigantic page, the total size of the gigantic page,
+and the thread function.  Internally, ktask will divide the address range into
+an appropriate number of chunks and start an appropriate number of threads to
+complete these chunks.
+
+
+Interface
+=========
+
+.. Include ktask.h inline here.  This file is heavily commented and documents
+.. the ktask interface.
+.. kernel-doc:: include/linux/ktask.h
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH v1 2/6] ktask: multithread cpu-intensive kernel work
  2017-07-14 22:16 [RFC PATCH v1 0/6] ktask: multithread cpu-intensive kernel work daniel.m.jordan
  2017-07-14 22:16 ` [RFC PATCH v1 1/6] ktask: add documentation daniel.m.jordan
@ 2017-07-14 22:16 ` daniel.m.jordan
  2017-07-14 22:16 ` [RFC PATCH v1 3/6] ktask: add /proc/sys/debug/ktask_max_threads daniel.m.jordan
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: daniel.m.jordan @ 2017-07-14 22:16 UTC (permalink / raw)
  To: linux-kernel, linux-mm

ktask is a generic framework for parallelizing cpu-intensive work in the
kernel.  The intended use is for big machines that can use their cpu power to
speed up large tasks that can't otherwise be multithreaded in userland.  The
API is generic enough to add concurrency to many different kinds of tasks--for
example, zeroing a range of pages or evicting a list of inodes--and aims to
save its clients the trouble of splitting up the work, choosing the number of
threads to use, starting these threads, and load balancing the work between
them.

The Documentation patch earlier in this series has more background.

Introduces the ktask API; consumers appear in subsequent patches.

Based on work by Pavel Tatashin, Steve Sistare, and Jonathan Adams.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Suggested-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Suggested-by: Steve Sistare <steven.sistare@oracle.com>
Suggested-by: Jonathan Adams <jonathan.adams@oracle.com>
---
 include/linux/ktask.h          |  228 +++++++++++++++++++++++
 include/linux/ktask_internal.h |   19 ++
 include/linux/mm.h             |    5 +
 init/Kconfig                   |    7 +
 kernel/Makefile                |    2 +-
 kernel/ktask.c                 |  389 ++++++++++++++++++++++++++++++++++++++++
 6 files changed, 649 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/ktask.h
 create mode 100644 include/linux/ktask_internal.h
 create mode 100644 kernel/ktask.c

diff --git a/include/linux/ktask.h b/include/linux/ktask.h
new file mode 100644
index 0000000..c224e9d
--- /dev/null
+++ b/include/linux/ktask.h
@@ -0,0 +1,228 @@
+/*
+ * ktask.h
+ *
+ * Framework to parallelize cpu-intensive kernel work such as zeroing
+ * huge pages or freeing many pages at once.  For more information, see
+ * Documentation/core-api/ktask.rst.
+ *
+ * This is the interface to ktask; everything in this file is
+ * accessible to ktask clients.
+ *
+ * If CONFIG_KTASK=n, calls to the ktask API are simply #define'd to run the
+ * thread function that the client provides so that the task is completed
+ * without concurrency in the current thread.
+ */
+
+#ifndef _LINUX_KTASK_H
+#define _LINUX_KTASK_H
+
+#include <linux/types.h>
+
+struct ktask_ctl;
+struct ktask_node;
+
+#define	KTASK_RETURN_SUCCESS	0
+#define	KTASK_RETURN_ERROR	(-1)
+
+#ifdef CONFIG_KTASK
+
+/**
+ * ktask_run() runs one task.  It doesn't account for NUMA locality.
+ *
+ * @start: An object that describes the start of the task.  The client thread
+ *         function interprets the object however it sees fit (e.g. an array
+ *         index, a simple pointer, or a pointer to a more complicated
+ *         representation of job position.
+ * @task_size:  The size of the task (units are task-specific).
+ * @ctl:  A control structure containing information about the task, including
+ *        the client thread function (see the definition of struct ktask_ctl).
+ *
+ * RETURNS:
+ * KTASK_RETURN_SUCCESS or KTASK_RETURN_ERROR.
+ *
+ * XXX include ks_error in ktask_ctl instead so client can provide own error
+ * information in void *?
+ */
+int ktask_run(void *start, size_t task_size, struct ktask_ctl *ctl);
+
+/**
+ * ktask_run_numa() runs one task while accounting for NUMA locality.
+ *
+ * The ktask framework ensures worker threads are scheduled on a CPU local to
+ * each chunk of a task.  The client is responsible for organizing the work
+ * along NUMA boundaries in the 'nodes' array.
+ *
+ * @nodes: An array of struct ktask_node's, each of which describes the task on
+ *         a NUMA node (see struct ktask_node).
+ * @nr_nodes:  The length of the 'nodes' array.
+ * @ctl:  A control structure containing information about the task (see
+ *        the definition of struct ktask_ctl).
+ *
+ * RETURNS:
+ * KTASK_RETURN_SUCCESS or KTASK_RETURN_ERROR.
+ */
+int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,
+		   struct ktask_ctl *ctl);
+
+/*
+ * XXX Two possible future enhancements related to error handling, should the
+ * need arise, are:
+ *
+ * - Add client specific error reporting.  It's possible for tasks to fail for
+ *   different reasons, so let the client pass a pointer for its own error
+ *   information.
+ *
+ * - Allow clients to pass an "undo" callback to ktask that is responsible for
+ *   undoing those parts of the task that fail if an error occurs.
+ */
+
+#else  /* CONFIG_KTASK */
+
+#define ktask_run(start, task_size, ctl)				      \
+	((ctl)->kc_thread_func((start),				              \
+			       (ctl)->kc_iter_advance((start), (task_size)),  \
+			       (ctl)->kc_thread_func_arg))
+
+#define ktask_run_numa(nodes, nr_nodes, ctl)				      \
+({									      \
+	size_t __i;							      \
+	int __ret = KTASK_RETURN_SUCCESS;				      \
+									      \
+	for (__i = 0; __i < (nr_nodes); ++__i) {			      \
+		__ret = (ctl)->kc_thread_func(				      \
+			    (nodes)->kn_start,				      \
+			    (ctl)->kc_iter_advance((nodes)->kn_start,	      \
+						   (nodes)->kn_task_size),    \
+			    (ctl)->kc_thread_func_arg);			      \
+									      \
+		if (__ret == KTASK_RETURN_ERROR)			      \
+			break;						      \
+	}								      \
+									      \
+	__ret;								      \
+})
+
+#endif /* CONFIG_KTASK */
+
+/**
+ * Holds per-NUMA-node information about a task.
+ *
+ * @kn_start: An object that describes the start of the task on this NUMA node.
+ * @kn_task_size: The size of the task on this NUMA node (units are
+ *                task-specific).
+ * @kn_nid: The NUMA node id (or NUMA_NO_NODE, in which case the work is done on
+ *          the current node).
+ */
+struct ktask_node {
+	void		*kn_start;
+	size_t		kn_task_size;
+	int		kn_nid;
+};
+
+/**
+ * @KTASK_ASYNC: ktask_run* won't wait for the task to finish before returning.
+ */
+enum ktask_flags {
+	/* XXX last ktask thread has to kfree the struct ktask_node's */
+	KTASK_ASYNC = 0x1,
+};
+
+/**
+ * Called on each chunk of work that a ktask thread does, where the chunk is
+ * delimited by [start, end).  A thread may call this multiple times during one
+ * task.
+ *
+ * @start: An object that describes the start of the chunk.
+ * @end: An object that describes the end of the chunk.
+ * @arg: The thread function argument (provided with struct ktask_ctl).
+ *
+ * RETURNS:
+ * KTASK_RETURN_SUCCESS or KTASK_RETURN_ERROR.
+ */
+typedef int (*ktask_thread_func)(void *start, void *end, void *arg);
+
+/**
+ * An iterator function that advances the given position 'nsteps'.
+ *
+ * @position: An object that describes the current position in the task.
+ * @nsteps: The number of steps to advance in the task (in task-specific
+ *          units).
+ *
+ * RETURNS:
+ * An object representing the new position.
+ */
+typedef void *(*ktask_iter_func)(void *position, size_t nsteps);
+
+/**
+ * An iterator function for a contiguous range.  Clients should use this to
+ * avoid reinventing the wheel for this common case.
+ *
+ * The arguments and return value are the same as the ktask_iter_func function
+ * pointer type.
+ */
+void *ktask_iter_range(void *position, size_t nsteps);
+
+/**
+ * Client-provided per-task control information.
+ *
+ * @kc_thread_func: A thread function that completes one chunk of the task per
+ *                  call.
+ * @kc_thread_func_arg: An argument to be passed to the thread function.
+ * @kc_iter_advance: An iterator function to advance the iterator by some number
+ *                   of task-specific units.
+ * @kc_min_chunk_size: The minimum chunk size in task-specific units.  This
+ *                     allows the client to communicate the minimum amount of
+ *                     work that's appropriate for one worker thread to do at
+ *                     once.
+ * @kc_gfp_flags: gfp flags for allocating ktask metadata during the task.
+ * @kc_flags: ktask flags that control the behavior of the job:
+ *    - KTASK_ASYNC: ktask API calls may return before the task is finished.
+ *                   By default, ktask API calls do not return until the
+ *                   task is finished.
+ */
+struct ktask_ctl {
+	ktask_thread_func	kc_thread_func;
+	void			*kc_thread_func_arg;
+	ktask_iter_func		kc_iter_advance;
+	size_t			kc_min_chunk_size;
+	gfp_t			kc_gfp_flags;
+	int			kc_flags;
+};
+
+#define KTASK_CTL_INITIALIZER(thread_func, thread_func_arg, iter_advance, \
+			      min_chunk_size, gfp_flags, flags)		  \
+	{								  \
+		.kc_thread_func = (ktask_thread_func)(thread_func),	  \
+		.kc_thread_func_arg = (thread_func_arg),		  \
+		.kc_iter_advance = (iter_advance),			  \
+		.kc_min_chunk_size = (min_chunk_size),			  \
+		.kc_gfp_flags = (gfp_flags),				  \
+		.kc_flags = (flags),					  \
+	}
+
+/**
+ * Convenience macro to pack a struct ktask_ctl.
+ *
+ * Note that KTASK_CTL_INITIALIZER casts 'thread_func' to be of type
+ * ktask_thread_func.  This is to help clients write cleaner thread functions
+ * by relieving them of the need to cast the three void * arguments.  Clients
+ * can just use the actual argument types instead.
+ */
+#define DEFINE_KTASK_CTL(ctl_name, thread_func, thread_func_arg,	  \
+			 iter_advance, min_chunk_size, gfp_flags, flags)  \
+	struct ktask_ctl ctl_name =					  \
+		KTASK_CTL_INITIALIZER(thread_func, thread_func_arg,	  \
+				      iter_advance, min_chunk_size,	  \
+				      gfp_flags, flags)
+/**
+ * Similar to DEFINE_KTASK_CTL, but omits the iterator argument in favor of
+ * using ktask_iter_range.
+ */
+#define DEFINE_KTASK_CTL_RANGE(ctl_name, thread_func, thread_func_arg,	  \
+			 min_chunk_size, gfp_flags, flags)		  \
+	struct ktask_ctl ctl_name =					  \
+		KTASK_CTL_INITIALIZER(thread_func, thread_func_arg,	  \
+				      ktask_iter_range, min_chunk_size,	  \
+				      gfp_flags, flags)
+
+#endif /* _LINUX_KTASK_H */
diff --git a/include/linux/ktask_internal.h b/include/linux/ktask_internal.h
new file mode 100644
index 0000000..7b576f4
--- /dev/null
+++ b/include/linux/ktask_internal.h
@@ -0,0 +1,19 @@
+/*
+ * ktask_internal.h
+ *
+ * Framework to parallelize cpu-intensive kernel work such as zeroing
+ * huge pages or freeing many pages at once.  For more information, see
+ * Documentation/core-api/ktask.rst.
+ *
+ * This file contains implementation details of ktask for core kernel code that
+ * needs to be aware of them.  ktask clients should not include this file.
+ */
+#ifndef _LINUX_KTASK_INTERNAL_H
+#define _LINUX_KTASK_INTERNAL_H
+
+#ifdef CONFIG_KTASK
+/* Caps the number of threads that are allowed to be used in one task. */
+extern int ktask_max_threads;
+#endif
+
+#endif /* _LINUX_KTASK_INTERNAL_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f543a4..dc7099a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2557,5 +2557,10 @@ static inline bool page_is_guard(struct page *page)
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+#ifdef CONFIG_KTASK
+/* The minimum chunk size for a task that uses base page units. */
+#define	KTASK_BPGS_MINCHUNK	256
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/init/Kconfig b/init/Kconfig
index 1d3475f..b058fd7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -332,6 +332,13 @@ config AUDIT_TREE
 	depends on AUDITSYSCALL
 	select FSNOTIFY
 
+config KTASK
+	bool "Multithread cpu-intensive kernel tasks"
+	depends on SMP
+	default n
+	help
+          Parallelize expensive kernel tasks such as zeroing huge pages.
+
 source "kernel/irq/Kconfig"
 source "kernel/time/Kconfig"
 
diff --git a/kernel/Makefile b/kernel/Makefile
index 72aa080..7e6a2c3 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -9,7 +9,7 @@ obj-y     = fork.o exec_domain.o panic.o \
 	    extable.o params.o \
 	    kthread.o sys_ni.o nsproxy.o \
 	    notifier.o ksysfs.o cred.o reboot.o \
-	    async.o range.o smpboot.o ucount.o
+	    async.o range.o smpboot.o ucount.o ktask.o
 
 obj-$(CONFIG_MULTIUSER) += groups.o
 
diff --git a/kernel/ktask.c b/kernel/ktask.c
new file mode 100644
index 0000000..8437871
--- /dev/null
+++ b/kernel/ktask.c
@@ -0,0 +1,389 @@
+/*
+ * ktask.c
+ *
+ * Framework to parallelize cpu-intensive kernel work such as zeroing
+ * huge pages or freeing many pages at once.  For more information, see
+ * Documentation/core-api/ktask.rst.
+ *
+ * This is the ktask implementation; everything in this file is private to
+ * ktask.
+ */
+
+#include <linux/ktask.h>
+
+#ifdef CONFIG_KTASK
+
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
+#include <linux/completion.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/mutex.h>
+#include <linux/printk.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+
+/*
+ * Shrink the size of each job by this shift amount to load balance between the
+ * worker threads.
+ */
+#define	KTASK_LOAD_BAL_SHIFT		2
+
+#define	KTASK_DEFAULT_MAX_THREADS	4
+
+/* Maximum number of threads for a single task. */
+int ktask_max_threads = KTASK_DEFAULT_MAX_THREADS;
+
+static struct workqueue_struct *ktask_wq;
+
+/* Used to pass ktask state to the workqueue API. */
+struct ktask_work {
+	struct work_struct kw_work;
+	void               *kw_state;
+};
+
+/* Internal per-task state hidden from clients. */
+struct ktask_state {
+	struct ktask_ctl	ks_ctl;
+	size_t			ks_total_size;
+	size_t			ks_chunk_size;
+	/* mutex protects nodes, nr_nodes_left, nthreads_fini, error */
+	struct mutex		ks_mutex;
+	struct ktask_node	*ks_nodes;
+	size_t			ks_nr_nodes;
+	size_t			ks_nr_nodes_left;
+	size_t			ks_nthreads;
+	size_t			ks_nthreads_fini;
+	int			ks_error; /* tracks error(s) from thread_func */
+	struct completion	ks_ktask_done;
+};
+
+static inline size_t ktask_get_start_node(struct ktask_node *nodes,
+					  size_t nr_nodes)
+{
+	int cur_nid = numa_node_id();
+	size_t fallback_i = 0;
+	size_t i;
+
+	for (i = 0; i < nr_nodes; ++i) {
+		if (nodes[i].kn_nid == cur_nid)
+			break;
+		else if (nodes[i].kn_nid == NUMA_NO_NODE)
+			fallback_i = i;
+	}
+
+	if (i >= nr_nodes)
+		i = fallback_i;
+
+	return i;
+}
+
+static void ktask_node_migrate(cpumask_var_t *saved_cpumask,
+			       struct ktask_node *kn,
+			       gfp_t gfp_flags, bool *migratedp)
+{
+	struct task_struct *p = current;
+	const struct cpumask *node_cpumask;
+	int ret;
+
+	/* Don't migrate a user thread; migrating to NUMA_NO_NODE is nonsense */
+	if (!(p->flags & PF_KTHREAD) || kn->kn_nid == NUMA_NO_NODE)
+		return;
+
+	node_cpumask = cpumask_of_node(kn->kn_nid);
+	/* No cpu to migrate to. */
+	if (cpumask_empty(node_cpumask))
+		return;
+
+	if (!*migratedp) {
+		/*
+		 * Save the workqueue thread's original mask so we can restore
+		 * it after the task is done.
+		 */
+		if (!alloc_cpumask_var(saved_cpumask, gfp_flags))
+			return;
+
+		cpumask_copy(*saved_cpumask, &p->cpus_allowed);
+	}
+
+	ret = set_cpus_allowed_ptr(current, node_cpumask);
+	if (ret == 0)
+		*migratedp = true;
+	else if (!*migratedp)
+		free_cpumask_var(*saved_cpumask);
+}
+
+static void ktask_task(struct work_struct *work)
+{
+	struct ktask_work  *kw;
+	struct ktask_state *ks;
+	struct ktask_ctl   *kc;
+	struct ktask_node  *kn;
+	size_t             nidx;
+	bool               done;
+	bool               migrated = false;
+	cpumask_var_t      saved_cpumask;
+
+	kw = container_of(work, struct ktask_work, kw_work);
+	ks = kw->kw_state;
+	kc = &ks->ks_ctl;
+
+	if (ks->ks_nr_nodes > 1)
+		nidx = ktask_get_start_node(ks->ks_nodes, ks->ks_nr_nodes);
+	else
+		nidx = 0;
+
+	WARN_ON(nidx >= ks->ks_nr_nodes);
+	kn = &ks->ks_nodes[nidx];
+
+	mutex_lock(&ks->ks_mutex);
+
+	while (ks->ks_total_size > 0 && ks->ks_error == KTASK_RETURN_SUCCESS) {
+		void *start, *end;
+		size_t nsteps;
+		int ret;
+
+		if (kn->kn_task_size == 0) {
+			/* The current node is out of work; pick a new one. */
+			size_t remaining_nodes_seen = 0;
+			size_t new_idx = prandom_u32_max(ks->ks_nr_nodes_left);
+
+			WARN_ON(ks->ks_nr_nodes_left == 0);
+			WARN_ON(new_idx >= ks->ks_nr_nodes_left);
+			for (nidx = 0; nidx < ks->ks_nr_nodes; ++nidx) {
+				if (ks->ks_nodes[nidx].kn_task_size == 0)
+					continue;
+
+				if (remaining_nodes_seen >= new_idx)
+					break;
+
+				++remaining_nodes_seen;
+			}
+			/* We should have found work on another node. */
+			WARN_ON(nidx >= ks->ks_nr_nodes);
+
+			kn = &ks->ks_nodes[nidx];
+
+			/* Temporarily migrate to the node we just chose. */
+			ktask_node_migrate(&saved_cpumask, kn, kc->kc_gfp_flags,
+					   &migrated);
+		}
+
+		start = kn->kn_start;
+		nsteps = min(ks->ks_chunk_size, kn->kn_task_size);
+		end = kc->kc_iter_advance(start, nsteps);
+		kn->kn_start = end;
+		WARN_ON(kn->kn_task_size < nsteps);
+		kn->kn_task_size -= nsteps;
+		WARN_ON(ks->ks_total_size < nsteps);
+		ks->ks_total_size -= nsteps;
+		if (kn->kn_task_size == 0) {
+			WARN_ON(ks->ks_nr_nodes_left == 0);
+			ks->ks_nr_nodes_left--;
+		}
+
+		mutex_unlock(&ks->ks_mutex);
+
+		ret = kc->kc_thread_func(start, end, kc->kc_thread_func_arg);
+
+		mutex_lock(&ks->ks_mutex);
+
+		if (ret == KTASK_RETURN_ERROR)
+			ks->ks_error = KTASK_RETURN_ERROR;
+	}
+
+	WARN_ON(ks->ks_nr_nodes_left > 0 &&
+		ks->ks_error == KTASK_RETURN_SUCCESS);
+
+	++ks->ks_nthreads_fini;
+	WARN_ON(ks->ks_nthreads_fini > ks->ks_nthreads);
+	done = (ks->ks_nthreads_fini == ks->ks_nthreads);
+	mutex_unlock(&ks->ks_mutex);
+
+	if (migrated) {
+		set_cpus_allowed_ptr(current, saved_cpumask);
+		free_cpumask_var(saved_cpumask);
+	}
+
+	if (done)
+		complete(&ks->ks_ktask_done);
+}
+
+/* Returns the number of threads to use for this task. */
+static inline size_t ktask_nthreads(size_t task_size, size_t min_chunk_size)
+{
+	size_t nthreads;
+
+	/* Ensure at least one thread when task_size < min_chunk_size. */
+	nthreads = DIV_ROUND_UP(task_size, min_chunk_size);
+
+	nthreads = min_t(size_t, nthreads, num_online_cpus());
+
+	nthreads = min_t(size_t, nthreads, ktask_max_threads);
+
+	return nthreads;
+}
+
+/*
+ * Returns the number of chunks to break this task into.
+ *
+ * The number of chunks will be at least the number of threads, but in the
+ * common case of a large task, the number of chunks will be greater to load
+ * balance the work between threads in case some threads finish their work more
+ * quickly than others.
+ */
+static inline size_t ktask_chunk_size(size_t task_size, size_t min_chunk_size,
+				    size_t nthreads)
+{
+	size_t chunk_size;
+
+	if (nthreads == 1)
+		return task_size;
+
+	chunk_size = (task_size / nthreads) >> KTASK_LOAD_BAL_SHIFT;
+
+	/*
+	 * chunk_size should be a multiple of min_chunk_size for tasks that
+	 * need to operate in fixed-size batches.
+	 */
+	if (chunk_size > min_chunk_size)
+		chunk_size -= (chunk_size % min_chunk_size);
+
+	return max(chunk_size, min_chunk_size);
+}
+
+int ktask_run_numa(struct ktask_node *nodes, size_t nr_nodes,
+		   struct ktask_ctl *ctl)
+{
+	size_t i;
+	struct ktask_work *kw;
+	struct ktask_state ks = {
+		.ks_ctl             = *ctl,
+		.ks_total_size        = 0,
+		.ks_nodes           = nodes,
+		.ks_nr_nodes        = nr_nodes,
+		.ks_nr_nodes_left   = nr_nodes,
+		.ks_nthreads_fini   = 0,
+		.ks_error           = KTASK_RETURN_SUCCESS,
+	};
+
+	for (i = 0; i < nr_nodes; ++i) {
+		ks.ks_total_size += nodes[i].kn_task_size;
+		if (nodes[i].kn_task_size == 0)
+			ks.ks_nr_nodes_left--;
+
+		WARN_ON(nodes[i].kn_nid >= MAX_NUMNODES);
+	}
+
+	if (ks.ks_total_size == 0)
+		return KTASK_RETURN_SUCCESS;
+
+	mutex_init(&ks.ks_mutex);
+
+	ks.ks_nthreads = ktask_nthreads(ks.ks_total_size,
+					ctl->kc_min_chunk_size);
+
+	ks.ks_chunk_size = ktask_chunk_size(ks.ks_total_size,
+					ctl->kc_min_chunk_size, ks.ks_nthreads);
+
+	init_completion(&ks.ks_ktask_done);
+
+	kw = kmalloc_array(ks.ks_nthreads, sizeof(struct ktask_work),
+			    ctl->kc_gfp_flags);
+	if (unlikely(!kw)) {
+		/* Low on memory; fall back to a single thread. */
+		struct ktask_work kw = {
+			.kw_work = __WORK_INITIALIZER(kw.kw_work, ktask_task),
+			.kw_state = &ks
+		};
+
+		ks.ks_nthreads = 1;
+
+		ktask_task(&kw.kw_work);
+		mutex_destroy(&ks.ks_mutex);
+
+		return ks.ks_error;
+	}
+
+	for (i = 1; i < ks.ks_nthreads; ++i) {
+		int cpu;
+		struct ktask_node *kn;
+
+		INIT_WORK(&kw[i].kw_work, ktask_task);
+		kw[i].kw_state = &ks;
+
+		/*
+		 * Spread workers evenly across nodes with work to do,
+		 * starting each worker on a cpu local to the nid of their
+		 * part of the task.
+		 */
+		kn = &nodes[i % nr_nodes];
+
+		if (kn->kn_nid == NUMA_NO_NODE) {
+			cpu = smp_processor_id();
+		} else {
+			/*
+			 * WQ_UNBOUND workqueues execute work on a cpu from
+			 * the node of the cpu we pass to queue_work_on, so
+			 * just pick any cpu to stand for the node.
+			 */
+			cpu = cpumask_any(cpumask_of_node(kn->kn_nid));
+		}
+
+		queue_work_on(cpu, ktask_wq, &kw[i].kw_work);
+	}
+
+	/*
+	 * Make ourselves one of the threads, which saves launching a workqueue
+	 * worker.
+	 */
+	INIT_WORK(&kw[0].kw_work, ktask_task);
+	kw[0].kw_state = &ks;
+	ktask_task(&kw[0].kw_work);
+
+	/* Wait for all the jobs to finish. */
+	wait_for_completion(&ks.ks_ktask_done);
+
+	kfree(kw);
+	mutex_destroy(&ks.ks_mutex);
+
+	return ks.ks_error;
+}
+EXPORT_SYMBOL_GPL(ktask_run_numa);
+
+int ktask_run(void *start, size_t task_size, struct ktask_ctl *ctl)
+{
+	struct ktask_node node;
+
+	node.kn_start = start;
+	node.kn_task_size = task_size;
+	node.kn_nid = NUMA_NO_NODE;
+
+	return ktask_run_numa(&node, 1, ctl);
+}
+EXPORT_SYMBOL_GPL(ktask_run);
+
+static int __init ktask_init(void)
+{
+	ktask_wq = alloc_workqueue("ktask_wq", WQ_UNBOUND, 0);
+	if (!ktask_wq) {
+		pr_err("%s: alloc_workqueue failed", __func__);
+		return -1;
+	}
+
+	return 0;
+}
+core_initcall(ktask_init);
+
+#endif /* CONFIG_KTASK */
+
+/*
+ * This function is defined outside CONFIG_KTASK so it can be called in the
+ * ktask_run and ktask_run_numa macros defined in ktask.h for CONFIG_KTASK=n
+ * kernels.
+ */
+void *ktask_iter_range(void *position, size_t nsteps)
+{
+	return (char *)position + nsteps;
+}
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH v1 3/6] ktask: add /proc/sys/debug/ktask_max_threads
  2017-07-14 22:16 [RFC PATCH v1 0/6] ktask: multithread cpu-intensive kernel work daniel.m.jordan
  2017-07-14 22:16 ` [RFC PATCH v1 1/6] ktask: add documentation daniel.m.jordan
  2017-07-14 22:16 ` [RFC PATCH v1 2/6] ktask: multithread cpu-intensive kernel work daniel.m.jordan
@ 2017-07-14 22:16 ` daniel.m.jordan
  2017-07-14 22:16 ` [RFC PATCH v1 4/6] mm: enlarge type of offset argument in mem_map_offset and mem_map_next daniel.m.jordan
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: daniel.m.jordan @ 2017-07-14 22:16 UTC (permalink / raw)
  To: linux-kernel, linux-mm

Adds a proc file to control the maximum number of ktask threads in use
for any one job.  Its primary use is to aid in debugging.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 kernel/sysctl.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4dfba1a..7fe142b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -67,6 +67,7 @@
 #include <linux/kexec.h>
 #include <linux/bpf.h>
 #include <linux/mount.h>
+#include <linux/ktask_internal.h>
 
 #include <linux/uaccess.h>
 #include <asm/processor.h>
@@ -1850,6 +1851,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
 		.extra2		= &one,
 	},
 #endif
+#if defined(CONFIG_KTASK)
+	{
+		.procname	= "ktask_max_threads",
+		.data		= &ktask_max_threads,
+		.maxlen		= sizeof(ktask_max_threads),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif
 	{ }
 };
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH v1 4/6] mm: enlarge type of offset argument in mem_map_offset and mem_map_next
  2017-07-14 22:16 [RFC PATCH v1 0/6] ktask: multithread cpu-intensive kernel work daniel.m.jordan
                   ` (2 preceding siblings ...)
  2017-07-14 22:16 ` [RFC PATCH v1 3/6] ktask: add /proc/sys/debug/ktask_max_threads daniel.m.jordan
@ 2017-07-14 22:16 ` daniel.m.jordan
  2017-07-14 22:16 ` [RFC PATCH v1 5/6] mm: parallelize clear_gigantic_page daniel.m.jordan
  2017-07-14 22:16 ` [RFC PATCH v1 6/6] hugetlbfs: parallelize hugetlbfs_fallocate with ktask daniel.m.jordan
  5 siblings, 0 replies; 9+ messages in thread
From: daniel.m.jordan @ 2017-07-14 22:16 UTC (permalink / raw)
  To: linux-kernel, linux-mm

Changes the type of 'offset' from int to unsigned long in both
mem_map_offset and mem_map_next.

This facilitates ktask's use of mem_map_next with its unsigned long
types to avoid silent truncation when these unsigned longs are passed as
ints.

It also fixes the preexisting truncation of 'offset' from unsigned long
to int by the sole caller of mem_map_offset, follow_hugetlb_page.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 mm/internal.h |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 0e4f558..96d9669 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -365,7 +365,8 @@ static inline void mlock_migrate_page(struct page *new, struct page *old) { }
  * the maximally aligned gigantic page 'base'.  Handle any discontiguity
  * in the mem_map at MAX_ORDER_NR_PAGES boundaries.
  */
-static inline struct page *mem_map_offset(struct page *base, int offset)
+static inline struct page *mem_map_offset(struct page *base,
+					  unsigned long offset)
 {
 	if (unlikely(offset >= MAX_ORDER_NR_PAGES))
 		return nth_page(base, offset);
@@ -376,8 +377,8 @@ static inline void mlock_migrate_page(struct page *new, struct page *old) { }
  * Iterator over all subpages within the maximally aligned gigantic
  * page 'base'.  Handle any discontiguity in the mem_map.
  */
-static inline struct page *mem_map_next(struct page *iter,
-						struct page *base, int offset)
+static inline struct page *mem_map_next(struct page *iter, struct page *base,
+					unsigned long offset)
 {
 	if (unlikely((offset & (MAX_ORDER_NR_PAGES - 1)) == 0)) {
 		unsigned long pfn = page_to_pfn(base) + offset;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH v1 5/6] mm: parallelize clear_gigantic_page
  2017-07-14 22:16 [RFC PATCH v1 0/6] ktask: multithread cpu-intensive kernel work daniel.m.jordan
                   ` (3 preceding siblings ...)
  2017-07-14 22:16 ` [RFC PATCH v1 4/6] mm: enlarge type of offset argument in mem_map_offset and mem_map_next daniel.m.jordan
@ 2017-07-14 22:16 ` daniel.m.jordan
  2017-07-17 16:02   ` Dave Hansen
  2017-07-14 22:16 ` [RFC PATCH v1 6/6] hugetlbfs: parallelize hugetlbfs_fallocate with ktask daniel.m.jordan
  5 siblings, 1 reply; 9+ messages in thread
From: daniel.m.jordan @ 2017-07-14 22:16 UTC (permalink / raw)
  To: linux-kernel, linux-mm

Parallelize clear_gigantic_page, which zeroes any page size larger than
8M (e.g. 1G on x86 or 2G on SPARC).

Performance results (the default number of threads is 4; higher thread
counts shown for context only):

Machine: SPARC T7-4, 1024 cpus, 504G memory
Test:    Clear a range of gigantic pages

nthread   speedup   size (GiB)   min time (s)   stdev
      1                     50           7.77    0.02
      2     1.97x           50           3.95    0.04
      4     3.85x           50           2.02    0.05
      8     6.27x           50           1.24    0.10
     16     9.84x           50           0.79    0.06

      1                    100          15.50    0.07
      2     1.91x          100           8.10    0.05
      4     3.48x          100           4.45    0.07
      8     5.18x          100           2.99    0.05
     16     7.79x          100           1.99    0.12

      1                    200          31.03    0.15
      2     1.88x          200          16.47    0.02
      4     3.37x          200           9.20    0.14
      8     5.16x          200           6.01    0.19
     16     7.04x          200           4.41    0.06

Machine:  Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 cpus, 1T memory
Test:    Clear a range of gigantic pages

nthread   speedup   size (GiB)   min time (s)   stdev
      1                    100          41.13    0.03
      2     2.03x          100          20.26    0.14
      4     4.28x          100           9.62    0.09
      8     8.39x          100           4.90    0.05
     16    10.44x          100           3.94    0.03

      1                    200          89.68    0.35
      2     2.21x          200          40.64    0.18
      4     4.64x          200          19.33    0.32
      8     8.99x          200           9.98    0.04
     16    11.27x          200           7.96    0.04

      1                    400         188.20    1.57
      2     2.30x          400          81.84    0.09
      4     4.63x          400          40.62    0.26
      8     8.92x          400          21.09    0.50
     16    11.78x          400          15.97    0.25

      1                    800         434.91    1.81
      2     2.54x          800         170.97    1.46
      4     4.98x          800          87.38    1.91
      8    10.15x          800          42.86    2.59
     16    12.99x          800          33.48    0.83

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 mm/memory.c |   35 +++++++++++++++++++++++++++--------
 1 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index bb11c47..073745f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -68,6 +68,7 @@
 #include <linux/debugfs.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/dax.h>
+#include <linux/ktask.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -4294,27 +4295,45 @@ void __might_fault(const char *file, int line)
 #endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
-static void clear_gigantic_page(struct page *page,
-				unsigned long addr,
-				unsigned int pages_per_huge_page)
+
+struct cgp_args {
+	struct page	*base_page;
+	unsigned long	addr;
+};
+
+static int clear_gigantic_page_chunk(unsigned long start, unsigned long end,
+				     struct cgp_args *args)
 {
-	int i;
-	struct page *p = page;
+	struct page *base_page = args->base_page;
+	struct page *p = base_page;
+	unsigned long addr = args->addr;
+	unsigned long i;
 
 	might_sleep();
-	for (i = 0; i < pages_per_huge_page;
-	     i++, p = mem_map_next(p, page, i)) {
+	for (i = start; i < end; ++i) {
 		cond_resched();
 		clear_user_highpage(p, addr + i * PAGE_SIZE);
+
+		p = mem_map_next(p, base_page, i);
 	}
+
+	return KTASK_RETURN_SUCCESS;
 }
+
 void clear_huge_page(struct page *page,
 		     unsigned long addr, unsigned int pages_per_huge_page)
 {
 	int i;
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-		clear_gigantic_page(page, addr, pages_per_huge_page);
+		struct cgp_args args = {page, addr};
+		struct ktask_node node = {0, pages_per_huge_page,
+					  page_to_nid(page)};
+		DEFINE_KTASK_CTL_RANGE(ctl, clear_gigantic_page_chunk, &args,
+				       KTASK_BPGS_MINCHUNK, GFP_KERNEL, 0);
+
+		ktask_run_numa(&node, 1, &ctl);
+
 		return;
 	}
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH v1 6/6] hugetlbfs: parallelize hugetlbfs_fallocate with ktask
  2017-07-14 22:16 [RFC PATCH v1 0/6] ktask: multithread cpu-intensive kernel work daniel.m.jordan
                   ` (4 preceding siblings ...)
  2017-07-14 22:16 ` [RFC PATCH v1 5/6] mm: parallelize clear_gigantic_page daniel.m.jordan
@ 2017-07-14 22:16 ` daniel.m.jordan
  5 siblings, 0 replies; 9+ messages in thread
From: daniel.m.jordan @ 2017-07-14 22:16 UTC (permalink / raw)
  To: linux-kernel, linux-mm

hugetlbfs_fallocate preallocates huge pages to back a file in a
hugetlbfs filesystem.  The time to call this function grows linearly
with size.

ktask performs well with its default thread count of 4; higher thread
counts are given for context only.

Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 cpus, 1T memory
Test:    fallocate(1) a file on a hugetlbfs filesystem

nthread   speedup   size (GiB)   min time (s)   stdev
      1                    200         127.53    2.19
      2     3.09x          200          41.30    2.11
      4     5.72x          200          22.29    0.51
      8     9.45x          200          13.50    2.58
     16     9.74x          200          13.09    1.64

      1                    400         193.09    2.47
      2     2.14x          400          90.31    3.39
      4     3.84x          400          50.32    0.44
      8     5.11x          400          37.75    1.23
     16     6.12x          400          31.54    3.13

Machine: SPARC T7-4, 1024 cpus, 504G memory
Test:    fallocate(1) a file on a hugetlbfs filesystem

nthread   speedup   size (GiB)   min time (s)   stdev

      1                    100          15.55    0.05
      2     1.92x          100           8.08    0.01
      4     3.55x          100           4.38    0.02
      8     5.87x          100           2.65    0.06
     16     6.45x          100           2.41    0.09

      1                    200          31.26    0.02
      2     1.92x          200          16.26    0.02
      4     3.58x          200           8.73    0.04
      8     5.54x          200           5.64    0.16
     16     6.96x          200           4.49    0.35

      1                    400          62.18    0.09
      2     1.98x          400          31.36    0.04
      4     3.55x          400          17.52    0.03
      8     5.53x          400          11.25    0.04
     16     6.61x          400           9.40    0.17

The primary bottleneck for better scaling at higher thread counts is
hugetlb_fault_mutex_table[hash].  perf showed L1-dcache-loads increase
with 8 threads and again sharply with 16 threads, and a cpu counter
profile showed that 31% of the L1d misses were on
hugetlb_fault_mutex_table[hash] in the 16-thread case.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 fs/hugetlbfs/inode.c |  122 +++++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 100 insertions(+), 22 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index d44f545..1e1d136 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -36,6 +36,7 @@
 #include <linux/magic.h>
 #include <linux/migrate.h>
 #include <linux/uio.h>
+#include <linux/ktask.h>
 
 #include <linux/uaccess.h>
 
@@ -86,11 +87,16 @@ enum {
 };
 
 #ifdef CONFIG_NUMA
+static inline struct shared_policy *hugetlb_get_shared_policy(
+							struct inode *inode)
+{
+	return &HUGETLBFS_I(inode)->policy;
+}
+
 static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma,
-					struct inode *inode, pgoff_t index)
+				struct shared_policy *policy, pgoff_t index)
 {
-	vma->vm_policy = mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
-							index);
+	vma->vm_policy = mpol_shared_policy_lookup(policy, index);
 }
 
 static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma)
@@ -98,8 +104,14 @@ static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma)
 	mpol_cond_put(vma->vm_policy);
 }
 #else
+static inline struct shared_policy *hugetlb_get_shared_policy(
+							struct inode *inode)
+{
+	return NULL;
+}
+
 static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma,
-					struct inode *inode, pgoff_t index)
+				struct shared_policy *policy, pgoff_t index)
 {
 }
 
@@ -551,19 +563,29 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	return 0;
 }
 
+struct hf_args {
+	struct file		*file;
+	struct task_struct	*parent_task;
+	struct mm_struct	*mm;
+	struct shared_policy	*shared_policy;
+	struct hstate		*hstate;
+	struct address_space	*mapping;
+	int			error;
+};
+
+static int hugetlbfs_fallocate_chunk(pgoff_t start, pgoff_t end,
+				     struct hf_args *args);
+
 static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 				loff_t len)
 {
 	struct inode *inode = file_inode(file);
-	struct address_space *mapping = inode->i_mapping;
 	struct hstate *h = hstate_inode(inode);
-	struct vm_area_struct pseudo_vma;
-	struct mm_struct *mm = current->mm;
 	loff_t hpage_size = huge_page_size(h);
 	unsigned long hpage_shift = huge_page_shift(h);
-	pgoff_t start, index, end;
+	pgoff_t start, end;
+	struct hf_args hf_args;
 	int error;
-	u32 hash;
 
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
 		return -EOPNOTSUPP;
@@ -586,16 +608,72 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 	if (error)
 		goto out;
 
+	hf_args.file = file;
+	hf_args.parent_task = current;
+	hf_args.mm = current->mm;
+	hf_args.shared_policy = hugetlb_get_shared_policy(inode);
+	hf_args.hstate = h;
+	hf_args.mapping = inode->i_mapping;
+	hf_args.error = 0;
+
+	if (unlikely(hstate_is_gigantic(h))) {
+		/*
+		 * Use multiple threads in clear_gigantic_page instead of here,
+		 * so just do a 1-threaded hugetlbfs_fallocate_chunk.
+		 */
+		error = hugetlbfs_fallocate_chunk(start, end, &hf_args);
+	} else {
+		DEFINE_KTASK_CTL_RANGE(ctl, hugetlbfs_fallocate_chunk,
+				       &hf_args, KTASK_BPGS_MINCHUNK,
+				       GFP_KERNEL, 0);
+
+		error = ktask_run((void *)start, end - start, &ctl);
+	}
+
+	if (error == KTASK_RETURN_ERROR && hf_args.error != -EINTR)
+		goto out;
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
+		i_size_write(inode, offset + len);
+	inode->i_ctime = current_time(inode);
+out:
+	inode_unlock(inode);
+	return error;
+}
+
+static int hugetlbfs_fallocate_chunk(pgoff_t start, pgoff_t end,
+				     struct hf_args *args) {
+	pgoff_t			index;
+	struct vm_area_struct	pseudo_vma;
+	struct task_struct	*parent_task;
+	struct mm_struct	*mm;
+	struct shared_policy	*shared_policy;
+	struct file		*file;
+	struct address_space	*mapping;
+	struct hstate		*h;
+	loff_t			hpage_size;
+	int			error = 0;
+	u32			hash;
+
+	file = args->file;
+	parent_task = args->parent_task;
+	mm = args->mm;
+	shared_policy = args->shared_policy;
+	h = args->hstate;
+	mapping = args->mapping;
+	hpage_size = huge_page_size(h);
+
 	/*
 	 * Initialize a pseudo vma as this is required by the huge page
 	 * allocation routines.  If NUMA is configured, use page index
-	 * as input to create an allocation policy.
+	 * as input to create an allocation policy.  Each thread gets its
+	 * own pseudo vma because mempolicies can differ by page.
 	 */
 	memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
 	pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
 	pseudo_vma.vm_file = file;
 
-	for (index = start; index < end; index++) {
+	for (index = (pgoff_t)start; index < (pgoff_t)end; ++index) {
 		/*
 		 * This is supposed to be the vaddr where the page is being
 		 * faulted in, but we have no vaddr here.
@@ -610,13 +688,13 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		 * fallocate(2) manpage permits EINTR; we may have been
 		 * interrupted because we are using up too much memory.
 		 */
-		if (signal_pending(current)) {
+		if (signal_pending(parent_task) || signal_pending(current)) {
 			error = -EINTR;
-			break;
+			goto err;
 		}
 
 		/* Set numa allocation policy based on index */
-		hugetlb_set_vma_policy(&pseudo_vma, inode, index);
+		hugetlb_set_vma_policy(&pseudo_vma, shared_policy, index);
 
 		/* addr is the offset within the file (zero based) */
 		addr = index * hpage_size;
@@ -641,7 +719,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		if (IS_ERR(page)) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			error = PTR_ERR(page);
-			goto out;
+			goto err;
 		}
 		clear_huge_page(page, addr, pages_per_huge_page(h));
 		__SetPageUptodate(page);
@@ -649,7 +727,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		if (unlikely(error)) {
 			put_page(page);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			goto out;
+			goto err;
 		}
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
@@ -662,12 +740,12 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		unlock_page(page);
 	}
 
-	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
-		i_size_write(inode, offset + len);
-	inode->i_ctime = current_time(inode);
-out:
-	inode_unlock(inode);
-	return error;
+	return KTASK_RETURN_SUCCESS;
+
+err:
+	args->error = error;
+
+	return KTASK_RETURN_ERROR;
 }
 
 static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v1 5/6] mm: parallelize clear_gigantic_page
  2017-07-14 22:16 ` [RFC PATCH v1 5/6] mm: parallelize clear_gigantic_page daniel.m.jordan
@ 2017-07-17 16:02   ` Dave Hansen
  2017-07-18  1:49     ` Daniel Jordan
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Hansen @ 2017-07-17 16:02 UTC (permalink / raw)
  To: daniel.m.jordan, linux-kernel, linux-mm

On 07/14/2017 03:16 PM, daniel.m.jordan@oracle.com wrote:
> Machine:  Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 cpus, 1T memory
> Test:    Clear a range of gigantic pages
> nthread   speedup   size (GiB)   min time (s)   stdev
>       1                    100          41.13    0.03
>       2     2.03x          100          20.26    0.14
>       4     4.28x          100           9.62    0.09
>       8     8.39x          100           4.90    0.05
>      16    10.44x          100           3.94    0.03
...
>       1                    800         434.91    1.81
>       2     2.54x          800         170.97    1.46
>       4     4.98x          800          87.38    1.91
>       8    10.15x          800          42.86    2.59
>      16    12.99x          800          33.48    0.83

What was the actual test here?  Did you just use sysfs to allocate 800GB
of 1GB huge pages?

This test should be entirely memory-bandwidth-limited, right?  Are you
contending here that a single core can only use 1/10th of the memory
bandwidth when clearing a page?

Or, does all the gain here come because we are round-robin-allocating
the pages across all 8 NUMA nodes' memory controllers and the speedup
here is because we're not doing the clearing across the interconnect?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v1 5/6] mm: parallelize clear_gigantic_page
  2017-07-17 16:02   ` Dave Hansen
@ 2017-07-18  1:49     ` Daniel Jordan
  0 siblings, 0 replies; 9+ messages in thread
From: Daniel Jordan @ 2017-07-18  1:49 UTC (permalink / raw)
  To: Dave Hansen, linux-mm, linux-kernel

On 07/17/2017 12:02 PM, Dave Hansen wrote:
> On 07/14/2017 03:16 PM, daniel.m.jordan@oracle.com wrote:
>> Machine:  Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 cpus, 1T memory
>> Test:    Clear a range of gigantic pages
>> nthread   speedup   size (GiB)   min time (s)   stdev
>>        1                    100          41.13    0.03
>>        2     2.03x          100          20.26    0.14
>>        4     4.28x          100           9.62    0.09
>>        8     8.39x          100           4.90    0.05
>>       16    10.44x          100           3.94    0.03
> ...
>>        1                    800         434.91    1.81
>>        2     2.54x          800         170.97    1.46
>>        4     4.98x          800          87.38    1.91
>>        8    10.15x          800          42.86    2.59
>>       16    12.99x          800          33.48    0.83
> What was the actual test here?  Did you just use sysfs to allocate 800GB
> of 1GB huge pages?

I used fallocate(1) on a hugetlbfs, so this test is similar to the 6th 
patch in the series, but here we parallelize only the page clearing 
function since gigantic pages are large enough to benefit from multiple 
threads, whereas we parallelize at the level of hugetlbfs_fallocate in 
patch 6 for smaller page sizes (e.g. 2M on x86).

> This test should be entirely memory-bandwidth-limited, right?

That's right, the page clearing function dominates the test, so it's 
memory-bandwidth-limited.

> Are you
> contending here that a single core can only use 1/10th of the memory
> bandwidth when clearing a page?

Yes, this is the biggest factor here.  More threads can use more memory 
bandwidth.

And yes, in the page clearing loop exercised in the test, a single 
thread can use only a fraction of the chip's theoretical memory 
bandwidth.  This is the page clearing loop I'm stressing:

ENTRY(clear_page_erms)
movl $4096,%ecx
xorl %eax,%eax
rep stosb
ret

On my test machine, it tops out at around 2550 MiB/s with 1 thread, and 
I get that same rate for each of 2, 4, or 8 threads when running on the 
same chip (i.e. group of 18 cores for this machine).  It's only at 16 
threads on the same chip that it starts to drop, falling to something 
around 1420 MiB/s.

> Or, does all the gain here come because we are round-robin-allocating
> the pages across all 8 NUMA nodes' memory controllers and the speedup
> here is because we're not doing the clearing across the interconnect?

The default NUMA policy was used for all results shown, so there was no 
round-robin'ing at small sizes.  For example, in the 100 GiB case, all 
pages were allocated from the same node.  But when it gets up to 800 
GiB, obviously we're allocating from many nodes so that we get a sort of 
round-robin effect and NUMA starts to matter.  For instance, the 
1-thread case does better on 100 GiB with all local accesses than on 800 
GiB with mostly remote accesses.

ktask's ability to run NUMA-aware threads helps out here so that we're 
not clearing across the interconnect, which is why the speedups get 
better as the sizes get larger.

Thanks for your questions.

Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-07-18  1:49 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-14 22:16 [RFC PATCH v1 0/6] ktask: multithread cpu-intensive kernel work daniel.m.jordan
2017-07-14 22:16 ` [RFC PATCH v1 1/6] ktask: add documentation daniel.m.jordan
2017-07-14 22:16 ` [RFC PATCH v1 2/6] ktask: multithread cpu-intensive kernel work daniel.m.jordan
2017-07-14 22:16 ` [RFC PATCH v1 3/6] ktask: add /proc/sys/debug/ktask_max_threads daniel.m.jordan
2017-07-14 22:16 ` [RFC PATCH v1 4/6] mm: enlarge type of offset argument in mem_map_offset and mem_map_next daniel.m.jordan
2017-07-14 22:16 ` [RFC PATCH v1 5/6] mm: parallelize clear_gigantic_page daniel.m.jordan
2017-07-17 16:02   ` Dave Hansen
2017-07-18  1:49     ` Daniel Jordan
2017-07-14 22:16 ` [RFC PATCH v1 6/6] hugetlbfs: parallelize hugetlbfs_fallocate with ktask daniel.m.jordan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).