From: Daniel Jordan <daniel.m.jordan@oracle.com>
To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: aarcange@redhat.com, aaron.lu@intel.com,
akpm@linux-foundation.org, alex.williamson@redhat.com,
bsd@redhat.com, daniel.m.jordan@oracle.com,
darrick.wong@oracle.com, dave.hansen@linux.intel.com,
jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com,
mhocko@kernel.org, mike.kravetz@oracle.com,
Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com,
rdunlap@infradead.org, steven.sistare@oracle.com,
tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz
Subject: [RFC PATCH v4 01/13] ktask: add documentation
Date: Mon, 5 Nov 2018 11:55:46 -0500 [thread overview]
Message-ID: <20181105165558.11698-2-daniel.m.jordan@oracle.com> (raw)
In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com>
Motivates and explains the ktask API for kernel clients.
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
Documentation/core-api/index.rst | 1 +
Documentation/core-api/ktask.rst | 213 +++++++++++++++++++++++++++++++
2 files changed, 214 insertions(+)
create mode 100644 Documentation/core-api/ktask.rst
diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index 3adee82be311..c143a280a5b1 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -18,6 +18,7 @@ Core utilities
refcount-vs-atomic
cpu_hotplug
idr
+ ktask
local_ops
workqueue
genericirq
diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst
new file mode 100644
index 000000000000..c3c00e1f802f
--- /dev/null
+++ b/Documentation/core-api/ktask.rst
@@ -0,0 +1,213 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+============================================
+ktask: parallelize CPU-intensive kernel work
+============================================
+
+:Date: November, 2018
+:Author: Daniel Jordan <daniel.m.jordan@oracle.com>
+
+
+Introduction
+============
+
+ktask is a generic framework for parallelizing CPU-intensive work in the
+kernel. The intended use is for big machines that can use their CPU power to
+speed up large tasks that can't otherwise be multithreaded in userland. The
+API is generic enough to add concurrency to many different kinds of tasks--for
+example, page clearing over an address range or freeing a list of pages--and
+aims to save its clients the trouble of splitting up the work, choosing the
+number of helper threads to use, maintaining an efficient concurrency level,
+starting these threads, and load balancing the work between them.
+
+
+Motivation
+==========
+
+A single CPU can spend an excessive amount of time in the kernel operating on
+large amounts of data. Often these situations arise during initialization- and
+destruction-related tasks, where the data involved scales with system size.
+These long-running jobs can slow startup and shutdown of applications and the
+system itself while extra CPUs sit idle.
+
+To ensure that applications and the kernel continue to perform well as core
+counts and memory sizes increase, the kernel harnesses these idle CPUs to
+complete such jobs more quickly.
+
+For example, when booting a large NUMA machine, ktask uses additional CPUs that
+would otherwise be idle until the machine is fully up to avoid a needless
+bottleneck during system boot and allow the kernel to take advantage of unused
+memory bandwidth. Similarly, when starting a large VM using VFIO, ktask takes
+advantage of the VM's idle CPUs during VFIO page pinning rather than have the
+VM's boot blocked on one thread doing all the work.
+
+ktask is not a substitute for single-threaded optimization. However, there is
+a point where a single CPU hits a wall despite performance tuning, so
+parallelize!
+
+
+Concept
+=======
+
+ktask is built on unbound workqueues to take advantage of the thread management
+facilities it provides: creation, destruction, flushing, priority setting, and
+NUMA affinity.
+
+A little terminology up front: A 'task' is the total work there is to do and a
+'chunk' is a unit of work given to a thread.
+
+To complete a task using the ktask framework, a client provides a thread
+function that is responsible for completing one chunk. The thread function is
+defined in a standard way, with start and end arguments that delimit the chunk
+as well as an argument that the client uses to pass data specific to the task.
+
+In addition, the client supplies an object representing the start of the task
+and an iterator function that knows how to advance some number of units in the
+task to yield another object representing the new task position. The framework
+uses the start object and iterator internally to divide the task into chunks.
+
+Finally, the client passes the total task size and a minimum chunk size to
+indicate the minimum amount of work that's appropriate to do in one chunk. The
+sizes are given in task-specific units (e.g. pages, inodes, bytes). The
+framework uses these sizes, along with the number of online CPUs and an
+internal maximum number of threads, to decide how many threads to start and how
+many chunks to divide the task into.
+
+For example, consider the task of clearing a gigantic page. This used to be
+done in a single thread with a for loop that calls a page clearing function for
+each constituent base page. To parallelize with ktask, the client first moves
+the for loop to the thread function, adapting it to operate on the range passed
+to the function. In this simple case, the thread function's start and end
+arguments are just addresses delimiting the portion of the gigantic page to
+clear. Then, where the for loop used to be, the client calls into ktask with
+the start address of the gigantic page, the total size of the gigantic page,
+and the thread function. Internally, ktask will divide the address range into
+an appropriate number of chunks and start an appropriate number of threads to
+complete these chunks.
+
+
+Configuration
+=============
+
+To use ktask, configure the kernel with CONFIG_KTASK=y.
+
+If CONFIG_KTASK=n, calls to the ktask API are simply #define'd to run the
+thread function that the client provides so that the task is completed without
+concurrency in the current thread.
+
+
+Interface
+=========
+
+.. kernel-doc:: include/linux/ktask.h
+
+
+Resource Limits
+===============
+
+ktask has resource limits on the number of work items it sends to workqueue.
+In ktask, a workqueue item is a thread that runs chunks of the task until the
+task is finished.
+
+These limits support the different ways ktask uses workqueues:
+ - ktask_run to run threads on the calling thread's node.
+ - ktask_run_numa to run threads on the node(s) specified.
+ - ktask_run_numa with nid=NUMA_NO_NODE to run threads on any node in the
+ system.
+
+To support these different ways of queueing work while maintaining an efficient
+concurrency level, we need both system-wide and per-node limits on the number
+of threads. Without per-node limits, a node might become oversubscribed
+despite ktask staying within the system-wide limit, and without a system-wide
+limit, we can't properly account for work that can run on any node.
+
+The system-wide limit is based on the total number of CPUs, and the per-node
+limit on the CPU count for each node. A per-node work item counts against the
+system-wide limit. Workqueue's max_active can't accommodate both types of
+limit, no matter how many workqueues are used, so ktask implements its own.
+
+If a per-node limit is reached, the work item is allowed to run anywhere on the
+machine to avoid overwhelming the node. If the global limit is also reached,
+ktask won't queue additional work items until we fall below the limit again.
+
+These limits apply only to workqueue items--that is, helper threads beyond the
+one starting the task. That way, one thread per task is always allowed to run.
+
+
+Scheduler Interaction
+=====================
+
+Even within the resource limits, ktask must take care to run a number of
+threads appropriate for the system's current CPU load. Under high CPU usage,
+starting excessive helper threads may disturb other tasks, unfairly taking CPU
+time away from them for the sake of an optimized kernel code path.
+
+ktask plays nicely in this case by setting helper threads to the lowest
+scheduling priority on the system (MAX_NICE). This way, helpers' CPU time is
+appropriately throttled on a busy system and other tasks are not disturbed.
+
+The main thread initiating the task remains at its original priority so that it
+still makes progress on a busy system.
+
+It is possible for a helper thread to start running and then be forced off-CPU
+by a higher priority thread. With the helper's CPU time curtailed by MAX_NICE,
+the main thread may wait longer for the task to finish than it would have had
+it not started any helpers, so to ensure forward progress at a single-threaded
+pace, once the main thread is finished with all outstanding work in the task,
+the main thread wills its priority to one helper thread at a time. At least
+one thread will then always be running at the priority of the calling thread.
+
+
+Cgroup Awareness
+================
+
+Given the potentially large amount of CPU time ktask threads may consume, they
+should be aware of the cgroup of the task that called into ktask and
+appropriately throttled.
+
+TODO: Implement cgroup-awareness in unbound workqueues.
+
+
+Power Management
+================
+
+Starting additional helper threads may cause the system to consume more energy,
+which is undesirable on energy-conscious devices. Therefore ktask needs to be
+aware of cpufreq policies and scaling governors.
+
+If an energy-conscious policy is in use (e.g. powersave, conservative) on any
+part of the system, that is a signal that the user has strong power management
+preferences, in which case ktask is disabled.
+
+TODO: Implement this.
+
+
+Backward Compatibility
+======================
+
+ktask is written so that existing calls to the API will be backwards compatible
+should the API gain new features in the future. This is accomplished by
+restricting API changes to members of struct ktask_ctl and having clients make
+an opaque initialization call (DEFINE_KTASK_CTL). This initialization can then
+be modified to include any new arguments so that existing call sites stay the
+same.
+
+
+Error Handling
+==============
+
+Calls to ktask fail only if the provided thread function fails. In particular,
+ktask avoids allocating memory internally during a task, so it's safe to use in
+sensitive contexts.
+
+Tasks can fail midway through their work. To recover, the finished chunks of
+work need to be undone in a task-specific way, so ktask allows clients to pass
+an "undo" callback that is responsible for undoing one chunk of work. To avoid
+multiple levels of error handling, this "undo" callback should not be allowed
+to fail. For simplicity and because it's a slow path, undoing is not
+multithreaded.
+
+Each call to ktask_run and ktask_run_numa returns a single value,
+KTASK_RETURN_SUCCESS or a client-specific value. Since threads can fail for
+different reasons, however, ktask may need the ability to return
+thread-specific error information. This can be added later if needed.
--
2.19.1
next prev parent reply other threads:[~2018-11-05 16:55 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-05 16:55 [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work Daniel Jordan
2018-11-05 16:55 ` Daniel Jordan [this message]
2018-11-05 21:19 ` [RFC PATCH v4 01/13] ktask: add documentation Randy Dunlap
2018-11-06 2:27 ` Daniel Jordan
2018-11-06 8:49 ` Peter Zijlstra
2018-11-06 20:34 ` Daniel Jordan
2018-11-06 20:51 ` Jason Gunthorpe
2018-11-07 10:27 ` Peter Zijlstra
2018-11-07 20:21 ` Daniel Jordan
2018-11-07 10:35 ` Peter Zijlstra
2018-11-07 21:20 ` Daniel Jordan
2018-11-08 17:26 ` Jonathan Corbet
2018-11-08 19:15 ` Daniel Jordan
2018-11-08 19:24 ` Jonathan Corbet
2018-11-27 19:50 ` Pavel Machek
2018-11-28 16:56 ` Daniel Jordan
2018-11-05 16:55 ` [RFC PATCH v4 02/13] ktask: multithread CPU-intensive kernel work Daniel Jordan
2018-11-05 20:51 ` Randy Dunlap
2018-11-06 2:24 ` Daniel Jordan
2018-11-05 16:55 ` [RFC PATCH v4 03/13] ktask: add undo support Daniel Jordan
2018-11-05 16:55 ` [RFC PATCH v4 04/13] ktask: run helper threads at MAX_NICE Daniel Jordan
2018-11-05 16:55 ` [RFC PATCH v4 05/13] workqueue, ktask: renice helper threads to prevent starvation Daniel Jordan
2018-11-13 16:34 ` Tejun Heo
2018-11-19 16:45 ` Daniel Jordan
2018-11-20 16:33 ` Tejun Heo
2018-11-20 17:03 ` Daniel Jordan
2018-11-05 16:55 ` [RFC PATCH v4 06/13] vfio: parallelize vfio_pin_map_dma Daniel Jordan
2018-11-05 21:51 ` Alex Williamson
2018-11-06 2:42 ` Daniel Jordan
2018-11-05 16:55 ` [RFC PATCH v4 07/13] mm: change locked_vm's type from unsigned long to atomic_long_t Daniel Jordan
2018-11-05 16:55 ` [RFC PATCH v4 08/13] vfio: remove unnecessary mmap_sem writer acquisition around locked_vm Daniel Jordan
2018-11-05 16:55 ` [RFC PATCH v4 09/13] vfio: relieve mmap_sem reader cacheline bouncing by holding it longer Daniel Jordan
2018-11-05 16:55 ` [RFC PATCH v4 10/13] mm: enlarge type of offset argument in mem_map_offset and mem_map_next Daniel Jordan
2018-11-05 16:55 ` [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node Daniel Jordan
2018-11-10 3:48 ` Elliott, Robert (Persistent Memory)
2018-11-12 16:54 ` Daniel Jordan
2018-11-12 22:15 ` Elliott, Robert (Persistent Memory)
2018-11-19 16:01 ` Daniel Jordan
2018-11-27 0:12 ` Elliott, Robert (Persistent Memory)
2018-11-27 20:23 ` Daniel Jordan
2018-11-19 16:29 ` Daniel Jordan
2018-11-05 16:55 ` [RFC PATCH v4 12/13] mm: parallelize clear_gigantic_page Daniel Jordan
2018-11-05 16:55 ` [RFC PATCH v4 13/13] hugetlbfs: parallelize hugetlbfs_fallocate with ktask Daniel Jordan
2018-11-05 17:29 ` [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work Michal Hocko
2018-11-06 1:29 ` Daniel Jordan
2018-11-06 9:21 ` Michal Hocko
2018-11-07 20:17 ` Daniel Jordan
2018-11-05 18:49 ` Zi Yan
2018-11-06 2:20 ` Daniel Jordan
2018-11-06 2:48 ` Zi Yan
2018-11-06 19:00 ` Daniel Jordan
2018-11-30 19:18 ` Tejun Heo
2018-12-01 0:13 ` Daniel Jordan
2018-12-03 16:16 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181105165558.11698-2-daniel.m.jordan@oracle.com \
--to=daniel.m.jordan@oracle.com \
--cc=Pavel.Tatashin@microsoft.com \
--cc=aarcange@redhat.com \
--cc=aaron.lu@intel.com \
--cc=akpm@linux-foundation.org \
--cc=alex.williamson@redhat.com \
--cc=bsd@redhat.com \
--cc=darrick.wong@oracle.com \
--cc=dave.hansen@linux.intel.com \
--cc=jgg@mellanox.com \
--cc=jiangshanlai@gmail.com \
--cc=jwadams@google.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=mike.kravetz@oracle.com \
--cc=prasad.singamsetty@oracle.com \
--cc=rdunlap@infradead.org \
--cc=steven.sistare@oracle.com \
--cc=tim.c.chen@intel.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).