From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vk0-f70.google.com (mail-vk0-f70.google.com [209.85.213.70]) by kanga.kvack.org (Postfix) with ESMTP id 1D6FC6B0033 for ; Tue, 5 Dec 2017 14:49:30 -0500 (EST) Received: by mail-vk0-f70.google.com with SMTP id i68so801143vkd.15 for ; Tue, 05 Dec 2017 11:49:30 -0800 (PST) Received: from userp2120.oracle.com (userp2120.oracle.com. [156.151.31.85]) by mx.google.com with ESMTPS id y125si660300vkf.22.2017.12.05.11.49.28 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 05 Dec 2017 11:49:28 -0800 (PST) From: Daniel Jordan Subject: [RFC PATCH v3 1/7] ktask: add documentation Date: Tue, 5 Dec 2017 14:52:14 -0500 Message-Id: <20171205195220.28208-2-daniel.m.jordan@oracle.com> In-Reply-To: <20171205195220.28208-1-daniel.m.jordan@oracle.com> References: <20171205195220.28208-1-daniel.m.jordan@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: aaron.lu@intel.com, akpm@linux-foundation.org, dave.hansen@linux.intel.com, mgorman@techsingularity.net, mhocko@kernel.org, mike.kravetz@oracle.com, pasha.tatashin@oracle.com, steven.sistare@oracle.com, tim.c.chen@intel.com Motivates and explains the ktask API for kernel clients. Signed-off-by: Daniel Jordan Reviewed-by: Steve Sistare Cc: Aaron Lu Cc: Andrew Morton Cc: Dave Hansen Cc: Mel Gorman Cc: Michal Hocko Cc: Mike Kravetz Cc: Pavel Tatashin Cc: Tim Chen --- Documentation/core-api/index.rst | 1 + Documentation/core-api/ktask.rst | 173 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 174 insertions(+) create mode 100644 Documentation/core-api/ktask.rst diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index d5bbe035316d..255724095814 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -15,6 +15,7 @@ Core utilities assoc_array atomic_ops cpu_hotplug + ktask local_ops workqueue genericirq diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst new file mode 100644 index 000000000000..703f200c7d36 --- /dev/null +++ b/Documentation/core-api/ktask.rst @@ -0,0 +1,173 @@ +============================================ +ktask: parallelize CPU-intensive kernel work +============================================ + +:Date: December, 2017 +:Author: Daniel Jordan + + +Introduction +============ + +ktask is a generic framework for parallelizing CPU-intensive work in the +kernel. The intended use is for big machines that can use their CPU power to +speed up large tasks that can't otherwise be multithreaded in userland. The +API is generic enough to add concurrency to many different kinds of tasks--for +example, zeroing a range of pages or evicting a list of inodes--and aims to +save its clients the trouble of splitting up the work, choosing the number of +threads to use, maintaining an efficient concurrency level, starting these +threads, and load balancing the work between them. + + +Motivation +========== + +To ensure that applications and the kernel itself continue to perform well as +core counts and memory sizes increase, the kernel needs to scale. For example, +when a system call requests a certain fraction of system resources, the kernel +should respond in kind by devoting a similar fraction of system resources to +service the request. + +Before ktask, for example, when booting a NUMA machine with many CPUs, only one +thread per node was used to initialize struct pages. Using additional CPUs +that would otherwise be idle until the machine is fully up avoids a needless +bottleneck during system boot and allows the kernel to take advantage of unused +memory bandwidth. + +Why a new framework when there are existing kernel APIs for managing +concurrency and other ways to improve performance? Of the existing facilities, +workqueues aren't designed to divide work up (although ktask is built on +unbound workqueues), and kthread_worker supports only one thread. Existing +scalability techniques in the kernel such as doing work or holding locks in +batches are helpful and should be applied first for performance problems, but +eventually a single thread hits a wall. + + +Concept +======= + +A little terminology up front: A 'task' is the total work there is to do and a +'chunk' is a unit of work given to a thread. + +To complete a task using the ktask framework, a client provides a thread +function that is responsible for completing one chunk. The thread function is +defined in a standard way, with start and end arguments that delimit the chunk +as well as an argument that the client uses to pass data specific to the task. + +In addition, the client supplies an object representing the start of the task +and an iterator function that knows how to advance some number of units in the +task to yield another object representing the new task position. The framework +uses the start object and iterator internally to divide the task into chunks. + +Finally, the client passes the total task size and a minimum chunk size to +indicate the minimum amount of work that's appropriate to do in one chunk. The +sizes are given in task-specific units (e.g. pages, inodes, bytes). The +framework uses these sizes, along with the number of online CPUs and an +internal maximum number of threads, to decide how many threads to start and how +many chunks to divide the task into. + +For example, consider the task of clearing a gigantic page. This used to be +done in a single thread with a for loop that calls a page clearing function for +each constituent base page. To parallelize with ktask, the client first moves +the for loop to the thread function, adapting it to operate on the range passed +to the function. In this simple case, the thread function's start and end +arguments are just addresses delimiting the portion of the gigantic page to +clear. Then, where the for loop used to be, the client calls into ktask with +the start address of the gigantic page, the total size of the gigantic page, +and the thread function. Internally, ktask will divide the address range into +an appropriate number of chunks and start an appropriate number of threads to +complete these chunks. + + +Configuration +============= + +To use ktask, configure the kernel with CONFIG_KTASK=y. + +If CONFIG_KTASK=n, calls to the ktask API are simply #define'd to run the +thread function that the client provides so that the task is completed without +concurrency in the current thread. + + +Interface +========= + +.. Include ktask.h inline here. This file is heavily commented and documents +.. the ktask interface. +.. kernel-doc:: include/linux/ktask.h + + +Resource Limits and Auto-Tuning +=============================== + +ktask has resource limits on the number of workqueue items it queues. In +ktask, a workqueue item is a thread that runs chunks of the task until the task +is finished. + +These limits support the different ways ktask uses workqueues: + - ktask_run to run threads on the calling thread's node. + - ktask_run_numa to run threads on the node(s) specified. + - ktask_run_numa with nid=NUMA_NO_NODE to run threads on any node in the + system. + +To support these different ways of queueing work while maintaining an efficient +concurrency level, we need both system-wide and per-node limits on the number +of threads. Without per-node limits, a node might become oversubscribed +despite ktask staying within the system-wide limit, and without a system-wide +limit, we can't properly account for work that can run on any node. + +The system-wide limit is based on the total number of CPUs, and the per-node +limit on the CPU count for each node. A per-node work item counts against the +system-wide limit. Workqueue's max_active can't accommodate both types of +limit, no matter how many workqueues are used, so ktask implements its own. + +If a per-node limit is reached, the work item is allowed to run anywhere on the +machine to avoid overwhelming the node. If the global limit is also reached, +ktask won't queue additional work items until we fall below the limit again. + +These limits apply only to workqueue items--that is, additional threads beyond +the one starting the task. That way, one thread per task is always allowed to +run. + +Within the resource limits, ktask uses a default maximum number of threads per +task to avoid disturbing other processes on the system. Callers can change the +limit with ktask_ctl_set_max_threads. For example, this might be used to raise +the maximum number of threads for a boot-time initialization task when more +CPUs than usual are idle. + + +Backward Compatibility +====================== + +ktask is written so that existing calls to the API will be backwards compatible +should the API gain new features in the future. This is accomplished by +restricting API changes to members of struct ktask_ctl and having clients make +an opaque initialization call (DEFINE_KTASK_CTL). This initialization can then +be modified to include any new arguments so that existing call sites stay the +same. + + +Error Handling +============== + +Calls to ktask fail only if the provided thread function fails. In particular, +ktask avoids allocating memory internally during a task, so it's safe to use in +sensitive contexts. + +To avoid adding features before they're used, ktask currently has only basic +error handling. Each call to ktask_run and ktask_run_numa returns a simple +error code, KTASK_RETURN_SUCCESS or KTASK_RETURN_ERROR. As usage of the +framework expands, however, error handling will likely need to be enhanced in +two ways. + +First, ktask may need client-specific error reporting. It's possible for tasks +to fail for different reasons, so the framework should have a way to +communicate client-specific error information. For this purpose, allow the +client to pass a pointer for its own error information in struct ktask_ctl. + +Second, tasks can fail midway through their work. To recover, the finished +chunks of work need to be undone in a task-specific way, so ktask should allow +clients to pass an "undo" callback that is responsible for undoing one chunk of +work. To avoid multiple levels of error handling, this "undo" callback should +not be allowed to fail. The iterator used for the original task can simply be +reused for the undo operation. -- 2.15.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org