* [PATCH] workqueue: add documentation @ 2010-09-08 15:40 Tejun Heo 2010-09-08 15:51 ` [PATCH UPDATED] " Tejun Heo 0 siblings, 1 reply; 14+ messages in thread From: Tejun Heo @ 2010-09-08 15:40 UTC (permalink / raw) To: lkml, Ingo Molnar, Christoph Lameter, Dave Chinner, Florian Mickler Update copyright notice and add Documentation/workqueue.txt. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> --- Florian, I took good part from the previous overview document and tried put them in a more compact form. It would be great if you can review this one too. Thanks. Documentation/workqueue.txt | 336 ++++++++++++++++++++++++++++++++++++++++++++ include/linux/workqueue.h | 4 kernel/workqueue.c | 27 ++- 3 files changed, 357 insertions(+), 10 deletions(-) Index: work/kernel/workqueue.c =================================================================== --- work.orig/kernel/workqueue.c +++ work/kernel/workqueue.c @@ -1,19 +1,26 @@ /* - * linux/kernel/workqueue.c + * kernel/workqueue.c - generic async execution with shared worker pool * - * Generic mechanism for defining kernel helper threads for running - * arbitrary tasks in process context. + * Copyright (C) 2002 Ingo Molnar * - * Started by Ingo Molnar, Copyright (C) 2002 + * Derived from the taskqueue/keventd code by: + * David Woodhouse <dwmw2@infradead.org> + * Andrew Morton + * Kai Petzke <wpp@marie.physik.tu-berlin.de> + * Theodore Ts'o <tytso@mit.edu> * - * Derived from the taskqueue/keventd code by: + * Made to use alloc_percpu by Christoph Lameter. * - * David Woodhouse <dwmw2@infradead.org> - * Andrew Morton - * Kai Petzke <wpp@marie.physik.tu-berlin.de> - * Theodore Ts'o <tytso@mit.edu> + * Copyright (C) 2010 SUSE Linux Products GmbH + * Copyright (C) 2010 Tejun Heo <tj@kernel.org> * - * Made to use alloc_percpu by Christoph Lameter. + * This is the generic async execution mechanism. Work items as are + * executed in process context. The worker pool is shared and + * automatically managed. There is one worker pool for each CPU and + * one extra for works which are better served by workers which are + * not bound to any specific CPU. + * + * Please read Documentation/workqueue.txt for details. */ #include <linux/module.h> Index: work/include/linux/workqueue.h =================================================================== --- work.orig/include/linux/workqueue.h +++ work/include/linux/workqueue.h @@ -235,6 +235,10 @@ static inline unsigned int work_static(s #define work_clear_pending(work) \ clear_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)) +/* + * Workqueue flags and constants. For details, please refer to + * Documentation/workqueue.txt. + */ enum { WQ_NON_REENTRANT = 1 << 0, /* guarantee non-reentrance */ WQ_UNBOUND = 1 << 1, /* not bound to any cpu */ Index: work/Documentation/workqueue.txt =================================================================== --- /dev/null +++ work/Documentation/workqueue.txt @@ -0,0 +1,336 @@ + +Concurrency Managed Workqueue (cmwq) + +September, 2010 Tejun Heo <tj@kernel.org> + +CONTENTS + +1. Why cmwq? +2. The Design +3. Workqueue Attributes +4. Example Execution Scenarios +5. Guidelines + + +1. Why cmwq? + +There are many cases where an asynchronous process execution context +is needed and the workqueue (wq) is the most commonly used mechanism +for such cases. A work item describing which function to execute is +queued on a workqueue which executes the work item in a process +context asynchronously. + +In the original wq implementation, a multi threaded (MT) wq had one +worker thread per CPU and a single threaded (ST) wq had one worker +thread system-wide. A single MT wq needed to keep around the same +number of workers as the number of CPUs. The kernel grew a lot of MT +wq users over the years and with the number of CPU cores continuously +rising, some systems saturated the default 32k PID space just booting +up. + +Although MT wq wasted a lot of resource, the level of concurrency +provided was unsatisfactory. The limitation was common to both ST and +MT wq albeit less severe on MT. Each wq maintained its own seprate +worker pool. A MT wq could provid only one execution context per CPU +while a ST wq one for the whole system. Work items had to compete for +those very limited execution contexts leading to various problems +including proneness to deadlocks around the single execution context. + +The tension between the provided level of concurrency and resource +usage also forced its users to make unnecessary tradeoffs like libata +choosing to use ST wq for polling PIOs and accepting an unnecessary +limitation that no two polling PIOs can progress at the same time. As +MT wq don't provide much better concurrency, users which require +higher level of concurrency, like async or fscache, had to implement +their own thread pool. + +Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with +focus on the following goals. + +* Maintain compatibility with the original workqueue API. + +* Use per-CPU unified worker pools shared by all wq to provide + flexible level of concurrency on demand without wasting a lot of + resource. + +* Automatically regulate worker pool and level of concurrency so that + the API users don't need to worry about such details. + + +2. The Design + +There's a single global cwq (gcwq) for each possible CPU and a pseudo +CPU for unbound wq. A gcwq manages and serves out all the execution +contexts on the associated CPU. cpu_workqueue's (cwq) of each wq are +mostly simple frontends to the associated gcwq. When a work item is +queued, it's queued to the unified worklist of the target gcwq. Each +gcwq maintains pool of workers used to process the worklist. + +For any worker pool implmentation, managing the concurrency level (how +many execution contexts are active) is an important issue. cmwq tries +to keep the concurrency at minimal but sufficient level. + +Each gcwq bound to an actual CPU implements concurrency management by +hooking into the scheduler. The gcwq is notified whenever an active +worker wakes up or sleeps and keeps track of the number of the +currently runnable workers. Generally, work items are not expected to +hog CPU cycle and maintaining just enough concurrency to prevent work +processing from stalling should be optimal. As long as there is one +or more runnable workers on the CPU, the gcwq doesn't start execution +of a new work, but, when the last running worker goes to sleep, it +immediately schedules a new worker so that the CPU doesn't sit idle +while there are pending work items. This allows using minimal number +of workers without losing execution bandwidth. + +Keeping idle workers around doesn't cost other than the memory space +for kthreads, so cmwq holds onto idle ones for a while before killing +them. + +For an unbound wq, the above concurrency management doesn't apply and +the gcwq for the pseudo unbound CPU tries to start executing all work +items as soon as possible. The responsibility of regulating +concurrency level is on the users. There is also a flag to mark a +bound wq to ignore the concurrency management. Please refer to the +Workqueue Attributes section for details. + +Forward progress guarantee relies on that workers can be created when +more execution contexts are necessary, which in turn is guaranteed +through the use of rescue workers. All wq which might be used in +memory reclamation path are required to have a rescuer reserved for +execution of the wq under memory pressure so that memory reclamation +for worker creation doesn't deadlock waiting for execution contexts to +free up. + + +3. Workqueue Attributes + +alloc_workqueue() allocates a wq. The original create_*workqueue() +functions are deprecated and scheduled for removal. alloc_workqueue() +takes three arguments - @name, @flags and @max_active. @name is the +name of the wq and also used as the name of the rescuer thread if +there is one. + +A wq no longer manages execution resources but serves as a domain for +forward progress guarantee, flush and work item attributes. @flags +and @max_active control how work items are assigned execution +resources, scheduled and executed. + +@flags: + + WQ_NON_REENTRANT + + By default, a wq guarantees non-reentrance only on the same + CPU. A work may not be executed concurrently on the same CPU + by multiple workers but is allowed to be executed concurrently + on multiple CPUs. This flag makes sure non-reentrance is + enforced across all CPUs. Work items queued to a + non-reentrant wq are guaranteed to be executed by at most one + worker system-wide at any given time. + + WQ_UNBOUND + + Work items queued to an unbound wq are served by a special + gcwq which hosts workers which are not bound to any specific + CPU. This makes the wq behave as a simple execution context + provider without concurrency management. The unbound gcwq + tries to start execution of work items as soon as possible. + Unbound wq sacrifices locality but is useful for the following + cases. + + * Wide fluctuation in the concurrency level requirement is + expected and using bound wq may end up creating large number + of mostly unused workers across different CPUs as the issuer + hops through different CPUs. + + * Long running CPU intensive workloads which can be better + managed by the system scheduler. + + WQ_FREEZEABLE + + A freezeable wq participates in the freeze phase of the system + suspend operations. Work items on the wq are drained and no + new work item starts execution until thawed. + + WQ_RESCUER + + All wq which might be used in the memory reclamation paths + _MUST_ have this flag set. This reserves one worker + exclusively for the execution of this wq under memory + pressure. + + WQ_HIGHPRI + + Work items of a highpri wq are queued at the head of the + worklist of the target gcwq and start execution regardless of + the current concurrency level. In other words, highpri work + items will always start execution as soon as execution + resource is available. + + Ordering among highpri work items is preserved - a highpri + work item queued after another highpri work item will start + execution after the earlier highpri work item starts. + + Although highpri work items are not held back by other + runnable work items, they still contribute to the concurrency + level. Highpri work items in runnable state will prevent + non-highpri work items from starting execution. + + This flag is meaningless for unbound wq. + + WQ_CPU_INTENSIVE + + Work items of a CPU intensive wq do not contribute to the + concurrency level. In other words, Runnable CPU intensive + work items will not prevent other work items from starting + execution. This is useful for bound work items which are + expected to hog CPU cycles so that their execution is + regulated by the system scheduler. + + Although CPU intensive work items don't contribute to the + concurrency level, start of their executions is still + regulated by the concurrency management and runnable + non-CPU-intensive work items can delay execution of CPU + intensive work items. + + This flag is meaningless for unbound wq. + + WQ_HIGHPRI | WQ_CPU_INTENSIVE + + This combination makes the wq avoid interaction with + concurrency management completely and behave as a simple + per-CPU execution context provider. Work items queued on a + highpri CPU-intensive wq start execution as soon as resources + are available and don't affect execution of other work items. + +@max_active: + +@max_active determines the maximum number of execution contexts per +CPU which can be assigned to the work items of a wq. For example, +with @max_active of 16, at most 16 work items of the wq can be +executing at the same time per CPU. + +Currently, for a bound wq, the maximum limit for @max_active is 512 +and the default value used when 0 is specified is 256. For an unbound +wq, the limit is higher of 512 and 4 * num_possible_cpus(). These +values are chosen sufficiently high such that they are not the +limiting factor while providing protection in runaway cases. + +The number of active work items of a wq is usually regulated by the +users of the wq, more specifically, by how many work items the users +may queue at the same time. Unless there is a specific need for +throttling the number of active work items, specifying '0' is +recommended. + +Some users depend on the strict execution ordering of ST wq. The +combination of @max_active of 1 and WQ_UNBOUND is used to achieve this +behavior. Work items on such wq are always queued to the unbound gcwq +and only one work item can be active at any given time thus achieving +the same ordering property as ST wq. + + +4. Example Execution Scenarios + +The following example execution scenarios try to illustrate how cmwq +behave under different configurations. + + Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU. + w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms + again before finishing. w1 and w2 burn CPU for 5ms then sleep for + 10ms. + +Ignoring all other tasks, works and processing overhead, and assuming +simple FIFO scheduling, the following is one highly simplified version +of possible sequences of events with the original wq. + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 starts and burns CPU + 25 w1 sleeps + 35 w1 wakes up and finishes + 35 w2 starts and burns CPU + 40 w2 sleeps + 50 w2 wakes up and finishes + +And with cmwq with @max_active >= 3, + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 starts and burns CPU + 10 w1 sleeps + 10 w2 starts and burns CPU + 15 w2 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 25 w2 wakes up and finishes + +If @max_active == 2, + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 starts and burns CPU + 10 w1 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 20 w2 starts and burns CPU + 25 w2 sleeps + 35 w2 wakes up and finishes + +Now, let's assume w1 and w2 are queued to a different wq q1 which has +WQ_HIGHPRI set, + + TIME IN MSECS EVENT + 0 w1 and w2 start and burn CPU + 5 w1 sleeps + 10 w2 sleeps + 10 w0 starts and burns CPU + 15 w0 sleeps + 15 w1 wakes up and finishes + 20 w2 wakes up and finishes + 25 w0 wakes up and burns CPU + 30 w0 finishes + +If q1 has WQ_CPU_INTENSIVE set, + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 and w2 start and burn CPU + 10 w1 sleeps + 15 w2 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 25 w2 wakes up and finishes + + +5. Guidelines + +* Do not forget to use WQ_RESCUER if a wq may process work items which + are used during memory reclamation. Each wq with WQ_RESCUER set has + one rescuer thread reserved for it. If there is dependency among + multiple work items used during memory reclamation, they should be + queued to separate wq each with WQ_RESCUER. + +* Unless strict ordering is required, there is no need to use ST wq. + +* Unless there is a specific need, using 0 for @nr_active is + recommended. In most use cases, concurrency level usually stays + well under the default limit. + +* A wq serves as a domain for forward progress guarantee (WQ_RESCUER), + flush and work item attributes. Work items which are not involved + in memory reclamation and don't need to be flushed as a part of a + group of work items, and don't require any special attribute, can + use one of the system wq. There is no difference in execution + characteristics between using a dedicated wq and a system wq. + +* Unless work items are expected to consume huge amount of CPU cycles, + using bound wq is usually beneficial due to increased level of + locality in wq operations and work item execution. ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH UPDATED] workqueue: add documentation 2010-09-08 15:40 [PATCH] workqueue: add documentation Tejun Heo @ 2010-09-08 15:51 ` Tejun Heo 2010-09-09 8:02 ` Florian Mickler 0 siblings, 1 reply; 14+ messages in thread From: Tejun Heo @ 2010-09-08 15:51 UTC (permalink / raw) To: lkml, Ingo Molnar, Christoph Lameter, Dave Chinner, Florian Mickler Update copyright notice and add Documentation/workqueue.txt. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> --- Forgot to run ispell. Here's the ispell'd version. Thanks. Documentation/workqueue.txt | 336 ++++++++++++++++++++++++++++++++++++++++++++ include/linux/workqueue.h | 4 kernel/workqueue.c | 27 ++- 3 files changed, 357 insertions(+), 10 deletions(-) Index: work/kernel/workqueue.c =================================================================== --- work.orig/kernel/workqueue.c +++ work/kernel/workqueue.c @@ -1,19 +1,26 @@ /* - * linux/kernel/workqueue.c + * kernel/workqueue.c - generic async execution with shared worker pool * - * Generic mechanism for defining kernel helper threads for running - * arbitrary tasks in process context. + * Copyright (C) 2002 Ingo Molnar * - * Started by Ingo Molnar, Copyright (C) 2002 + * Derived from the taskqueue/keventd code by: + * David Woodhouse <dwmw2@infradead.org> + * Andrew Morton + * Kai Petzke <wpp@marie.physik.tu-berlin.de> + * Theodore Ts'o <tytso@mit.edu> * - * Derived from the taskqueue/keventd code by: + * Made to use alloc_percpu by Christoph Lameter. * - * David Woodhouse <dwmw2@infradead.org> - * Andrew Morton - * Kai Petzke <wpp@marie.physik.tu-berlin.de> - * Theodore Ts'o <tytso@mit.edu> + * Copyright (C) 2010 SUSE Linux Products GmbH + * Copyright (C) 2010 Tejun Heo <tj@kernel.org> * - * Made to use alloc_percpu by Christoph Lameter. + * This is the generic async execution mechanism. Work items as are + * executed in process context. The worker pool is shared and + * automatically managed. There is one worker pool for each CPU and + * one extra for works which are better served by workers which are + * not bound to any specific CPU. + * + * Please read Documentation/workqueue.txt for details. */ #include <linux/module.h> Index: work/include/linux/workqueue.h =================================================================== --- work.orig/include/linux/workqueue.h +++ work/include/linux/workqueue.h @@ -235,6 +235,10 @@ static inline unsigned int work_static(s #define work_clear_pending(work) \ clear_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)) +/* + * Workqueue flags and constants. For details, please refer to + * Documentation/workqueue.txt. + */ enum { WQ_NON_REENTRANT = 1 << 0, /* guarantee non-reentrance */ WQ_UNBOUND = 1 << 1, /* not bound to any cpu */ Index: work/Documentation/workqueue.txt =================================================================== --- /dev/null +++ work/Documentation/workqueue.txt @@ -0,0 +1,336 @@ + +Concurrency Managed Workqueue (cmwq) + +September, 2010 Tejun Heo <tj@kernel.org> + +CONTENTS + +1. Why cmwq? +2. The Design +3. Workqueue Attributes +4. Example Execution Scenarios +5. Guidelines + + +1. Why cmwq? + +There are many cases where an asynchronous process execution context +is needed and the workqueue (wq) is the most commonly used mechanism +for such cases. A work item describing which function to execute is +queued on a workqueue which executes the work item in a process +context asynchronously. + +In the original wq implementation, a multi threaded (MT) wq had one +worker thread per CPU and a single threaded (ST) wq had one worker +thread system-wide. A single MT wq needed to keep around the same +number of workers as the number of CPUs. The kernel grew a lot of MT +wq users over the years and with the number of CPU cores continuously +rising, some systems saturated the default 32k PID space just booting +up. + +Although MT wq wasted a lot of resource, the level of concurrency +provided was unsatisfactory. The limitation was common to both ST and +MT wq albeit less severe on MT. Each wq maintained its own separate +worker pool. A MT wq could provide only one execution context per CPU +while a ST wq one for the whole system. Work items had to compete for +those very limited execution contexts leading to various problems +including proneness to deadlocks around the single execution context. + +The tension between the provided level of concurrency and resource +usage also forced its users to make unnecessary tradeoffs like libata +choosing to use ST wq for polling PIOs and accepting an unnecessary +limitation that no two polling PIOs can progress at the same time. As +MT wq don't provide much better concurrency, users which require +higher level of concurrency, like async or fscache, had to implement +their own thread pool. + +Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with +focus on the following goals. + +* Maintain compatibility with the original workqueue API. + +* Use per-CPU unified worker pools shared by all wq to provide + flexible level of concurrency on demand without wasting a lot of + resource. + +* Automatically regulate worker pool and level of concurrency so that + the API users don't need to worry about such details. + + +2. The Design + +There's a single global cwq (gcwq) for each possible CPU and a pseudo +CPU for unbound wq. A gcwq manages and serves out all the execution +contexts on the associated CPU. cpu_workqueue's (cwq) of each wq are +mostly simple frontends to the associated gcwq. When a work item is +queued, it's queued to the unified worklist of the target gcwq. Each +gcwq maintains pool of workers used to process the worklist. + +For any worker pool implementation, managing the concurrency level (how +many execution contexts are active) is an important issue. cmwq tries +to keep the concurrency at minimal but sufficient level. + +Each gcwq bound to an actual CPU implements concurrency management by +hooking into the scheduler. The gcwq is notified whenever an active +worker wakes up or sleeps and keeps track of the number of the +currently runnable workers. Generally, work items are not expected to +hog CPU cycle and maintaining just enough concurrency to prevent work +processing from stalling should be optimal. As long as there is one +or more runnable workers on the CPU, the gcwq doesn't start execution +of a new work, but, when the last running worker goes to sleep, it +immediately schedules a new worker so that the CPU doesn't sit idle +while there are pending work items. This allows using minimal number +of workers without losing execution bandwidth. + +Keeping idle workers around doesn't cost other than the memory space +for kthreads, so cmwq holds onto idle ones for a while before killing +them. + +For an unbound wq, the above concurrency management doesn't apply and +the gcwq for the pseudo unbound CPU tries to start executing all work +items as soon as possible. The responsibility of regulating +concurrency level is on the users. There is also a flag to mark a +bound wq to ignore the concurrency management. Please refer to the +Workqueue Attributes section for details. + +Forward progress guarantee relies on that workers can be created when +more execution contexts are necessary, which in turn is guaranteed +through the use of rescue workers. All wq which might be used in +memory reclamation path are required to have a rescuer reserved for +execution of the wq under memory pressure so that memory reclamation +for worker creation doesn't deadlock waiting for execution contexts to +free up. + + +3. Workqueue Attributes + +alloc_workqueue() allocates a wq. The original create_*workqueue() +functions are deprecated and scheduled for removal. alloc_workqueue() +takes three arguments - @name, @flags and @max_active. @name is the +name of the wq and also used as the name of the rescuer thread if +there is one. + +A wq no longer manages execution resources but serves as a domain for +forward progress guarantee, flush and work item attributes. @flags +and @max_active control how work items are assigned execution +resources, scheduled and executed. + +@flags: + + WQ_NON_REENTRANT + + By default, a wq guarantees non-reentrance only on the same + CPU. A work may not be executed concurrently on the same CPU + by multiple workers but is allowed to be executed concurrently + on multiple CPUs. This flag makes sure non-reentrance is + enforced across all CPUs. Work items queued to a + non-reentrant wq are guaranteed to be executed by at most one + worker system-wide at any given time. + + WQ_UNBOUND + + Work items queued to an unbound wq are served by a special + gcwq which hosts workers which are not bound to any specific + CPU. This makes the wq behave as a simple execution context + provider without concurrency management. The unbound gcwq + tries to start execution of work items as soon as possible. + Unbound wq sacrifices locality but is useful for the following + cases. + + * Wide fluctuation in the concurrency level requirement is + expected and using bound wq may end up creating large number + of mostly unused workers across different CPUs as the issuer + hops through different CPUs. + + * Long running CPU intensive workloads which can be better + managed by the system scheduler. + + WQ_FREEZEABLE + + A freezeable wq participates in the freeze phase of the system + suspend operations. Work items on the wq are drained and no + new work item starts execution until thawed. + + WQ_RESCUER + + All wq which might be used in the memory reclamation paths + _MUST_ have this flag set. This reserves one worker + exclusively for the execution of this wq under memory + pressure. + + WQ_HIGHPRI + + Work items of a highpri wq are queued at the head of the + worklist of the target gcwq and start execution regardless of + the current concurrency level. In other words, highpri work + items will always start execution as soon as execution + resource is available. + + Ordering among highpri work items is preserved - a highpri + work item queued after another highpri work item will start + execution after the earlier highpri work item starts. + + Although highpri work items are not held back by other + runnable work items, they still contribute to the concurrency + level. Highpri work items in runnable state will prevent + non-highpri work items from starting execution. + + This flag is meaningless for unbound wq. + + WQ_CPU_INTENSIVE + + Work items of a CPU intensive wq do not contribute to the + concurrency level. In other words, Runnable CPU intensive + work items will not prevent other work items from starting + execution. This is useful for bound work items which are + expected to hog CPU cycles so that their execution is + regulated by the system scheduler. + + Although CPU intensive work items don't contribute to the + concurrency level, start of their executions is still + regulated by the concurrency management and runnable + non-CPU-intensive work items can delay execution of CPU + intensive work items. + + This flag is meaningless for unbound wq. + + WQ_HIGHPRI | WQ_CPU_INTENSIVE + + This combination makes the wq avoid interaction with + concurrency management completely and behave as a simple + per-CPU execution context provider. Work items queued on a + highpri CPU-intensive wq start execution as soon as resources + are available and don't affect execution of other work items. + +@max_active: + +@max_active determines the maximum number of execution contexts per +CPU which can be assigned to the work items of a wq. For example, +with @max_active of 16, at most 16 work items of the wq can be +executing at the same time per CPU. + +Currently, for a bound wq, the maximum limit for @max_active is 512 +and the default value used when 0 is specified is 256. For an unbound +wq, the limit is higher of 512 and 4 * num_possible_cpus(). These +values are chosen sufficiently high such that they are not the +limiting factor while providing protection in runaway cases. + +The number of active work items of a wq is usually regulated by the +users of the wq, more specifically, by how many work items the users +may queue at the same time. Unless there is a specific need for +throttling the number of active work items, specifying '0' is +recommended. + +Some users depend on the strict execution ordering of ST wq. The +combination of @max_active of 1 and WQ_UNBOUND is used to achieve this +behavior. Work items on such wq are always queued to the unbound gcwq +and only one work item can be active at any given time thus achieving +the same ordering property as ST wq. + + +4. Example Execution Scenarios + +The following example execution scenarios try to illustrate how cmwq +behave under different configurations. + + Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU. + w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms + again before finishing. w1 and w2 burn CPU for 5ms then sleep for + 10ms. + +Ignoring all other tasks, works and processing overhead, and assuming +simple FIFO scheduling, the following is one highly simplified version +of possible sequences of events with the original wq. + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 starts and burns CPU + 25 w1 sleeps + 35 w1 wakes up and finishes + 35 w2 starts and burns CPU + 40 w2 sleeps + 50 w2 wakes up and finishes + +And with cmwq with @max_active >= 3, + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 starts and burns CPU + 10 w1 sleeps + 10 w2 starts and burns CPU + 15 w2 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 25 w2 wakes up and finishes + +If @max_active == 2, + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 starts and burns CPU + 10 w1 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 20 w2 starts and burns CPU + 25 w2 sleeps + 35 w2 wakes up and finishes + +Now, let's assume w1 and w2 are queued to a different wq q1 which has +WQ_HIGHPRI set, + + TIME IN MSECS EVENT + 0 w1 and w2 start and burn CPU + 5 w1 sleeps + 10 w2 sleeps + 10 w0 starts and burns CPU + 15 w0 sleeps + 15 w1 wakes up and finishes + 20 w2 wakes up and finishes + 25 w0 wakes up and burns CPU + 30 w0 finishes + +If q1 has WQ_CPU_INTENSIVE set, + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 and w2 start and burn CPU + 10 w1 sleeps + 15 w2 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 25 w2 wakes up and finishes + + +5. Guidelines + +* Do not forget to use WQ_RESCUER if a wq may process work items which + are used during memory reclamation. Each wq with WQ_RESCUER set has + one rescuer thread reserved for it. If there is dependency among + multiple work items used during memory reclamation, they should be + queued to separate wq each with WQ_RESCUER. + +* Unless strict ordering is required, there is no need to use ST wq. + +* Unless there is a specific need, using 0 for @nr_active is + recommended. In most use cases, concurrency level usually stays + well under the default limit. + +* A wq serves as a domain for forward progress guarantee (WQ_RESCUER), + flush and work item attributes. Work items which are not involved + in memory reclamation and don't need to be flushed as a part of a + group of work items, and don't require any special attribute, can + use one of the system wq. There is no difference in execution + characteristics between using a dedicated wq and a system wq. + +* Unless work items are expected to consume huge amount of CPU cycles, + using bound wq is usually beneficial due to increased level of + locality in wq operations and work item execution. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH UPDATED] workqueue: add documentation 2010-09-08 15:51 ` [PATCH UPDATED] " Tejun Heo @ 2010-09-09 8:02 ` Florian Mickler 2010-09-09 10:22 ` Tejun Heo 0 siblings, 1 reply; 14+ messages in thread From: Florian Mickler @ 2010-09-09 8:02 UTC (permalink / raw) To: Tejun Heo; +Cc: lkml, Ingo Molnar, Christoph Lameter, Dave Chinner Hi Tejun! Perfect timing. Just enough for the details to get a little foggy, while still knowing a little bit what you want to talk about. :-) On Wed, 08 Sep 2010 17:40:02 +0200 Tejun Heo <tj@kernel.org> wrote: > + > +1. Why cmwq? Perhaps better to begin with an introduction: 1. Introduction > + > +There are many cases where an asynchronous process execution context > +is needed and the workqueue (wq) is the most commonly used mechanism > +for such cases. There are many cases where an asynchronous process execution context is needed and the workqueue (wq) API is the most commonly used mechanism for such cases. > A work item describing which function to execute is > +queued on a workqueue which executes the work item in a process > +context asynchronously. When such an asynchronous execution context is needed, a work item describing which function to execute is put on a queue. An independent thread serves as the asynchronous execution context. The queue is called workqueue and the thread is called worker. While there are work items on the workqueue the worker executes the functions associated with the work items one after the other. When there is no work item left on the workqueue the worker becomes idle. When a new work item gets queued, the worker begins executing again. 2. Why cmwq? > + > +In the original wq implementation, a multi threaded (MT) wq had one > +worker thread per CPU and a single threaded (ST) wq had one worker > +thread system-wide. A single MT wq needed to keep around the same > +number of workers as the number of CPUs. The kernel grew a lot of MT > +wq users over the years and with the number of CPU cores continuously > +rising, some systems saturated the default 32k PID space just booting > +up. > + > +Although MT wq wasted a lot of resource, the level of concurrency > +provided was unsatisfactory. The limitation was common to both ST and > +MT wq albeit less severe on MT. Each wq maintained its own separate > +worker pool. A MT wq could provide only one execution context per CPU > +while a ST wq one for the whole system. Work items had to compete for > +those very limited execution contexts leading to various problems > +including proneness to deadlocks around the single execution context. > + > +The tension between the provided level of concurrency and resource > +usage also forced its users to make unnecessary tradeoffs like libata > +choosing to use ST wq for polling PIOs and accepting an unnecessary > +limitation that no two polling PIOs can progress at the same time. As > +MT wq don't provide much better concurrency, users which require > +higher level of concurrency, like async or fscache, had to implement > +their own thread pool. > + > +Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with > +focus on the following goals. > + > +* Maintain compatibility with the original workqueue API. > + > +* Use per-CPU unified worker pools shared by all wq to provide > + flexible level of concurrency on demand without wasting a lot of > + resource. > + > +* Automatically regulate worker pool and level of concurrency so that > + the API users don't need to worry about such details. > + > + > +2. The Design Now it get's a little bit rougher: > + > +There's a single global cwq (gcwq) for each possible CPU and a pseudo > +CPU for unbound wq. A gcwq manages and serves out all the execution > +contexts on the associated CPU. cpu_workqueue's (cwq) of each wq are > +mostly simple frontends to the associated gcwq. When a work item is > +queued, it's queued to the unified worklist of the target gcwq. Each > +gcwq maintains pool of workers used to process the worklist. Hm. That hurt my brain a little. :) What about smth along the lines: In order to ease the asynchronous execution of functions a new abstraction, the work item, is introduced. A work item is a simple struct that holds a pointer to the function that is to be executed asynchronously. Whenever a driver or subsystem wants a function to be executed asynchronously it has to set up a work item pointing to that function and queue that work item on a workqueue. Special purpose threads, called worker threads, execute the functions off of the queue, one after the other. If no work is queued, the worker threads become idle. These worker threads are managed in so called thread-pools. The cmwq design differentiates between the user-facing workqueues that subsystems and drivers queue work items on and what queues the thread-pools actually work on. There are worker-thread-pools for each possible CPU and one worker-thread-pool whose threads are not bound to any specific CPU. Each worker-thread-pool has it's own queue (called gcwq) from which it executes work-items. When a driver or subsystem creates a workqueue it is automatically associated with one of the gcwq's. For CPU-bound workqueues they are associated to that specific CPU's gcwq. For unbound workqueues, they are queued to the gcwq of the global thread-pool. [Btw, I realized, now that I read the guidelines below, that this last paragraph is probably incorrect? Is there an association or does the enqueue-API automatically determine the CPU it needs to queue the work item on?] > +For any worker pool implementation, managing the concurrency level (how > +many execution contexts are active) is an important issue. cmwq tries > +to keep the concurrency at minimal but sufficient level. > + > +Each gcwq bound to an actual CPU implements concurrency management by > +hooking into the scheduler. The gcwq is notified whenever an active > +worker wakes up or sleeps and keeps track of the number of the > +currently runnable workers. Generally, work items are not expected to > +hog CPU cycle and maintaining just enough concurrency to prevent work > +processing from stalling should be optimal. As long as there is one > +or more runnable workers on the CPU, the gcwq doesn't start execution > +of a new work, but, when the last running worker goes to sleep, it > +immediately schedules a new worker so that the CPU doesn't sit idle > +while there are pending work items. This allows using minimal number > +of workers without losing execution bandwidth. > + > +Keeping idle workers around doesn't cost other than the memory space > +for kthreads, so cmwq holds onto idle ones for a while before killing > +them. > + > +For an unbound wq, the above concurrency management doesn't apply and > +the gcwq for the pseudo unbound CPU tries to start executing all work > +items as soon as possible. The responsibility of regulating > +concurrency level is on the users. There is also a flag to mark a > +bound wq to ignore the concurrency management. Please refer to the > +Workqueue Attributes section for details. > + > +Forward progress guarantee relies on that workers can be created when > +more execution contexts are necessary, which in turn is guaranteed > +through the use of rescue workers. > +All wq which might be used in > +memory reclamation path are required to have a rescuer reserved for > +execution of the wq under memory pressure so that memory reclamation > +for worker creation doesn't deadlock waiting for execution contexts to > +free up. All work items which might be used on code paths that handle memory reclaim are required to be queued on wq's that have a rescue-worker reserved for execution under memory pressure. Else it is possible that the thread-pool deadlocks waiting for execution contexts to free up. > + > + > +3. Workqueue Attributes > + 3. Application Programming Interface (API) > +alloc_workqueue() allocates a wq. The original create_*workqueue() > +functions are deprecated and scheduled for removal. alloc_workqueue() > +takes three arguments - @name, @flags and @max_active. @name is the > +name of the wq and also used as the name of the rescuer thread if > +there is one. > + > +A wq no longer manages execution resources but serves as a domain for > +forward progress guarantee, flush and work item attributes. @flags > +and @max_active control how work items are assigned execution > +resources, scheduled and executed. [snip] I think it is worth mentioning all functions that are considered to be part of the API here. [snip] > +5. Guidelines > + > +* Do not forget to use WQ_RESCUER if a wq may process work items which > + are used during memory reclamation. Each wq with WQ_RESCUER set has hmm.. it's not "reclamation". But I can't say the correct term either. I'd say: ".. are used during memory reclaim." > + one rescuer thread reserved for it. If there is dependency among > + multiple work items used during memory reclamation, they should be "during memory reclaim" > + queued to separate wq each with WQ_RESCUER. > + > +* Unless strict ordering is required, there is no need to use ST wq. > + > +* Unless there is a specific need, using 0 for @nr_active is > + recommended. In most use cases, concurrency level usually stays > + well under the default limit. > + > +* A wq serves as a domain for forward progress guarantee (WQ_RESCUER), > + flush and work item attributes. Work items which are not involved > + in memory reclamation and don't need to be flushed as a part of a see above (-> memory reclaim) > + group of work items, and don't require any special attribute, can > + use one of the system wq. There is no difference in execution > + characteristics between using a dedicated wq and a system wq. > + > +* Unless work items are expected to consume huge amount of CPU cycles, > + using bound wq is usually beneficial due to increased level of > + locality in wq operations and work item execution. "Unless work items are expected to consume a huge amount of CPU cycles, using a bound wq is usually beneficial due to the increased level of locality in wq operations and work item exection. " Btw, it is not clear to me, what you mean with "wq operations". Do the enqueuing API functions automatically determine the cpu they are executed on and queue the workitem to the corresponding gcwq? Or do you need to explicitly queue to a specific CPU? Either you mean the operations that lead to the enqueueing of the work-item, or you mean the operations done by the thread-pool? ... after thinking a bit, the wq implementation should obviously do the automatic enqueuing on the nearest gcwq thingy... But that should probably be mentioned in the API description. Although I have to admit I only skimmed over the flag description above it seems you only mention the UNBOUND case and not the default one? Cheers, Flo ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH UPDATED] workqueue: add documentation 2010-09-09 8:02 ` Florian Mickler @ 2010-09-09 10:22 ` Tejun Heo 2010-09-09 18:50 ` Florian Mickler 0 siblings, 1 reply; 14+ messages in thread From: Tejun Heo @ 2010-09-09 10:22 UTC (permalink / raw) To: Florian Mickler; +Cc: lkml, Ingo Molnar, Christoph Lameter, Dave Chinner Hello, On 09/09/2010 10:02 AM, Florian Mickler wrote: > Perfect timing. Just enough for the details to get a little foggy, > while still knowing a little bit what you want to talk about. > :-) :-) Added Introduction and updated Why cmwq section as suggested. >> +2. The Design > > Now it get's a little bit rougher: > >> + >> +There's a single global cwq (gcwq) for each possible CPU and a pseudo >> +CPU for unbound wq. A gcwq manages and serves out all the execution >> +contexts on the associated CPU. cpu_workqueue's (cwq) of each wq are >> +mostly simple frontends to the associated gcwq. When a work item is >> +queued, it's queued to the unified worklist of the target gcwq. Each >> +gcwq maintains pool of workers used to process the worklist. > > Hm. That hurt my brain a little. :) Yeap, that's a lot of overly compressed information there. > What about smth along the lines: > > In order to ease the asynchronous execution of functions a new > abstraction, the work item, is introduced. > > A work item is a simple struct that holds a pointer to the > function that is to be executed asynchronously. Whenever a driver or > subsystem wants a function to be executed asynchronously it has to set > up a work item pointing to that function and queue that work item on a > workqueue. > > Special purpose threads, called worker threads, execute the functions > off of the queue, one after the other. If no work is queued, the worker > threads become idle. > > These worker threads are managed in so called thread-pools. > > The cmwq design differentiates between the user-facing workqueues that > subsystems and drivers queue work items on and what queues the > thread-pools actually work on. > > There are worker-thread-pools for each possible CPU and one > worker-thread-pool whose threads are not bound to any specific CPU. Each > worker-thread-pool has it's own queue (called gcwq) from which it > executes work-items. > > When a driver or subsystem creates a workqueue it is > automatically associated with one of the gcwq's. For CPU-bound > workqueues they are associated to that specific CPU's gcwq. For > unbound workqueues, they are queued to the gcwq of the global > thread-pool. > > [Btw, I realized, now that I read the guidelines below, that this last > paragraph is probably incorrect? Is there an association or does the > enqueue-API automatically determine the CPU it needs to queue the work > item on?] Bound workqueues are per-cpu and by default work items will be queued and processed on the same cpu as the issuer. Unbound ones are system-wide. How about something like the following? In order to ease the asynchronous execution of functions a new abstraction, the work item, is introduced. A work item is a simple struct that holds a pointer to the function that is to be executed asynchronously. Whenever a driver or subsystem wants a function to be executed asynchronously it has to set up a work item pointing to that function and queue that work item on a workqueue. Special purpose threads, called worker threads, execute the functions off of the queue, one after the other. If no work is queued, the worker threads become idle. These worker threads are managed in so called thread-pools. Subsystems and drivers can create and queue work items on workqueues as they see fit. By default, workqueues are per-cpu. Work items are queued and executed on the same CPU as the issuer. These workqueues and work items are said to be "bound". A workqueue can be specifically configured to be "unbound" in which case work items queued on the workqueue are executed by worker threads not bound to any specific CPU. The cmwq design differentiates between the user-facing workqueues that subsystems and drivers queue work items on and the backend mechanism which manages thread-pool and processes the queued work items. The backend mechanism is called Global CPU Workqueue (gcwq). There is one gcwq for each possible CPU and one gcwq to serve work items queued on unbound workqueues. When a work item is queued to a workqueue, the target gcwq is determined according to the queue parameters and workqueue attributes and queued on the shared worklist of the gcwq. For example, unless specifically overridden, a work item of a bound workqueue will be queued on the worklist of the gcwq of the CPU the issuer is running on. >> +All wq which might be used in >> +memory reclamation path are required to have a rescuer reserved for >> +execution of the wq under memory pressure so that memory reclamation >> +for worker creation doesn't deadlock waiting for execution contexts to >> +free up. > > All work items which might be used on code paths that handle memory > reclaim are required to be queued on wq's that have a rescue-worker > reserved for execution under memory pressure. Else it is possible that > the thread-pool deadlocks waiting for execution contexts to free up. Updated as suggested. >> + >> + >> +3. Workqueue Attributes >> + > > 3. Application Programming Interface (API) > >> +alloc_workqueue() allocates a wq. The original create_*workqueue() >> +functions are deprecated and scheduled for removal. alloc_workqueue() >> +takes three arguments - @name, @flags and @max_active. @name is the >> +name of the wq and also used as the name of the rescuer thread if >> +there is one. >> + >> +A wq no longer manages execution resources but serves as a domain for >> +forward progress guarantee, flush and work item attributes. @flags >> +and @max_active control how work items are assigned execution >> +resources, scheduled and executed. > [snip] > > I think it is worth mentioning all functions that are considered to be > part of the API here. Yeah, that would be nice but a slightly larger task that I would like to postpone at this point. :-) > "Unless work items are expected to consume a huge amount of CPU > cycles, using a bound wq is usually beneficial due to the increased > level of locality in wq operations and work item exection. " So updated. > Btw, it is not clear to me, what you mean with "wq operations". Queueing, dispatching and other book keeping operations. > Do the enqueuing API functions automatically determine the cpu they are > executed on and queue the workitem to the corresponding gcwq? Or do you > need to explicitly queue to a specific CPU? > > Either you mean the operations that lead to the enqueueing of the > work-item, or you mean the operations done by the thread-pool? > > ... after thinking a bit, the wq implementation should obviously do the > automatic enqueuing on the nearest gcwq thingy... But that should > probably be mentioned in the API description. > Although I have to admit I only skimmed over the flag description > above it seems you only mention the UNBOUND case and not the default > one? Yeah, queue_work() queues works on the gcwq of the local CPU. It can be overridden by queue_work_on(). The unbound is special case where the workqueue always sends works to the unbound gcwq which is served by unbound workers. Did the update in the design section explain enough or do you think there needs to be more explanation? Thanks. -- tejun ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH UPDATED] workqueue: add documentation 2010-09-09 10:22 ` Tejun Heo @ 2010-09-09 18:50 ` Florian Mickler 2010-09-10 10:25 ` Tejun Heo 0 siblings, 1 reply; 14+ messages in thread From: Florian Mickler @ 2010-09-09 18:50 UTC (permalink / raw) To: Tejun Heo; +Cc: lkml, Ingo Molnar, Christoph Lameter, Dave Chinner On Thu, 09 Sep 2010 12:22:22 +0200 Tejun Heo <tj@kernel.org> wrote: > The backend mechanism is called Global CPU Workqueue (gcwq). There is I tried to avoid that name. It somehow is confusing to me . Global/Local in context of CPU is somehow associated with CPU locality in my mind. Also the name doesn't fit for the unbound gcwq. I know what you mean by it, but I don't think it's a self explanatory name. That was why I just said "they are called gcwq". But I'm ok with it either way. After all, that _is_ how they are called. :) > > > > I think it is worth mentioning all functions that are considered to be > > part of the API here. > > Yeah, that would be nice but a slightly larger task that I would like > to postpone at this point. :-) Ah well, I may just give it a go then... > > > "Unless work items are expected to consume a huge amount of CPU > > cycles, using a bound wq is usually beneficial due to the increased > > level of locality in wq operations and work item exection. " > > So updated. > > > Btw, it is not clear to me, what you mean with "wq operations". > > Queueing, dispatching and other book keeping operations. Yes. That makes sense. > > > Do the enqueuing API functions automatically determine the cpu they are > > executed on and queue the workitem to the corresponding gcwq? Or do you > > need to explicitly queue to a specific CPU? > > > > Either you mean the operations that lead to the enqueueing of the > > work-item, or you mean the operations done by the thread-pool? > > > > ... after thinking a bit, the wq implementation should obviously do the > > automatic enqueuing on the nearest gcwq thingy... But that should > > probably be mentioned in the API description. > > Although I have to admit I only skimmed over the flag description > > above it seems you only mention the UNBOUND case and not the default > > one? > > Yeah, queue_work() queues works on the gcwq of the local CPU. It can > be overridden by queue_work_on(). The unbound is special case where > the workqueue always sends works to the unbound gcwq which is served > by unbound workers. Did the update in the design section explain > enough or do you think there needs to be more explanation? I'm looking forward to reading the new version en bloc, but if I can trust my gut feeling, I'm ok with it now. :) Let's see if someone else with more kernel-experience has something to add, but here you've got my Reviewed-By: Florian Mickler <florian@mickler.org> in any case. Cheers, Flo ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH UPDATED] workqueue: add documentation 2010-09-09 18:50 ` Florian Mickler @ 2010-09-10 10:25 ` Tejun Heo 2010-09-10 14:26 ` Florian Mickler 0 siblings, 1 reply; 14+ messages in thread From: Tejun Heo @ 2010-09-10 10:25 UTC (permalink / raw) To: Florian Mickler; +Cc: lkml, Ingo Molnar, Christoph Lameter, Dave Chinner Hello, On 09/09/2010 08:50 PM, Florian Mickler wrote: >> The backend mechanism is called Global CPU Workqueue (gcwq). There is > > I tried to avoid that name. It somehow is confusing to me . Global/Local > in context of CPU is somehow associated with CPU locality in my mind. > Also the name doesn't fit for the unbound gcwq. Hmm... yeah, it makes sense from the implementation POV as they're global to a CPU and the unbound gcwq is bound to pseudo unbound CPU. I just dropped the expanded version and just used gcwq as you suggested. >> Yeah, that would be nice but a slightly larger task that I would like >> to postpone at this point. :-) > > Ah well, I may just give it a go then... That would be great. >> Yeah, queue_work() queues works on the gcwq of the local CPU. It can >> be overridden by queue_work_on(). The unbound is special case where >> the workqueue always sends works to the unbound gcwq which is served >> by unbound workers. Did the update in the design section explain >> enough or do you think there needs to be more explanation? > > I'm looking forward to reading the new version en > bloc, but if I can trust my gut feeling, I'm ok with it now. :) > > Let's see if someone else with more kernel-experience has something to > add, but here you've got my > > Reviewed-By: Florian Mickler <florian@mickler.org> Here's the current version. If it looks good to you, I'll push it upstream. Thanks. Subject: workqueue: add documentation Update copyright notice and add Documentation/workqueue.txt. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-By: Florian Mickler <florian@mickler.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> --- Documentation/workqueue.txt | 380 ++++++++++++++++++++++++++++++++++++++++++++ include/linux/workqueue.h | 4 kernel/workqueue.c | 27 +-- 3 files changed, 401 insertions(+), 10 deletions(-) Index: work/kernel/workqueue.c =================================================================== --- work.orig/kernel/workqueue.c +++ work/kernel/workqueue.c @@ -1,19 +1,26 @@ /* - * linux/kernel/workqueue.c + * kernel/workqueue.c - generic async execution with shared worker pool * - * Generic mechanism for defining kernel helper threads for running - * arbitrary tasks in process context. + * Copyright (C) 2002 Ingo Molnar * - * Started by Ingo Molnar, Copyright (C) 2002 + * Derived from the taskqueue/keventd code by: + * David Woodhouse <dwmw2@infradead.org> + * Andrew Morton + * Kai Petzke <wpp@marie.physik.tu-berlin.de> + * Theodore Ts'o <tytso@mit.edu> * - * Derived from the taskqueue/keventd code by: + * Made to use alloc_percpu by Christoph Lameter. * - * David Woodhouse <dwmw2@infradead.org> - * Andrew Morton - * Kai Petzke <wpp@marie.physik.tu-berlin.de> - * Theodore Ts'o <tytso@mit.edu> + * Copyright (C) 2010 SUSE Linux Products GmbH + * Copyright (C) 2010 Tejun Heo <tj@kernel.org> * - * Made to use alloc_percpu by Christoph Lameter. + * This is the generic async execution mechanism. Work items as are + * executed in process context. The worker pool is shared and + * automatically managed. There is one worker pool for each CPU and + * one extra for works which are better served by workers which are + * not bound to any specific CPU. + * + * Please read Documentation/workqueue.txt for details. */ #include <linux/module.h> Index: work/include/linux/workqueue.h =================================================================== --- work.orig/include/linux/workqueue.h +++ work/include/linux/workqueue.h @@ -235,6 +235,10 @@ static inline unsigned int work_static(s #define work_clear_pending(work) \ clear_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)) +/* + * Workqueue flags and constants. For details, please refer to + * Documentation/workqueue.txt. + */ enum { WQ_NON_REENTRANT = 1 << 0, /* guarantee non-reentrance */ WQ_UNBOUND = 1 << 1, /* not bound to any cpu */ Index: work/Documentation/workqueue.txt =================================================================== --- /dev/null +++ work/Documentation/workqueue.txt @@ -0,0 +1,380 @@ + +Concurrency Managed Workqueue (cmwq) + +September, 2010 Tejun Heo <tj@kernel.org> + Florian Mickler <florian@mickler.org> + +CONTENTS + +1. Introduction +2. Why cmwq? +3. The Design +4. Application Programming Interface (API) +5. Example Execution Scenarios +6. Guidelines + + +1. Introduction + +There are many cases where an asynchronous process execution context +is needed and the workqueue (wq) API is the most commonly used +mechanism for such cases. + +When such an asynchronous execution context is needed, a work item +describing which function to execute is put on a queue. An +independent thread serves as the asynchronous execution context. The +queue is called workqueue and the thread is called worker. + +While there are work items on the workqueue the worker executes the +functions associated with the work items one after the other. When +there is no work item left on the workqueue the worker becomes idle. +When a new work item gets queued, the worker begins executing again. + + +2. Why cmwq? + +In the original wq implementation, a multi threaded (MT) wq had one +worker thread per CPU and a single threaded (ST) wq had one worker +thread system-wide. A single MT wq needed to keep around the same +number of workers as the number of CPUs. The kernel grew a lot of MT +wq users over the years and with the number of CPU cores continuously +rising, some systems saturated the default 32k PID space just booting +up. + +Although MT wq wasted a lot of resource, the level of concurrency +provided was unsatisfactory. The limitation was common to both ST and +MT wq albeit less severe on MT. Each wq maintained its own separate +worker pool. A MT wq could provide only one execution context per CPU +while a ST wq one for the whole system. Work items had to compete for +those very limited execution contexts leading to various problems +including proneness to deadlocks around the single execution context. + +The tension between the provided level of concurrency and resource +usage also forced its users to make unnecessary tradeoffs like libata +choosing to use ST wq for polling PIOs and accepting an unnecessary +limitation that no two polling PIOs can progress at the same time. As +MT wq don't provide much better concurrency, users which require +higher level of concurrency, like async or fscache, had to implement +their own thread pool. + +Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with +focus on the following goals. + +* Maintain compatibility with the original workqueue API. + +* Use per-CPU unified worker pools shared by all wq to provide + flexible level of concurrency on demand without wasting a lot of + resource. + +* Automatically regulate worker pool and level of concurrency so that + the API users don't need to worry about such details. + + +3. The Design + +In order to ease the asynchronous execution of functions a new +abstraction, the work item, is introduced. + +A work item is a simple struct that holds a pointer to the function +that is to be executed asynchronously. Whenever a driver or subsystem +wants a function to be executed asynchronously it has to set up a work +item pointing to that function and queue that work item on a +workqueue. + +Special purpose threads, called worker threads, execute the functions +off of the queue, one after the other. If no work is queued, the +worker threads become idle. These worker threads are managed in so +called thread-pools. + +Subsystems and drivers can create and queue work items on workqueues +as they see fit. + +By default, workqueues are per-cpu. Work items are queued and +executed on the same CPU as the issuer. These workqueues and work +items are said to be "bound". A workqueue can be specifically +configured to be "unbound" in which case work items queued on the +workqueue are executed by worker threads not bound to any specific +CPU. + +The cmwq design differentiates between the user-facing workqueues that +subsystems and drivers queue work items on and the backend mechanism +which manages thread-pool and processes the queued work items. + +The backend mechanism is called gcwq. There is one gcwq for each +possible CPU and one gcwq to serve work items queued on unbound +workqueues. + +When a work item is queued to a workqueue, the target gcwq is +determined according to the queue parameters and workqueue attributes +and queued on the shared worklist of the gcwq. For example, unless +specifically overridden, a work item of a bound workqueue will be +queued on the worklist of the gcwq of the CPU the issuer is running +on. + +For any worker pool implementation, managing the concurrency level (how +many execution contexts are active) is an important issue. cmwq tries +to keep the concurrency at minimal but sufficient level. + +Each gcwq bound to an actual CPU implements concurrency management by +hooking into the scheduler. The gcwq is notified whenever an active +worker wakes up or sleeps and keeps track of the number of the +currently runnable workers. Generally, work items are not expected to +hog CPU cycle and maintaining just enough concurrency to prevent work +processing from stalling should be optimal. As long as there is one +or more runnable workers on the CPU, the gcwq doesn't start execution +of a new work, but, when the last running worker goes to sleep, it +immediately schedules a new worker so that the CPU doesn't sit idle +while there are pending work items. This allows using minimal number +of workers without losing execution bandwidth. + +Keeping idle workers around doesn't cost other than the memory space +for kthreads, so cmwq holds onto idle ones for a while before killing +them. + +For an unbound wq, the above concurrency management doesn't apply and +the gcwq for the pseudo unbound CPU tries to start executing all work +items as soon as possible. The responsibility of regulating +concurrency level is on the users. There is also a flag to mark a +bound wq to ignore the concurrency management. Please refer to the +Workqueue Attributes section for details. + +Forward progress guarantee relies on that workers can be created when +more execution contexts are necessary, which in turn is guaranteed +through the use of rescue workers. All work items which might be used +on code paths that handle memory reclaim are required to be queued on +wq's that have a rescue-worker reserved for execution under memory +pressure. Else it is possible that the thread-pool deadlocks waiting +for execution contexts to free up. + + +4. Application Programming Interface (API) + +alloc_workqueue() allocates a wq. The original create_*workqueue() +functions are deprecated and scheduled for removal. alloc_workqueue() +takes three arguments - @name, @flags and @max_active. @name is the +name of the wq and also used as the name of the rescuer thread if +there is one. + +A wq no longer manages execution resources but serves as a domain for +forward progress guarantee, flush and work item attributes. @flags +and @max_active control how work items are assigned execution +resources, scheduled and executed. + +@flags: + + WQ_NON_REENTRANT + + By default, a wq guarantees non-reentrance only on the same + CPU. A work may not be executed concurrently on the same CPU + by multiple workers but is allowed to be executed concurrently + on multiple CPUs. This flag makes sure non-reentrance is + enforced across all CPUs. Work items queued to a + non-reentrant wq are guaranteed to be executed by at most one + worker system-wide at any given time. + + WQ_UNBOUND + + Work items queued to an unbound wq are served by a special + gcwq which hosts workers which are not bound to any specific + CPU. This makes the wq behave as a simple execution context + provider without concurrency management. The unbound gcwq + tries to start execution of work items as soon as possible. + Unbound wq sacrifices locality but is useful for the following + cases. + + * Wide fluctuation in the concurrency level requirement is + expected and using bound wq may end up creating large number + of mostly unused workers across different CPUs as the issuer + hops through different CPUs. + + * Long running CPU intensive workloads which can be better + managed by the system scheduler. + + WQ_FREEZEABLE + + A freezeable wq participates in the freeze phase of the system + suspend operations. Work items on the wq are drained and no + new work item starts execution until thawed. + + WQ_RESCUER + + All wq which might be used in the memory reclaim paths _MUST_ + have this flag set. This reserves one worker exclusively for + the execution of this wq under memory pressure. + + WQ_HIGHPRI + + Work items of a highpri wq are queued at the head of the + worklist of the target gcwq and start execution regardless of + the current concurrency level. In other words, highpri work + items will always start execution as soon as execution + resource is available. + + Ordering among highpri work items is preserved - a highpri + work item queued after another highpri work item will start + execution after the earlier highpri work item starts. + + Although highpri work items are not held back by other + runnable work items, they still contribute to the concurrency + level. Highpri work items in runnable state will prevent + non-highpri work items from starting execution. + + This flag is meaningless for unbound wq. + + WQ_CPU_INTENSIVE + + Work items of a CPU intensive wq do not contribute to the + concurrency level. In other words, Runnable CPU intensive + work items will not prevent other work items from starting + execution. This is useful for bound work items which are + expected to hog CPU cycles so that their execution is + regulated by the system scheduler. + + Although CPU intensive work items don't contribute to the + concurrency level, start of their executions is still + regulated by the concurrency management and runnable + non-CPU-intensive work items can delay execution of CPU + intensive work items. + + This flag is meaningless for unbound wq. + + WQ_HIGHPRI | WQ_CPU_INTENSIVE + + This combination makes the wq avoid interaction with + concurrency management completely and behave as a simple + per-CPU execution context provider. Work items queued on a + highpri CPU-intensive wq start execution as soon as resources + are available and don't affect execution of other work items. + +@max_active: + +@max_active determines the maximum number of execution contexts per +CPU which can be assigned to the work items of a wq. For example, +with @max_active of 16, at most 16 work items of the wq can be +executing at the same time per CPU. + +Currently, for a bound wq, the maximum limit for @max_active is 512 +and the default value used when 0 is specified is 256. For an unbound +wq, the limit is higher of 512 and 4 * num_possible_cpus(). These +values are chosen sufficiently high such that they are not the +limiting factor while providing protection in runaway cases. + +The number of active work items of a wq is usually regulated by the +users of the wq, more specifically, by how many work items the users +may queue at the same time. Unless there is a specific need for +throttling the number of active work items, specifying '0' is +recommended. + +Some users depend on the strict execution ordering of ST wq. The +combination of @max_active of 1 and WQ_UNBOUND is used to achieve this +behavior. Work items on such wq are always queued to the unbound gcwq +and only one work item can be active at any given time thus achieving +the same ordering property as ST wq. + + +5. Example Execution Scenarios + +The following example execution scenarios try to illustrate how cmwq +behave under different configurations. + + Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU. + w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms + again before finishing. w1 and w2 burn CPU for 5ms then sleep for + 10ms. + +Ignoring all other tasks, works and processing overhead, and assuming +simple FIFO scheduling, the following is one highly simplified version +of possible sequences of events with the original wq. + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 starts and burns CPU + 25 w1 sleeps + 35 w1 wakes up and finishes + 35 w2 starts and burns CPU + 40 w2 sleeps + 50 w2 wakes up and finishes + +And with cmwq with @max_active >= 3, + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 starts and burns CPU + 10 w1 sleeps + 10 w2 starts and burns CPU + 15 w2 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 25 w2 wakes up and finishes + +If @max_active == 2, + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 starts and burns CPU + 10 w1 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 20 w2 starts and burns CPU + 25 w2 sleeps + 35 w2 wakes up and finishes + +Now, let's assume w1 and w2 are queued to a different wq q1 which has +WQ_HIGHPRI set, + + TIME IN MSECS EVENT + 0 w1 and w2 start and burn CPU + 5 w1 sleeps + 10 w2 sleeps + 10 w0 starts and burns CPU + 15 w0 sleeps + 15 w1 wakes up and finishes + 20 w2 wakes up and finishes + 25 w0 wakes up and burns CPU + 30 w0 finishes + +If q1 has WQ_CPU_INTENSIVE set, + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 and w2 start and burn CPU + 10 w1 sleeps + 15 w2 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 25 w2 wakes up and finishes + + +6. Guidelines + +* Do not forget to use WQ_RESCUER if a wq may process work items which + are used during memory reclaim. Each wq with WQ_RESCUER set has one + rescuer thread reserved for it. If there is dependency among + multiple work items used during memory reclaim, they should be + queued to separate wq each with WQ_RESCUER. + +* Unless strict ordering is required, there is no need to use ST wq. + +* Unless there is a specific need, using 0 for @nr_active is + recommended. In most use cases, concurrency level usually stays + well under the default limit. + +* A wq serves as a domain for forward progress guarantee (WQ_RESCUER), + flush and work item attributes. Work items which are not involved + in memory reclaim and don't need to be flushed as a part of a group + of work items, and don't require any special attribute, can use one + of the system wq. There is no difference in execution + characteristics between using a dedicated wq and a system wq. + +* Unless work items are expected to consume a huge amount of CPU + cycles, using a bound wq is usually beneficial due to the increased + level of locality in wq operations and work item execution. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH UPDATED] workqueue: add documentation 2010-09-10 10:25 ` Tejun Heo @ 2010-09-10 14:26 ` Florian Mickler 2010-09-10 14:55 ` Tejun Heo 0 siblings, 1 reply; 14+ messages in thread From: Florian Mickler @ 2010-09-10 14:26 UTC (permalink / raw) To: Tejun Heo; +Cc: lkml, Ingo Molnar, Christoph Lameter, Dave Chinner On Fri, 10 Sep 2010 12:25:55 +0200 Tejun Heo <tj@kernel.org> wrote: > +Concurrency Managed Workqueue (cmwq) > + > +September, 2010 Tejun Heo <tj@kernel.org> > + Florian Mickler <florian@mickler.org> > + > +CONTENTS Thx. I fumbled a bit with the ordering in the design description.. ok so? Cheers, Flo diff --git a/Documentation/workqueue.txt b/Documentation/workqueue.txt index 5317229..3d22821 100644 --- a/Documentation/workqueue.txt +++ b/Documentation/workqueue.txt @@ -86,45 +86,44 @@ off of the queue, one after the other. If no work is queued, the worker threads become idle. These worker threads are managed in so called thread-pools. -Subsystems and drivers can create and queue work items on workqueues -as they see fit. - -By default, workqueues are per-cpu. Work items are queued and -executed on the same CPU as the issuer. These workqueues and work -items are said to be "bound". A workqueue can be specifically -configured to be "unbound" in which case work items queued on the -workqueue are executed by worker threads not bound to any specific -CPU. - The cmwq design differentiates between the user-facing workqueues that subsystems and drivers queue work items on and the backend mechanism which manages thread-pool and processes the queued work items. -The backend mechanism is called gcwq. There is one gcwq for each +The backend is called gcwq. There is one gcwq for each possible CPU and one gcwq to serve work items queued on unbound workqueues. +Subsystems and drivers can create and queue work items through special +workqueue API functions as they see fit. They can influence some +aspects of the way the work items are executed by setting flags on the +workqueue they are putting the work item on. These flags include +things like cpu locality, reentrancy, concurrency limits and more. To +get a detailed overview refer to the API description of +alloc_workqueue() below. + When a work item is queued to a workqueue, the target gcwq is determined according to the queue parameters and workqueue attributes -and queued on the shared worklist of the gcwq. For example, unless +and appended to the shared worklist of that gcwq. For example, unless specifically overridden, a work item of a bound workqueue will be -queued on the worklist of the gcwq of the CPU the issuer is running -on. +queued on the worklist of exactly that gcwq that is associated to the +CPU the issuer is running on. For any worker pool implementation, managing the concurrency level (how many execution contexts are active) is an important issue. cmwq tries -to keep the concurrency at minimal but sufficient level. +to keep the concurrency at a minimal but sufficient level. Minimal to save +resources and sufficient in that the system is used at it's full capacity. Each gcwq bound to an actual CPU implements concurrency management by hooking into the scheduler. The gcwq is notified whenever an active worker wakes up or sleeps and keeps track of the number of the currently runnable workers. Generally, work items are not expected to -hog CPU cycle and maintaining just enough concurrency to prevent work -processing from stalling should be optimal. As long as there is one -or more runnable workers on the CPU, the gcwq doesn't start execution -of a new work, but, when the last running worker goes to sleep, it -immediately schedules a new worker so that the CPU doesn't sit idle -while there are pending work items. This allows using minimal number +hog a CPU and consume many cycles. That means maintaining just enough +concurrency to prevent work processing from stalling should be optimal. +As long as there is one or more runnable workers on the CPU, the gcwq +doesn't start execution of a new work, but, when the last running worker goes +to sleep, it immediately schedules a new worker so that the CPU doesn't sit +idle while there are pending work items. This allows using a minimal number of workers without losing execution bandwidth. Keeping idle workers around doesn't cost other than the memory space ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH UPDATED] workqueue: add documentation 2010-09-10 14:26 ` Florian Mickler @ 2010-09-10 14:55 ` Tejun Heo 2010-09-10 17:43 ` Randy Dunlap 2010-09-13 0:51 ` Dave Chinner 0 siblings, 2 replies; 14+ messages in thread From: Tejun Heo @ 2010-09-10 14:55 UTC (permalink / raw) To: Florian Mickler; +Cc: lkml, Ingo Molnar, Christoph Lameter, Dave Chinner >From e9818ca0cd087229a3665a9ec186ee3da4a046bb Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Fri, 10 Sep 2010 16:51:36 +0200 Update copyright notice and add Documentation/workqueue.txt. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-By: Florian Mickler <florian@mickler.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> --- Applied to wq#for-linus branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-linus Thanks a lot for helping with the documentation. Much appreciated. Documentation/workqueue.txt | 380 +++++++++++++++++++++++++++++++++++++++++++ include/linux/workqueue.h | 4 + kernel/workqueue.c | 27 ++- 3 files changed, 401 insertions(+), 10 deletions(-) create mode 100644 Documentation/workqueue.txt diff --git a/Documentation/workqueue.txt b/Documentation/workqueue.txt new file mode 100644 index 0000000..6d1bcd3 --- /dev/null +++ b/Documentation/workqueue.txt @@ -0,0 +1,380 @@ + +Concurrency Managed Workqueue (cmwq) + +September, 2010 Tejun Heo <tj@kernel.org> + Florian Mickler <florian@mickler.org> + +CONTENTS + +1. Introduction +2. Why cmwq? +3. The Design +4. Application Programming Interface (API) +5. Example Execution Scenarios +6. Guidelines + + +1. Introduction + +There are many cases where an asynchronous process execution context +is needed and the workqueue (wq) API is the most commonly used +mechanism for such cases. + +When such an asynchronous execution context is needed, a work item +describing which function to execute is put on a queue. An +independent thread serves as the asynchronous execution context. The +queue is called workqueue and the thread is called worker. + +While there are work items on the workqueue the worker executes the +functions associated with the work items one after the other. When +there is no work item left on the workqueue the worker becomes idle. +When a new work item gets queued, the worker begins executing again. + + +2. Why cmwq? + +In the original wq implementation, a multi threaded (MT) wq had one +worker thread per CPU and a single threaded (ST) wq had one worker +thread system-wide. A single MT wq needed to keep around the same +number of workers as the number of CPUs. The kernel grew a lot of MT +wq users over the years and with the number of CPU cores continuously +rising, some systems saturated the default 32k PID space just booting +up. + +Although MT wq wasted a lot of resource, the level of concurrency +provided was unsatisfactory. The limitation was common to both ST and +MT wq albeit less severe on MT. Each wq maintained its own separate +worker pool. A MT wq could provide only one execution context per CPU +while a ST wq one for the whole system. Work items had to compete for +those very limited execution contexts leading to various problems +including proneness to deadlocks around the single execution context. + +The tension between the provided level of concurrency and resource +usage also forced its users to make unnecessary tradeoffs like libata +choosing to use ST wq for polling PIOs and accepting an unnecessary +limitation that no two polling PIOs can progress at the same time. As +MT wq don't provide much better concurrency, users which require +higher level of concurrency, like async or fscache, had to implement +their own thread pool. + +Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with +focus on the following goals. + +* Maintain compatibility with the original workqueue API. + +* Use per-CPU unified worker pools shared by all wq to provide + flexible level of concurrency on demand without wasting a lot of + resource. + +* Automatically regulate worker pool and level of concurrency so that + the API users don't need to worry about such details. + + +3. The Design + +In order to ease the asynchronous execution of functions a new +abstraction, the work item, is introduced. + +A work item is a simple struct that holds a pointer to the function +that is to be executed asynchronously. Whenever a driver or subsystem +wants a function to be executed asynchronously it has to set up a work +item pointing to that function and queue that work item on a +workqueue. + +Special purpose threads, called worker threads, execute the functions +off of the queue, one after the other. If no work is queued, the +worker threads become idle. These worker threads are managed in so +called thread-pools. + +The cmwq design differentiates between the user-facing workqueues that +subsystems and drivers queue work items on and the backend mechanism +which manages thread-pool and processes the queued work items. + +The backend is called gcwq. There is one gcwq for each possible CPU +and one gcwq to serve work items queued on unbound workqueues. + +Subsystems and drivers can create and queue work items through special +workqueue API functions as they see fit. They can influence some +aspects of the way the work items are executed by setting flags on the +workqueue they are putting the work item on. These flags include +things like cpu locality, reentrancy, concurrency limits and more. To +get a detailed overview refer to the API description of +alloc_workqueue() below. + +When a work item is queued to a workqueue, the target gcwq is +determined according to the queue parameters and workqueue attributes +and appended on the shared worklist of the gcwq. For example, unless +specifically overridden, a work item of a bound workqueue will be +queued on the worklist of exactly that gcwq that is associated to the +CPU the issuer is running on. + +For any worker pool implementation, managing the concurrency level +(how many execution contexts are active) is an important issue. cmwq +tries to keep the concurrency at a minimal but sufficient level. +Minimal to save resources and sufficient in that the system is used at +its full capacity. + +Each gcwq bound to an actual CPU implements concurrency management by +hooking into the scheduler. The gcwq is notified whenever an active +worker wakes up or sleeps and keeps track of the number of the +currently runnable workers. Generally, work items are not expected to +hog a CPU and consume many cycles. That means maintaining just enough +concurrency to prevent work processing from stalling should be +optimal. As long as there are one or more runnable workers on the +CPU, the gcwq doesn't start execution of a new work, but, when the +last running worker goes to sleep, it immediately schedules a new +worker so that the CPU doesn't sit idle while there are pending work +items. This allows using a minimal number of workers without losing +execution bandwidth. + +Keeping idle workers around doesn't cost other than the memory space +for kthreads, so cmwq holds onto idle ones for a while before killing +them. + +For an unbound wq, the above concurrency management doesn't apply and +the gcwq for the pseudo unbound CPU tries to start executing all work +items as soon as possible. The responsibility of regulating +concurrency level is on the users. There is also a flag to mark a +bound wq to ignore the concurrency management. Please refer to the +Workqueue Attributes section for details. + +Forward progress guarantee relies on that workers can be created when +more execution contexts are necessary, which in turn is guaranteed +through the use of rescue workers. All work items which might be used +on code paths that handle memory reclaim are required to be queued on +wq's that have a rescue-worker reserved for execution under memory +pressure. Else it is possible that the thread-pool deadlocks waiting +for execution contexts to free up. + + +4. Application Programming Interface (API) + +alloc_workqueue() allocates a wq. The original create_*workqueue() +functions are deprecated and scheduled for removal. alloc_workqueue() +takes three arguments - @name, @flags and @max_active. @name is the +name of the wq and also used as the name of the rescuer thread if +there is one. + +A wq no longer manages execution resources but serves as a domain for +forward progress guarantee, flush and work item attributes. @flags +and @max_active control how work items are assigned execution +resources, scheduled and executed. + +@flags: + + WQ_NON_REENTRANT + + By default, a wq guarantees non-reentrance only on the same + CPU. A work may not be executed concurrently on the same CPU + by multiple workers but is allowed to be executed concurrently + on multiple CPUs. This flag makes sure non-reentrance is + enforced across all CPUs. Work items queued to a + non-reentrant wq are guaranteed to be executed by at most one + worker system-wide at any given time. + + WQ_UNBOUND + + Work items queued to an unbound wq are served by a special + gcwq which hosts workers which are not bound to any specific + CPU. This makes the wq behave as a simple execution context + provider without concurrency management. The unbound gcwq + tries to start execution of work items as soon as possible. + Unbound wq sacrifices locality but is useful for the following + cases. + + * Wide fluctuation in the concurrency level requirement is + expected and using bound wq may end up creating large number + of mostly unused workers across different CPUs as the issuer + hops through different CPUs. + + * Long running CPU intensive workloads which can be better + managed by the system scheduler. + + WQ_FREEZEABLE + + A freezeable wq participates in the freeze phase of the system + suspend operations. Work items on the wq are drained and no + new work item starts execution until thawed. + + WQ_RESCUER + + All wq which might be used in the memory reclaim paths _MUST_ + have this flag set. This reserves one worker exclusively for + the execution of this wq under memory pressure. + + WQ_HIGHPRI + + Work items of a highpri wq are queued at the head of the + worklist of the target gcwq and start execution regardless of + the current concurrency level. In other words, highpri work + items will always start execution as soon as execution + resource is available. + + Ordering among highpri work items is preserved - a highpri + work item queued after another highpri work item will start + execution after the earlier highpri work item starts. + + Although highpri work items are not held back by other + runnable work items, they still contribute to the concurrency + level. Highpri work items in runnable state will prevent + non-highpri work items from starting execution. + + This flag is meaningless for unbound wq. + + WQ_CPU_INTENSIVE + + Work items of a CPU intensive wq do not contribute to the + concurrency level. In other words, Runnable CPU intensive + work items will not prevent other work items from starting + execution. This is useful for bound work items which are + expected to hog CPU cycles so that their execution is + regulated by the system scheduler. + + Although CPU intensive work items don't contribute to the + concurrency level, start of their executions is still + regulated by the concurrency management and runnable + non-CPU-intensive work items can delay execution of CPU + intensive work items. + + This flag is meaningless for unbound wq. + + WQ_HIGHPRI | WQ_CPU_INTENSIVE + + This combination makes the wq avoid interaction with + concurrency management completely and behave as a simple + per-CPU execution context provider. Work items queued on a + highpri CPU-intensive wq start execution as soon as resources + are available and don't affect execution of other work items. + +@max_active: + +@max_active determines the maximum number of execution contexts per +CPU which can be assigned to the work items of a wq. For example, +with @max_active of 16, at most 16 work items of the wq can be +executing at the same time per CPU. + +Currently, for a bound wq, the maximum limit for @max_active is 512 +and the default value used when 0 is specified is 256. For an unbound +wq, the limit is higher of 512 and 4 * num_possible_cpus(). These +values are chosen sufficiently high such that they are not the +limiting factor while providing protection in runaway cases. + +The number of active work items of a wq is usually regulated by the +users of the wq, more specifically, by how many work items the users +may queue at the same time. Unless there is a specific need for +throttling the number of active work items, specifying '0' is +recommended. + +Some users depend on the strict execution ordering of ST wq. The +combination of @max_active of 1 and WQ_UNBOUND is used to achieve this +behavior. Work items on such wq are always queued to the unbound gcwq +and only one work item can be active at any given time thus achieving +the same ordering property as ST wq. + + +5. Example Execution Scenarios + +The following example execution scenarios try to illustrate how cmwq +behave under different configurations. + + Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU. + w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms + again before finishing. w1 and w2 burn CPU for 5ms then sleep for + 10ms. + +Ignoring all other tasks, works and processing overhead, and assuming +simple FIFO scheduling, the following is one highly simplified version +of possible sequences of events with the original wq. + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 starts and burns CPU + 25 w1 sleeps + 35 w1 wakes up and finishes + 35 w2 starts and burns CPU + 40 w2 sleeps + 50 w2 wakes up and finishes + +And with cmwq with @max_active >= 3, + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 starts and burns CPU + 10 w1 sleeps + 10 w2 starts and burns CPU + 15 w2 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 25 w2 wakes up and finishes + +If @max_active == 2, + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 starts and burns CPU + 10 w1 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 20 w2 starts and burns CPU + 25 w2 sleeps + 35 w2 wakes up and finishes + +Now, let's assume w1 and w2 are queued to a different wq q1 which has +WQ_HIGHPRI set, + + TIME IN MSECS EVENT + 0 w1 and w2 start and burn CPU + 5 w1 sleeps + 10 w2 sleeps + 10 w0 starts and burns CPU + 15 w0 sleeps + 15 w1 wakes up and finishes + 20 w2 wakes up and finishes + 25 w0 wakes up and burns CPU + 30 w0 finishes + +If q1 has WQ_CPU_INTENSIVE set, + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 and w2 start and burn CPU + 10 w1 sleeps + 15 w2 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 25 w2 wakes up and finishes + + +6. Guidelines + +* Do not forget to use WQ_RESCUER if a wq may process work items which + are used during memory reclaim. Each wq with WQ_RESCUER set has one + rescuer thread reserved for it. If there is dependency among + multiple work items used during memory reclaim, they should be + queued to separate wq each with WQ_RESCUER. + +* Unless strict ordering is required, there is no need to use ST wq. + +* Unless there is a specific need, using 0 for @nr_active is + recommended. In most use cases, concurrency level usually stays + well under the default limit. + +* A wq serves as a domain for forward progress guarantee (WQ_RESCUER), + flush and work item attributes. Work items which are not involved + in memory reclaim and don't need to be flushed as a part of a group + of work items, and don't require any special attribute, can use one + of the system wq. There is no difference in execution + characteristics between using a dedicated wq and a system wq. + +* Unless work items are expected to consume a huge amount of CPU + cycles, using a bound wq is usually beneficial due to the increased + level of locality in wq operations and work item execution. diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index f11100f..25e02c9 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -235,6 +235,10 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; } #define work_clear_pending(work) \ clear_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)) +/* + * Workqueue flags and constants. For details, please refer to + * Documentation/workqueue.txt. + */ enum { WQ_NON_REENTRANT = 1 << 0, /* guarantee non-reentrance */ WQ_UNBOUND = 1 << 1, /* not bound to any cpu */ diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 727f24e..f77afd9 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1,19 +1,26 @@ /* - * linux/kernel/workqueue.c + * kernel/workqueue.c - generic async execution with shared worker pool * - * Generic mechanism for defining kernel helper threads for running - * arbitrary tasks in process context. + * Copyright (C) 2002 Ingo Molnar * - * Started by Ingo Molnar, Copyright (C) 2002 + * Derived from the taskqueue/keventd code by: + * David Woodhouse <dwmw2@infradead.org> + * Andrew Morton + * Kai Petzke <wpp@marie.physik.tu-berlin.de> + * Theodore Ts'o <tytso@mit.edu> * - * Derived from the taskqueue/keventd code by: + * Made to use alloc_percpu by Christoph Lameter. * - * David Woodhouse <dwmw2@infradead.org> - * Andrew Morton - * Kai Petzke <wpp@marie.physik.tu-berlin.de> - * Theodore Ts'o <tytso@mit.edu> + * Copyright (C) 2010 SUSE Linux Products GmbH + * Copyright (C) 2010 Tejun Heo <tj@kernel.org> * - * Made to use alloc_percpu by Christoph Lameter. + * This is the generic async execution mechanism. Work items as are + * executed in process context. The worker pool is shared and + * automatically managed. There is one worker pool for each CPU and + * one extra for works which are better served by workers which are + * not bound to any specific CPU. + * + * Please read Documentation/workqueue.txt for details. */ #include <linux/module.h> -- 1.7.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH UPDATED] workqueue: add documentation 2010-09-10 14:55 ` Tejun Heo @ 2010-09-10 17:43 ` Randy Dunlap 2010-09-12 10:50 ` Tejun Heo 2010-09-13 0:51 ` Dave Chinner 1 sibling, 1 reply; 14+ messages in thread From: Randy Dunlap @ 2010-09-10 17:43 UTC (permalink / raw) To: Tejun Heo Cc: Florian Mickler, lkml, Ingo Molnar, Christoph Lameter, Dave Chinner On Fri, 10 Sep 2010 16:55:21 +0200 Tejun Heo wrote: > +3. The Design > +Subsystems and drivers can create and queue work items through special > +workqueue API functions as they see fit. They can influence some > +aspects of the way the work items are executed by setting flags on the > +workqueue they are putting the work item on. These flags include > +things like cpu locality, reentrancy, concurrency limits and more. To CPU > +get a detailed overview refer to the API description of > +alloc_workqueue() below. > +4. Application Programming Interface (API) > +@flags: > + > + WQ_NON_REENTRANT > + > + By default, a wq guarantees non-reentrance only on the same > + CPU. A work may not be executed concurrently on the same CPU work item > + by multiple workers but is allowed to be executed concurrently > + on multiple CPUs. This flag makes sure non-reentrance is > + enforced across all CPUs. Work items queued to a > + non-reentrant wq are guaranteed to be executed by at most one > + worker system-wide at any given time. > + WQ_CPU_INTENSIVE > + > + Work items of a CPU intensive wq do not contribute to the > + concurrency level. In other words, Runnable CPU intensive runnable > + work items will not prevent other work items from starting > + execution. This is useful for bound work items which are > + expected to hog CPU cycles so that their execution is > + regulated by the system scheduler. --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH UPDATED] workqueue: add documentation 2010-09-10 17:43 ` Randy Dunlap @ 2010-09-12 10:50 ` Tejun Heo 0 siblings, 0 replies; 14+ messages in thread From: Tejun Heo @ 2010-09-12 10:50 UTC (permalink / raw) To: Randy Dunlap Cc: Florian Mickler, lkml, Ingo Molnar, Christoph Lameter, Dave Chinner Updated accordingly. Thanks. -- tejun ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH UPDATED] workqueue: add documentation 2010-09-10 14:55 ` Tejun Heo 2010-09-10 17:43 ` Randy Dunlap @ 2010-09-13 0:51 ` Dave Chinner 2010-09-13 8:08 ` Tejun Heo 1 sibling, 1 reply; 14+ messages in thread From: Dave Chinner @ 2010-09-13 0:51 UTC (permalink / raw) To: Tejun Heo; +Cc: Florian Mickler, lkml, Ingo Molnar, Christoph Lameter Hi Tejun, A couple more queustions on cmwq. On Fri, Sep 10, 2010 at 04:55:21PM +0200, Tejun Heo wrote: ..... > + WQ_HIGHPRI > + > + Work items of a highpri wq are queued at the head of the > + worklist of the target gcwq and start execution regardless of > + the current concurrency level. In other words, highpri work > + items will always start execution as soon as execution > + resource is available. > + > + Ordering among highpri work items is preserved - a highpri > + work item queued after another highpri work item will start > + execution after the earlier highpri work item starts. > + > + Although highpri work items are not held back by other > + runnable work items, they still contribute to the concurrency > + level. Highpri work items in runnable state will prevent > + non-highpri work items from starting execution. > + > + This flag is meaningless for unbound wq. We talked about this for XFS w.r.t. the xfslogd IO completion work items to be promoted ahead of data IO completion items and that has worked fine. This appears to gives us only two levels of priority, or from an user point of view, two levels of dependency between workqueue item execution. Thinking about the XFS situation more, we actually have three levels of dependency: xfslogd -> xfsdatad -> xfsconvertd. That is, we defer long running, blocking items from xfsdatad to xfsconvertd so we don't block the xfsdatad from continuing to process data IO completion items. How do we guarantee that the xfsconvertd work items won't prevent/excessively delay processing of xfsdatad items? > +@max_active determines the maximum number of execution contexts per > +CPU which can be assigned to the work items of a wq. For example, > +with @max_active of 16, at most 16 work items of the wq can be > +executing at the same time per CPU. I think the reason you were seeing XFS blow this out of the water is that every IO completion for a write beyond EOF (i.e. every single one for an extending streaming write) will require inode locking to update file size. If the inode is locked, then the item will delay(1), and the cmwq controller will run the next item in a new worker. That will then block in delay(1) 'cause it can't get the inode lock, as so on.... As such, I can't see that increasing the max_active count for XFS is a good thing - all it will do is cause larger blockages to occur.... > +6. Guidelines > + > +* Do not forget to use WQ_RESCUER if a wq may process work items which > + are used during memory reclaim. Each wq with WQ_RESCUER set has one > + rescuer thread reserved for it. If there is dependency among > + multiple work items used during memory reclaim, they should be > + queued to separate wq each with WQ_RESCUER. > + > +* Unless strict ordering is required, there is no need to use ST wq. > + > +* Unless there is a specific need, using 0 for @nr_active is max_active? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH UPDATED] workqueue: add documentation 2010-09-13 0:51 ` Dave Chinner @ 2010-09-13 8:08 ` Tejun Heo 2010-09-13 8:16 ` Florian Mickler 0 siblings, 1 reply; 14+ messages in thread From: Tejun Heo @ 2010-09-13 8:08 UTC (permalink / raw) To: Dave Chinner; +Cc: Florian Mickler, lkml, Ingo Molnar, Christoph Lameter Hello, On 09/13/2010 02:51 AM, Dave Chinner wrote: > We talked about this for XFS w.r.t. the xfslogd IO completion > work items to be promoted ahead of data IO completion items and > that has worked fine. This appears to gives us only two > levels of priority, or from an user point of view, two levels of > dependency between workqueue item execution. It's not priority per-se. It's basically a bypass switch for workqueue work deferring mechanism. > Thinking about the XFS situation more, we actually have three levels > of dependency: xfslogd -> xfsdatad -> xfsconvertd. That is, we defer > long running, blocking items from xfsdatad to xfsconvertd so we > don't block the xfsdatad from continuing to process data IO > completion items. How do we guarantee that the xfsconvertd work > items won't prevent/excessively delay processing of xfsdatad items? What do you mean by "long running"? Do you mean it would consume a lot of CPU cycles or it would block for locks and IOs a lot? It's the latter, right? There isn't much to worry about. >> +@max_active determines the maximum number of execution contexts per >> +CPU which can be assigned to the work items of a wq. For example, >> +with @max_active of 16, at most 16 work items of the wq can be >> +executing at the same time per CPU. > > I think the reason you were seeing XFS blow this out of the water is > that every IO completion for a write beyond EOF (i.e. every single > one for an extending streaming write) will require inode locking to > update file size. If the inode is locked, then the item will > delay(1), and the cmwq controller will run the next item in a new > worker. That will then block in delay(1) 'cause it can't get the > inode lock, as so on.... > > As such, I can't see that increasing the max_active count for XFS is > a good thing - all it will do is cause larger blockages to occur.... >From the description above, it looks like xfs developed its own way of regulating work processing involving multiple workqueues and yielding queue positions with delay. For now, it probably would be best to just keep things running as they are but in the long run it might be beneficial to replace those explicit mechanisms. >> +6. Guidelines >> + >> +* Do not forget to use WQ_RESCUER if a wq may process work items which >> + are used during memory reclaim. Each wq with WQ_RESCUER set has one >> + rescuer thread reserved for it. If there is dependency among >> + multiple work items used during memory reclaim, they should be >> + queued to separate wq each with WQ_RESCUER. >> + >> +* Unless strict ordering is required, there is no need to use ST wq. >> + >> +* Unless there is a specific need, using 0 for @nr_active is > max_active? Oops, thanks. Updated. -- tejun ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH UPDATED] workqueue: add documentation 2010-09-13 8:08 ` Tejun Heo @ 2010-09-13 8:16 ` Florian Mickler 2010-09-13 8:27 ` Tejun Heo 0 siblings, 1 reply; 14+ messages in thread From: Florian Mickler @ 2010-09-13 8:16 UTC (permalink / raw) To: Tejun Heo; +Cc: Dave Chinner, lkml, Ingo Molnar, Christoph Lameter one more detail, seems like this never ends... sorry :) On Mon, 13 Sep 2010 10:08:12 +0200 Tejun Heo <tj@kernel.org> wrote: > + > +For an unbound wq, the above concurrency management doesn't apply and > +the gcwq for the pseudo unbound CPU tries to start executing all work > +items as soon as possible. The responsibility of regulating > +concurrency level is on the users. There is also a flag to mark a > +bound wq to ignore the concurrency management. Please refer to the > +Workqueue Attributes section for details. renamed to "API section" regards, Flo ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH UPDATED] workqueue: add documentation 2010-09-13 8:16 ` Florian Mickler @ 2010-09-13 8:27 ` Tejun Heo 0 siblings, 0 replies; 14+ messages in thread From: Tejun Heo @ 2010-09-13 8:27 UTC (permalink / raw) To: Florian Mickler; +Cc: Dave Chinner, lkml, Ingo Molnar, Christoph Lameter On 09/13/2010 10:16 AM, Florian Mickler wrote: >> +For an unbound wq, the above concurrency management doesn't apply and >> +the gcwq for the pseudo unbound CPU tries to start executing all work >> +items as soon as possible. The responsibility of regulating >> +concurrency level is on the users. There is also a flag to mark a >> +bound wq to ignore the concurrency management. Please refer to the >> +Workqueue Attributes section for details. > > renamed to "API section" Updated, thanks. -- tejun ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-09-13 8:28 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-09-08 15:40 [PATCH] workqueue: add documentation Tejun Heo 2010-09-08 15:51 ` [PATCH UPDATED] " Tejun Heo 2010-09-09 8:02 ` Florian Mickler 2010-09-09 10:22 ` Tejun Heo 2010-09-09 18:50 ` Florian Mickler 2010-09-10 10:25 ` Tejun Heo 2010-09-10 14:26 ` Florian Mickler 2010-09-10 14:55 ` Tejun Heo 2010-09-10 17:43 ` Randy Dunlap 2010-09-12 10:50 ` Tejun Heo 2010-09-13 0:51 ` Dave Chinner 2010-09-13 8:08 ` Tejun Heo 2010-09-13 8:16 ` Florian Mickler 2010-09-13 8:27 ` Tejun Heo
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.