All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] workqueue: add documentation
@ 2010-09-08 15:40 Tejun Heo
  2010-09-08 15:51 ` [PATCH UPDATED] " Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2010-09-08 15:40 UTC (permalink / raw)
  To: lkml, Ingo Molnar, Christoph Lameter, Dave Chinner, Florian Mickler

Update copyright notice and add Documentation/workqueue.txt.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
---
Florian, I took good part from the previous overview document and tried
put them in a more compact form.  It would be great if you can review
this one too.  Thanks.

 Documentation/workqueue.txt |  336 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/workqueue.h   |    4
 kernel/workqueue.c          |   27 ++-
 3 files changed, 357 insertions(+), 10 deletions(-)

Index: work/kernel/workqueue.c
===================================================================
--- work.orig/kernel/workqueue.c
+++ work/kernel/workqueue.c
@@ -1,19 +1,26 @@
 /*
- * linux/kernel/workqueue.c
+ * kernel/workqueue.c - generic async execution with shared worker pool
  *
- * Generic mechanism for defining kernel helper threads for running
- * arbitrary tasks in process context.
+ * Copyright (C) 2002		Ingo Molnar
  *
- * Started by Ingo Molnar, Copyright (C) 2002
+ *   Derived from the taskqueue/keventd code by:
+ *     David Woodhouse <dwmw2@infradead.org>
+ *     Andrew Morton
+ *     Kai Petzke <wpp@marie.physik.tu-berlin.de>
+ *     Theodore Ts'o <tytso@mit.edu>
  *
- * Derived from the taskqueue/keventd code by:
+ * Made to use alloc_percpu by Christoph Lameter.
  *
- *   David Woodhouse <dwmw2@infradead.org>
- *   Andrew Morton
- *   Kai Petzke <wpp@marie.physik.tu-berlin.de>
- *   Theodore Ts'o <tytso@mit.edu>
+ * Copyright (C) 2010		SUSE Linux Products GmbH
+ * Copyright (C) 2010		Tejun Heo <tj@kernel.org>
  *
- * Made to use alloc_percpu by Christoph Lameter.
+ * This is the generic async execution mechanism.  Work items as are
+ * executed in process context.  The worker pool is shared and
+ * automatically managed.  There is one worker pool for each CPU and
+ * one extra for works which are better served by workers which are
+ * not bound to any specific CPU.
+ *
+ * Please read Documentation/workqueue.txt for details.
  */

 #include <linux/module.h>
Index: work/include/linux/workqueue.h
===================================================================
--- work.orig/include/linux/workqueue.h
+++ work/include/linux/workqueue.h
@@ -235,6 +235,10 @@ static inline unsigned int work_static(s
 #define work_clear_pending(work) \
 	clear_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))

+/*
+ * Workqueue flags and constants.  For details, please refer to
+ * Documentation/workqueue.txt.
+ */
 enum {
 	WQ_NON_REENTRANT	= 1 << 0, /* guarantee non-reentrance */
 	WQ_UNBOUND		= 1 << 1, /* not bound to any cpu */
Index: work/Documentation/workqueue.txt
===================================================================
--- /dev/null
+++ work/Documentation/workqueue.txt
@@ -0,0 +1,336 @@
+
+Concurrency Managed Workqueue (cmwq)
+
+September, 2010		Tejun Heo <tj@kernel.org>
+
+CONTENTS
+
+1. Why cmwq?
+2. The Design
+3. Workqueue Attributes
+4. Example Execution Scenarios
+5. Guidelines
+
+
+1. Why cmwq?
+
+There are many cases where an asynchronous process execution context
+is needed and the workqueue (wq) is the most commonly used mechanism
+for such cases.  A work item describing which function to execute is
+queued on a workqueue which executes the work item in a process
+context asynchronously.
+
+In the original wq implementation, a multi threaded (MT) wq had one
+worker thread per CPU and a single threaded (ST) wq had one worker
+thread system-wide.  A single MT wq needed to keep around the same
+number of workers as the number of CPUs.  The kernel grew a lot of MT
+wq users over the years and with the number of CPU cores continuously
+rising, some systems saturated the default 32k PID space just booting
+up.
+
+Although MT wq wasted a lot of resource, the level of concurrency
+provided was unsatisfactory.  The limitation was common to both ST and
+MT wq albeit less severe on MT.  Each wq maintained its own seprate
+worker pool.  A MT wq could provid only one execution context per CPU
+while a ST wq one for the whole system.  Work items had to compete for
+those very limited execution contexts leading to various problems
+including proneness to deadlocks around the single execution context.
+
+The tension between the provided level of concurrency and resource
+usage also forced its users to make unnecessary tradeoffs like libata
+choosing to use ST wq for polling PIOs and accepting an unnecessary
+limitation that no two polling PIOs can progress at the same time.  As
+MT wq don't provide much better concurrency, users which require
+higher level of concurrency, like async or fscache, had to implement
+their own thread pool.
+
+Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with
+focus on the following goals.
+
+* Maintain compatibility with the original workqueue API.
+
+* Use per-CPU unified worker pools shared by all wq to provide
+  flexible level of concurrency on demand without wasting a lot of
+  resource.
+
+* Automatically regulate worker pool and level of concurrency so that
+  the API users don't need to worry about such details.
+
+
+2. The Design
+
+There's a single global cwq (gcwq) for each possible CPU and a pseudo
+CPU for unbound wq.  A gcwq manages and serves out all the execution
+contexts on the associated CPU.  cpu_workqueue's (cwq) of each wq are
+mostly simple frontends to the associated gcwq.  When a work item is
+queued, it's queued to the unified worklist of the target gcwq.  Each
+gcwq maintains pool of workers used to process the worklist.
+
+For any worker pool implmentation, managing the concurrency level (how
+many execution contexts are active) is an important issue.  cmwq tries
+to keep the concurrency at minimal but sufficient level.
+
+Each gcwq bound to an actual CPU implements concurrency management by
+hooking into the scheduler.  The gcwq is notified whenever an active
+worker wakes up or sleeps and keeps track of the number of the
+currently runnable workers.  Generally, work items are not expected to
+hog CPU cycle and maintaining just enough concurrency to prevent work
+processing from stalling should be optimal.  As long as there is one
+or more runnable workers on the CPU, the gcwq doesn't start execution
+of a new work, but, when the last running worker goes to sleep, it
+immediately schedules a new worker so that the CPU doesn't sit idle
+while there are pending work items.  This allows using minimal number
+of workers without losing execution bandwidth.
+
+Keeping idle workers around doesn't cost other than the memory space
+for kthreads, so cmwq holds onto idle ones for a while before killing
+them.
+
+For an unbound wq, the above concurrency management doesn't apply and
+the gcwq for the pseudo unbound CPU tries to start executing all work
+items as soon as possible.  The responsibility of regulating
+concurrency level is on the users.  There is also a flag to mark a
+bound wq to ignore the concurrency management.  Please refer to the
+Workqueue Attributes section for details.
+
+Forward progress guarantee relies on that workers can be created when
+more execution contexts are necessary, which in turn is guaranteed
+through the use of rescue workers.  All wq which might be used in
+memory reclamation path are required to have a rescuer reserved for
+execution of the wq under memory pressure so that memory reclamation
+for worker creation doesn't deadlock waiting for execution contexts to
+free up.
+
+
+3. Workqueue Attributes
+
+alloc_workqueue() allocates a wq.  The original create_*workqueue()
+functions are deprecated and scheduled for removal.  alloc_workqueue()
+takes three arguments - @name, @flags and @max_active.  @name is the
+name of the wq and also used as the name of the rescuer thread if
+there is one.
+
+A wq no longer manages execution resources but serves as a domain for
+forward progress guarantee, flush and work item attributes.  @flags
+and @max_active control how work items are assigned execution
+resources, scheduled and executed.
+
+@flags:
+
+  WQ_NON_REENTRANT
+
+	By default, a wq guarantees non-reentrance only on the same
+	CPU.  A work may not be executed concurrently on the same CPU
+	by multiple workers but is allowed to be executed concurrently
+	on multiple CPUs.  This flag makes sure non-reentrance is
+	enforced across all CPUs.  Work items queued to a
+	non-reentrant wq are guaranteed to be executed by at most one
+	worker system-wide at any given time.
+
+  WQ_UNBOUND
+
+	Work items queued to an unbound wq are served by a special
+	gcwq which hosts workers which are not bound to any specific
+	CPU.  This makes the wq behave as a simple execution context
+	provider without concurrency management.  The unbound gcwq
+	tries to start execution of work items as soon as possible.
+	Unbound wq sacrifices locality but is useful for the following
+	cases.
+
+	* Wide fluctuation in the concurrency level requirement is
+	  expected and using bound wq may end up creating large number
+	  of mostly unused workers across different CPUs as the issuer
+	  hops through different CPUs.
+
+	* Long running CPU intensive workloads which can be better
+	  managed by the system scheduler.
+
+  WQ_FREEZEABLE
+
+	A freezeable wq participates in the freeze phase of the system
+	suspend operations.  Work items on the wq are drained and no
+	new work item starts execution until thawed.
+
+  WQ_RESCUER
+
+	All wq which might be used in the memory reclamation paths
+	_MUST_ have this flag set.  This reserves one worker
+	exclusively for the execution of this wq under memory
+	pressure.
+
+  WQ_HIGHPRI
+
+	Work items of a highpri wq are queued at the head of the
+	worklist of the target gcwq and start execution regardless of
+	the current concurrency level.  In other words, highpri work
+	items will always start execution as soon as execution
+	resource is available.
+
+	Ordering among highpri work items is preserved - a highpri
+	work item queued after another highpri work item will start
+	execution after the earlier highpri work item starts.
+
+	Although highpri work items are not held back by other
+	runnable work items, they still contribute to the concurrency
+	level.  Highpri work items in runnable state will prevent
+	non-highpri work items from starting execution.
+
+	This flag is meaningless for unbound wq.
+
+  WQ_CPU_INTENSIVE
+
+	Work items of a CPU intensive wq do not contribute to the
+	concurrency level.  In other words, Runnable CPU intensive
+	work items will not prevent other work items from starting
+	execution.  This is useful for bound work items which are
+	expected to hog CPU cycles so that their execution is
+	regulated by the system scheduler.
+
+	Although CPU intensive work items don't contribute to the
+	concurrency level, start of their executions is still
+	regulated by the concurrency management and runnable
+	non-CPU-intensive work items can delay execution of CPU
+	intensive work items.
+
+	This flag is meaningless for unbound wq.
+
+  WQ_HIGHPRI | WQ_CPU_INTENSIVE
+
+	This combination makes the wq avoid interaction with
+	concurrency management completely and behave as a simple
+	per-CPU execution context provider.  Work items queued on a
+	highpri CPU-intensive wq start execution as soon as resources
+	are available and don't affect execution of other work items.
+
+@max_active:
+
+@max_active determines the maximum number of execution contexts per
+CPU which can be assigned to the work items of a wq.  For example,
+with @max_active of 16, at most 16 work items of the wq can be
+executing at the same time per CPU.
+
+Currently, for a bound wq, the maximum limit for @max_active is 512
+and the default value used when 0 is specified is 256.  For an unbound
+wq, the limit is higher of 512 and 4 * num_possible_cpus().  These
+values are chosen sufficiently high such that they are not the
+limiting factor while providing protection in runaway cases.
+
+The number of active work items of a wq is usually regulated by the
+users of the wq, more specifically, by how many work items the users
+may queue at the same time.  Unless there is a specific need for
+throttling the number of active work items, specifying '0' is
+recommended.
+
+Some users depend on the strict execution ordering of ST wq.  The
+combination of @max_active of 1 and WQ_UNBOUND is used to achieve this
+behavior.  Work items on such wq are always queued to the unbound gcwq
+and only one work item can be active at any given time thus achieving
+the same ordering property as ST wq.
+
+
+4. Example Execution Scenarios
+
+The following example execution scenarios try to illustrate how cmwq
+behave under different configurations.
+
+ Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU.
+ w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms
+ again before finishing.  w1 and w2 burn CPU for 5ms then sleep for
+ 10ms.
+
+Ignoring all other tasks, works and processing overhead, and assuming
+simple FIFO scheduling, the following is one highly simplified version
+of possible sequences of events with the original wq.
+
+ TIME IN MSECS	EVENT
+ 0		w0 starts and burns CPU
+ 5		w0 sleeps
+ 15		w0 wakes up and burns CPU
+ 20		w0 finishes
+ 20		w1 starts and burns CPU
+ 25		w1 sleeps
+ 35		w1 wakes up and finishes
+ 35		w2 starts and burns CPU
+ 40		w2 sleeps
+ 50		w2 wakes up and finishes
+
+And with cmwq with @max_active >= 3,
+
+ TIME IN MSECS	EVENT
+ 0		w0 starts and burns CPU
+ 5		w0 sleeps
+ 5		w1 starts and burns CPU
+ 10		w1 sleeps
+ 10		w2 starts and burns CPU
+ 15		w2 sleeps
+ 15		w0 wakes up and burns CPU
+ 20		w0 finishes
+ 20		w1 wakes up and finishes
+ 25		w2 wakes up and finishes
+
+If @max_active == 2,
+
+ TIME IN MSECS	EVENT
+ 0		w0 starts and burns CPU
+ 5		w0 sleeps
+ 5		w1 starts and burns CPU
+ 10		w1 sleeps
+ 15		w0 wakes up and burns CPU
+ 20		w0 finishes
+ 20		w1 wakes up and finishes
+ 20		w2 starts and burns CPU
+ 25		w2 sleeps
+ 35		w2 wakes up and finishes
+
+Now, let's assume w1 and w2 are queued to a different wq q1 which has
+WQ_HIGHPRI set,
+
+ TIME IN MSECS	EVENT
+ 0		w1 and w2 start and burn CPU
+ 5		w1 sleeps
+ 10		w2 sleeps
+ 10		w0 starts and burns CPU
+ 15		w0 sleeps
+ 15		w1 wakes up and finishes
+ 20		w2 wakes up and finishes
+ 25		w0 wakes up and burns CPU
+ 30		w0 finishes
+
+If q1 has WQ_CPU_INTENSIVE set,
+
+ TIME IN MSECS	EVENT
+ 0		w0 starts and burns CPU
+ 5		w0 sleeps
+ 5		w1 and w2 start and burn CPU
+ 10		w1 sleeps
+ 15		w2 sleeps
+ 15		w0 wakes up and burns CPU
+ 20		w0 finishes
+ 20		w1 wakes up and finishes
+ 25		w2 wakes up and finishes
+
+
+5. Guidelines
+
+* Do not forget to use WQ_RESCUER if a wq may process work items which
+  are used during memory reclamation.  Each wq with WQ_RESCUER set has
+  one rescuer thread reserved for it.  If there is dependency among
+  multiple work items used during memory reclamation, they should be
+  queued to separate wq each with WQ_RESCUER.
+
+* Unless strict ordering is required, there is no need to use ST wq.
+
+* Unless there is a specific need, using 0 for @nr_active is
+  recommended.  In most use cases, concurrency level usually stays
+  well under the default limit.
+
+* A wq serves as a domain for forward progress guarantee (WQ_RESCUER),
+  flush and work item attributes.  Work items which are not involved
+  in memory reclamation and don't need to be flushed as a part of a
+  group of work items, and don't require any special attribute, can
+  use one of the system wq.  There is no difference in execution
+  characteristics between using a dedicated wq and a system wq.
+
+* Unless work items are expected to consume huge amount of CPU cycles,
+  using bound wq is usually beneficial due to increased level of
+  locality in wq operations and work item execution.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH UPDATED] workqueue: add documentation
  2010-09-08 15:40 [PATCH] workqueue: add documentation Tejun Heo
@ 2010-09-08 15:51 ` Tejun Heo
  2010-09-09  8:02   ` Florian Mickler
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2010-09-08 15:51 UTC (permalink / raw)
  To: lkml, Ingo Molnar, Christoph Lameter, Dave Chinner, Florian Mickler

Update copyright notice and add Documentation/workqueue.txt.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
---
Forgot to run ispell.  Here's the ispell'd version.

Thanks.

 Documentation/workqueue.txt |  336 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/workqueue.h   |    4
 kernel/workqueue.c          |   27 ++-
 3 files changed, 357 insertions(+), 10 deletions(-)

Index: work/kernel/workqueue.c
===================================================================
--- work.orig/kernel/workqueue.c
+++ work/kernel/workqueue.c
@@ -1,19 +1,26 @@
 /*
- * linux/kernel/workqueue.c
+ * kernel/workqueue.c - generic async execution with shared worker pool
  *
- * Generic mechanism for defining kernel helper threads for running
- * arbitrary tasks in process context.
+ * Copyright (C) 2002		Ingo Molnar
  *
- * Started by Ingo Molnar, Copyright (C) 2002
+ *   Derived from the taskqueue/keventd code by:
+ *     David Woodhouse <dwmw2@infradead.org>
+ *     Andrew Morton
+ *     Kai Petzke <wpp@marie.physik.tu-berlin.de>
+ *     Theodore Ts'o <tytso@mit.edu>
  *
- * Derived from the taskqueue/keventd code by:
+ * Made to use alloc_percpu by Christoph Lameter.
  *
- *   David Woodhouse <dwmw2@infradead.org>
- *   Andrew Morton
- *   Kai Petzke <wpp@marie.physik.tu-berlin.de>
- *   Theodore Ts'o <tytso@mit.edu>
+ * Copyright (C) 2010		SUSE Linux Products GmbH
+ * Copyright (C) 2010		Tejun Heo <tj@kernel.org>
  *
- * Made to use alloc_percpu by Christoph Lameter.
+ * This is the generic async execution mechanism.  Work items as are
+ * executed in process context.  The worker pool is shared and
+ * automatically managed.  There is one worker pool for each CPU and
+ * one extra for works which are better served by workers which are
+ * not bound to any specific CPU.
+ *
+ * Please read Documentation/workqueue.txt for details.
  */

 #include <linux/module.h>
Index: work/include/linux/workqueue.h
===================================================================
--- work.orig/include/linux/workqueue.h
+++ work/include/linux/workqueue.h
@@ -235,6 +235,10 @@ static inline unsigned int work_static(s
 #define work_clear_pending(work) \
 	clear_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))

+/*
+ * Workqueue flags and constants.  For details, please refer to
+ * Documentation/workqueue.txt.
+ */
 enum {
 	WQ_NON_REENTRANT	= 1 << 0, /* guarantee non-reentrance */
 	WQ_UNBOUND		= 1 << 1, /* not bound to any cpu */
Index: work/Documentation/workqueue.txt
===================================================================
--- /dev/null
+++ work/Documentation/workqueue.txt
@@ -0,0 +1,336 @@
+
+Concurrency Managed Workqueue (cmwq)
+
+September, 2010		Tejun Heo <tj@kernel.org>
+
+CONTENTS
+
+1. Why cmwq?
+2. The Design
+3. Workqueue Attributes
+4. Example Execution Scenarios
+5. Guidelines
+
+
+1. Why cmwq?
+
+There are many cases where an asynchronous process execution context
+is needed and the workqueue (wq) is the most commonly used mechanism
+for such cases.  A work item describing which function to execute is
+queued on a workqueue which executes the work item in a process
+context asynchronously.
+
+In the original wq implementation, a multi threaded (MT) wq had one
+worker thread per CPU and a single threaded (ST) wq had one worker
+thread system-wide.  A single MT wq needed to keep around the same
+number of workers as the number of CPUs.  The kernel grew a lot of MT
+wq users over the years and with the number of CPU cores continuously
+rising, some systems saturated the default 32k PID space just booting
+up.
+
+Although MT wq wasted a lot of resource, the level of concurrency
+provided was unsatisfactory.  The limitation was common to both ST and
+MT wq albeit less severe on MT.  Each wq maintained its own separate
+worker pool.  A MT wq could provide only one execution context per CPU
+while a ST wq one for the whole system.  Work items had to compete for
+those very limited execution contexts leading to various problems
+including proneness to deadlocks around the single execution context.
+
+The tension between the provided level of concurrency and resource
+usage also forced its users to make unnecessary tradeoffs like libata
+choosing to use ST wq for polling PIOs and accepting an unnecessary
+limitation that no two polling PIOs can progress at the same time.  As
+MT wq don't provide much better concurrency, users which require
+higher level of concurrency, like async or fscache, had to implement
+their own thread pool.
+
+Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with
+focus on the following goals.
+
+* Maintain compatibility with the original workqueue API.
+
+* Use per-CPU unified worker pools shared by all wq to provide
+  flexible level of concurrency on demand without wasting a lot of
+  resource.
+
+* Automatically regulate worker pool and level of concurrency so that
+  the API users don't need to worry about such details.
+
+
+2. The Design
+
+There's a single global cwq (gcwq) for each possible CPU and a pseudo
+CPU for unbound wq.  A gcwq manages and serves out all the execution
+contexts on the associated CPU.  cpu_workqueue's (cwq) of each wq are
+mostly simple frontends to the associated gcwq.  When a work item is
+queued, it's queued to the unified worklist of the target gcwq.  Each
+gcwq maintains pool of workers used to process the worklist.
+
+For any worker pool implementation, managing the concurrency level (how
+many execution contexts are active) is an important issue.  cmwq tries
+to keep the concurrency at minimal but sufficient level.
+
+Each gcwq bound to an actual CPU implements concurrency management by
+hooking into the scheduler.  The gcwq is notified whenever an active
+worker wakes up or sleeps and keeps track of the number of the
+currently runnable workers.  Generally, work items are not expected to
+hog CPU cycle and maintaining just enough concurrency to prevent work
+processing from stalling should be optimal.  As long as there is one
+or more runnable workers on the CPU, the gcwq doesn't start execution
+of a new work, but, when the last running worker goes to sleep, it
+immediately schedules a new worker so that the CPU doesn't sit idle
+while there are pending work items.  This allows using minimal number
+of workers without losing execution bandwidth.
+
+Keeping idle workers around doesn't cost other than the memory space
+for kthreads, so cmwq holds onto idle ones for a while before killing
+them.
+
+For an unbound wq, the above concurrency management doesn't apply and
+the gcwq for the pseudo unbound CPU tries to start executing all work
+items as soon as possible.  The responsibility of regulating
+concurrency level is on the users.  There is also a flag to mark a
+bound wq to ignore the concurrency management.  Please refer to the
+Workqueue Attributes section for details.
+
+Forward progress guarantee relies on that workers can be created when
+more execution contexts are necessary, which in turn is guaranteed
+through the use of rescue workers.  All wq which might be used in
+memory reclamation path are required to have a rescuer reserved for
+execution of the wq under memory pressure so that memory reclamation
+for worker creation doesn't deadlock waiting for execution contexts to
+free up.
+
+
+3. Workqueue Attributes
+
+alloc_workqueue() allocates a wq.  The original create_*workqueue()
+functions are deprecated and scheduled for removal.  alloc_workqueue()
+takes three arguments - @name, @flags and @max_active.  @name is the
+name of the wq and also used as the name of the rescuer thread if
+there is one.
+
+A wq no longer manages execution resources but serves as a domain for
+forward progress guarantee, flush and work item attributes.  @flags
+and @max_active control how work items are assigned execution
+resources, scheduled and executed.
+
+@flags:
+
+  WQ_NON_REENTRANT
+
+	By default, a wq guarantees non-reentrance only on the same
+	CPU.  A work may not be executed concurrently on the same CPU
+	by multiple workers but is allowed to be executed concurrently
+	on multiple CPUs.  This flag makes sure non-reentrance is
+	enforced across all CPUs.  Work items queued to a
+	non-reentrant wq are guaranteed to be executed by at most one
+	worker system-wide at any given time.
+
+  WQ_UNBOUND
+
+	Work items queued to an unbound wq are served by a special
+	gcwq which hosts workers which are not bound to any specific
+	CPU.  This makes the wq behave as a simple execution context
+	provider without concurrency management.  The unbound gcwq
+	tries to start execution of work items as soon as possible.
+	Unbound wq sacrifices locality but is useful for the following
+	cases.
+
+	* Wide fluctuation in the concurrency level requirement is
+	  expected and using bound wq may end up creating large number
+	  of mostly unused workers across different CPUs as the issuer
+	  hops through different CPUs.
+
+	* Long running CPU intensive workloads which can be better
+	  managed by the system scheduler.
+
+  WQ_FREEZEABLE
+
+	A freezeable wq participates in the freeze phase of the system
+	suspend operations.  Work items on the wq are drained and no
+	new work item starts execution until thawed.
+
+  WQ_RESCUER
+
+	All wq which might be used in the memory reclamation paths
+	_MUST_ have this flag set.  This reserves one worker
+	exclusively for the execution of this wq under memory
+	pressure.
+
+  WQ_HIGHPRI
+
+	Work items of a highpri wq are queued at the head of the
+	worklist of the target gcwq and start execution regardless of
+	the current concurrency level.  In other words, highpri work
+	items will always start execution as soon as execution
+	resource is available.
+
+	Ordering among highpri work items is preserved - a highpri
+	work item queued after another highpri work item will start
+	execution after the earlier highpri work item starts.
+
+	Although highpri work items are not held back by other
+	runnable work items, they still contribute to the concurrency
+	level.  Highpri work items in runnable state will prevent
+	non-highpri work items from starting execution.
+
+	This flag is meaningless for unbound wq.
+
+  WQ_CPU_INTENSIVE
+
+	Work items of a CPU intensive wq do not contribute to the
+	concurrency level.  In other words, Runnable CPU intensive
+	work items will not prevent other work items from starting
+	execution.  This is useful for bound work items which are
+	expected to hog CPU cycles so that their execution is
+	regulated by the system scheduler.
+
+	Although CPU intensive work items don't contribute to the
+	concurrency level, start of their executions is still
+	regulated by the concurrency management and runnable
+	non-CPU-intensive work items can delay execution of CPU
+	intensive work items.
+
+	This flag is meaningless for unbound wq.
+
+  WQ_HIGHPRI | WQ_CPU_INTENSIVE
+
+	This combination makes the wq avoid interaction with
+	concurrency management completely and behave as a simple
+	per-CPU execution context provider.  Work items queued on a
+	highpri CPU-intensive wq start execution as soon as resources
+	are available and don't affect execution of other work items.
+
+@max_active:
+
+@max_active determines the maximum number of execution contexts per
+CPU which can be assigned to the work items of a wq.  For example,
+with @max_active of 16, at most 16 work items of the wq can be
+executing at the same time per CPU.
+
+Currently, for a bound wq, the maximum limit for @max_active is 512
+and the default value used when 0 is specified is 256.  For an unbound
+wq, the limit is higher of 512 and 4 * num_possible_cpus().  These
+values are chosen sufficiently high such that they are not the
+limiting factor while providing protection in runaway cases.
+
+The number of active work items of a wq is usually regulated by the
+users of the wq, more specifically, by how many work items the users
+may queue at the same time.  Unless there is a specific need for
+throttling the number of active work items, specifying '0' is
+recommended.
+
+Some users depend on the strict execution ordering of ST wq.  The
+combination of @max_active of 1 and WQ_UNBOUND is used to achieve this
+behavior.  Work items on such wq are always queued to the unbound gcwq
+and only one work item can be active at any given time thus achieving
+the same ordering property as ST wq.
+
+
+4. Example Execution Scenarios
+
+The following example execution scenarios try to illustrate how cmwq
+behave under different configurations.
+
+ Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU.
+ w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms
+ again before finishing.  w1 and w2 burn CPU for 5ms then sleep for
+ 10ms.
+
+Ignoring all other tasks, works and processing overhead, and assuming
+simple FIFO scheduling, the following is one highly simplified version
+of possible sequences of events with the original wq.
+
+ TIME IN MSECS	EVENT
+ 0		w0 starts and burns CPU
+ 5		w0 sleeps
+ 15		w0 wakes up and burns CPU
+ 20		w0 finishes
+ 20		w1 starts and burns CPU
+ 25		w1 sleeps
+ 35		w1 wakes up and finishes
+ 35		w2 starts and burns CPU
+ 40		w2 sleeps
+ 50		w2 wakes up and finishes
+
+And with cmwq with @max_active >= 3,
+
+ TIME IN MSECS	EVENT
+ 0		w0 starts and burns CPU
+ 5		w0 sleeps
+ 5		w1 starts and burns CPU
+ 10		w1 sleeps
+ 10		w2 starts and burns CPU
+ 15		w2 sleeps
+ 15		w0 wakes up and burns CPU
+ 20		w0 finishes
+ 20		w1 wakes up and finishes
+ 25		w2 wakes up and finishes
+
+If @max_active == 2,
+
+ TIME IN MSECS	EVENT
+ 0		w0 starts and burns CPU
+ 5		w0 sleeps
+ 5		w1 starts and burns CPU
+ 10		w1 sleeps
+ 15		w0 wakes up and burns CPU
+ 20		w0 finishes
+ 20		w1 wakes up and finishes
+ 20		w2 starts and burns CPU
+ 25		w2 sleeps
+ 35		w2 wakes up and finishes
+
+Now, let's assume w1 and w2 are queued to a different wq q1 which has
+WQ_HIGHPRI set,
+
+ TIME IN MSECS	EVENT
+ 0		w1 and w2 start and burn CPU
+ 5		w1 sleeps
+ 10		w2 sleeps
+ 10		w0 starts and burns CPU
+ 15		w0 sleeps
+ 15		w1 wakes up and finishes
+ 20		w2 wakes up and finishes
+ 25		w0 wakes up and burns CPU
+ 30		w0 finishes
+
+If q1 has WQ_CPU_INTENSIVE set,
+
+ TIME IN MSECS	EVENT
+ 0		w0 starts and burns CPU
+ 5		w0 sleeps
+ 5		w1 and w2 start and burn CPU
+ 10		w1 sleeps
+ 15		w2 sleeps
+ 15		w0 wakes up and burns CPU
+ 20		w0 finishes
+ 20		w1 wakes up and finishes
+ 25		w2 wakes up and finishes
+
+
+5. Guidelines
+
+* Do not forget to use WQ_RESCUER if a wq may process work items which
+  are used during memory reclamation.  Each wq with WQ_RESCUER set has
+  one rescuer thread reserved for it.  If there is dependency among
+  multiple work items used during memory reclamation, they should be
+  queued to separate wq each with WQ_RESCUER.
+
+* Unless strict ordering is required, there is no need to use ST wq.
+
+* Unless there is a specific need, using 0 for @nr_active is
+  recommended.  In most use cases, concurrency level usually stays
+  well under the default limit.
+
+* A wq serves as a domain for forward progress guarantee (WQ_RESCUER),
+  flush and work item attributes.  Work items which are not involved
+  in memory reclamation and don't need to be flushed as a part of a
+  group of work items, and don't require any special attribute, can
+  use one of the system wq.  There is no difference in execution
+  characteristics between using a dedicated wq and a system wq.
+
+* Unless work items are expected to consume huge amount of CPU cycles,
+  using bound wq is usually beneficial due to increased level of
+  locality in wq operations and work item execution.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH UPDATED] workqueue: add documentation
  2010-09-08 15:51 ` [PATCH UPDATED] " Tejun Heo
@ 2010-09-09  8:02   ` Florian Mickler
  2010-09-09 10:22     ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Florian Mickler @ 2010-09-09  8:02 UTC (permalink / raw)
  To: Tejun Heo; +Cc: lkml, Ingo Molnar, Christoph Lameter, Dave Chinner

Hi Tejun!
Perfect timing. Just enough for the details to get a little foggy, 
while still knowing a little bit what you want to talk about. 
:-)

On Wed, 08 Sep 2010 17:40:02 +0200 Tejun Heo <tj@kernel.org> wrote:

> +
> +1. Why cmwq?

Perhaps better to begin with an introduction:

1. Introduction

> +
> +There are many cases where an asynchronous process execution context
> +is needed and the workqueue (wq)  is the most commonly used mechanism
> +for such cases.  

There are many cases where an asynchronous process execution context is
needed and the workqueue (wq) API is the most commonly used mechanism
for such cases. 

> A work item describing which function to execute is
> +queued on a workqueue which executes the work item in a process
> +context asynchronously.

When such an asynchronous execution context is needed, a work item
describing which function to execute is put on a queue. An independent
thread serves as the asynchronous execution context. The queue is
called workqueue and the thread is called worker. 

While there are work items on the workqueue the worker executes
the functions associated with the work items one after the other. 
When there is no work item left on the workqueue the worker
becomes idle. When a new work item gets queued, the worker begins
executing again.

2. Why cmwq?

> +
> +In the original wq implementation, a multi threaded (MT) wq had one
> +worker thread per CPU and a single threaded (ST) wq had one worker
> +thread system-wide.  A single MT wq needed to keep around the same
> +number of workers as the number of CPUs.  The kernel grew a lot of MT
> +wq users over the years and with the number of CPU cores continuously
> +rising, some systems saturated the default 32k PID space just booting
> +up.
> +
> +Although MT wq wasted a lot of resource, the level of concurrency
> +provided was unsatisfactory.  The limitation was common to both ST and
> +MT wq albeit less severe on MT.  Each wq maintained its own separate
> +worker pool.  A MT wq could provide only one execution context per CPU
> +while a ST wq one for the whole system.  Work items had to compete for
> +those very limited execution contexts leading to various problems
> +including proneness to deadlocks around the single execution context.
> +
> +The tension between the provided level of concurrency and resource
> +usage also forced its users to make unnecessary tradeoffs like libata
> +choosing to use ST wq for polling PIOs and accepting an unnecessary
> +limitation that no two polling PIOs can progress at the same time.  As
> +MT wq don't provide much better concurrency, users which require
> +higher level of concurrency, like async or fscache, had to implement
> +their own thread pool.
> +
> +Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with
> +focus on the following goals.
> +
> +* Maintain compatibility with the original workqueue API.
> +
> +* Use per-CPU unified worker pools shared by all wq to provide
> +  flexible level of concurrency on demand without wasting a lot of
> +  resource.
> +
> +* Automatically regulate worker pool and level of concurrency so that
> +  the API users don't need to worry about such details.
> +
> +



> +2. The Design

Now it get's a little bit rougher:

> +
> +There's a single global cwq (gcwq) for each possible CPU and a pseudo
> +CPU for unbound wq.  A gcwq manages and serves out all the execution
> +contexts on the associated CPU.  cpu_workqueue's (cwq) of each wq are
> +mostly simple frontends to the associated gcwq.  When a work item is
> +queued, it's queued to the unified worklist of the target gcwq.  Each
> +gcwq maintains pool of workers used to process the worklist.

Hm. That hurt my brain a little. :) 
What about smth along the lines:

In order to ease the asynchronous execution of functions a new
abstraction, the work item, is introduced.

A work item is a simple struct that holds a pointer to the
function that is to be executed asynchronously. Whenever a driver or
subsystem wants a function to be executed asynchronously it has to set
up a work item pointing to that function and queue that work item on a
workqueue.

Special purpose threads, called worker threads,  execute the functions
off of the queue, one after the other. If no work is queued, the worker
threads become idle.

These worker threads are managed in so called thread-pools.

The cmwq design differentiates between the user-facing workqueues that
subsystems and drivers queue work items on and what queues the 
thread-pools actually work on.

There are worker-thread-pools for each possible CPU and one
worker-thread-pool whose threads are not bound to any specific CPU. Each
worker-thread-pool has it's own queue (called gcwq) from which it
executes work-items.  

When a driver or subsystem creates a workqueue it is
automatically associated with one of the gcwq's. For CPU-bound
workqueues they are associated to that specific CPU's gcwq. For
unbound workqueues, they are queued to the gcwq of the global
thread-pool. 

[Btw, I realized, now that I read the guidelines below, that this last
paragraph is probably incorrect? Is there an association or does the
enqueue-API automatically determine the CPU it needs to queue the work
item on?]

> +For any worker pool implementation, managing the concurrency level (how
> +many execution contexts are active) is an important issue.  cmwq tries
> +to keep the concurrency at minimal but sufficient level.
> +
> +Each gcwq bound to an actual CPU implements concurrency management by
> +hooking into the scheduler.  The gcwq is notified whenever an active
> +worker wakes up or sleeps and keeps track of the number of the
> +currently runnable workers.  Generally, work items are not expected to
> +hog CPU cycle and maintaining just enough concurrency to prevent work
> +processing from stalling should be optimal.  As long as there is one
> +or more runnable workers on the CPU, the gcwq doesn't start execution
> +of a new work, but, when the last running worker goes to sleep, it
> +immediately schedules a new worker so that the CPU doesn't sit idle
> +while there are pending work items.  This allows using minimal number
> +of workers without losing execution bandwidth.
> +
> +Keeping idle workers around doesn't cost other than the memory space
> +for kthreads, so cmwq holds onto idle ones for a while before killing
> +them.
> +
> +For an unbound wq, the above concurrency management doesn't apply and
> +the gcwq for the pseudo unbound CPU tries to start executing all work
> +items as soon as possible.  The responsibility of regulating
> +concurrency level is on the users.  There is also a flag to mark a
> +bound wq to ignore the concurrency management.  Please refer to the
> +Workqueue Attributes section for details.
> +
> +Forward progress guarantee relies on that workers can be created when
> +more execution contexts are necessary, which in turn is guaranteed
> +through the use of rescue workers.  

> +All wq which might be used in
> +memory reclamation path are required to have a rescuer reserved for
> +execution of the wq under memory pressure so that memory reclamation
> +for worker creation doesn't deadlock waiting for execution contexts to
> +free up.

All work items which might be used on code paths that handle memory 
reclaim are required to be queued on wq's that have a rescue-worker 
reserved for execution under memory pressure. Else it is possible that 
the thread-pool deadlocks waiting for execution contexts to free up.


> +
> +
> +3. Workqueue Attributes
> +

3. Application Programming Interface (API)

> +alloc_workqueue() allocates a wq.  The original create_*workqueue()
> +functions are deprecated and scheduled for removal.  alloc_workqueue()
> +takes three arguments - @name, @flags and @max_active.  @name is the
> +name of the wq and also used as the name of the rescuer thread if
> +there is one.
> +
> +A wq no longer manages execution resources but serves as a domain for
> +forward progress guarantee, flush and work item attributes.  @flags
> +and @max_active control how work items are assigned execution
> +resources, scheduled and executed.
[snip]

I think it is worth mentioning all functions that are considered to be
part of the API here. 

[snip]

> +5. Guidelines
> +
> +* Do not forget to use WQ_RESCUER if a wq may process work items which
> +  are used during memory reclamation.  Each wq with WQ_RESCUER set has

hmm.. it's not "reclamation". But I can't say the correct term either. 

I'd say:
".. are used during memory reclaim."

> +  one rescuer thread reserved for it.  If there is dependency among
> +  multiple work items used during memory reclamation, they should be

"during memory reclaim" 

> +  queued to separate wq each with WQ_RESCUER.
> +
> +* Unless strict ordering is required, there is no need to use ST wq.
> +
> +* Unless there is a specific need, using 0 for @nr_active is
> +  recommended.  In most use cases, concurrency level usually stays
> +  well under the default limit.
> +
> +* A wq serves as a domain for forward progress guarantee (WQ_RESCUER),
> +  flush and work item attributes.  Work items which are not involved
> +  in memory reclamation and don't need to be flushed as a part of a

see above (-> memory reclaim)

> +  group of work items, and don't require any special attribute, can
> +  use one of the system wq.  There is no difference in execution
> +  characteristics between using a dedicated wq and a system wq.
> +
> +* Unless work items are expected to consume huge amount of CPU cycles,
> +  using bound wq is usually beneficial due to increased level of
> +  locality in wq operations and work item execution.

"Unless work items are expected to consume a huge amount of CPU
cycles, using a bound wq is usually beneficial due to the increased
level of locality in wq operations and work item exection. "

Btw, it is not clear to me, what you mean with "wq operations". 
Do the enqueuing API functions automatically determine the cpu they are
executed on and queue the workitem to the corresponding gcwq? Or do you
need to explicitly queue to a specific CPU?

Either you mean the operations that lead to the enqueueing of the
work-item, or you mean the operations done by the thread-pool?


... after thinking a bit, the wq implementation should obviously do the
automatic enqueuing on the nearest gcwq thingy... But that should
probably be mentioned in the API description. 
Although I have to admit I only skimmed over the flag description
above it seems you only mention the UNBOUND case and not the default
one?


Cheers,
Flo





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH UPDATED] workqueue: add documentation
  2010-09-09  8:02   ` Florian Mickler
@ 2010-09-09 10:22     ` Tejun Heo
  2010-09-09 18:50       ` Florian Mickler
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2010-09-09 10:22 UTC (permalink / raw)
  To: Florian Mickler; +Cc: lkml, Ingo Molnar, Christoph Lameter, Dave Chinner

Hello,

On 09/09/2010 10:02 AM, Florian Mickler wrote:
> Perfect timing. Just enough for the details to get a little foggy, 
> while still knowing a little bit what you want to talk about. 
> :-)

:-)

Added Introduction and updated Why cmwq section as suggested.

>> +2. The Design
> 
> Now it get's a little bit rougher:
> 
>> +
>> +There's a single global cwq (gcwq) for each possible CPU and a pseudo
>> +CPU for unbound wq.  A gcwq manages and serves out all the execution
>> +contexts on the associated CPU.  cpu_workqueue's (cwq) of each wq are
>> +mostly simple frontends to the associated gcwq.  When a work item is
>> +queued, it's queued to the unified worklist of the target gcwq.  Each
>> +gcwq maintains pool of workers used to process the worklist.
> 
> Hm. That hurt my brain a little. :) 

Yeap, that's a lot of overly compressed information there.

> What about smth along the lines:
> 
> In order to ease the asynchronous execution of functions a new
> abstraction, the work item, is introduced.
> 
> A work item is a simple struct that holds a pointer to the
> function that is to be executed asynchronously. Whenever a driver or
> subsystem wants a function to be executed asynchronously it has to set
> up a work item pointing to that function and queue that work item on a
> workqueue.
> 
> Special purpose threads, called worker threads,  execute the functions
> off of the queue, one after the other. If no work is queued, the worker
> threads become idle.
> 
> These worker threads are managed in so called thread-pools.
> 
> The cmwq design differentiates between the user-facing workqueues that
> subsystems and drivers queue work items on and what queues the 
> thread-pools actually work on.
> 
> There are worker-thread-pools for each possible CPU and one
> worker-thread-pool whose threads are not bound to any specific CPU. Each
> worker-thread-pool has it's own queue (called gcwq) from which it
> executes work-items.  
> 
> When a driver or subsystem creates a workqueue it is
> automatically associated with one of the gcwq's. For CPU-bound
> workqueues they are associated to that specific CPU's gcwq. For
> unbound workqueues, they are queued to the gcwq of the global
> thread-pool. 
> 
> [Btw, I realized, now that I read the guidelines below, that this last
> paragraph is probably incorrect? Is there an association or does the
> enqueue-API automatically determine the CPU it needs to queue the work
> item on?]

Bound workqueues are per-cpu and by default work items will be queued
and processed on the same cpu as the issuer.  Unbound ones are
system-wide.  How about something like the following?


In order to ease the asynchronous execution of functions a new
abstraction, the work item, is introduced.

A work item is a simple struct that holds a pointer to the function
that is to be executed asynchronously.  Whenever a driver or subsystem
wants a function to be executed asynchronously it has to set up a work
item pointing to that function and queue that work item on a
workqueue.

Special purpose threads, called worker threads, execute the functions
off of the queue, one after the other.  If no work is queued, the
worker threads become idle.  These worker threads are managed in so
called thread-pools.

Subsystems and drivers can create and queue work items on workqueues
as they see fit.

By default, workqueues are per-cpu.  Work items are queued and
executed on the same CPU as the issuer.  These workqueues and work
items are said to be "bound".  A workqueue can be specifically
configured to be "unbound" in which case work items queued on the
workqueue are executed by worker threads not bound to any specific
CPU.

The cmwq design differentiates between the user-facing workqueues that
subsystems and drivers queue work items on and the backend mechanism
which manages thread-pool and processes the queued work items.

The backend mechanism is called Global CPU Workqueue (gcwq).  There is
one gcwq for each possible CPU and one gcwq to serve work items queued
on unbound workqueues.

When a work item is queued to a workqueue, the target gcwq is
determined according to the queue parameters and workqueue attributes
and queued on the shared worklist of the gcwq.  For example, unless
specifically overridden, a work item of a bound workqueue will be
queued on the worklist of the gcwq of the CPU the issuer is running
on.

>> +All wq which might be used in
>> +memory reclamation path are required to have a rescuer reserved for
>> +execution of the wq under memory pressure so that memory reclamation
>> +for worker creation doesn't deadlock waiting for execution contexts to
>> +free up.
> 
> All work items which might be used on code paths that handle memory 
> reclaim are required to be queued on wq's that have a rescue-worker 
> reserved for execution under memory pressure. Else it is possible that 
> the thread-pool deadlocks waiting for execution contexts to free up.

Updated as suggested.

>> +
>> +
>> +3. Workqueue Attributes
>> +
> 
> 3. Application Programming Interface (API)
> 
>> +alloc_workqueue() allocates a wq.  The original create_*workqueue()
>> +functions are deprecated and scheduled for removal.  alloc_workqueue()
>> +takes three arguments - @name, @flags and @max_active.  @name is the
>> +name of the wq and also used as the name of the rescuer thread if
>> +there is one.
>> +
>> +A wq no longer manages execution resources but serves as a domain for
>> +forward progress guarantee, flush and work item attributes.  @flags
>> +and @max_active control how work items are assigned execution
>> +resources, scheduled and executed.
> [snip]
> 
> I think it is worth mentioning all functions that are considered to be
> part of the API here. 

Yeah, that would be nice but a slightly larger task that I would like
to postpone at this point.  :-)

> "Unless work items are expected to consume a huge amount of CPU
> cycles, using a bound wq is usually beneficial due to the increased
> level of locality in wq operations and work item exection. "

So updated.

> Btw, it is not clear to me, what you mean with "wq operations". 

Queueing, dispatching and other book keeping operations.

> Do the enqueuing API functions automatically determine the cpu they are
> executed on and queue the workitem to the corresponding gcwq? Or do you
> need to explicitly queue to a specific CPU?
> 
> Either you mean the operations that lead to the enqueueing of the
> work-item, or you mean the operations done by the thread-pool?
> 
> ... after thinking a bit, the wq implementation should obviously do the
> automatic enqueuing on the nearest gcwq thingy... But that should
> probably be mentioned in the API description. 
> Although I have to admit I only skimmed over the flag description
> above it seems you only mention the UNBOUND case and not the default
> one?

Yeah, queue_work() queues works on the gcwq of the local CPU.  It can
be overridden by queue_work_on().  The unbound is special case where
the workqueue always sends works to the unbound gcwq which is served
by unbound workers.  Did the update in the design section explain
enough or do you think there needs to be more explanation?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH UPDATED] workqueue: add documentation
  2010-09-09 10:22     ` Tejun Heo
@ 2010-09-09 18:50       ` Florian Mickler
  2010-09-10 10:25         ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Florian Mickler @ 2010-09-09 18:50 UTC (permalink / raw)
  To: Tejun Heo; +Cc: lkml, Ingo Molnar, Christoph Lameter, Dave Chinner

On Thu, 09 Sep 2010 12:22:22 +0200
Tejun Heo <tj@kernel.org> wrote:

> The backend mechanism is called Global CPU Workqueue (gcwq).  There is

I tried to avoid that name. It somehow is confusing to me . Global/Local
in context of CPU is somehow associated with CPU locality in my mind.
Also the name doesn't fit for the unbound gcwq. 
I know what you mean by it, but I don't think it's a self explanatory
name. That was why I just said "they are called gcwq". But I'm ok with
it either way. After all, that _is_ how they are called. :) 


> > 
> > I think it is worth mentioning all functions that are considered to be
> > part of the API here. 
> 
> Yeah, that would be nice but a slightly larger task that I would like
> to postpone at this point.  :-)

Ah well, I may just give it a go then... 

> 
> > "Unless work items are expected to consume a huge amount of CPU
> > cycles, using a bound wq is usually beneficial due to the increased
> > level of locality in wq operations and work item exection. "
> 
> So updated.
> 
> > Btw, it is not clear to me, what you mean with "wq operations". 
> 
> Queueing, dispatching and other book keeping operations.

Yes. That makes sense. 

> 
> > Do the enqueuing API functions automatically determine the cpu they are
> > executed on and queue the workitem to the corresponding gcwq? Or do you
> > need to explicitly queue to a specific CPU?
> > 
> > Either you mean the operations that lead to the enqueueing of the
> > work-item, or you mean the operations done by the thread-pool?
> > 
> > ... after thinking a bit, the wq implementation should obviously do the
> > automatic enqueuing on the nearest gcwq thingy... But that should
> > probably be mentioned in the API description. 
> > Although I have to admit I only skimmed over the flag description
> > above it seems you only mention the UNBOUND case and not the default
> > one?
> 
> Yeah, queue_work() queues works on the gcwq of the local CPU.  It can
> be overridden by queue_work_on().  The unbound is special case where
> the workqueue always sends works to the unbound gcwq which is served
> by unbound workers.  Did the update in the design section explain
> enough or do you think there needs to be more explanation?

I'm looking forward to reading the new version en
bloc, but if I can trust my gut feeling, I'm ok with it now. :)

Let's see if someone else with more kernel-experience has something to
add, but here you've got my

Reviewed-By: Florian Mickler <florian@mickler.org>

in any case.

Cheers,
Flo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH UPDATED] workqueue: add documentation
  2010-09-09 18:50       ` Florian Mickler
@ 2010-09-10 10:25         ` Tejun Heo
  2010-09-10 14:26           ` Florian Mickler
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2010-09-10 10:25 UTC (permalink / raw)
  To: Florian Mickler; +Cc: lkml, Ingo Molnar, Christoph Lameter, Dave Chinner

Hello,

On 09/09/2010 08:50 PM, Florian Mickler wrote:
>> The backend mechanism is called Global CPU Workqueue (gcwq).  There is
> 
> I tried to avoid that name. It somehow is confusing to me . Global/Local
> in context of CPU is somehow associated with CPU locality in my mind.
> Also the name doesn't fit for the unbound gcwq.

Hmm... yeah, it makes sense from the implementation POV as they're
global to a CPU and the unbound gcwq is bound to pseudo unbound CPU.
I just dropped the expanded version and just used gcwq as you
suggested.

>> Yeah, that would be nice but a slightly larger task that I would like
>> to postpone at this point.  :-)
> 
> Ah well, I may just give it a go then... 

That would be great.

>> Yeah, queue_work() queues works on the gcwq of the local CPU.  It can
>> be overridden by queue_work_on().  The unbound is special case where
>> the workqueue always sends works to the unbound gcwq which is served
>> by unbound workers.  Did the update in the design section explain
>> enough or do you think there needs to be more explanation?
> 
> I'm looking forward to reading the new version en
> bloc, but if I can trust my gut feeling, I'm ok with it now. :)
> 
> Let's see if someone else with more kernel-experience has something to
> add, but here you've got my
> 
> Reviewed-By: Florian Mickler <florian@mickler.org>

Here's the current version.  If it looks good to you, I'll push it
upstream.

Thanks.

Subject: workqueue: add documentation

Update copyright notice and add Documentation/workqueue.txt.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-By: Florian Mickler <florian@mickler.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
---
 Documentation/workqueue.txt |  380 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/workqueue.h   |    4
 kernel/workqueue.c          |   27 +--
 3 files changed, 401 insertions(+), 10 deletions(-)

Index: work/kernel/workqueue.c
===================================================================
--- work.orig/kernel/workqueue.c
+++ work/kernel/workqueue.c
@@ -1,19 +1,26 @@
 /*
- * linux/kernel/workqueue.c
+ * kernel/workqueue.c - generic async execution with shared worker pool
  *
- * Generic mechanism for defining kernel helper threads for running
- * arbitrary tasks in process context.
+ * Copyright (C) 2002		Ingo Molnar
  *
- * Started by Ingo Molnar, Copyright (C) 2002
+ *   Derived from the taskqueue/keventd code by:
+ *     David Woodhouse <dwmw2@infradead.org>
+ *     Andrew Morton
+ *     Kai Petzke <wpp@marie.physik.tu-berlin.de>
+ *     Theodore Ts'o <tytso@mit.edu>
  *
- * Derived from the taskqueue/keventd code by:
+ * Made to use alloc_percpu by Christoph Lameter.
  *
- *   David Woodhouse <dwmw2@infradead.org>
- *   Andrew Morton
- *   Kai Petzke <wpp@marie.physik.tu-berlin.de>
- *   Theodore Ts'o <tytso@mit.edu>
+ * Copyright (C) 2010		SUSE Linux Products GmbH
+ * Copyright (C) 2010		Tejun Heo <tj@kernel.org>
  *
- * Made to use alloc_percpu by Christoph Lameter.
+ * This is the generic async execution mechanism.  Work items as are
+ * executed in process context.  The worker pool is shared and
+ * automatically managed.  There is one worker pool for each CPU and
+ * one extra for works which are better served by workers which are
+ * not bound to any specific CPU.
+ *
+ * Please read Documentation/workqueue.txt for details.
  */

 #include <linux/module.h>
Index: work/include/linux/workqueue.h
===================================================================
--- work.orig/include/linux/workqueue.h
+++ work/include/linux/workqueue.h
@@ -235,6 +235,10 @@ static inline unsigned int work_static(s
 #define work_clear_pending(work) \
 	clear_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))

+/*
+ * Workqueue flags and constants.  For details, please refer to
+ * Documentation/workqueue.txt.
+ */
 enum {
 	WQ_NON_REENTRANT	= 1 << 0, /* guarantee non-reentrance */
 	WQ_UNBOUND		= 1 << 1, /* not bound to any cpu */
Index: work/Documentation/workqueue.txt
===================================================================
--- /dev/null
+++ work/Documentation/workqueue.txt
@@ -0,0 +1,380 @@
+
+Concurrency Managed Workqueue (cmwq)
+
+September, 2010		Tejun Heo <tj@kernel.org>
+			Florian Mickler <florian@mickler.org>
+
+CONTENTS
+
+1. Introduction
+2. Why cmwq?
+3. The Design
+4. Application Programming Interface (API)
+5. Example Execution Scenarios
+6. Guidelines
+
+
+1. Introduction
+
+There are many cases where an asynchronous process execution context
+is needed and the workqueue (wq) API is the most commonly used
+mechanism for such cases.
+
+When such an asynchronous execution context is needed, a work item
+describing which function to execute is put on a queue.  An
+independent thread serves as the asynchronous execution context.  The
+queue is called workqueue and the thread is called worker.
+
+While there are work items on the workqueue the worker executes the
+functions associated with the work items one after the other.  When
+there is no work item left on the workqueue the worker becomes idle.
+When a new work item gets queued, the worker begins executing again.
+
+
+2. Why cmwq?
+
+In the original wq implementation, a multi threaded (MT) wq had one
+worker thread per CPU and a single threaded (ST) wq had one worker
+thread system-wide.  A single MT wq needed to keep around the same
+number of workers as the number of CPUs.  The kernel grew a lot of MT
+wq users over the years and with the number of CPU cores continuously
+rising, some systems saturated the default 32k PID space just booting
+up.
+
+Although MT wq wasted a lot of resource, the level of concurrency
+provided was unsatisfactory.  The limitation was common to both ST and
+MT wq albeit less severe on MT.  Each wq maintained its own separate
+worker pool.  A MT wq could provide only one execution context per CPU
+while a ST wq one for the whole system.  Work items had to compete for
+those very limited execution contexts leading to various problems
+including proneness to deadlocks around the single execution context.
+
+The tension between the provided level of concurrency and resource
+usage also forced its users to make unnecessary tradeoffs like libata
+choosing to use ST wq for polling PIOs and accepting an unnecessary
+limitation that no two polling PIOs can progress at the same time.  As
+MT wq don't provide much better concurrency, users which require
+higher level of concurrency, like async or fscache, had to implement
+their own thread pool.
+
+Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with
+focus on the following goals.
+
+* Maintain compatibility with the original workqueue API.
+
+* Use per-CPU unified worker pools shared by all wq to provide
+  flexible level of concurrency on demand without wasting a lot of
+  resource.
+
+* Automatically regulate worker pool and level of concurrency so that
+  the API users don't need to worry about such details.
+
+
+3. The Design
+
+In order to ease the asynchronous execution of functions a new
+abstraction, the work item, is introduced.
+
+A work item is a simple struct that holds a pointer to the function
+that is to be executed asynchronously.  Whenever a driver or subsystem
+wants a function to be executed asynchronously it has to set up a work
+item pointing to that function and queue that work item on a
+workqueue.
+
+Special purpose threads, called worker threads, execute the functions
+off of the queue, one after the other.  If no work is queued, the
+worker threads become idle.  These worker threads are managed in so
+called thread-pools.
+
+Subsystems and drivers can create and queue work items on workqueues
+as they see fit.
+
+By default, workqueues are per-cpu.  Work items are queued and
+executed on the same CPU as the issuer.  These workqueues and work
+items are said to be "bound".  A workqueue can be specifically
+configured to be "unbound" in which case work items queued on the
+workqueue are executed by worker threads not bound to any specific
+CPU.
+
+The cmwq design differentiates between the user-facing workqueues that
+subsystems and drivers queue work items on and the backend mechanism
+which manages thread-pool and processes the queued work items.
+
+The backend mechanism is called gcwq.  There is one gcwq for each
+possible CPU and one gcwq to serve work items queued on unbound
+workqueues.
+
+When a work item is queued to a workqueue, the target gcwq is
+determined according to the queue parameters and workqueue attributes
+and queued on the shared worklist of the gcwq.  For example, unless
+specifically overridden, a work item of a bound workqueue will be
+queued on the worklist of the gcwq of the CPU the issuer is running
+on.
+
+For any worker pool implementation, managing the concurrency level (how
+many execution contexts are active) is an important issue.  cmwq tries
+to keep the concurrency at minimal but sufficient level.
+
+Each gcwq bound to an actual CPU implements concurrency management by
+hooking into the scheduler.  The gcwq is notified whenever an active
+worker wakes up or sleeps and keeps track of the number of the
+currently runnable workers.  Generally, work items are not expected to
+hog CPU cycle and maintaining just enough concurrency to prevent work
+processing from stalling should be optimal.  As long as there is one
+or more runnable workers on the CPU, the gcwq doesn't start execution
+of a new work, but, when the last running worker goes to sleep, it
+immediately schedules a new worker so that the CPU doesn't sit idle
+while there are pending work items.  This allows using minimal number
+of workers without losing execution bandwidth.
+
+Keeping idle workers around doesn't cost other than the memory space
+for kthreads, so cmwq holds onto idle ones for a while before killing
+them.
+
+For an unbound wq, the above concurrency management doesn't apply and
+the gcwq for the pseudo unbound CPU tries to start executing all work
+items as soon as possible.  The responsibility of regulating
+concurrency level is on the users.  There is also a flag to mark a
+bound wq to ignore the concurrency management.  Please refer to the
+Workqueue Attributes section for details.
+
+Forward progress guarantee relies on that workers can be created when
+more execution contexts are necessary, which in turn is guaranteed
+through the use of rescue workers.  All work items which might be used
+on code paths that handle memory reclaim are required to be queued on
+wq's that have a rescue-worker reserved for execution under memory
+pressure.  Else it is possible that the thread-pool deadlocks waiting
+for execution contexts to free up.
+
+
+4. Application Programming Interface (API)
+
+alloc_workqueue() allocates a wq.  The original create_*workqueue()
+functions are deprecated and scheduled for removal.  alloc_workqueue()
+takes three arguments - @name, @flags and @max_active.  @name is the
+name of the wq and also used as the name of the rescuer thread if
+there is one.
+
+A wq no longer manages execution resources but serves as a domain for
+forward progress guarantee, flush and work item attributes.  @flags
+and @max_active control how work items are assigned execution
+resources, scheduled and executed.
+
+@flags:
+
+  WQ_NON_REENTRANT
+
+	By default, a wq guarantees non-reentrance only on the same
+	CPU.  A work may not be executed concurrently on the same CPU
+	by multiple workers but is allowed to be executed concurrently
+	on multiple CPUs.  This flag makes sure non-reentrance is
+	enforced across all CPUs.  Work items queued to a
+	non-reentrant wq are guaranteed to be executed by at most one
+	worker system-wide at any given time.
+
+  WQ_UNBOUND
+
+	Work items queued to an unbound wq are served by a special
+	gcwq which hosts workers which are not bound to any specific
+	CPU.  This makes the wq behave as a simple execution context
+	provider without concurrency management.  The unbound gcwq
+	tries to start execution of work items as soon as possible.
+	Unbound wq sacrifices locality but is useful for the following
+	cases.
+
+	* Wide fluctuation in the concurrency level requirement is
+	  expected and using bound wq may end up creating large number
+	  of mostly unused workers across different CPUs as the issuer
+	  hops through different CPUs.
+
+	* Long running CPU intensive workloads which can be better
+	  managed by the system scheduler.
+
+  WQ_FREEZEABLE
+
+	A freezeable wq participates in the freeze phase of the system
+	suspend operations.  Work items on the wq are drained and no
+	new work item starts execution until thawed.
+
+  WQ_RESCUER
+
+	All wq which might be used in the memory reclaim paths _MUST_
+	have this flag set.  This reserves one worker exclusively for
+	the execution of this wq under memory pressure.
+
+  WQ_HIGHPRI
+
+	Work items of a highpri wq are queued at the head of the
+	worklist of the target gcwq and start execution regardless of
+	the current concurrency level.  In other words, highpri work
+	items will always start execution as soon as execution
+	resource is available.
+
+	Ordering among highpri work items is preserved - a highpri
+	work item queued after another highpri work item will start
+	execution after the earlier highpri work item starts.
+
+	Although highpri work items are not held back by other
+	runnable work items, they still contribute to the concurrency
+	level.  Highpri work items in runnable state will prevent
+	non-highpri work items from starting execution.
+
+	This flag is meaningless for unbound wq.
+
+  WQ_CPU_INTENSIVE
+
+	Work items of a CPU intensive wq do not contribute to the
+	concurrency level.  In other words, Runnable CPU intensive
+	work items will not prevent other work items from starting
+	execution.  This is useful for bound work items which are
+	expected to hog CPU cycles so that their execution is
+	regulated by the system scheduler.
+
+	Although CPU intensive work items don't contribute to the
+	concurrency level, start of their executions is still
+	regulated by the concurrency management and runnable
+	non-CPU-intensive work items can delay execution of CPU
+	intensive work items.
+
+	This flag is meaningless for unbound wq.
+
+  WQ_HIGHPRI | WQ_CPU_INTENSIVE
+
+	This combination makes the wq avoid interaction with
+	concurrency management completely and behave as a simple
+	per-CPU execution context provider.  Work items queued on a
+	highpri CPU-intensive wq start execution as soon as resources
+	are available and don't affect execution of other work items.
+
+@max_active:
+
+@max_active determines the maximum number of execution contexts per
+CPU which can be assigned to the work items of a wq.  For example,
+with @max_active of 16, at most 16 work items of the wq can be
+executing at the same time per CPU.
+
+Currently, for a bound wq, the maximum limit for @max_active is 512
+and the default value used when 0 is specified is 256.  For an unbound
+wq, the limit is higher of 512 and 4 * num_possible_cpus().  These
+values are chosen sufficiently high such that they are not the
+limiting factor while providing protection in runaway cases.
+
+The number of active work items of a wq is usually regulated by the
+users of the wq, more specifically, by how many work items the users
+may queue at the same time.  Unless there is a specific need for
+throttling the number of active work items, specifying '0' is
+recommended.
+
+Some users depend on the strict execution ordering of ST wq.  The
+combination of @max_active of 1 and WQ_UNBOUND is used to achieve this
+behavior.  Work items on such wq are always queued to the unbound gcwq
+and only one work item can be active at any given time thus achieving
+the same ordering property as ST wq.
+
+
+5. Example Execution Scenarios
+
+The following example execution scenarios try to illustrate how cmwq
+behave under different configurations.
+
+ Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU.
+ w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms
+ again before finishing.  w1 and w2 burn CPU for 5ms then sleep for
+ 10ms.
+
+Ignoring all other tasks, works and processing overhead, and assuming
+simple FIFO scheduling, the following is one highly simplified version
+of possible sequences of events with the original wq.
+
+ TIME IN MSECS	EVENT
+ 0		w0 starts and burns CPU
+ 5		w0 sleeps
+ 15		w0 wakes up and burns CPU
+ 20		w0 finishes
+ 20		w1 starts and burns CPU
+ 25		w1 sleeps
+ 35		w1 wakes up and finishes
+ 35		w2 starts and burns CPU
+ 40		w2 sleeps
+ 50		w2 wakes up and finishes
+
+And with cmwq with @max_active >= 3,
+
+ TIME IN MSECS	EVENT
+ 0		w0 starts and burns CPU
+ 5		w0 sleeps
+ 5		w1 starts and burns CPU
+ 10		w1 sleeps
+ 10		w2 starts and burns CPU
+ 15		w2 sleeps
+ 15		w0 wakes up and burns CPU
+ 20		w0 finishes
+ 20		w1 wakes up and finishes
+ 25		w2 wakes up and finishes
+
+If @max_active == 2,
+
+ TIME IN MSECS	EVENT
+ 0		w0 starts and burns CPU
+ 5		w0 sleeps
+ 5		w1 starts and burns CPU
+ 10		w1 sleeps
+ 15		w0 wakes up and burns CPU
+ 20		w0 finishes
+ 20		w1 wakes up and finishes
+ 20		w2 starts and burns CPU
+ 25		w2 sleeps
+ 35		w2 wakes up and finishes
+
+Now, let's assume w1 and w2 are queued to a different wq q1 which has
+WQ_HIGHPRI set,
+
+ TIME IN MSECS	EVENT
+ 0		w1 and w2 start and burn CPU
+ 5		w1 sleeps
+ 10		w2 sleeps
+ 10		w0 starts and burns CPU
+ 15		w0 sleeps
+ 15		w1 wakes up and finishes
+ 20		w2 wakes up and finishes
+ 25		w0 wakes up and burns CPU
+ 30		w0 finishes
+
+If q1 has WQ_CPU_INTENSIVE set,
+
+ TIME IN MSECS	EVENT
+ 0		w0 starts and burns CPU
+ 5		w0 sleeps
+ 5		w1 and w2 start and burn CPU
+ 10		w1 sleeps
+ 15		w2 sleeps
+ 15		w0 wakes up and burns CPU
+ 20		w0 finishes
+ 20		w1 wakes up and finishes
+ 25		w2 wakes up and finishes
+
+
+6. Guidelines
+
+* Do not forget to use WQ_RESCUER if a wq may process work items which
+  are used during memory reclaim.  Each wq with WQ_RESCUER set has one
+  rescuer thread reserved for it.  If there is dependency among
+  multiple work items used during memory reclaim, they should be
+  queued to separate wq each with WQ_RESCUER.
+
+* Unless strict ordering is required, there is no need to use ST wq.
+
+* Unless there is a specific need, using 0 for @nr_active is
+  recommended.  In most use cases, concurrency level usually stays
+  well under the default limit.
+
+* A wq serves as a domain for forward progress guarantee (WQ_RESCUER),
+  flush and work item attributes.  Work items which are not involved
+  in memory reclaim and don't need to be flushed as a part of a group
+  of work items, and don't require any special attribute, can use one
+  of the system wq.  There is no difference in execution
+  characteristics between using a dedicated wq and a system wq.
+
+* Unless work items are expected to consume a huge amount of CPU
+  cycles, using a bound wq is usually beneficial due to the increased
+  level of locality in wq operations and work item execution.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH UPDATED] workqueue: add documentation
  2010-09-10 10:25         ` Tejun Heo
@ 2010-09-10 14:26           ` Florian Mickler
  2010-09-10 14:55             ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Florian Mickler @ 2010-09-10 14:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: lkml, Ingo Molnar, Christoph Lameter, Dave Chinner

On Fri, 10 Sep 2010 12:25:55 +0200
Tejun Heo <tj@kernel.org> wrote:

> +Concurrency Managed Workqueue (cmwq)
> +
> +September, 2010		Tejun Heo <tj@kernel.org>
> +			Florian Mickler <florian@mickler.org>
> +
> +CONTENTS

Thx.


I fumbled a bit with the ordering in the design
description.. ok so?

Cheers,
Flo

diff --git a/Documentation/workqueue.txt b/Documentation/workqueue.txt
index 5317229..3d22821 100644
--- a/Documentation/workqueue.txt
+++ b/Documentation/workqueue.txt
@@ -86,45 +86,44 @@ off of the queue, one after the other.  If no work
is queued, the
 worker threads become idle.  These worker threads are managed in so
 called thread-pools.
 
-Subsystems and drivers can create and queue work items on workqueues
-as they see fit.
-
-By default, workqueues are per-cpu.  Work items are queued and
-executed on the same CPU as the issuer.  These workqueues and work
-items are said to be "bound".  A workqueue can be specifically
-configured to be "unbound" in which case work items queued on the
-workqueue are executed by worker threads not bound to any specific
-CPU.
-
 The cmwq design differentiates between the user-facing workqueues that
 subsystems and drivers queue work items on and the backend mechanism
 which manages thread-pool and processes the queued work items.
 
-The backend mechanism is called gcwq.  There is one gcwq for each
+The backend is called gcwq.  There is one gcwq for each
 possible CPU and one gcwq to serve work items queued on unbound
 workqueues.
 
+Subsystems and drivers can create and queue work items through special
+workqueue API functions as they see fit. They can influence some
+aspects of the way the work items are executed by setting flags on the
+workqueue they are putting the work item on. These flags include
+things like cpu locality, reentrancy, concurrency limits and more. To
+get a detailed overview refer to the API description of
+alloc_workqueue() below. 
+
 When a work item is queued to a workqueue, the target gcwq is
 determined according to the queue parameters and workqueue attributes
-and queued on the shared worklist of the gcwq.  For example, unless
+and appended to the shared worklist of that gcwq.  For example, unless
 specifically overridden, a work item of a bound workqueue will be
-queued on the worklist of the gcwq of the CPU the issuer is running
-on.
+queued on the worklist of exactly that gcwq that is associated to the 
+CPU the issuer is running on.
 
 For any worker pool implementation, managing the concurrency level (how
 many execution contexts are active) is an important issue.  cmwq tries
-to keep the concurrency at minimal but sufficient level.
+to keep the concurrency at a minimal but sufficient level. Minimal to
save
+resources and sufficient in that the system is used at it's full
capacity.
 
 Each gcwq bound to an actual CPU implements concurrency management by
 hooking into the scheduler.  The gcwq is notified whenever an active
 worker wakes up or sleeps and keeps track of the number of the
 currently runnable workers.  Generally, work items are not expected to
-hog CPU cycle and maintaining just enough concurrency to prevent work
-processing from stalling should be optimal.  As long as there is one
-or more runnable workers on the CPU, the gcwq doesn't start execution
-of a new work, but, when the last running worker goes to sleep, it
-immediately schedules a new worker so that the CPU doesn't sit idle
-while there are pending work items.  This allows using minimal number
+hog a CPU and consume many cycles. That means maintaining just enough 
+concurrency to prevent work processing from stalling should be
optimal.  
+As long as there is one or more runnable workers on the CPU, the gcwq 
+doesn't start execution of a new work, but, when the last running
worker goes
+to sleep, it immediately schedules a new worker so that the CPU
doesn't sit 
+idle while there are pending work items.  This allows using a minimal
number
 of workers without losing execution bandwidth.
 
 Keeping idle workers around doesn't cost other than the memory space


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH UPDATED] workqueue: add documentation
  2010-09-10 14:26           ` Florian Mickler
@ 2010-09-10 14:55             ` Tejun Heo
  2010-09-10 17:43               ` Randy Dunlap
  2010-09-13  0:51               ` Dave Chinner
  0 siblings, 2 replies; 14+ messages in thread
From: Tejun Heo @ 2010-09-10 14:55 UTC (permalink / raw)
  To: Florian Mickler; +Cc: lkml, Ingo Molnar, Christoph Lameter, Dave Chinner

>From e9818ca0cd087229a3665a9ec186ee3da4a046bb Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Fri, 10 Sep 2010 16:51:36 +0200

Update copyright notice and add Documentation/workqueue.txt.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-By: Florian Mickler <florian@mickler.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
---
Applied to wq#for-linus branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-linus

Thanks a lot for helping with the documentation.  Much appreciated.

 Documentation/workqueue.txt |  380 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/workqueue.h   |    4 +
 kernel/workqueue.c          |   27 ++-
 3 files changed, 401 insertions(+), 10 deletions(-)
 create mode 100644 Documentation/workqueue.txt

diff --git a/Documentation/workqueue.txt b/Documentation/workqueue.txt
new file mode 100644
index 0000000..6d1bcd3
--- /dev/null
+++ b/Documentation/workqueue.txt
@@ -0,0 +1,380 @@
+
+Concurrency Managed Workqueue (cmwq)
+
+September, 2010		Tejun Heo <tj@kernel.org>
+			Florian Mickler <florian@mickler.org>
+
+CONTENTS
+
+1. Introduction
+2. Why cmwq?
+3. The Design
+4. Application Programming Interface (API)
+5. Example Execution Scenarios
+6. Guidelines
+
+
+1. Introduction
+
+There are many cases where an asynchronous process execution context
+is needed and the workqueue (wq) API is the most commonly used
+mechanism for such cases.
+
+When such an asynchronous execution context is needed, a work item
+describing which function to execute is put on a queue.  An
+independent thread serves as the asynchronous execution context.  The
+queue is called workqueue and the thread is called worker.
+
+While there are work items on the workqueue the worker executes the
+functions associated with the work items one after the other.  When
+there is no work item left on the workqueue the worker becomes idle.
+When a new work item gets queued, the worker begins executing again.
+
+
+2. Why cmwq?
+
+In the original wq implementation, a multi threaded (MT) wq had one
+worker thread per CPU and a single threaded (ST) wq had one worker
+thread system-wide.  A single MT wq needed to keep around the same
+number of workers as the number of CPUs.  The kernel grew a lot of MT
+wq users over the years and with the number of CPU cores continuously
+rising, some systems saturated the default 32k PID space just booting
+up.
+
+Although MT wq wasted a lot of resource, the level of concurrency
+provided was unsatisfactory.  The limitation was common to both ST and
+MT wq albeit less severe on MT.  Each wq maintained its own separate
+worker pool.  A MT wq could provide only one execution context per CPU
+while a ST wq one for the whole system.  Work items had to compete for
+those very limited execution contexts leading to various problems
+including proneness to deadlocks around the single execution context.
+
+The tension between the provided level of concurrency and resource
+usage also forced its users to make unnecessary tradeoffs like libata
+choosing to use ST wq for polling PIOs and accepting an unnecessary
+limitation that no two polling PIOs can progress at the same time.  As
+MT wq don't provide much better concurrency, users which require
+higher level of concurrency, like async or fscache, had to implement
+their own thread pool.
+
+Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with
+focus on the following goals.
+
+* Maintain compatibility with the original workqueue API.
+
+* Use per-CPU unified worker pools shared by all wq to provide
+  flexible level of concurrency on demand without wasting a lot of
+  resource.
+
+* Automatically regulate worker pool and level of concurrency so that
+  the API users don't need to worry about such details.
+
+
+3. The Design
+
+In order to ease the asynchronous execution of functions a new
+abstraction, the work item, is introduced.
+
+A work item is a simple struct that holds a pointer to the function
+that is to be executed asynchronously.  Whenever a driver or subsystem
+wants a function to be executed asynchronously it has to set up a work
+item pointing to that function and queue that work item on a
+workqueue.
+
+Special purpose threads, called worker threads, execute the functions
+off of the queue, one after the other.  If no work is queued, the
+worker threads become idle.  These worker threads are managed in so
+called thread-pools.
+
+The cmwq design differentiates between the user-facing workqueues that
+subsystems and drivers queue work items on and the backend mechanism
+which manages thread-pool and processes the queued work items.
+
+The backend is called gcwq.  There is one gcwq for each possible CPU
+and one gcwq to serve work items queued on unbound workqueues.
+
+Subsystems and drivers can create and queue work items through special
+workqueue API functions as they see fit. They can influence some
+aspects of the way the work items are executed by setting flags on the
+workqueue they are putting the work item on. These flags include
+things like cpu locality, reentrancy, concurrency limits and more. To
+get a detailed overview refer to the API description of
+alloc_workqueue() below.
+
+When a work item is queued to a workqueue, the target gcwq is
+determined according to the queue parameters and workqueue attributes
+and appended on the shared worklist of the gcwq.  For example, unless
+specifically overridden, a work item of a bound workqueue will be
+queued on the worklist of exactly that gcwq that is associated to the
+CPU the issuer is running on.
+
+For any worker pool implementation, managing the concurrency level
+(how many execution contexts are active) is an important issue.  cmwq
+tries to keep the concurrency at a minimal but sufficient level.
+Minimal to save resources and sufficient in that the system is used at
+its full capacity.
+
+Each gcwq bound to an actual CPU implements concurrency management by
+hooking into the scheduler.  The gcwq is notified whenever an active
+worker wakes up or sleeps and keeps track of the number of the
+currently runnable workers.  Generally, work items are not expected to
+hog a CPU and consume many cycles.  That means maintaining just enough
+concurrency to prevent work processing from stalling should be
+optimal.  As long as there are one or more runnable workers on the
+CPU, the gcwq doesn't start execution of a new work, but, when the
+last running worker goes to sleep, it immediately schedules a new
+worker so that the CPU doesn't sit idle while there are pending work
+items.  This allows using a minimal number of workers without losing
+execution bandwidth.
+
+Keeping idle workers around doesn't cost other than the memory space
+for kthreads, so cmwq holds onto idle ones for a while before killing
+them.
+
+For an unbound wq, the above concurrency management doesn't apply and
+the gcwq for the pseudo unbound CPU tries to start executing all work
+items as soon as possible.  The responsibility of regulating
+concurrency level is on the users.  There is also a flag to mark a
+bound wq to ignore the concurrency management.  Please refer to the
+Workqueue Attributes section for details.
+
+Forward progress guarantee relies on that workers can be created when
+more execution contexts are necessary, which in turn is guaranteed
+through the use of rescue workers.  All work items which might be used
+on code paths that handle memory reclaim are required to be queued on
+wq's that have a rescue-worker reserved for execution under memory
+pressure.  Else it is possible that the thread-pool deadlocks waiting
+for execution contexts to free up.
+
+
+4. Application Programming Interface (API)
+
+alloc_workqueue() allocates a wq.  The original create_*workqueue()
+functions are deprecated and scheduled for removal.  alloc_workqueue()
+takes three arguments - @name, @flags and @max_active.  @name is the
+name of the wq and also used as the name of the rescuer thread if
+there is one.
+
+A wq no longer manages execution resources but serves as a domain for
+forward progress guarantee, flush and work item attributes.  @flags
+and @max_active control how work items are assigned execution
+resources, scheduled and executed.
+
+@flags:
+
+  WQ_NON_REENTRANT
+
+	By default, a wq guarantees non-reentrance only on the same
+	CPU.  A work may not be executed concurrently on the same CPU
+	by multiple workers but is allowed to be executed concurrently
+	on multiple CPUs.  This flag makes sure non-reentrance is
+	enforced across all CPUs.  Work items queued to a
+	non-reentrant wq are guaranteed to be executed by at most one
+	worker system-wide at any given time.
+
+  WQ_UNBOUND
+
+	Work items queued to an unbound wq are served by a special
+	gcwq which hosts workers which are not bound to any specific
+	CPU.  This makes the wq behave as a simple execution context
+	provider without concurrency management.  The unbound gcwq
+	tries to start execution of work items as soon as possible.
+	Unbound wq sacrifices locality but is useful for the following
+	cases.
+
+	* Wide fluctuation in the concurrency level requirement is
+	  expected and using bound wq may end up creating large number
+	  of mostly unused workers across different CPUs as the issuer
+	  hops through different CPUs.
+
+	* Long running CPU intensive workloads which can be better
+	  managed by the system scheduler.
+
+  WQ_FREEZEABLE
+
+	A freezeable wq participates in the freeze phase of the system
+	suspend operations.  Work items on the wq are drained and no
+	new work item starts execution until thawed.
+
+  WQ_RESCUER
+
+	All wq which might be used in the memory reclaim paths _MUST_
+	have this flag set.  This reserves one worker exclusively for
+	the execution of this wq under memory pressure.
+
+  WQ_HIGHPRI
+
+	Work items of a highpri wq are queued at the head of the
+	worklist of the target gcwq and start execution regardless of
+	the current concurrency level.  In other words, highpri work
+	items will always start execution as soon as execution
+	resource is available.
+
+	Ordering among highpri work items is preserved - a highpri
+	work item queued after another highpri work item will start
+	execution after the earlier highpri work item starts.
+
+	Although highpri work items are not held back by other
+	runnable work items, they still contribute to the concurrency
+	level.  Highpri work items in runnable state will prevent
+	non-highpri work items from starting execution.
+
+	This flag is meaningless for unbound wq.
+
+  WQ_CPU_INTENSIVE
+
+	Work items of a CPU intensive wq do not contribute to the
+	concurrency level.  In other words, Runnable CPU intensive
+	work items will not prevent other work items from starting
+	execution.  This is useful for bound work items which are
+	expected to hog CPU cycles so that their execution is
+	regulated by the system scheduler.
+
+	Although CPU intensive work items don't contribute to the
+	concurrency level, start of their executions is still
+	regulated by the concurrency management and runnable
+	non-CPU-intensive work items can delay execution of CPU
+	intensive work items.
+
+	This flag is meaningless for unbound wq.
+
+  WQ_HIGHPRI | WQ_CPU_INTENSIVE
+
+	This combination makes the wq avoid interaction with
+	concurrency management completely and behave as a simple
+	per-CPU execution context provider.  Work items queued on a
+	highpri CPU-intensive wq start execution as soon as resources
+	are available and don't affect execution of other work items.
+
+@max_active:
+
+@max_active determines the maximum number of execution contexts per
+CPU which can be assigned to the work items of a wq.  For example,
+with @max_active of 16, at most 16 work items of the wq can be
+executing at the same time per CPU.
+
+Currently, for a bound wq, the maximum limit for @max_active is 512
+and the default value used when 0 is specified is 256.  For an unbound
+wq, the limit is higher of 512 and 4 * num_possible_cpus().  These
+values are chosen sufficiently high such that they are not the
+limiting factor while providing protection in runaway cases.
+
+The number of active work items of a wq is usually regulated by the
+users of the wq, more specifically, by how many work items the users
+may queue at the same time.  Unless there is a specific need for
+throttling the number of active work items, specifying '0' is
+recommended.
+
+Some users depend on the strict execution ordering of ST wq.  The
+combination of @max_active of 1 and WQ_UNBOUND is used to achieve this
+behavior.  Work items on such wq are always queued to the unbound gcwq
+and only one work item can be active at any given time thus achieving
+the same ordering property as ST wq.
+
+
+5. Example Execution Scenarios
+
+The following example execution scenarios try to illustrate how cmwq
+behave under different configurations.
+
+ Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU.
+ w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms
+ again before finishing.  w1 and w2 burn CPU for 5ms then sleep for
+ 10ms.
+
+Ignoring all other tasks, works and processing overhead, and assuming
+simple FIFO scheduling, the following is one highly simplified version
+of possible sequences of events with the original wq.
+
+ TIME IN MSECS	EVENT
+ 0		w0 starts and burns CPU
+ 5		w0 sleeps
+ 15		w0 wakes up and burns CPU
+ 20		w0 finishes
+ 20		w1 starts and burns CPU
+ 25		w1 sleeps
+ 35		w1 wakes up and finishes
+ 35		w2 starts and burns CPU
+ 40		w2 sleeps
+ 50		w2 wakes up and finishes
+
+And with cmwq with @max_active >= 3,
+
+ TIME IN MSECS	EVENT
+ 0		w0 starts and burns CPU
+ 5		w0 sleeps
+ 5		w1 starts and burns CPU
+ 10		w1 sleeps
+ 10		w2 starts and burns CPU
+ 15		w2 sleeps
+ 15		w0 wakes up and burns CPU
+ 20		w0 finishes
+ 20		w1 wakes up and finishes
+ 25		w2 wakes up and finishes
+
+If @max_active == 2,
+
+ TIME IN MSECS	EVENT
+ 0		w0 starts and burns CPU
+ 5		w0 sleeps
+ 5		w1 starts and burns CPU
+ 10		w1 sleeps
+ 15		w0 wakes up and burns CPU
+ 20		w0 finishes
+ 20		w1 wakes up and finishes
+ 20		w2 starts and burns CPU
+ 25		w2 sleeps
+ 35		w2 wakes up and finishes
+
+Now, let's assume w1 and w2 are queued to a different wq q1 which has
+WQ_HIGHPRI set,
+
+ TIME IN MSECS	EVENT
+ 0		w1 and w2 start and burn CPU
+ 5		w1 sleeps
+ 10		w2 sleeps
+ 10		w0 starts and burns CPU
+ 15		w0 sleeps
+ 15		w1 wakes up and finishes
+ 20		w2 wakes up and finishes
+ 25		w0 wakes up and burns CPU
+ 30		w0 finishes
+
+If q1 has WQ_CPU_INTENSIVE set,
+
+ TIME IN MSECS	EVENT
+ 0		w0 starts and burns CPU
+ 5		w0 sleeps
+ 5		w1 and w2 start and burn CPU
+ 10		w1 sleeps
+ 15		w2 sleeps
+ 15		w0 wakes up and burns CPU
+ 20		w0 finishes
+ 20		w1 wakes up and finishes
+ 25		w2 wakes up and finishes
+
+
+6. Guidelines
+
+* Do not forget to use WQ_RESCUER if a wq may process work items which
+  are used during memory reclaim.  Each wq with WQ_RESCUER set has one
+  rescuer thread reserved for it.  If there is dependency among
+  multiple work items used during memory reclaim, they should be
+  queued to separate wq each with WQ_RESCUER.
+
+* Unless strict ordering is required, there is no need to use ST wq.
+
+* Unless there is a specific need, using 0 for @nr_active is
+  recommended.  In most use cases, concurrency level usually stays
+  well under the default limit.
+
+* A wq serves as a domain for forward progress guarantee (WQ_RESCUER),
+  flush and work item attributes.  Work items which are not involved
+  in memory reclaim and don't need to be flushed as a part of a group
+  of work items, and don't require any special attribute, can use one
+  of the system wq.  There is no difference in execution
+  characteristics between using a dedicated wq and a system wq.
+
+* Unless work items are expected to consume a huge amount of CPU
+  cycles, using a bound wq is usually beneficial due to the increased
+  level of locality in wq operations and work item execution.
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index f11100f..25e02c9 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -235,6 +235,10 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
 #define work_clear_pending(work) \
 	clear_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))

+/*
+ * Workqueue flags and constants.  For details, please refer to
+ * Documentation/workqueue.txt.
+ */
 enum {
 	WQ_NON_REENTRANT	= 1 << 0, /* guarantee non-reentrance */
 	WQ_UNBOUND		= 1 << 1, /* not bound to any cpu */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 727f24e..f77afd9 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1,19 +1,26 @@
 /*
- * linux/kernel/workqueue.c
+ * kernel/workqueue.c - generic async execution with shared worker pool
  *
- * Generic mechanism for defining kernel helper threads for running
- * arbitrary tasks in process context.
+ * Copyright (C) 2002		Ingo Molnar
  *
- * Started by Ingo Molnar, Copyright (C) 2002
+ *   Derived from the taskqueue/keventd code by:
+ *     David Woodhouse <dwmw2@infradead.org>
+ *     Andrew Morton
+ *     Kai Petzke <wpp@marie.physik.tu-berlin.de>
+ *     Theodore Ts'o <tytso@mit.edu>
  *
- * Derived from the taskqueue/keventd code by:
+ * Made to use alloc_percpu by Christoph Lameter.
  *
- *   David Woodhouse <dwmw2@infradead.org>
- *   Andrew Morton
- *   Kai Petzke <wpp@marie.physik.tu-berlin.de>
- *   Theodore Ts'o <tytso@mit.edu>
+ * Copyright (C) 2010		SUSE Linux Products GmbH
+ * Copyright (C) 2010		Tejun Heo <tj@kernel.org>
  *
- * Made to use alloc_percpu by Christoph Lameter.
+ * This is the generic async execution mechanism.  Work items as are
+ * executed in process context.  The worker pool is shared and
+ * automatically managed.  There is one worker pool for each CPU and
+ * one extra for works which are better served by workers which are
+ * not bound to any specific CPU.
+ *
+ * Please read Documentation/workqueue.txt for details.
  */

 #include <linux/module.h>
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH UPDATED] workqueue: add documentation
  2010-09-10 14:55             ` Tejun Heo
@ 2010-09-10 17:43               ` Randy Dunlap
  2010-09-12 10:50                 ` Tejun Heo
  2010-09-13  0:51               ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Randy Dunlap @ 2010-09-10 17:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Florian Mickler, lkml, Ingo Molnar, Christoph Lameter, Dave Chinner

On Fri, 10 Sep 2010 16:55:21 +0200 Tejun Heo wrote:

> +3. The Design

> +Subsystems and drivers can create and queue work items through special
> +workqueue API functions as they see fit. They can influence some
> +aspects of the way the work items are executed by setting flags on the
> +workqueue they are putting the work item on. These flags include
> +things like cpu locality, reentrancy, concurrency limits and more. To

               CPU

> +get a detailed overview refer to the API description of
> +alloc_workqueue() below.

> +4. Application Programming Interface (API)

> +@flags:
> +
> +  WQ_NON_REENTRANT
> +
> +	By default, a wq guarantees non-reentrance only on the same
> +	CPU.  A work may not be executed concurrently on the same CPU

	        work item

> +	by multiple workers but is allowed to be executed concurrently
> +	on multiple CPUs.  This flag makes sure non-reentrance is
> +	enforced across all CPUs.  Work items queued to a
> +	non-reentrant wq are guaranteed to be executed by at most one
> +	worker system-wide at any given time.

> +  WQ_CPU_INTENSIVE
> +
> +	Work items of a CPU intensive wq do not contribute to the
> +	concurrency level.  In other words, Runnable CPU intensive

	                                    runnable

> +	work items will not prevent other work items from starting
> +	execution.  This is useful for bound work items which are
> +	expected to hog CPU cycles so that their execution is
> +	regulated by the system scheduler.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH UPDATED] workqueue: add documentation
  2010-09-10 17:43               ` Randy Dunlap
@ 2010-09-12 10:50                 ` Tejun Heo
  0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2010-09-12 10:50 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Florian Mickler, lkml, Ingo Molnar, Christoph Lameter, Dave Chinner

Updated accordingly.  Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH UPDATED] workqueue: add documentation
  2010-09-10 14:55             ` Tejun Heo
  2010-09-10 17:43               ` Randy Dunlap
@ 2010-09-13  0:51               ` Dave Chinner
  2010-09-13  8:08                 ` Tejun Heo
  1 sibling, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2010-09-13  0:51 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Florian Mickler, lkml, Ingo Molnar, Christoph Lameter

Hi Tejun,

A couple more queustions on cmwq.

On Fri, Sep 10, 2010 at 04:55:21PM +0200, Tejun Heo wrote:
.....
> +  WQ_HIGHPRI
> +
> +	Work items of a highpri wq are queued at the head of the
> +	worklist of the target gcwq and start execution regardless of
> +	the current concurrency level.  In other words, highpri work
> +	items will always start execution as soon as execution
> +	resource is available.
> +
> +	Ordering among highpri work items is preserved - a highpri
> +	work item queued after another highpri work item will start
> +	execution after the earlier highpri work item starts.
> +
> +	Although highpri work items are not held back by other
> +	runnable work items, they still contribute to the concurrency
> +	level.  Highpri work items in runnable state will prevent
> +	non-highpri work items from starting execution.
> +
> +	This flag is meaningless for unbound wq.

We talked about this for XFS w.r.t. the xfslogd IO completion
work items to be promoted ahead of data IO completion items and
that has worked fine. This appears to gives us only two
levels of priority, or from an user point of view, two levels of
dependency between workqueue item execution.

Thinking about the XFS situation more, we actually have three levels
of dependency: xfslogd -> xfsdatad -> xfsconvertd. That is, we defer
long running, blocking items from xfsdatad to xfsconvertd so we
don't block the xfsdatad from continuing to process data IO
completion items. How do we guarantee that the xfsconvertd work
items won't prevent/excessively delay processing of xfsdatad items?


> +@max_active determines the maximum number of execution contexts per
> +CPU which can be assigned to the work items of a wq.  For example,
> +with @max_active of 16, at most 16 work items of the wq can be
> +executing at the same time per CPU.

I think the reason you were seeing XFS blow this out of the water is
that every IO completion for a write beyond EOF (i.e. every single
one for an extending streaming write) will require inode locking to
update file size. If the inode is locked, then the item will
delay(1), and the cmwq controller will run the next item in a new
worker. That will then block in delay(1) 'cause it can't get the
inode lock, as so on....

As such, I can't see that increasing the max_active count for XFS is
a good thing - all it will do is cause larger blockages to occur....

> +6. Guidelines
> +
> +* Do not forget to use WQ_RESCUER if a wq may process work items which
> +  are used during memory reclaim.  Each wq with WQ_RESCUER set has one
> +  rescuer thread reserved for it.  If there is dependency among
> +  multiple work items used during memory reclaim, they should be
> +  queued to separate wq each with WQ_RESCUER.
> +
> +* Unless strict ordering is required, there is no need to use ST wq.
> +
> +* Unless there is a specific need, using 0 for @nr_active is
                                                  max_active?


Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH UPDATED] workqueue: add documentation
  2010-09-13  0:51               ` Dave Chinner
@ 2010-09-13  8:08                 ` Tejun Heo
  2010-09-13  8:16                   ` Florian Mickler
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2010-09-13  8:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Florian Mickler, lkml, Ingo Molnar, Christoph Lameter

Hello,

On 09/13/2010 02:51 AM, Dave Chinner wrote:
> We talked about this for XFS w.r.t. the xfslogd IO completion
> work items to be promoted ahead of data IO completion items and
> that has worked fine. This appears to gives us only two
> levels of priority, or from an user point of view, two levels of
> dependency between workqueue item execution.

It's not priority per-se.  It's basically a bypass switch for
workqueue work deferring mechanism.

> Thinking about the XFS situation more, we actually have three levels
> of dependency: xfslogd -> xfsdatad -> xfsconvertd. That is, we defer
> long running, blocking items from xfsdatad to xfsconvertd so we
> don't block the xfsdatad from continuing to process data IO
> completion items. How do we guarantee that the xfsconvertd work
> items won't prevent/excessively delay processing of xfsdatad items?

What do you mean by "long running"?  Do you mean it would consume a
lot of CPU cycles or it would block for locks and IOs a lot?  It's the
latter, right?  There isn't much to worry about.

>> +@max_active determines the maximum number of execution contexts per
>> +CPU which can be assigned to the work items of a wq.  For example,
>> +with @max_active of 16, at most 16 work items of the wq can be
>> +executing at the same time per CPU.
> 
> I think the reason you were seeing XFS blow this out of the water is
> that every IO completion for a write beyond EOF (i.e. every single
> one for an extending streaming write) will require inode locking to
> update file size. If the inode is locked, then the item will
> delay(1), and the cmwq controller will run the next item in a new
> worker. That will then block in delay(1) 'cause it can't get the
> inode lock, as so on....
> 
> As such, I can't see that increasing the max_active count for XFS is
> a good thing - all it will do is cause larger blockages to occur....

>From the description above, it looks like xfs developed its own way of
regulating work processing involving multiple workqueues and yielding
queue positions with delay.  For now, it probably would be best to
just keep things running as they are but in the long run it might be
beneficial to replace those explicit mechanisms.

>> +6. Guidelines
>> +
>> +* Do not forget to use WQ_RESCUER if a wq may process work items which
>> +  are used during memory reclaim.  Each wq with WQ_RESCUER set has one
>> +  rescuer thread reserved for it.  If there is dependency among
>> +  multiple work items used during memory reclaim, they should be
>> +  queued to separate wq each with WQ_RESCUER.
>> +
>> +* Unless strict ordering is required, there is no need to use ST wq.
>> +
>> +* Unless there is a specific need, using 0 for @nr_active is
>                                                   max_active?

Oops, thanks.  Updated.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH UPDATED] workqueue: add documentation
  2010-09-13  8:08                 ` Tejun Heo
@ 2010-09-13  8:16                   ` Florian Mickler
  2010-09-13  8:27                     ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Florian Mickler @ 2010-09-13  8:16 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Dave Chinner, lkml, Ingo Molnar, Christoph Lameter


one more detail, seems like this never ends... sorry :) 

On Mon, 13 Sep 2010 10:08:12 +0200
Tejun Heo <tj@kernel.org> wrote:




> +
> +For an unbound wq, the above concurrency management doesn't apply and
> +the gcwq for the pseudo unbound CPU tries to start executing all work
> +items as soon as possible.  The responsibility of regulating
> +concurrency level is on the users.  There is also a flag to mark a
> +bound wq to ignore the concurrency management.  Please refer to the
> +Workqueue Attributes section for details.

renamed to "API section" 

regards,
Flo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH UPDATED] workqueue: add documentation
  2010-09-13  8:16                   ` Florian Mickler
@ 2010-09-13  8:27                     ` Tejun Heo
  0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2010-09-13  8:27 UTC (permalink / raw)
  To: Florian Mickler; +Cc: Dave Chinner, lkml, Ingo Molnar, Christoph Lameter

On 09/13/2010 10:16 AM, Florian Mickler wrote:
>> +For an unbound wq, the above concurrency management doesn't apply and
>> +the gcwq for the pseudo unbound CPU tries to start executing all work
>> +items as soon as possible.  The responsibility of regulating
>> +concurrency level is on the users.  There is also a flag to mark a
>> +bound wq to ignore the concurrency management.  Please refer to the
>> +Workqueue Attributes section for details.
> 
> renamed to "API section" 

Updated, thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-09-13  8:28 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-08 15:40 [PATCH] workqueue: add documentation Tejun Heo
2010-09-08 15:51 ` [PATCH UPDATED] " Tejun Heo
2010-09-09  8:02   ` Florian Mickler
2010-09-09 10:22     ` Tejun Heo
2010-09-09 18:50       ` Florian Mickler
2010-09-10 10:25         ` Tejun Heo
2010-09-10 14:26           ` Florian Mickler
2010-09-10 14:55             ` Tejun Heo
2010-09-10 17:43               ` Randy Dunlap
2010-09-12 10:50                 ` Tejun Heo
2010-09-13  0:51               ` Dave Chinner
2010-09-13  8:08                 ` Tejun Heo
2010-09-13  8:16                   ` Florian Mickler
2010-09-13  8:27                     ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.