From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757802AbcBXFJ0 (ORCPT ); Wed, 24 Feb 2016 00:09:26 -0500 Received: from e19.ny.us.ibm.com ([129.33.205.209]:45420 "EHLO e19.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757860AbcBXFHp (ORCPT ); Wed, 24 Feb 2016 00:07:45 -0500 X-IBM-Helo: d01dlp01.pok.ibm.com X-IBM-MailFrom: paulmck@linux.vnet.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org From: "Paul E. McKenney" To: linux-kernel@vger.kernel.org Cc: mingo@kernel.org, jiangshanlai@gmail.com, dipankar@in.ibm.com, akpm@linux-foundation.org, mathieu.desnoyers@efficios.com, josh@joshtriplett.org, tglx@linutronix.de, peterz@infradead.org, rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com, dvhart@linux.intel.com, fweisbec@gmail.com, oleg@redhat.com, bobby.prani@gmail.com, "Paul E. McKenney" Subject: [PATCH tip/core/rcu 09/14] documentation: Add documentation for RCU's major data structures Date: Tue, 23 Feb 2016 21:00:42 -0800 Message-Id: <1456290047-16654-9-git-send-email-paulmck@linux.vnet.ibm.com> X-Mailer: git-send-email 2.5.2 In-Reply-To: <20160224050021.GA14616@linux.vnet.ibm.com> References: <20160224050021.GA14616@linux.vnet.ibm.com> X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16022405-0057-0000-0000-000003830814 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This commit adds documentation for RCU's major data structures, including rcu_state, rcu_node, rcu_data, rcu_dynticks, and rcu_head. Signed-off-by: Paul E. McKenney --- .../Design/Data-Structures/BigTreeClassicRCU.svg | 474 +++++++ .../Design/Data-Structures/BigTreeClassicRCUBH.svg | 499 +++++++ .../Data-Structures/BigTreeClassicRCUBHdyntick.svg | 695 ++++++++++ .../Data-Structures/BigTreePreemptRCUBHdyntick.svg | 741 +++++++++++ .../BigTreePreemptRCUBHdyntickCB.svg | 858 ++++++++++++ .../Design/Data-Structures/Data-Structures.html | 1372 ++++++++++++++++++++ .../Design/Data-Structures/Data-Structures.htmlx | 1272 ++++++++++++++++++ .../Design/Data-Structures/HugeTreeClassicRCU.svg | 939 ++++++++++++++ .../RCU/Design/Data-Structures/TreeLevel.svg | 828 ++++++++++++ .../RCU/Design/Data-Structures/TreeMapping.svg | 305 +++++ .../Design/Data-Structures/TreeMappingLevel.svg | 380 ++++++ .../RCU/Design/Data-Structures/blkd_task.svg | 843 ++++++++++++ .../RCU/Design/Data-Structures/nxtlist.svg | 396 ++++++ 13 files changed, 9602 insertions(+) create mode 100644 Documentation/RCU/Design/Data-Structures/BigTreeClassicRCU.svg create mode 100644 Documentation/RCU/Design/Data-Structures/BigTreeClassicRCUBH.svg create mode 100644 Documentation/RCU/Design/Data-Structures/BigTreeClassicRCUBHdyntick.svg create mode 100644 Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntick.svg create mode 100644 Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntickCB.svg create mode 100644 Documentation/RCU/Design/Data-Structures/Data-Structures.html create mode 100644 Documentation/RCU/Design/Data-Structures/Data-Structures.htmlx create mode 100644 Documentation/RCU/Design/Data-Structures/HugeTreeClassicRCU.svg create mode 100644 Documentation/RCU/Design/Data-Structures/TreeLevel.svg create mode 100644 Documentation/RCU/Design/Data-Structures/TreeMapping.svg create mode 100644 Documentation/RCU/Design/Data-Structures/TreeMappingLevel.svg create mode 100644 Documentation/RCU/Design/Data-Structures/blkd_task.svg create mode 100644 Documentation/RCU/Design/Data-Structures/nxtlist.svg diff --git a/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCU.svg b/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCU.svg new file mode 100644 index 000000000000..727e270b11e4 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCU.svg @@ -0,0 +1,474 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + struct + + rcu_data + + CPU 0 + + struct + + rcu_data + + CPU 15 + + struct + + rcu_data + + CPU 1007 + + struct + + rcu_data + + CPU 1023 + + struct rcu_state + + struct + + rcu_node + + rcu_node + + struct + + struct + + rcu_node + + + + + + + + diff --git a/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCUBH.svg b/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCUBH.svg new file mode 100644 index 000000000000..9bbb1944f962 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCUBH.svg @@ -0,0 +1,499 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_bh + + struct + + rcu_node + + struct + + rcu_node + + rcu_node + + struct + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct rcu_state + + rcu_sched + + + + + + + + + + + diff --git a/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCUBHdyntick.svg b/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCUBHdyntick.svg new file mode 100644 index 000000000000..21ba7823479d --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCUBHdyntick.svg @@ -0,0 +1,695 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_bh + + struct + + rcu_node + + struct + + rcu_node + + rcu_node + + struct + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct rcu_state + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + rcu_sched + + + + + diff --git a/Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntick.svg b/Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntick.svg new file mode 100644 index 000000000000..15adcac036c7 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntick.svg @@ -0,0 +1,741 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_bh + + struct + + rcu_node + + struct + + rcu_node + + rcu_node + + struct + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct rcu_state + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + rcu_preempt + + rcu_sched + + + + + diff --git a/Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntickCB.svg b/Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntickCB.svg new file mode 100644 index 000000000000..bbc3801470d0 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntickCB.svg @@ -0,0 +1,858 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + struct + + rcu_head + + struct + + rcu_head + + struct + + rcu_head + + rcu_sched + + rcu_bh + + struct + + rcu_node + + struct + + rcu_node + + rcu_node + + struct + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct rcu_state + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + rcu_preempt + + + + + diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.html b/Documentation/RCU/Design/Data-Structures/Data-Structures.html new file mode 100644 index 000000000000..ba9fbb5177f6 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.html @@ -0,0 +1,1372 @@ + + + + + A Tour Through TREE_RCU's Data Structures [LWN.net] + + +

January 27, 2016

+

This article was contributed by Paul E. McKenney

+ +

Introduction

+ +This document describes RCU's major data structures and their relationship +to each other. + +
    +
  1. + Data-Structure Relationships +
  2. + The rcu_state Structure +
  3. + The rcu_node Structure +
  4. + The rcu_data Structure +
  5. + The rcu_dynticks Structure +
  6. + The rcu_head Structure +
  7. + RCU-Specific Fields in the task_struct Structure +
  8. + Accessor Functions +
+ +At the end we have the +answers to the quick quizzes. + +

Data-Structure Relationships

+ +

RCU is for all intents and purposes a large state machine, and its +data structures maintain the state in such a way as to allow RCU readers +to execute extremely quickly, while also processing the RCU grace periods +requested by updaters in an efficient and extremely scalable fashion. +The efficiency and scalability of RCU updaters is provided primarily +by a combining tree, as shown below: + +

BigTreeClassicRCU.svg + +

This diagram shows an enclosing rcu_state structure +containing a tree of rcu_node structures. +Each leaf node of the rcu_node tree has up to 16 +rcu_data structures associated with it, so that there +are NR_CPUS number of rcu_data structures, +one for each possible CPU. +This structure is adjusted at boot time, if needed, to handle the +common case where nr_cpu_ids is much less than +NR_CPUs. +For example, a number of Linux distributions set NR_CPUs=4096, +which results in a three-level rcu_node tree. +If the actual hardware has only 16 CPUs, RCU will adjust itself +at boot time, resulting in an rcu_node tree with only a single node. + +

The purpose of this combining tree is to allow per-CPU events +such as quiescent states, dyntick-idle transitions, +and CPU hotplug operations to be processed efficiently +and scalably. +Quiescent states are recorded by the per-CPU rcu_data structures, +and other events are recorded by the leaf-level rcu_node +structures. +All of these events are combined at each level of the tree until finally +grace periods are completed at the tree's root rcu_node +structure. +A grace period can be completed at the root once every CPU +(or, in the case of CONFIG_TREE_PREEMPT_RCU, task) +has passed through a quiescent state. +Once a grace period has completed, record of that fact is propagated +back down the tree. + +

As can be seen from the diagram, on a 64-bit system +a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout +of 64 at the root and a fanout of 16 at the leaves. + +

Quick Quiz 1: +Why isn't the fanout at the leaves also 64? +
Answer + +

If your system has more than 1,024 CPUs (or more than 512 CPUs on +a 32-bit system), then RCU will automatically add more levels to the +tree. +For example, if you are crazy enough to build a 64-bit system with 65,536 +CPUs, RCU would configure the rcu_node tree as follows: + +

HugeTreeClassicRCU.svg + +

RCU currently permits up to a four-level tree, which on a 64-bit system +accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for +32-bit systems. +On the other hand, you can set CONFIG_RCU_FANOUT to be +as small as 2 if you wish, which would permit only 16 CPUs, which +is useful for testing. + +

The Linux kernel actually supports multiple flavors of RCU +running concurrently, so RCU builds separate data structures for each +flavor. +For example, for CONFIG_TREE_RCU=y kernels, RCU provides +rcu_sched and rcu_bh, as shown below: + +

BigTreeClassicRCUBH.svg + +

Energy efficiency is increasingly important, and for that +reason the Linux kernel provides CONFIG_NO_HZ_IDLE, which +turns off the scheduling-clock interrupts on idle CPUs, which in +turn allows those CPUs to attain deeper sleep states and to consume +less energy. +CPUs whose scheduling-clock interrupts have been turned off are +said to be in dyntick-idle mode. +RCU must handle dyntick-idle CPUs specially +because RCU would otherwise wake up each CPU on every grace period, +which would defeat the whole purpose of CONFIG_NO_HZ_IDLE. +RCU uses the rcu_dynticks structure to track +which CPUs are in dyntick idle mode, as shown below: + +

BigTreeClassicRCUBHdyntick.svg + +

However, if a CPU is in dyntick-idle mode, it is in that mode +for all flavors of RCU. +Therefore, a single rcu_dynticks structure is allocated per +CPU, and all of a given CPU's rcu_data structures share +that rcu_dynticks, as shown in the figure. + +

Kernels built with CONFIG_TREE_PREEMPT_RCU support +rcu_preempt in addition to rcu_sched and rcu_bh, as shown below: + +

BigTreePreemptRCUBHdyntick.svg + +

RCU updaters wait for normal grace periods by registering +RCU callbacks, either directly via call_rcu() and +friends (namely call_rcu_bh() and call_rcu_sched()), +there being a separate interface per flavor of RCU) +or indirectly via synchronize_rcu() and friends. +RCU callbacks are represented by rcu_head structures, +which are queued on rcu_data structures while they are +waiting for a grace period to elapse, as shown in the following figure: + +

BigTreePreemptRCUBHdyntickCB.svg + +

This figure shows how TREE_RCU's and +TREE_PREEMPT_RCU's major data structures are related. +Lesser data structures will be introduced with the algorithms that +make use of them. + +

Note that each of the data structures in the above figure has +its own synchronization: + +

    +
  1. Each rcu_state structures has a lock and a mutex, + and some fields are protected by the corresponding root + rcu_node structure's lock. +
  2. Each rcu_node structure has a spinlock. +
  3. The fields in rcu_data are private to the corresponding + CPU, although a few can be read and written by other CPUs. +
  4. Similarly, the fields in rcu_dynticks are private + to the corresponding CPU, although a few can be read by + other CPUs. +
+ +

It is important to note that different data structures can have +very different ideas about the state of RCU at any given time. +For but one example, awareness of the start or end of a given RCU +grace period propagates slowly through the data structures. +This slow propagation is absolutely necessary for RCU to have good +read-side performance. +If this balkanized implementation seems foreign to you, one useful +trick is to consider each instance of these data structures to be +a different person, each having the usual slightly different +view of reality. + +

The general role of each of these data structures is as +follows: + +

    +
  1. rcu_state: + This structure forms the interconnection between the + rcu_node and rcu_data structures, + tracks grace periods, serves as short-term repository + for callbacks orphaned by CPU-hotplug events, + maintains rcu_barrier() state, + tracks expedited grace-period state, + and maintains state used to force quiescent states when + grace periods extend too long, +
  2. rcu_node: This structure forms the combining + tree that propagates quiescent-state + information from the leaves to the root, and also propagates + grace-period information from the root to the leaves. + It provides local copies of the grace-period state in order + to allow this information to be accessed in a synchronized + manner without suffering the scalability limitations that + would otherwise be imposed by global locking. + In CONFIG_TREE_PREEMPT_RCU kernels, it manages the lists + of tasks that have blocked while in their current + RCU read-side critical section. + In CONFIG_TREE_PREEMPT_RCU with + CONFIG_RCU_BOOST, it manages the + per-rcu_node priority-boosting + kernel threads (kthreads) and state. + Finally, it records CPU-hotplug state in order to determine + which CPUs should be ignored during a given grace period. +
  3. rcu_data: This per-CPU structure is the + focus of quiescent-state detection and RCU callback queuing. + It also tracks its relationship to the corresponding leaf + rcu_node structure to allow more-efficient + propagation of quiescent states up the rcu_node + combining tree. + Like the rcu_node structure, it provides a local + copy of the grace-period information to allow for-free + synchronized + access to this information from the corresponding CPU. + Finally, this structure records past dyntick-idle state + for the corresponding CPU and also tracks statistics. +
  4. rcu_dynticks: + This per-CPU structure tracks the current dyntick-idle + state for the corresponding CPU. + Unlike the other three structures, the rcu_dynticks + structure is not replicated per RCU flavor. +
  5. rcu_head: + This structure represents RCU callbacks, and is the + only structure allocated and managed by RCU users. + The rcu_head structure is normally embedded + within the RCU-protected data structure. +
+ +

If all you wanted from this article was a general notion of how +RCU's data structures are related, you are done. +Otherwise, each of the following sections give more details on +the rcu_state, rcu_node, rcu_data, +and rcu_dynticks data structures. + +

+The rcu_state Structure

+ +

The rcu_state structure is the base structure that +represents a flavor of RCU. +This structure forms the interconnection between the +rcu_node and rcu_data structures, +tracks grace periods, contains the lock used to +synchronize with CPU-hotplug events, +and maintains state used to force quiescent states when +grace periods extend too long, + +

A few of the rcu_state structure's fields are discussed, +singly and in groups, in the following sections. +The more specialized fields are covered in the discussion of their +use. + +

Relationship to rcu_node and rcu_data Structures
+ +This portion of the rcu_state structure is declared +as follows: + +
+  1   struct rcu_node node[NUM_RCU_NODES];
+  2   struct rcu_node *level[NUM_RCU_LVLS + 1];
+  3   struct rcu_data __percpu *rda;
+
+ +

Quick Quiz 2: +Wait a minute! +You said that the rcu_node structures formed a tree, +but they are declared as a flat array! +What gives? +
Answer + +

The rcu_node tree is embedded into the +->node[] array as shown in the following figure: + +

TreeMapping.svg + +

One interesting consequence of this mapping is that a +breadth-first traversal of the tree is implemented as a simple +linear scan of the array, which is in fact what the +rcu_for_each_node_breadth_first() macro does. +This macro is used at the beginning and ends of grace periods. + +

Each entry of the ->level array references +the first rcu_node structure on the corresponding level +of the tree, for example, as shown below: + +

TreeMappingLevel.svg + +

The zeroth element of the array references the root +rcu_node structure, the first element references the +first child of the root rcu_node, and finally the second +element references the first leaf rcu_node structure. + +

Quick Quiz 3: +Given that this array represents a tree, why can't the diagram that +includes the ->level array be planar? +
Answer + +

Finally, the ->rda field references a per-CPU +pointer to the corresponding CPU's rcu_data structure. + +

All of these fields are constant once initialization is complete, +and therefore need no protection. + +

Grace-Period Tracking
+ +

This portion of the rcu_state structure is declared +as follows: + +

+  1   unsigned long gpnum;
+  2   unsigned long completed;
+
+ +

RCU grace periods are numbered, and +the ->gpnum field contains the number of the grace +period that started most recently. +The ->completed field contains the number of the +grace period that completed most recently. +If the two fields are equal, the RCU grace period that most recently +started has already completed, and therefore the corresponding +flavor of RCU is idle. +If ->gpnum is one greater than ->completed, +then ->gpnum gives the number of the current RCU +grace period, which has not yet completed. +Any other combination of values indicates that something is broken. +These two fields are protected by the root rcu_node's +->lock field. + +

There are ->gpnum and ->completed fields +in the rcu_node and rcu_data structures +as well. +The fields in the rcu_state structure represent the +most current values, and those of the other structures are compared +in order to detect the start of a new grace period in a distributed +fashion. +The values flow from rcu_state to rcu_node +(down the tree from the root to the leaves) to rcu_data. + +

Miscellaneous
+ +

This portion of the rcu_state structure is declared +as follows: + +

+  1   unsigned long gp_max;
+  2   char abbr;
+  3   char *name;
+
+ +

The ->gp_max field tracks the duration of the longest +grace period in jiffies. +It is protected by the root rcu_node's ->lock. + +

The ->name field points to the name of the RCU flavor +(for example, “rcu_sched”), and is constant. +The ->abbr field contains a one-character abbreviation, +for example, “s” for RCU-sched. + +

+The rcu_node Structure

+ +

The rcu_node structures form the combining +tree that propagates quiescent-state +information from the leaves to the root and also that propagates +grace-period information from the root down to the leaves. +They provides local copies of the grace-period state in order +to allow this information to be accessed in a synchronized +manner without suffering the scalability limitations that +would otherwise be imposed by global locking. +In CONFIG_TREE_PREEMPT_RCU kernels, they manage the lists +of tasks that have blocked while in their current +RCU read-side critical section. +In CONFIG_TREE_PREEMPT_RCU with +CONFIG_RCU_BOOST, they manage the +per-rcu_node priority-boosting +kernel threads (kthreads) and state. +Finally, they record CPU-hotplug state in order to determine +which CPUs should be ignored during a given grace period. + +

The rcu_node structure's fields are discussed, +singly and in groups, in the following sections. + +

Connection to Combining Tree
+ +

This portion of the rcu_node structure is declared +as follows: + +

+  1   struct rcu_node *parent;
+  2   u8 level;
+  3   u8 grpnum;
+  4   unsigned long grpmask;
+  5   int grplo;
+  6   int grphi;
+
+ +

The ->parent pointer references the rcu_node +one level up in the tree, and is NULL for the root +rcu_node. +The RCU implementation makes heavy use of this field to push quiescent +states up the tree. +The ->level field gives the level in the tree, with +the root being at level zero, its children at level one, and so on. +The ->grpnum field gives this node's position within +the children of its parent, so this number can range between 0 and 31 +on 32-bit systems and between 0 and 63 on 64-bit systems. +The ->level and ->grpnum fields are +used only during initialization and for tracing. +The ->grpmask field is the bitmask counterpart of +->grpnum, and therefore always has exactly one bit set. +This mask is used to clear the bit corresponding to this rcu_node +structure in its parent's bitmasks, which are described later. +Finally, the ->grplo and ->grphi fields +contain the lowest and highest numbered CPU served by this +rcu_node structure, respectively. + +

All of these fields are constant, and thus do not require any +synchronization. + +

Synchronization
+ +

This field of the rcu_node structure is declared +as follows: + +

+  1   raw_spinlock_t lock;
+
+ +

This field is used to protect the remaining fields in this structure, +unless otherwise stated. +That said, all of the fields in this structure can be accessed without +locking for tracing purposes. +Yes, this can result in confusing traces, but better some tracing confusion +than to be heisenbugged out of existence. + +

Grace-Period Tracking
+ +

This portion of the rcu_node structure is declared +as follows: + +

+  1   unsigned long gpnum;
+  2   unsigned long completed;
+
+ +

These fields are the counterparts of the fields of the same name in +the rcu_state structure. +They each may lag up to one behind their rcu_state +counterparts. +If a given rcu_node structure's ->gpnum and +->complete fields are equal, then this rcu_node +structure believes that RCU is idle. +Otherwise, as with the rcu_state structure, +the ->gpnum field will be one greater than the +->complete fields, with ->gpnum +indicating which grace period this rcu_node believes +is still being waited for. + +

The >gpnum field of each rcu_node +structure is updated at the beginning +of each grace period, and the ->completed fields are +updated at the end of each grace period. + +

Quiescent-State Tracking
+ +

These fields manage the propagation of quiescent states up the +combining tree. + +

This portion of the rcu_node structure has fields +as follows: + +

+  1   unsigned long qsmask;
+  2   unsigned long expmask;
+  3   unsigned long qsmaskinit;
+  4   unsigned long expmaskinit;
+
+ +

The ->qsmask field tracks which of this +rcu_node structure's children still need to report +quiescent states for the current normal grace period. +Such children will have a value of 1 in their corresponding bit. +Note that the leaf rcu_node structures should be +thought of as having rcu_data structures as their +children. +Similarly, the ->expmask field tracks which +of this rcu_node structure's children still need to report +quiescent states for the current expedited grace period. +An expedited grace period has +the same conceptual properties as a normal grace period, but the +expedited implementation accepts extreme CPU overhead to obtain +much lower grace-period latency, for example, consuming a few +tens of microseconds worth of CPU time to reduce grace-period +duration from milliseconds to tens of microseconds. +The ->qsmaskinit field tracks which of this +rcu_node structure's children cover for at least +one online CPU. +This mask is used to initialize ->qsmask, +and ->expmaskinit is used to initialize +->expmask and the beginning of the +normal and expedited grace periods, respectively. + +

Quick Quiz 4: +Why are these bitmasks protected by locking? +Come on, haven't you heard of atomic instructions??? +
Answer + +

Blocked-Task Management
+ +

TREE_PREEMPT_RCU allows tasks to be preempted in the +midst of their RCU read-side critical sections, and these tasks +must be tracked explicitly. +The details of exactly why and how they are tracked will be covered +in a separate article on RCU read-side processing. +For now, it is enough to know that the rcu_node +structure tracks them. + +

+  1   struct list_head blkd_tasks;
+  2   struct list_head *gp_tasks;
+  3   struct list_head *exp_tasks;
+  4   bool wait_blkd_tasks;
+
+ +

The ->blkd_tasks field is a list header for +the list of blocked and preempted tasks. +As tasks undergo context switches within RCU read-side critical +sections, their task_struct structures are enqueued +(via the task_struct's ->rcu_node_entry +field) onto the head of the ->blkd_tasks list for the +leaf rcu_node structure corresponding to the CPU +on which the outgoing context switch executed. +As these tasks later exit their RCU read-side critical sections, +they remove themselves from the list. +This list is therefore in reverse time order, so that if one of the tasks +is blocking the current grace period, all subsequent tasks must +also be blocking that same grace period. +Therefore, a single pointer into this list suffices to track +all tasks blocking a given grace period. +That pointer is stored in ->gp_tasks for normal +grace periods and in ->exp_tasks for expedited +grace periods. +These last two fields are NULL if either there is +no grace period in flight or if there are no blocked tasks +preventing that grace period from completing. +If either of these two pointers is referencing a task that +removes itself from the ->blkd_tasks list, +then that task must advance the pointer to the next task on +the list, or set the pointer to NULL if there +are no subsequent tasks on the list. + +

For example, suppose that tasks T1, T2, and T3 are +all hard-affinitied to the largest-numbered CPU in the system. +Then if task T1 blocked in an RCU read-side +critical section, then an expedited grace period started, +then task T2 blocked in an RCU read-side critical section, +then a normal grace period started, and finally task 3 blocked +in an RCU read-side critical section, then the state of the +last leaf rcu_node structure's blocked-task list +would be as shown below: + +

blkd_task.svg + +

Task T1 is blocking both grace periods, task T2 is +blocking only the normal grace period, and task T3 is blocking +neither grace period. +Note that these tasks will not remove themselves from this list +immediately upon resuming execution. +They will instead remain on the list until they execute the outermost +rcu_read_unlock() that ends their RCU read-side critical +section. + +

+The ->wait_blkd_tasks field indicates whether or not +the current grace period is waiting on a blocked task. + +

Sizing the rcu_node Array
+ +

The rcu_node array is sized via a series of +C-preprocessor expressions as follows: + +

+ 1 #ifdef CONFIG_RCU_FANOUT
+ 2 #define RCU_FANOUT CONFIG_RCU_FANOUT
+ 3 #else
+ 4 # ifdef CONFIG_64BIT
+ 5 # define RCU_FANOUT 64
+ 6 # else
+ 7 # define RCU_FANOUT 32
+ 8 # endif
+ 9 #endif
+10
+11 #ifdef CONFIG_RCU_FANOUT_LEAF
+12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
+13 #else
+14 # ifdef CONFIG_64BIT
+15 # define RCU_FANOUT_LEAF 64
+16 # else
+17 # define RCU_FANOUT_LEAF 32
+18 # endif
+19 #endif
+20
+21 #define RCU_FANOUT_1        (RCU_FANOUT_LEAF)
+22 #define RCU_FANOUT_2        (RCU_FANOUT_1 * RCU_FANOUT)
+23 #define RCU_FANOUT_3        (RCU_FANOUT_2 * RCU_FANOUT)
+24 #define RCU_FANOUT_4        (RCU_FANOUT_3 * RCU_FANOUT)
+25
+26 #if NR_CPUS <= RCU_FANOUT_1
+27 #  define RCU_NUM_LVLS        1
+28 #  define NUM_RCU_LVL_0        1
+29 #  define NUM_RCU_NODES        NUM_RCU_LVL_0
+30 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0 }
+31 #  define RCU_NODE_NAME_INIT  { "rcu_node_0" }
+32 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0" }
+33 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0" }
+34 #elif NR_CPUS <= RCU_FANOUT_2
+35 #  define RCU_NUM_LVLS        2
+36 #  define NUM_RCU_LVL_0        1
+37 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+38 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
+39 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
+40 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1" }
+41 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1" }
+42 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1" }
+43 #elif NR_CPUS <= RCU_FANOUT_3
+44 #  define RCU_NUM_LVLS        3
+45 #  define NUM_RCU_LVL_0        1
+46 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+47 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+48 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
+49 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
+50 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
+51 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
+52 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" }
+53 #elif NR_CPUS <= RCU_FANOUT_4
+54 #  define RCU_NUM_LVLS        4
+55 #  define NUM_RCU_LVL_0        1
+56 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
+57 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+58 #  define NUM_RCU_LVL_3        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+59 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
+60 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
+61 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
+62 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
+63 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" }
+64 #else
+65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
+66 #endif
+
+ +

The maximum number of levels in the rcu_node structure +is currently limited to four, as specified by lines 21-24 +and the structure of the subsequent “if” statement. +For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which +should be sufficient for the next few years at least. +For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which +should see us through the next decade or so. +This four-level tree also allows kernels built with +CONFIG_RCU_FANOUT=8 to support up to 4096 CPUs, +which might be useful in very large systems having eight CPUs per +socket (but please note that no one has yet shown any measurable +performance degradation due to misaligned socket and rcu_node +boundaries). +In addition, building kernels with a full four levels of rcu_node +tree permits better testing of RCU's combining-tree code. + +

The RCU_FANOUT symbol controls how many children +are permitted at each non-leaf level of the rcu_node tree. +If the CONFIG_RCU_FANOUT Kconfig option is not specified, +it is set based on the word size of the system, which is also +the Kconfig default. + +

The RCU_FANOUT_LEAF symbol controls how many CPUs are +handled by each leaf rcu_node structure. +Experience has shown that allowing a given leaf rcu_node +structure to handle 64 CPUs, as permitted by the number of bits in +the ->qsmask field on a 64-bit system, results in +excessive contention for the leaf rcu_node structures' +->lock fields. +The number of CPUs per leaf rcu_node structure is therefore +limited to 16 given the default value of CONFIG_RCU_FANOUT_LEAF. +If CONFIG_RCU_FANOUT_LEAF is unspecified, the value +selected is based on the word size of the system, just as for +CONFIG_RCU_FANOUT. +Lines 11-19 perform this computation. + +

Lines 21-24 compute the maximum number of CPUs supported by +a single-level (which contains a single rcu_node structure), +two-level, three-level, and four-level rcu_node tree, +respectively, given the fanout specified by RCU_FANOUT +and RCU_FANOUT_LEAF. +These numbers of CPUs are retained in the +RCU_FANOUT_1, +RCU_FANOUT_2, +RCU_FANOUT_3, and +RCU_FANOUT_4 +C-preprocessor variables, respectively. + +

These variables are used to control the C-preprocessor #if +statement spanning lines 26-66 that computes the number of +rcu_node structures required for each level of the tree, +as well as the number of levels required. +The number of levels is placed in the NUM_RCU_LVLS +C-preprocessor variable by lines 27, 35, 44, and 54. +The number of rcu_node structures for the topmost level +of the tree is always exactly one, and this value is unconditionally +placed into NUM_RCU_LVL_0 by lines 28, 36, 45, and 55. +The rest of the levels (if any) of the rcu_node tree +are computed by dividing the maximum number of CPUs by the +fanout supported by the number of levels from the current level down, +rounding up. This computation is performed by lines 37, +46-47, and 56-58. +Lines 31-33, 40-42, 50-52, and 62-63 create initializers +for lockdep lock-class names. +Finally, lines 64-66 produce an error if the maximum number of +CPUs is too large for the specified fanout. + +

+The rcu_data Structure

+ +

The rcu_data maintains the per-CPU state for the +corresponding flavor of RCU. +The fields in this structure may be accessed only from the corresponding +CPU (and from tracing) unless otherwise stated. +This structure is the +focus of quiescent-state detection and RCU callback queuing. +It also tracks its relationship to the corresponding leaf +rcu_node structure to allow more-efficient +propagation of quiescent states up the rcu_node +combining tree. +Like the rcu_node structure, it provides a local +copy of the grace-period information to allow for-free +synchronized +access to this information from the corresponding CPU. +Finally, this structure records past dyntick-idle state +for the corresponding CPU and also tracks statistics. + +

The rcu_data structure's fields are discussed, +singly and in groups, in the following sections. + +

Connection to Other Data Structures
+ +

This portion of the rcu_data structure is declared +as follows: + +

+  1   int cpu;
+  2   struct rcu_state *rsp;
+  3   struct rcu_node *mynode;
+  4   struct rcu_dynticks *dynticks;
+  5   unsigned long grpmask;
+  6   bool beenonline;
+
+ +

The ->cpu field contains the number of the +corresponding CPU, the ->rsp pointer references +the corresponding rcu_state structure (and is most frequently +used to locate the name of the corresponding flavor of RCU for tracing), +and the ->mynode field references the corresponding +rcu_node structure. +The ->mynode is used to propagate quiescent states +up the combining tree. +

The ->dynticks pointer references the +rcu_dynticks structure corresponding to this +CPU. +Recall that a single per-CPU instance of the rcu_dynticks +structure is shared among all flavors of RCU. +These first four fields are constant and therefore require not +synchronization. + +

The ->grpmask field indicates the bit in +the ->mynode->qsmask corresponding to this +rcu_data structure, and is also used when propagating +quiescent states. +The ->beenonline flag is set whenever the corresponding +CPU comes online, which means that the debugfs tracing need not dump +out any rcu_data structure for which this flag is not set. + +

Quiescent-State and Grace-Period Tracking
+ +

This portion of the rcu_data structure is declared +as follows: + +

+  1   unsigned long completed;
+  2   unsigned long gpnum;
+  3   bool cpu_no_qs;
+  4   bool core_needs_qs;
+  5   bool gpwrap;
+  6   unsigned long rcu_qs_ctr_snap;
+
+ +

The completed and gpnum +fields are the counterparts of the fields of the same name +in the rcu_state and rcu_node structures. +They may each lag up to one behind their rcu_node +counterparts, but in CONFIG_NO_HZ_IDLE and +CONFIG_NO_HZ_FULL kernels can lag +arbitrarily far behind for CPUs in dyntick-idle mode (but these counters +will catch up upon exit from dyntick-idle mode). +If a given rcu_data structure's ->gpnum and +->complete fields are equal, then this rcu_data +structure believes that RCU is idle. +Otherwise, as with the rcu_state and rcu_node +structure, +the ->gpnum field will be one greater than the +->complete fields, with ->gpnum +indicating which grace period this rcu_data believes +is still being waited for. + +

Quick Quiz 5: +All this replication of the grace period numbers can only cause +massive confusion. +Why not just keep a global pair of counters and be done with it??? +
Answer + +

The ->cpu_no_qs flag indicates that the +CPU has not yet passed through a quiescent state, +while the ->core_needs_qs flag indicates that the +RCU core needs a quiescent state from the corresponding CPU. +The ->gpwrap field indicates that the corresponding +CPU has remained idle for so long that the completed +and gpnum counters are in danger of overflow, which +will cause the CPU to disregard the values of its counters on +its next exit from idle. +Finally, the rcu_qs_ctr_snap field is used to detect +cases where a given operation has resulted in a quiescent state +for all flavors of RCU, for example, cond_resched_rcu_qs(). + +

RCU Callback Handling
+ +

In the absence of CPU-hotplug events, RCU callbacks are invoked by +the same CPU that registered them. +This is strictly a cache-locality optimization: callbacks can and +do get invoked on CPUs other than the one that registered them. +After all, if the CPU that registered a given callback has gone +offline before the callback can be invoked, there really is no other +choice. + +

This portion of the rcu_data structure is declared +as follows: + +

+ 1 struct rcu_head *nxtlist;
+ 2 struct rcu_head **nxttail[RCU_NEXT_SIZE];
+ 3 unsigned long nxtcompleted[RCU_NEXT_SIZE];
+ 4 long qlen_lazy;
+ 5 long qlen;
+ 6 long qlen_last_fqs_check;
+ 7 unsigned long n_force_qs_snap;
+ 8 unsigned long n_cbs_invoked;
+ 9 unsigned long n_cbs_orphaned;
+10 unsigned long n_cbs_adopted;
+11 long blimit;
+
+ +

The ->nxtlist pointer and the +->nxttail[] array form a four-segment list with +older callbacks near the head and newer ones near the tail. +Each segment contains callbacks with the corresponding relationship +to the current grace period. +The pointer out of the end of each of the four segments is referenced +by the element of the ->nxttail[] array indexed by +RCU_DONE_TAIL (for callbacks handled by a prior grace period), +RCU_WAIT_TAIL (for callbacks waiting on the current grace period), +RCU_NEXT_READY_TAIL (for callbacks that will wait on the next +grace period), and +RCU_NEXT_TAIL (for callbacks that are not yet associated +with a specific grace period) +respectively, as shown in the following figure. + +

nxtlist.svg + +

In this figure, the ->nxtlist pointer references the +first +RCU callback in the list. +The ->nxttail[RCU_DONE_TAIL] array element references +the ->nxtlist pointer itself, indicating that none +of the callbacks is ready to invoke. +The ->nxttail[RCU_WAIT_TAIL] array element references callback +CB 2's ->next pointer, which indicates that +CB 1 and CB 2 are both waiting on the current grace period. +The ->nxttail[RCU_NEXT_READY_TAIL] array element +references the same RCU callback that ->nxttail[RCU_WAIT_TAIL] +does, which indicates that there are no callbacks waiting on the next +RCU grace period. +The ->nxttail[RCU_NEXT_TAIL] array element references +CB 4's ->next pointer, indicating that all the +remaining RCU callbacks have not yet been assigned to an RCU grace +period. +Note that the ->nxttail[RCU_NEXT_TAIL] array element +always references the last RCU callback's ->next pointer +unless the callback list is empty, in which case it references +the ->nxtlist pointer. + +

CPUs advance their callbacks from the +RCU_NEXT_TAIL to the RCU_NEXT_READY_TAIL to the +RCU_WAIT_TAIL to the RCU_DONE_TAIL list segments +as grace periods advance. +The CPU advances the callbacks in its rcu_data structure +whenever it notices that another RCU grace period has completed. +The CPU detects the completion of an RCU grace period by noticing +that the value of its rcu_data structure's +->completed field differs from that of its leaf +rcu_node structure. +Recall that each rcu_node structure's +->completed field is updated at the end of each +grace period. + +

The ->nxtcompleted[] array records grace-period +numbers corresponding to the list segments. +This allows CPUs that go idle for extended periods to determine +which of their callbacks are ready to be invoked after reawakening. + +

The ->qlen counter contains the number of +callbacks in ->nxtlist, and the +->qlen_lazy contains the number of those callbacks that +are known to only free memory, and whose invocation can therefore +be safely deferred. +The ->qlen_last_fqs_check and +->n_force_qs_snap coordinate the forcing of quiescent +states from call_rcu() and friends when callback +lists grow excessively long. + +

The ->n_cbs_invoked, +->n_cbs_orphaned, and ->n_cbs_adopted +fields count the number of callbacks invoked, +sent to other CPUs when this CPU goes offline, +and received from other CPUs when those other CPUs go offline. +Finally, the ->blimit counter is the maximum number of +RCU callbacks that may be invoked at a given time. + +

Dyntick-Idle Handling
+ +

This portion of the rcu_data structure is declared +as follows: + +

+  1   int dynticks_snap;
+  2   unsigned long dynticks_fqs;
+
+ +The ->dynticks_snap field is used to take a snapshot +of the corresponding CPU's dyntick-idle state when forcing +quiescent states, and is therefore accessed from other CPUs. +Finally, the ->dynticks_fqs field is used to +count the number of times this CPU is determined to be in +dyntick-idle state, and is used for tracing and debugging purposes. + +

+The rcu_dynticks Structure

+ +

The rcu_dynticks maintains the per-CPU dyntick-idle state +for the corresponding CPU. +Unlike the other structures, rcu_dynticks is not +replicated over the different flavors of RCU. +The fields in this structure may be accessed only from the corresponding +CPU (and from tracing) unless otherwise stated. +Its fields are as follows: + +

+  1   int dynticks_nesting;
+  2   int dynticks_nmi_nesting;
+  3   atomic_t dynticks;
+
+ +

The ->dynticks_nesting field counts the +nesting depth of normal interrupts. +In addition, this counter is incremented when exiting dyntick-idle +mode and decremented when entering it. +This counter can therefore be thought of as counting the number +of reasons why this CPU cannot be permitted to enter dyntick-idle +mode, aside from non-maskable interrupts (NMIs). +NMIs are counted by the ->dynticks_nmi_nesting +field, except that NMIs that interrupt non-dyntick-idle execution +are not counted. + +

Finally, the ->dynticks field counts the corresponding +CPU's transitions to and from dyntick-idle mode, so that this counter +has an even value when the CPU is in dyntick-idle mode and an odd +value otherwise. + +

Quick Quiz 6: +Why not just count all NMIs? +Wouldn't that be simpler and less error prone? +
Answer + +

Additional fields are present for some special-purpose +builds, and are discussed separately. + +

+The rcu_head Structure

+ +

Each rcu_head structure represents an RCU callback. +These structures are normally embedded within RCU-protected data +structures whose algorithms use asynchronous grace periods. +In contrast, when using algorithms that block waiting for RCU grace periods, +RCU users need not provide rcu_head structures. + +

The rcu_head structure has fields as follows: + +

+  1   struct rcu_head *next;
+  2   void (*func)(struct rcu_head *head);
+
+ +

The ->next field is used +to link the rcu_head structures together in the +lists within the rcu_data structures. +The ->func field is a pointer to the function +to be called when the callback is ready to be invoked, and +this function is passed a pointer to the rcu_head +structure. +However, kfree_rcu() uses the ->func +field to record the offset of the rcu_head +structure within the enclosing RCU-protected data structure. + +

Both of these fields are used internally by RCU. +From the viewpoint of RCU users, this structure is an +opaque “cookie”. + +

Quick Quiz 7: +Given that the callback function ->func +is passed a pointer to the rcu_head structure, +how is that function supposed to find the beginning of the +enclosing RCU-protected data structure? +
Answer + +

+RCU-Specific Fields in the task_struct Structure

+ +

The CONFIG_TREE_PREEMPT_RCU implementation uses some +additional fields in the task_struct structure: + +

+ 1 #ifdef CONFIG_PREEMPT_RCU
+ 2   int rcu_read_lock_nesting;
+ 3   union rcu_special rcu_read_unlock_special;
+ 4   struct list_head rcu_node_entry;
+ 5   struct rcu_node *rcu_blocked_node;
+ 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */
+ 7 #ifdef CONFIG_TASKS_RCU
+ 8   unsigned long rcu_tasks_nvcsw;
+ 9   bool rcu_tasks_holdout;
+10   struct list_head rcu_tasks_holdout_list;
+11   int rcu_tasks_idle_cpu;
+12 #endif /* #ifdef CONFIG_TASKS_RCU */
+
+ +

The ->rcu_read_lock_nesting field records the +nesting level for RCU read-side critical sections, and +the ->rcu_read_unlock_special field is a bitmask +that records special conditions that require rcu_read_unlock() +to do additional work. +The ->rcu_node_entry field is used to form lists of +tasks that have blocked within preemptible-RCU read-side critical +sections and the ->rcu_blocked_node field references +the rcu_node structure whose list this task is a member of, +or NULL if it is not blocked within a preemptible-RCU +read-side critical section. + +

The ->rcu_tasks_nvcsw field tracks the number of +voluntary context switches that this task had undergone at the +beginning of the current tasks-RCU grace period, +->rcu_tasks_holdout is set if the current tasks-RCU +grace period is waiting on this task, ->rcu_tasks_holdout_list +is a list element enqueuing this task on the holdout list, +and ->rcu_tasks_idle_cpu tracks which CPU this +idle task is running, but only if the task is currently running, +that is, if the CPU is currently idle. + +

Quick Quiz 8: +Why is ->rcu_boosted required, given that there is +a RCU_READ_UNLOCK_BOOSTED bit in +->rcu_read_unlock_special? +
Answer + +

+Accessor Functions

+ +

The following listing shows the +rcu_get_root(), rcu_for_each_node_breadth_first, +rcu_for_each_nonleaf_node_breadth_first(), and +rcu_for_each_leaf_node() function and macros: + +

+  1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
+  2 {
+  3   return &rsp->node[0];
+  4 }
+  5
+  6 #define rcu_for_each_node_breadth_first(rsp, rnp) \
+  7   for ((rnp) = &(rsp)->node[0]; \
+  8        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
+  9
+ 10 #define rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) \
+ 11   for ((rnp) = &(rsp)->node[0]; \
+ 12        (rnp) < (rsp)->level[NUM_RCU_LVLS - 1]; (rnp)++)
+ 13
+ 14 #define rcu_for_each_leaf_node(rsp, rnp) \
+ 15   for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \
+ 16        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
+
+ +

The rcu_get_root() simply returns a pointer to the +first element of the specified rcu_state structure's +->node[] array, which is the root rcu_node +structure. + +

As noted earlier, the rcu_for_each_node_breadth_first() +macro takes advantage of the layout of the rcu_node +structures in the rcu_state structure's +->node[] array, performing a breadth-first traversal by +simply traversing the array in order. +The rcu_for_each_nonleaf_node_breadth_first() macro operates +similarly, but traverses only the first part of the array, thus excluding +the leaf rcu_node structures. +Finally, the rcu_for_each_leaf_node() macro traverses only +the last part of the array, thus traversing only the leaf +rcu_node structures. + +

Quick Quiz 9: +What do rcu_for_each_nonleaf_node_breadth_first() and +rcu_for_each_leaf_node() do if the rcu_node tree +contains only a single node? +
Answer + +

+Summary

+ +So each flavor of RCU is represented by an rcu_state structure, +which contains a combining tree of rcu_node and +rcu_data structures. +Finally, in CONFIG_NO_HZ_IDLE kernels, each CPU's dyntick-idle +state is tracked by an rcu_dynticks structure. + +If you made it this far, you are well prepared to read the code +walkthroughs in the other articles in this series. + +

+Acknowledgments

+ +I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul +Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn +for helping me get this document into a more human-readable state. + +

+Legal Statement

+ +

This work represents the view of the author and does not necessarily +represent the view of IBM. + +

Linux is a registered trademark of Linus Torvalds. + +

Other company, product, and service names may be trademarks or +service marks of others. + + +

+Answers to Quick Quizzes

+ + +

Quick Quiz 1: +Why isn't the fanout at the leaves also 64? + + +

Answer: +Because there are more types of events that affect the leaf-level +rcu_node structures than further up the tree. +Therefore, if the leaf rcu_node structures have +fanout of 64, the contention on these structures' ->structures +becomes excessive. +Experimentation on a wide variety of systems has shown that a fanout +of 16 works well for the leaves of the rcu_node tree. + +

Of course, further experience with systems having hundreds or +thousands of CPUs may demonstrate that the fanout for the non-leaf +rcu_node structures must also be reduced. +Such reduction can be easily carried out when and if it proves necessary. +In the meantime, if you are using such a system and running into +contention problems on the non-leaf rcu_node structures, +you may use the CONFIG_RCU_FANOUT kernel configuration +parameter to reduce the non-leaf fanout as needed. + +

Kernels built for systems with strong NUMA characteristics might +also need to adjust CONFIG_RCU_FANOUT so that the +domains of the rcu_node structures align with hardware +boundaries. +However, there has thus far been no need for this. + + +

Back to Quick Quiz 1. + + +

Quick Quiz 2: +Wait a minute! +You said that the rcu_node structures formed a tree, +but they are declared as a flat array! +What gives? + + +

Answer: +The tree is laid out in the array. +The first node In the array is the head, the next set of nodes in the +array are children of the head node, and so on until the last set of +nodes in the array are the leaves. + +

See the following diagrams to see how this works. + + +

Back to Quick Quiz 2. + + +

Quick Quiz 3: +Given that this array represents a tree, why can't the diagram that +includes the ->level array be planar? + + +

Answer: +It can be planar, it is just that it looks uglier that way. +But don't take my word for it, draw it yourself! + +

But if you draw the tree to be tree-shaped rather than +array-shaped, it is easy to draw a planar representation: + +

TreeLevel.svg + + +

Back to Quick Quiz 3. + + +

Quick Quiz 4: +Why are these bitmasks protected by locking? +Come on, haven't you heard of atomic instructions??? + + +

Answer: +Lockless grace-period computation! Such a tantalizing possibility! + +

But consider the following sequence of events: + +

    +
  1. CPU 0 has been in dyntick-idle mode for quite + some time. + When it wakes up, it notices that the current RCU + grace period needs it to report in, so it sets a + flag where the scheduling clock interrupt will find it. +
  2. Meanwhile, CPU 1 is running force_quiescent_state(), + and notices that CPU 0 has been in dyntick idle mode, + which qualifies as an extended quiescent state. +
  3. CPU 0's scheduling clock interrupt fires in the + middle of an RCU read-side critical section, and notices + that the RCU core needs something, so commences RCU softirq + processing. +
  4. CPU 0's softirq handler executes and is just about ready + to report its quiescent state up the rcu_node + tree. +
  5. But CPU 1 beats it to the punch, completing the current + grace period and starting a new one. +
  6. CPU 0 now reports its quiescent state for the wrong + grace period. + That grace period might now end before the RCU read-side + critical section. + If that happens, disaster will ensue. +
+ +

So the locking is absolutely required in order to coordinate clearing +of the bits with the grace-period numbers in ->gpnum +and ->completed. + + +

Back to Quick Quiz 4. + + +

Quick Quiz 5: +All this replication of the grace period numbers can only cause +massive confusion. +Why not just keep a global pair of counters and be done with it??? + + +

Answer: +Because if there was only a single global pair of grace-period numbers, +there would need to be a single global lock to allow safely accessing +and updating them. +And if we are not going to have a single global lock, we need to carefully +manage the numbers on a per-node basis. +Recall from the answer to a previous Quick Quiz that the consequences +of applying a previously sampled quiescent state to the wrong +grace period are quite severe. + + +

Back to Quick Quiz 5. + + +

Quick Quiz 6: +Why not just count all NMIs? +Wouldn't that be simpler and less error prone? + + +

Answer: +It seems simpler only until you think hard about how to go about +updating the rcu_dynticks structure's +->dynticks field. + + +

Back to Quick Quiz 6. + + +

Quick Quiz 7: +Given that the callback function ->func +is passed a pointer to the rcu_head structure, +how is that function supposed to find the beginning of the +enclosing RCU-protected data structure? + + +

Answer: +In actual practice, there is a separate callback function per +type of RCU-protected data structure. +The callback function can therefore use the container_of() +macro in the Linux kernel (or other pointer-manipulation facilities +in other software environments) to find the beginning of the +enclosing structure. + + +

Back to Quick Quiz 7. + + +

Quick Quiz 8: +Why is ->rcu_boosted required, given that there is +a RCU_READ_UNLOCK_BOOSTED bit in +->rcu_read_unlock_special? + + +

Answer: +The ->rcu_read_unlock_special field may only be +updated by the task itself. +By definition, RCU priority boosting must be carried out by some +other task. +This other task cannot safely update the boosted task's +->rcu_read_unlock_special field without the use of +expensive atomic instructions. +The ->rcu_boosted field is therefore used by the +boosting task to let the boosted task know that it has been boosted. +The boosted task makes use of the +RCU_READ_UNLOCK_BOOSTED bit in +->rcu_read_unlock_special +when deboosting itself. + + +

Back to Quick Quiz 8. + + +

Quick Quiz 9: +What do rcu_for_each_nonleaf_node_breadth_first() and +rcu_for_each_leaf_node() do if the rcu_node tree +contains only a single node? + + +

Answer: +In the single-node case, +rcu_for_each_nonleaf_node_breadth_first() is a no-op +and rcu_for_each_leaf_node() traverses the single node. + + +

Back to Quick Quiz 9. + + + + diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.htmlx b/Documentation/RCU/Design/Data-Structures/Data-Structures.htmlx new file mode 100644 index 000000000000..c08fd8e9574a --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.htmlx @@ -0,0 +1,1272 @@ + + + A Tour Through TREE_RCU's Data Structures [LWN.net] + + +

January 27, 2016

+

This article was contributed by Paul E. McKenney

+ +

Introduction

+ +This document describes RCU's major data structures and their relationship +to each other. + +
    +
  1. + Data-Structure Relationships +
  2. + The rcu_state Structure +
  3. + The rcu_node Structure +
  4. + The rcu_data Structure +
  5. + The rcu_dynticks Structure +
  6. + The rcu_head Structure +
  7. + RCU-Specific Fields in the task_struct Structure +
  8. + Accessor Functions +
+ +At the end we have the +answers to the quick quizzes. + +

Data-Structure Relationships

+ +

RCU is for all intents and purposes a large state machine, and its +data structures maintain the state in such a way as to allow RCU readers +to execute extremely quickly, while also processing the RCU grace periods +requested by updaters in an efficient and extremely scalable fashion. +The efficiency and scalability of RCU updaters is provided primarily +by a combining tree, as shown below: + +

BigTreeClassicRCU.svg + +

This diagram shows an enclosing rcu_state structure +containing a tree of rcu_node structures. +Each leaf node of the rcu_node tree has up to 16 +rcu_data structures associated with it, so that there +are NR_CPUS number of rcu_data structures, +one for each possible CPU. +This structure is adjusted at boot time, if needed, to handle the +common case where nr_cpu_ids is much less than +NR_CPUs. +For example, a number of Linux distributions set NR_CPUs=4096, +which results in a three-level rcu_node tree. +If the actual hardware has only 16 CPUs, RCU will adjust itself +at boot time, resulting in an rcu_node tree with only a single node. + +

The purpose of this combining tree is to allow per-CPU events +such as quiescent states, dyntick-idle transitions, +and CPU hotplug operations to be processed efficiently +and scalably. +Quiescent states are recorded by the per-CPU rcu_data structures, +and other events are recorded by the leaf-level rcu_node +structures. +All of these events are combined at each level of the tree until finally +grace periods are completed at the tree's root rcu_node +structure. +A grace period can be completed at the root once every CPU +(or, in the case of CONFIG_TREE_PREEMPT_RCU, task) +has passed through a quiescent state. +Once a grace period has completed, record of that fact is propagated +back down the tree. + +

As can be seen from the diagram, on a 64-bit system +a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout +of 64 at the root and a fanout of 16 at the leaves. + +

@@QQ@@ +Why isn't the fanout at the leaves also 64? +

@@QQA@@ +Because there are more types of events that affect the leaf-level +rcu_node structures than further up the tree. +Therefore, if the leaf rcu_node structures have +fanout of 64, the contention on these structures' ->structures +becomes excessive. +Experimentation on a wide variety of systems has shown that a fanout +of 16 works well for the leaves of the rcu_node tree. + +

Of course, further experience with systems having hundreds or +thousands of CPUs may demonstrate that the fanout for the non-leaf +rcu_node structures must also be reduced. +Such reduction can be easily carried out when and if it proves necessary. +In the meantime, if you are using such a system and running into +contention problems on the non-leaf rcu_node structures, +you may use the CONFIG_RCU_FANOUT kernel configuration +parameter to reduce the non-leaf fanout as needed. + +

Kernels built for systems with strong NUMA characteristics might +also need to adjust CONFIG_RCU_FANOUT so that the +domains of the rcu_node structures align with hardware +boundaries. +However, there has thus far been no need for this. +

@@QQE@@ + +

If your system has more than 1,024 CPUs (or more than 512 CPUs on +a 32-bit system), then RCU will automatically add more levels to the +tree. +For example, if you are crazy enough to build a 64-bit system with 65,536 +CPUs, RCU would configure the rcu_node tree as follows: + +

HugeTreeClassicRCU.svg + +

RCU currently permits up to a four-level tree, which on a 64-bit system +accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for +32-bit systems. +On the other hand, you can set CONFIG_RCU_FANOUT to be +as small as 2 if you wish, which would permit only 16 CPUs, which +is useful for testing. + +

The Linux kernel actually supports multiple flavors of RCU +running concurrently, so RCU builds separate data structures for each +flavor. +For example, for CONFIG_TREE_RCU=y kernels, RCU provides +rcu_sched and rcu_bh, as shown below: + +

BigTreeClassicRCUBH.svg + +

Energy efficiency is increasingly important, and for that +reason the Linux kernel provides CONFIG_NO_HZ_IDLE, which +turns off the scheduling-clock interrupts on idle CPUs, which in +turn allows those CPUs to attain deeper sleep states and to consume +less energy. +CPUs whose scheduling-clock interrupts have been turned off are +said to be in dyntick-idle mode. +RCU must handle dyntick-idle CPUs specially +because RCU would otherwise wake up each CPU on every grace period, +which would defeat the whole purpose of CONFIG_NO_HZ_IDLE. +RCU uses the rcu_dynticks structure to track +which CPUs are in dyntick idle mode, as shown below: + +

BigTreeClassicRCUBHdyntick.svg + +

However, if a CPU is in dyntick-idle mode, it is in that mode +for all flavors of RCU. +Therefore, a single rcu_dynticks structure is allocated per +CPU, and all of a given CPU's rcu_data structures share +that rcu_dynticks, as shown in the figure. + +

Kernels built with CONFIG_TREE_PREEMPT_RCU support +rcu_preempt in addition to rcu_sched and rcu_bh, as shown below: + +

BigTreePreemptRCUBHdyntick.svg + +

RCU updaters wait for normal grace periods by registering +RCU callbacks, either directly via call_rcu() and +friends (namely call_rcu_bh() and call_rcu_sched()), +there being a separate interface per flavor of RCU) +or indirectly via synchronize_rcu() and friends. +RCU callbacks are represented by rcu_head structures, +which are queued on rcu_data structures while they are +waiting for a grace period to elapse, as shown in the following figure: + +

BigTreePreemptRCUBHdyntickCB.svg + +

This figure shows how TREE_RCU's and +TREE_PREEMPT_RCU's major data structures are related. +Lesser data structures will be introduced with the algorithms that +make use of them. + +

Note that each of the data structures in the above figure has +its own synchronization: + +

    +
  1. Each rcu_state structures has a lock and a mutex, + and some fields are protected by the corresponding root + rcu_node structure's lock. +
  2. Each rcu_node structure has a spinlock. +
  3. The fields in rcu_data are private to the corresponding + CPU, although a few can be read and written by other CPUs. +
  4. Similarly, the fields in rcu_dynticks are private + to the corresponding CPU, although a few can be read by + other CPUs. +
+ +

It is important to note that different data structures can have +very different ideas about the state of RCU at any given time. +For but one example, awareness of the start or end of a given RCU +grace period propagates slowly through the data structures. +This slow propagation is absolutely necessary for RCU to have good +read-side performance. +If this balkanized implementation seems foreign to you, one useful +trick is to consider each instance of these data structures to be +a different person, each having the usual slightly different +view of reality. + +

The general role of each of these data structures is as +follows: + +

    +
  1. rcu_state: + This structure forms the interconnection between the + rcu_node and rcu_data structures, + tracks grace periods, serves as short-term repository + for callbacks orphaned by CPU-hotplug events, + maintains rcu_barrier() state, + tracks expedited grace-period state, + and maintains state used to force quiescent states when + grace periods extend too long, +
  2. rcu_node: This structure forms the combining + tree that propagates quiescent-state + information from the leaves to the root, and also propagates + grace-period information from the root to the leaves. + It provides local copies of the grace-period state in order + to allow this information to be accessed in a synchronized + manner without suffering the scalability limitations that + would otherwise be imposed by global locking. + In CONFIG_TREE_PREEMPT_RCU kernels, it manages the lists + of tasks that have blocked while in their current + RCU read-side critical section. + In CONFIG_TREE_PREEMPT_RCU with + CONFIG_RCU_BOOST, it manages the + per-rcu_node priority-boosting + kernel threads (kthreads) and state. + Finally, it records CPU-hotplug state in order to determine + which CPUs should be ignored during a given grace period. +
  3. rcu_data: This per-CPU structure is the + focus of quiescent-state detection and RCU callback queuing. + It also tracks its relationship to the corresponding leaf + rcu_node structure to allow more-efficient + propagation of quiescent states up the rcu_node + combining tree. + Like the rcu_node structure, it provides a local + copy of the grace-period information to allow for-free + synchronized + access to this information from the corresponding CPU. + Finally, this structure records past dyntick-idle state + for the corresponding CPU and also tracks statistics. +
  4. rcu_dynticks: + This per-CPU structure tracks the current dyntick-idle + state for the corresponding CPU. + Unlike the other three structures, the rcu_dynticks + structure is not replicated per RCU flavor. +
  5. rcu_head: + This structure represents RCU callbacks, and is the + only structure allocated and managed by RCU users. + The rcu_head structure is normally embedded + within the RCU-protected data structure. +
+ +

If all you wanted from this article was a general notion of how +RCU's data structures are related, you are done. +Otherwise, each of the following sections give more details on +the rcu_state, rcu_node, rcu_data, +and rcu_dynticks data structures. + +

+The rcu_state Structure

+ +

The rcu_state structure is the base structure that +represents a flavor of RCU. +This structure forms the interconnection between the +rcu_node and rcu_data structures, +tracks grace periods, contains the lock used to +synchronize with CPU-hotplug events, +and maintains state used to force quiescent states when +grace periods extend too long, + +

A few of the rcu_state structure's fields are discussed, +singly and in groups, in the following sections. +The more specialized fields are covered in the discussion of their +use. + +

Relationship to rcu_node and rcu_data Structures
+ +This portion of the rcu_state structure is declared +as follows: + +
+  1   struct rcu_node node[NUM_RCU_NODES];
+  2   struct rcu_node *level[NUM_RCU_LVLS + 1];
+  3   struct rcu_data __percpu *rda;
+
+ +

@@QQ@@ +Wait a minute! +You said that the rcu_node structures formed a tree, +but they are declared as a flat array! +What gives? +

@@QQA@@ +The tree is laid out in the array. +The first node In the array is the head, the next set of nodes in the +array are children of the head node, and so on until the last set of +nodes in the array are the leaves. + +

See the following diagrams to see how this works. +

@@QQE@@ + +

The rcu_node tree is embedded into the +->node[] array as shown in the following figure: + +

TreeMapping.svg + +

One interesting consequence of this mapping is that a +breadth-first traversal of the tree is implemented as a simple +linear scan of the array, which is in fact what the +rcu_for_each_node_breadth_first() macro does. +This macro is used at the beginning and ends of grace periods. + +

Each entry of the ->level array references +the first rcu_node structure on the corresponding level +of the tree, for example, as shown below: + +

TreeMappingLevel.svg + +

The zeroth element of the array references the root +rcu_node structure, the first element references the +first child of the root rcu_node, and finally the second +element references the first leaf rcu_node structure. + +

@@QQ@@ +Given that this array represents a tree, why can't the diagram that +includes the ->level array be planar? +

@@QQA@@ +It can be planar, it is just that it looks uglier that way. +But don't take my word for it, draw it yourself! + +

But if you draw the tree to be tree-shaped rather than +array-shaped, it is easy to draw a planar representation: + +

TreeLevel.svg +

@@QQE@@ + +

Finally, the ->rda field references a per-CPU +pointer to the corresponding CPU's rcu_data structure. + +

All of these fields are constant once initialization is complete, +and therefore need no protection. + +

Grace-Period Tracking
+ +

This portion of the rcu_state structure is declared +as follows: + +

+  1   unsigned long gpnum;
+  2   unsigned long completed;
+
+ +

RCU grace periods are numbered, and +the ->gpnum field contains the number of the grace +period that started most recently. +The ->completed field contains the number of the +grace period that completed most recently. +If the two fields are equal, the RCU grace period that most recently +started has already completed, and therefore the corresponding +flavor of RCU is idle. +If ->gpnum is one greater than ->completed, +then ->gpnum gives the number of the current RCU +grace period, which has not yet completed. +Any other combination of values indicates that something is broken. +These two fields are protected by the root rcu_node's +->lock field. + +

There are ->gpnum and ->completed fields +in the rcu_node and rcu_data structures +as well. +The fields in the rcu_state structure represent the +most current values, and those of the other structures are compared +in order to detect the start of a new grace period in a distributed +fashion. +The values flow from rcu_state to rcu_node +(down the tree from the root to the leaves) to rcu_data. + +

Miscellaneous
+ +

This portion of the rcu_state structure is declared +as follows: + +

+  1   unsigned long gp_max;
+  2   char abbr;
+  3   char *name;
+
+ +

The ->gp_max field tracks the duration of the longest +grace period in jiffies. +It is protected by the root rcu_node's ->lock. + +

The ->name field points to the name of the RCU flavor +(for example, “rcu_sched”), and is constant. +The ->abbr field contains a one-character abbreviation, +for example, “s” for RCU-sched. + +

+The rcu_node Structure

+ +

The rcu_node structures form the combining +tree that propagates quiescent-state +information from the leaves to the root and also that propagates +grace-period information from the root down to the leaves. +They provides local copies of the grace-period state in order +to allow this information to be accessed in a synchronized +manner without suffering the scalability limitations that +would otherwise be imposed by global locking. +In CONFIG_TREE_PREEMPT_RCU kernels, they manage the lists +of tasks that have blocked while in their current +RCU read-side critical section. +In CONFIG_TREE_PREEMPT_RCU with +CONFIG_RCU_BOOST, they manage the +per-rcu_node priority-boosting +kernel threads (kthreads) and state. +Finally, they record CPU-hotplug state in order to determine +which CPUs should be ignored during a given grace period. + +

The rcu_node structure's fields are discussed, +singly and in groups, in the following sections. + +

Connection to Combining Tree
+ +

This portion of the rcu_node structure is declared +as follows: + +

+  1   struct rcu_node *parent;
+  2   u8 level;
+  3   u8 grpnum;
+  4   unsigned long grpmask;
+  5   int grplo;
+  6   int grphi;
+
+ +

The ->parent pointer references the rcu_node +one level up in the tree, and is NULL for the root +rcu_node. +The RCU implementation makes heavy use of this field to push quiescent +states up the tree. +The ->level field gives the level in the tree, with +the root being at level zero, its children at level one, and so on. +The ->grpnum field gives this node's position within +the children of its parent, so this number can range between 0 and 31 +on 32-bit systems and between 0 and 63 on 64-bit systems. +The ->level and ->grpnum fields are +used only during initialization and for tracing. +The ->grpmask field is the bitmask counterpart of +->grpnum, and therefore always has exactly one bit set. +This mask is used to clear the bit corresponding to this rcu_node +structure in its parent's bitmasks, which are described later. +Finally, the ->grplo and ->grphi fields +contain the lowest and highest numbered CPU served by this +rcu_node structure, respectively. + +

All of these fields are constant, and thus do not require any +synchronization. + +

Synchronization
+ +

This field of the rcu_node structure is declared +as follows: + +

+  1   raw_spinlock_t lock;
+
+ +

This field is used to protect the remaining fields in this structure, +unless otherwise stated. +That said, all of the fields in this structure can be accessed without +locking for tracing purposes. +Yes, this can result in confusing traces, but better some tracing confusion +than to be heisenbugged out of existence. + +

Grace-Period Tracking
+ +

This portion of the rcu_node structure is declared +as follows: + +

+  1   unsigned long gpnum;
+  2   unsigned long completed;
+
+ +

These fields are the counterparts of the fields of the same name in +the rcu_state structure. +They each may lag up to one behind their rcu_state +counterparts. +If a given rcu_node structure's ->gpnum and +->complete fields are equal, then this rcu_node +structure believes that RCU is idle. +Otherwise, as with the rcu_state structure, +the ->gpnum field will be one greater than the +->complete fields, with ->gpnum +indicating which grace period this rcu_node believes +is still being waited for. + +

The >gpnum field of each rcu_node +structure is updated at the beginning +of each grace period, and the ->completed fields are +updated at the end of each grace period. + +

Quiescent-State Tracking
+ +

These fields manage the propagation of quiescent states up the +combining tree. + +

This portion of the rcu_node structure has fields +as follows: + +

+  1   unsigned long qsmask;
+  2   unsigned long expmask;
+  3   unsigned long qsmaskinit;
+  4   unsigned long expmaskinit;
+
+ +

The ->qsmask field tracks which of this +rcu_node structure's children still need to report +quiescent states for the current normal grace period. +Such children will have a value of 1 in their corresponding bit. +Note that the leaf rcu_node structures should be +thought of as having rcu_data structures as their +children. +Similarly, the ->expmask field tracks which +of this rcu_node structure's children still need to report +quiescent states for the current expedited grace period. +An expedited grace period has +the same conceptual properties as a normal grace period, but the +expedited implementation accepts extreme CPU overhead to obtain +much lower grace-period latency, for example, consuming a few +tens of microseconds worth of CPU time to reduce grace-period +duration from milliseconds to tens of microseconds. +The ->qsmaskinit field tracks which of this +rcu_node structure's children cover for at least +one online CPU. +This mask is used to initialize ->qsmask, +and ->expmaskinit is used to initialize +->expmask and the beginning of the +normal and expedited grace periods, respectively. + +

@@QQ@@ +Why are these bitmasks protected by locking? +Come on, haven't you heard of atomic instructions??? +

@@QQA@@ +Lockless grace-period computation! Such a tantalizing possibility! + +

But consider the following sequence of events: + +

    +
  1. CPU 0 has been in dyntick-idle mode for quite + some time. + When it wakes up, it notices that the current RCU + grace period needs it to report in, so it sets a + flag where the scheduling clock interrupt will find it. +
  2. Meanwhile, CPU 1 is running force_quiescent_state(), + and notices that CPU 0 has been in dyntick idle mode, + which qualifies as an extended quiescent state. +
  3. CPU 0's scheduling clock interrupt fires in the + middle of an RCU read-side critical section, and notices + that the RCU core needs something, so commences RCU softirq + processing. +
  4. CPU 0's softirq handler executes and is just about ready + to report its quiescent state up the rcu_node + tree. +
  5. But CPU 1 beats it to the punch, completing the current + grace period and starting a new one. +
  6. CPU 0 now reports its quiescent state for the wrong + grace period. + That grace period might now end before the RCU read-side + critical section. + If that happens, disaster will ensue. +
+ +

So the locking is absolutely required in order to coordinate clearing +of the bits with the grace-period numbers in ->gpnum +and ->completed. +

@@QQE@@ + +

Blocked-Task Management
+ +

TREE_PREEMPT_RCU allows tasks to be preempted in the +midst of their RCU read-side critical sections, and these tasks +must be tracked explicitly. +The details of exactly why and how they are tracked will be covered +in a separate article on RCU read-side processing. +For now, it is enough to know that the rcu_node +structure tracks them. + +

+  1   struct list_head blkd_tasks;
+  2   struct list_head *gp_tasks;
+  3   struct list_head *exp_tasks;
+  4   bool wait_blkd_tasks;
+
+ +

The ->blkd_tasks field is a list header for +the list of blocked and preempted tasks. +As tasks undergo context switches within RCU read-side critical +sections, their task_struct structures are enqueued +(via the task_struct's ->rcu_node_entry +field) onto the head of the ->blkd_tasks list for the +leaf rcu_node structure corresponding to the CPU +on which the outgoing context switch executed. +As these tasks later exit their RCU read-side critical sections, +they remove themselves from the list. +This list is therefore in reverse time order, so that if one of the tasks +is blocking the current grace period, all subsequent tasks must +also be blocking that same grace period. +Therefore, a single pointer into this list suffices to track +all tasks blocking a given grace period. +That pointer is stored in ->gp_tasks for normal +grace periods and in ->exp_tasks for expedited +grace periods. +These last two fields are NULL if either there is +no grace period in flight or if there are no blocked tasks +preventing that grace period from completing. +If either of these two pointers is referencing a task that +removes itself from the ->blkd_tasks list, +then that task must advance the pointer to the next task on +the list, or set the pointer to NULL if there +are no subsequent tasks on the list. + +

For example, suppose that tasks T1, T2, and T3 are +all hard-affinitied to the largest-numbered CPU in the system. +Then if task T1 blocked in an RCU read-side +critical section, then an expedited grace period started, +then task T2 blocked in an RCU read-side critical section, +then a normal grace period started, and finally task 3 blocked +in an RCU read-side critical section, then the state of the +last leaf rcu_node structure's blocked-task list +would be as shown below: + +

blkd_task.svg + +

Task T1 is blocking both grace periods, task T2 is +blocking only the normal grace period, and task T3 is blocking +neither grace period. +Note that these tasks will not remove themselves from this list +immediately upon resuming execution. +They will instead remain on the list until they execute the outermost +rcu_read_unlock() that ends their RCU read-side critical +section. + +

+The ->wait_blkd_tasks field indicates whether or not +the current grace period is waiting on a blocked task. + +

Sizing the rcu_node Array
+ +

The rcu_node array is sized via a series of +C-preprocessor expressions as follows: + +

+ 1 #ifdef CONFIG_RCU_FANOUT
+ 2 #define RCU_FANOUT CONFIG_RCU_FANOUT
+ 3 #else
+ 4 # ifdef CONFIG_64BIT
+ 5 # define RCU_FANOUT 64
+ 6 # else
+ 7 # define RCU_FANOUT 32
+ 8 # endif
+ 9 #endif
+10
+11 #ifdef CONFIG_RCU_FANOUT_LEAF
+12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
+13 #else
+14 # ifdef CONFIG_64BIT
+15 # define RCU_FANOUT_LEAF 64
+16 # else
+17 # define RCU_FANOUT_LEAF 32
+18 # endif
+19 #endif
+20
+21 #define RCU_FANOUT_1        (RCU_FANOUT_LEAF)
+22 #define RCU_FANOUT_2        (RCU_FANOUT_1 * RCU_FANOUT)
+23 #define RCU_FANOUT_3        (RCU_FANOUT_2 * RCU_FANOUT)
+24 #define RCU_FANOUT_4        (RCU_FANOUT_3 * RCU_FANOUT)
+25
+26 #if NR_CPUS <= RCU_FANOUT_1
+27 #  define RCU_NUM_LVLS        1
+28 #  define NUM_RCU_LVL_0        1
+29 #  define NUM_RCU_NODES        NUM_RCU_LVL_0
+30 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0 }
+31 #  define RCU_NODE_NAME_INIT  { "rcu_node_0" }
+32 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0" }
+33 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0" }
+34 #elif NR_CPUS <= RCU_FANOUT_2
+35 #  define RCU_NUM_LVLS        2
+36 #  define NUM_RCU_LVL_0        1
+37 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+38 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
+39 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
+40 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1" }
+41 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1" }
+42 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1" }
+43 #elif NR_CPUS <= RCU_FANOUT_3
+44 #  define RCU_NUM_LVLS        3
+45 #  define NUM_RCU_LVL_0        1
+46 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+47 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+48 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
+49 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
+50 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
+51 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
+52 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" }
+53 #elif NR_CPUS <= RCU_FANOUT_4
+54 #  define RCU_NUM_LVLS        4
+55 #  define NUM_RCU_LVL_0        1
+56 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
+57 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+58 #  define NUM_RCU_LVL_3        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+59 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
+60 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
+61 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
+62 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
+63 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" }
+64 #else
+65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
+66 #endif
+
+ +

The maximum number of levels in the rcu_node structure +is currently limited to four, as specified by lines 21-24 +and the structure of the subsequent “if” statement. +For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which +should be sufficient for the next few years at least. +For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which +should see us through the next decade or so. +This four-level tree also allows kernels built with +CONFIG_RCU_FANOUT=8 to support up to 4096 CPUs, +which might be useful in very large systems having eight CPUs per +socket (but please note that no one has yet shown any measurable +performance degradation due to misaligned socket and rcu_node +boundaries). +In addition, building kernels with a full four levels of rcu_node +tree permits better testing of RCU's combining-tree code. + +

The RCU_FANOUT symbol controls how many children +are permitted at each non-leaf level of the rcu_node tree. +If the CONFIG_RCU_FANOUT Kconfig option is not specified, +it is set based on the word size of the system, which is also +the Kconfig default. + +

The RCU_FANOUT_LEAF symbol controls how many CPUs are +handled by each leaf rcu_node structure. +Experience has shown that allowing a given leaf rcu_node +structure to handle 64 CPUs, as permitted by the number of bits in +the ->qsmask field on a 64-bit system, results in +excessive contention for the leaf rcu_node structures' +->lock fields. +The number of CPUs per leaf rcu_node structure is therefore +limited to 16 given the default value of CONFIG_RCU_FANOUT_LEAF. +If CONFIG_RCU_FANOUT_LEAF is unspecified, the value +selected is based on the word size of the system, just as for +CONFIG_RCU_FANOUT. +Lines 11-19 perform this computation. + +

Lines 21-24 compute the maximum number of CPUs supported by +a single-level (which contains a single rcu_node structure), +two-level, three-level, and four-level rcu_node tree, +respectively, given the fanout specified by RCU_FANOUT +and RCU_FANOUT_LEAF. +These numbers of CPUs are retained in the +RCU_FANOUT_1, +RCU_FANOUT_2, +RCU_FANOUT_3, and +RCU_FANOUT_4 +C-preprocessor variables, respectively. + +

These variables are used to control the C-preprocessor #if +statement spanning lines 26-66 that computes the number of +rcu_node structures required for each level of the tree, +as well as the number of levels required. +The number of levels is placed in the NUM_RCU_LVLS +C-preprocessor variable by lines 27, 35, 44, and 54. +The number of rcu_node structures for the topmost level +of the tree is always exactly one, and this value is unconditionally +placed into NUM_RCU_LVL_0 by lines 28, 36, 45, and 55. +The rest of the levels (if any) of the rcu_node tree +are computed by dividing the maximum number of CPUs by the +fanout supported by the number of levels from the current level down, +rounding up. This computation is performed by lines 37, +46-47, and 56-58. +Lines 31-33, 40-42, 50-52, and 62-63 create initializers +for lockdep lock-class names. +Finally, lines 64-66 produce an error if the maximum number of +CPUs is too large for the specified fanout. + +

+The rcu_data Structure

+ +

The rcu_data maintains the per-CPU state for the +corresponding flavor of RCU. +The fields in this structure may be accessed only from the corresponding +CPU (and from tracing) unless otherwise stated. +This structure is the +focus of quiescent-state detection and RCU callback queuing. +It also tracks its relationship to the corresponding leaf +rcu_node structure to allow more-efficient +propagation of quiescent states up the rcu_node +combining tree. +Like the rcu_node structure, it provides a local +copy of the grace-period information to allow for-free +synchronized +access to this information from the corresponding CPU. +Finally, this structure records past dyntick-idle state +for the corresponding CPU and also tracks statistics. + +

The rcu_data structure's fields are discussed, +singly and in groups, in the following sections. + +

Connection to Other Data Structures
+ +

This portion of the rcu_data structure is declared +as follows: + +

+  1   int cpu;
+  2   struct rcu_state *rsp;
+  3   struct rcu_node *mynode;
+  4   struct rcu_dynticks *dynticks;
+  5   unsigned long grpmask;
+  6   bool beenonline;
+
+ +

The ->cpu field contains the number of the +corresponding CPU, the ->rsp pointer references +the corresponding rcu_state structure (and is most frequently +used to locate the name of the corresponding flavor of RCU for tracing), +and the ->mynode field references the corresponding +rcu_node structure. +The ->mynode is used to propagate quiescent states +up the combining tree. +

The ->dynticks pointer references the +rcu_dynticks structure corresponding to this +CPU. +Recall that a single per-CPU instance of the rcu_dynticks +structure is shared among all flavors of RCU. +These first four fields are constant and therefore require not +synchronization. + +

The ->grpmask field indicates the bit in +the ->mynode->qsmask corresponding to this +rcu_data structure, and is also used when propagating +quiescent states. +The ->beenonline flag is set whenever the corresponding +CPU comes online, which means that the debugfs tracing need not dump +out any rcu_data structure for which this flag is not set. + +

Quiescent-State and Grace-Period Tracking
+ +

This portion of the rcu_data structure is declared +as follows: + +

+  1   unsigned long completed;
+  2   unsigned long gpnum;
+  3   bool cpu_no_qs;
+  4   bool core_needs_qs;
+  5   bool gpwrap;
+  6   unsigned long rcu_qs_ctr_snap;
+
+ +

The completed and gpnum +fields are the counterparts of the fields of the same name +in the rcu_state and rcu_node structures. +They may each lag up to one behind their rcu_node +counterparts, but in CONFIG_NO_HZ_IDLE and +CONFIG_NO_HZ_FULL kernels can lag +arbitrarily far behind for CPUs in dyntick-idle mode (but these counters +will catch up upon exit from dyntick-idle mode). +If a given rcu_data structure's ->gpnum and +->complete fields are equal, then this rcu_data +structure believes that RCU is idle. +Otherwise, as with the rcu_state and rcu_node +structure, +the ->gpnum field will be one greater than the +->complete fields, with ->gpnum +indicating which grace period this rcu_data believes +is still being waited for. + +

@@QQ@@ +All this replication of the grace period numbers can only cause +massive confusion. +Why not just keep a global pair of counters and be done with it??? +

@@QQA@@ +Because if there was only a single global pair of grace-period numbers, +there would need to be a single global lock to allow safely accessing +and updating them. +And if we are not going to have a single global lock, we need to carefully +manage the numbers on a per-node basis. +Recall from the answer to a previous Quick Quiz that the consequences +of applying a previously sampled quiescent state to the wrong +grace period are quite severe. +

@@QQE@@ + +

The ->cpu_no_qs flag indicates that the +CPU has not yet passed through a quiescent state, +while the ->core_needs_qs flag indicates that the +RCU core needs a quiescent state from the corresponding CPU. +The ->gpwrap field indicates that the corresponding +CPU has remained idle for so long that the completed +and gpnum counters are in danger of overflow, which +will cause the CPU to disregard the values of its counters on +its next exit from idle. +Finally, the rcu_qs_ctr_snap field is used to detect +cases where a given operation has resulted in a quiescent state +for all flavors of RCU, for example, cond_resched_rcu_qs(). + +

RCU Callback Handling
+ +

In the absence of CPU-hotplug events, RCU callbacks are invoked by +the same CPU that registered them. +This is strictly a cache-locality optimization: callbacks can and +do get invoked on CPUs other than the one that registered them. +After all, if the CPU that registered a given callback has gone +offline before the callback can be invoked, there really is no other +choice. + +

This portion of the rcu_data structure is declared +as follows: + +

+ 1 struct rcu_head *nxtlist;
+ 2 struct rcu_head **nxttail[RCU_NEXT_SIZE];
+ 3 unsigned long nxtcompleted[RCU_NEXT_SIZE];
+ 4 long qlen_lazy;
+ 5 long qlen;
+ 6 long qlen_last_fqs_check;
+ 7 unsigned long n_force_qs_snap;
+ 8 unsigned long n_cbs_invoked;
+ 9 unsigned long n_cbs_orphaned;
+10 unsigned long n_cbs_adopted;
+11 long blimit;
+
+ +

The ->nxtlist pointer and the +->nxttail[] array form a four-segment list with +older callbacks near the head and newer ones near the tail. +Each segment contains callbacks with the corresponding relationship +to the current grace period. +The pointer out of the end of each of the four segments is referenced +by the element of the ->nxttail[] array indexed by +RCU_DONE_TAIL (for callbacks handled by a prior grace period), +RCU_WAIT_TAIL (for callbacks waiting on the current grace period), +RCU_NEXT_READY_TAIL (for callbacks that will wait on the next +grace period), and +RCU_NEXT_TAIL (for callbacks that are not yet associated +with a specific grace period) +respectively, as shown in the following figure. + +

nxtlist.svg + +

In this figure, the ->nxtlist pointer references the +first +RCU callback in the list. +The ->nxttail[RCU_DONE_TAIL] array element references +the ->nxtlist pointer itself, indicating that none +of the callbacks is ready to invoke. +The ->nxttail[RCU_WAIT_TAIL] array element references callback +CB 2's ->next pointer, which indicates that +CB 1 and CB 2 are both waiting on the current grace period. +The ->nxttail[RCU_NEXT_READY_TAIL] array element +references the same RCU callback that ->nxttail[RCU_WAIT_TAIL] +does, which indicates that there are no callbacks waiting on the next +RCU grace period. +The ->nxttail[RCU_NEXT_TAIL] array element references +CB 4's ->next pointer, indicating that all the +remaining RCU callbacks have not yet been assigned to an RCU grace +period. +Note that the ->nxttail[RCU_NEXT_TAIL] array element +always references the last RCU callback's ->next pointer +unless the callback list is empty, in which case it references +the ->nxtlist pointer. + +

CPUs advance their callbacks from the +RCU_NEXT_TAIL to the RCU_NEXT_READY_TAIL to the +RCU_WAIT_TAIL to the RCU_DONE_TAIL list segments +as grace periods advance. +The CPU advances the callbacks in its rcu_data structure +whenever it notices that another RCU grace period has completed. +The CPU detects the completion of an RCU grace period by noticing +that the value of its rcu_data structure's +->completed field differs from that of its leaf +rcu_node structure. +Recall that each rcu_node structure's +->completed field is updated at the end of each +grace period. + +

The ->nxtcompleted[] array records grace-period +numbers corresponding to the list segments. +This allows CPUs that go idle for extended periods to determine +which of their callbacks are ready to be invoked after reawakening. + +

The ->qlen counter contains the number of +callbacks in ->nxtlist, and the +->qlen_lazy contains the number of those callbacks that +are known to only free memory, and whose invocation can therefore +be safely deferred. +The ->qlen_last_fqs_check and +->n_force_qs_snap coordinate the forcing of quiescent +states from call_rcu() and friends when callback +lists grow excessively long. + +

The ->n_cbs_invoked, +->n_cbs_orphaned, and ->n_cbs_adopted +fields count the number of callbacks invoked, +sent to other CPUs when this CPU goes offline, +and received from other CPUs when those other CPUs go offline. +Finally, the ->blimit counter is the maximum number of +RCU callbacks that may be invoked at a given time. + +

Dyntick-Idle Handling
+ +

This portion of the rcu_data structure is declared +as follows: + +

+  1   int dynticks_snap;
+  2   unsigned long dynticks_fqs;
+
+ +The ->dynticks_snap field is used to take a snapshot +of the corresponding CPU's dyntick-idle state when forcing +quiescent states, and is therefore accessed from other CPUs. +Finally, the ->dynticks_fqs field is used to +count the number of times this CPU is determined to be in +dyntick-idle state, and is used for tracing and debugging purposes. + +

+The rcu_dynticks Structure

+ +

The rcu_dynticks maintains the per-CPU dyntick-idle state +for the corresponding CPU. +Unlike the other structures, rcu_dynticks is not +replicated over the different flavors of RCU. +The fields in this structure may be accessed only from the corresponding +CPU (and from tracing) unless otherwise stated. +Its fields are as follows: + +

+  1   int dynticks_nesting;
+  2   int dynticks_nmi_nesting;
+  3   atomic_t dynticks;
+
+ +

The ->dynticks_nesting field counts the +nesting depth of normal interrupts. +In addition, this counter is incremented when exiting dyntick-idle +mode and decremented when entering it. +This counter can therefore be thought of as counting the number +of reasons why this CPU cannot be permitted to enter dyntick-idle +mode, aside from non-maskable interrupts (NMIs). +NMIs are counted by the ->dynticks_nmi_nesting +field, except that NMIs that interrupt non-dyntick-idle execution +are not counted. + +

Finally, the ->dynticks field counts the corresponding +CPU's transitions to and from dyntick-idle mode, so that this counter +has an even value when the CPU is in dyntick-idle mode and an odd +value otherwise. + +

@@QQ@@ +Why not just count all NMIs? +Wouldn't that be simpler and less error prone? +

@@QQA@@ +It seems simpler only until you think hard about how to go about +updating the rcu_dynticks structure's +->dynticks field. +

@@QQE@@ + +

Additional fields are present for some special-purpose +builds, and are discussed separately. + +

+The rcu_head Structure

+ +

Each rcu_head structure represents an RCU callback. +These structures are normally embedded within RCU-protected data +structures whose algorithms use asynchronous grace periods. +In contrast, when using algorithms that block waiting for RCU grace periods, +RCU users need not provide rcu_head structures. + +

The rcu_head structure has fields as follows: + +

+  1   struct rcu_head *next;
+  2   void (*func)(struct rcu_head *head);
+
+ +

The ->next field is used +to link the rcu_head structures together in the +lists within the rcu_data structures. +The ->func field is a pointer to the function +to be called when the callback is ready to be invoked, and +this function is passed a pointer to the rcu_head +structure. +However, kfree_rcu() uses the ->func +field to record the offset of the rcu_head +structure within the enclosing RCU-protected data structure. + +

Both of these fields are used internally by RCU. +From the viewpoint of RCU users, this structure is an +opaque “cookie”. + +

@@QQ@@ +Given that the callback function ->func +is passed a pointer to the rcu_head structure, +how is that function supposed to find the beginning of the +enclosing RCU-protected data structure? +

@@QQA@@ +In actual practice, there is a separate callback function per +type of RCU-protected data structure. +The callback function can therefore use the container_of() +macro in the Linux kernel (or other pointer-manipulation facilities +in other software environments) to find the beginning of the +enclosing structure. +

@@QQE@@ + +

+RCU-Specific Fields in the task_struct Structure

+ +

The CONFIG_TREE_PREEMPT_RCU implementation uses some +additional fields in the task_struct structure: + +

+ 1 #ifdef CONFIG_PREEMPT_RCU
+ 2   int rcu_read_lock_nesting;
+ 3   union rcu_special rcu_read_unlock_special;
+ 4   struct list_head rcu_node_entry;
+ 5   struct rcu_node *rcu_blocked_node;
+ 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */
+ 7 #ifdef CONFIG_TASKS_RCU
+ 8   unsigned long rcu_tasks_nvcsw;
+ 9   bool rcu_tasks_holdout;
+10   struct list_head rcu_tasks_holdout_list;
+11   int rcu_tasks_idle_cpu;
+12 #endif /* #ifdef CONFIG_TASKS_RCU */
+
+ +

The ->rcu_read_lock_nesting field records the +nesting level for RCU read-side critical sections, and +the ->rcu_read_unlock_special field is a bitmask +that records special conditions that require rcu_read_unlock() +to do additional work. +The ->rcu_node_entry field is used to form lists of +tasks that have blocked within preemptible-RCU read-side critical +sections and the ->rcu_blocked_node field references +the rcu_node structure whose list this task is a member of, +or NULL if it is not blocked within a preemptible-RCU +read-side critical section. + +

The ->rcu_tasks_nvcsw field tracks the number of +voluntary context switches that this task had undergone at the +beginning of the current tasks-RCU grace period, +->rcu_tasks_holdout is set if the current tasks-RCU +grace period is waiting on this task, ->rcu_tasks_holdout_list +is a list element enqueuing this task on the holdout list, +and ->rcu_tasks_idle_cpu tracks which CPU this +idle task is running, but only if the task is currently running, +that is, if the CPU is currently idle. + +

@@QQ@@ +Why is ->rcu_boosted required, given that there is +a RCU_READ_UNLOCK_BOOSTED bit in +->rcu_read_unlock_special? +

@@QQA@@ +The ->rcu_read_unlock_special field may only be +updated by the task itself. +By definition, RCU priority boosting must be carried out by some +other task. +This other task cannot safely update the boosted task's +->rcu_read_unlock_special field without the use of +expensive atomic instructions. +The ->rcu_boosted field is therefore used by the +boosting task to let the boosted task know that it has been boosted. +The boosted task makes use of the +RCU_READ_UNLOCK_BOOSTED bit in +->rcu_read_unlock_special +when deboosting itself. +

@@QQE@@ + +

+Accessor Functions

+ +

The following listing shows the +rcu_get_root(), rcu_for_each_node_breadth_first, +rcu_for_each_nonleaf_node_breadth_first(), and +rcu_for_each_leaf_node() function and macros: + +

+  1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
+  2 {
+  3   return &rsp->node[0];
+  4 }
+  5
+  6 #define rcu_for_each_node_breadth_first(rsp, rnp) \
+  7   for ((rnp) = &(rsp)->node[0]; \
+  8        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
+  9
+ 10 #define rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) \
+ 11   for ((rnp) = &(rsp)->node[0]; \
+ 12        (rnp) < (rsp)->level[NUM_RCU_LVLS - 1]; (rnp)++)
+ 13
+ 14 #define rcu_for_each_leaf_node(rsp, rnp) \
+ 15   for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \
+ 16        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
+
+ +

The rcu_get_root() simply returns a pointer to the +first element of the specified rcu_state structure's +->node[] array, which is the root rcu_node +structure. + +

As noted earlier, the rcu_for_each_node_breadth_first() +macro takes advantage of the layout of the rcu_node +structures in the rcu_state structure's +->node[] array, performing a breadth-first traversal by +simply traversing the array in order. +The rcu_for_each_nonleaf_node_breadth_first() macro operates +similarly, but traverses only the first part of the array, thus excluding +the leaf rcu_node structures. +Finally, the rcu_for_each_leaf_node() macro traverses only +the last part of the array, thus traversing only the leaf +rcu_node structures. + +

@@QQ@@ +What do rcu_for_each_nonleaf_node_breadth_first() and +rcu_for_each_leaf_node() do if the rcu_node tree +contains only a single node? +

@@QQA@@ +In the single-node case, +rcu_for_each_nonleaf_node_breadth_first() is a no-op +and rcu_for_each_leaf_node() traverses the single node. +

@@QQE@@ + +

+Summary

+ +So each flavor of RCU is represented by an rcu_state structure, +which contains a combining tree of rcu_node and +rcu_data structures. +Finally, in CONFIG_NO_HZ_IDLE kernels, each CPU's dyntick-idle +state is tracked by an rcu_dynticks structure. + +If you made it this far, you are well prepared to read the code +walkthroughs in the other articles in this series. + +

+Acknowledgments

+ +I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul +Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn +for helping me get this document into a more human-readable state. + +

+Legal Statement

+ +

This work represents the view of the author and does not necessarily +represent the view of IBM. + +

Linux is a registered trademark of Linus Torvalds. + +

Other company, product, and service names may be trademarks or +service marks of others. + + +

@@QQAL@@ + + + diff --git a/Documentation/RCU/Design/Data-Structures/HugeTreeClassicRCU.svg b/Documentation/RCU/Design/Data-Structures/HugeTreeClassicRCU.svg new file mode 100644 index 000000000000..2bf12b468206 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/HugeTreeClassicRCU.svg @@ -0,0 +1,939 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_node + + struct + + struct + + rcu_node + + struct + + rcu_node + + rcu_node + + struct + + rcu_node + + struct + + struct + + rcu_node + + CPU 0 + + struct + + rcu_data + + CPU 15 + + struct + + rcu_data + + struct + + rcu_data + + CPU 21823 + + CPU 21839 + + rcu_data + + struct + + struct + + rcu_data + + CPU 43679 + + CPU 43695 + + rcu_data + + struct + + struct + + rcu_data + + CPU 65519 + + CPU 65535 + + rcu_data + + struct + + struct rcu_state + + struct + + rcu_node + + diff --git a/Documentation/RCU/Design/Data-Structures/TreeLevel.svg b/Documentation/RCU/Design/Data-Structures/TreeLevel.svg new file mode 100644 index 000000000000..7a7eb3bac95c --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/TreeLevel.svg @@ -0,0 +1,828 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_node + + struct + + struct + + rcu_node + + struct + + rcu_node + + rcu_node + + struct + + rcu_node + + struct + + struct + + rcu_node + + ->level[0] + + ->level[1] + + ->level[2] + + struct + + rcu_node + + CPU 15 + + CPU 0 + + CPU 65535 + + CPU 65519 + + CPU 43695 + + CPU 43679 + + CPU 21839 + + CPU 21823 + + struct rcu_state + + diff --git a/Documentation/RCU/Design/Data-Structures/TreeMapping.svg b/Documentation/RCU/Design/Data-Structures/TreeMapping.svg new file mode 100644 index 000000000000..729cfa9e6cdb --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/TreeMapping.svg @@ -0,0 +1,305 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 0:7 + + 4:7 + + 0:1 + + 2:3 + + 4:5 + + 6:7 + + 0:3 + + struct rcu_state + + diff --git a/Documentation/RCU/Design/Data-Structures/TreeMappingLevel.svg b/Documentation/RCU/Design/Data-Structures/TreeMappingLevel.svg new file mode 100644 index 000000000000..5b416a4b8453 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/TreeMappingLevel.svg @@ -0,0 +1,380 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->level[0] + + ->level[1] + + ->level[2] + + 0:7 + + 4:7 + + 0:1 + + 2:3 + + 4:5 + + 6:7 + + 0:3 + + struct rcu_state + + + + + + + + + diff --git a/Documentation/RCU/Design/Data-Structures/blkd_task.svg b/Documentation/RCU/Design/Data-Structures/blkd_task.svg new file mode 100644 index 000000000000..00e810bb8419 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/blkd_task.svg @@ -0,0 +1,843 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_bh + + struct + + rcu_node + + struct + + rcu_node + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct rcu_state + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + rcu_sched + + T3 + + T2 + + T1 + + + + + + + + + + + + + rcu_node + + struct + + blkd_tasks + + gp_tasks + + exp_tasks + + diff --git a/Documentation/RCU/Design/Data-Structures/nxtlist.svg b/Documentation/RCU/Design/Data-Structures/nxtlist.svg new file mode 100644 index 000000000000..abc4cc73a097 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/nxtlist.svg @@ -0,0 +1,396 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + nxtlist + + nxttail[RCU_DONE_TAIL] + + nxttail[RCU_WAIT_TAIL] + + nxttail[RCU_NEXT_READY_TAIL] + + nxttail[RCU_NEXT_TAIL] + + CB 1 + + next + + CB 3 + + next + + CB 4 + + next + + CB 2 + + next + + -- 2.5.2