From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751895AbdANIz5 (ORCPT ); Sat, 14 Jan 2017 03:55:57 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:37659 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751828AbdANIzx (ORCPT ); Sat, 14 Jan 2017 03:55:53 -0500 From: "Paul E. McKenney" To: linux-kernel@vger.kernel.org Cc: mingo@kernel.org, jiangshanlai@gmail.com, dipankar@in.ibm.com, akpm@linux-foundation.org, mathieu.desnoyers@efficios.com, josh@joshtriplett.org, tglx@linutronix.de, peterz@infradead.org, rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com, dvhart@linux.intel.com, fweisbec@gmail.com, oleg@redhat.com, bobby.prani@gmail.com, "Paul E. McKenney" Subject: [PATCH tip/core/rcu 1/5] rcu: Design documentation for expedited grace periods Date: Sat, 14 Jan 2017 00:55:35 -0800 X-Mailer: git-send-email 2.5.2 In-Reply-To: <20170114085032.GA18485@linux.vnet.ibm.com> References: <20170114085032.GA18485@linux.vnet.ibm.com> X-TM-AS-GCONF: 00 X-Content-Scanned: Fidelis XPS MAILER x-cbid: 17011408-0016-0000-0000-000005DF0EE3 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00006430; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000199; SDB=6.00807463; UDB=6.00393075; IPR=6.00584801; BA=6.00005055; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00013919; XFM=3.00000011; UTC=2017-01-14 08:55:49 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17011408-0017-0000-0000-0000367D2823 Message-Id: <1484384139-19622-1-git-send-email-paulmck@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-01-14_01:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=1 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1612050000 definitions=main-1701140130 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This commit adds design documentation for expedited grace periods. This documentation is in HTML rather than the new documentation format because (1) I have prototype documentation already in HTML, and (2) Attempting to learn the new documentation format while creating the design documentation seems likely to result in neither happening in a timely fashion. Once the design documentation is complete, we can start a conversion effort. Signed-off-by: Paul E. McKenney --- .../Design/Expedited-Grace-Periods/ExpRCUFlow.svg | 830 +++++++++++++++++++++ .../Expedited-Grace-Periods/ExpSchedFlow.svg | 826 ++++++++++++++++++++ .../Expedited-Grace-Periods.html | 626 ++++++++++++++++ .../RCU/Design/Expedited-Grace-Periods/Funnel0.svg | 275 +++++++ .../RCU/Design/Expedited-Grace-Periods/Funnel1.svg | 275 +++++++ .../RCU/Design/Expedited-Grace-Periods/Funnel2.svg | 287 +++++++ .../RCU/Design/Expedited-Grace-Periods/Funnel3.svg | 323 ++++++++ .../RCU/Design/Expedited-Grace-Periods/Funnel4.svg | 323 ++++++++ .../RCU/Design/Expedited-Grace-Periods/Funnel5.svg | 335 +++++++++ .../RCU/Design/Expedited-Grace-Periods/Funnel6.svg | 335 +++++++++ .../RCU/Design/Expedited-Grace-Periods/Funnel7.svg | 347 +++++++++ .../RCU/Design/Expedited-Grace-Periods/Funnel8.svg | 311 ++++++++ 12 files changed, 5093 insertions(+) create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/ExpRCUFlow.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel0.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel1.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel2.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel3.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel4.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel5.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel6.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel7.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel8.svg diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/ExpRCUFlow.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/ExpRCUFlow.svg new file mode 100644 index 000000000000..7c6c90bd02c9 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/ExpRCUFlow.svg @@ -0,0 +1,830 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Idle oroffline? + + + + CPU N Start + + + + Y + + + Done + + + + Send IPI to CPU N + + + + + In RCUread-sidecriticalsection? + + N + + + + IPI Handler + + + + N + + + Report CPUQuiescentState + + + Y + + + rcu_read_unlock() + + + + + + Context Switch + + + + + Report CPUand TaskQuiescent States + + + + Enqueue Task + + + + + + Report CPUQuiescentState + + + + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg new file mode 100644 index 000000000000..e4233ac93c2b --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg @@ -0,0 +1,826 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Idle oroffline? + + + + CPU N Start + + + + Y + + + Done + + + + Send IPI to CPU N + + + + + CPU idle? + + N + + + + IPI Handler + + + + Y + + + Report CPUQuiescentState + + + N + + + Context Switch + + + + + + CPU Offline + + + + + Report CPUQuiescentState + + + + + Report CPUQuiescentState + + + + + + reched_cpu() + + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html new file mode 100644 index 000000000000..7a3194c5559a --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html @@ -0,0 +1,626 @@ + + + A Tour Through TREE_RCU's Expedited Grace Periods + + +

Introduction

+ +This document describes RCU's expedited grace periods. +Unlike RCU's normal grace periods, which accept long latencies to attain +high efficiency and minimal disturbance, expedited grace periods accept +lower efficiency and significant disturbance to attain shorter latencies. + +

+There are three flavors of RCU (RCU-bh, RCU-preempt, and RCU-sched), +but only two flavors of expedited grace periods because the RCU-bh +expedited grace period maps onto the RCU-sched expedited grace period. +Each of the remaining two implementations is covered in its own section. + +

    +
  1. + Expedited Grace Period Design +
  2. + RCU-preempt Expedited Grace Periods +
  3. + RCU-sched Expedited Grace Periods +
  4. + Expedited Grace Period and CPU Hotplug +
  5. + Expedited Grace Period Refinements +
+ +

+Expedited Grace Period Design

+ +

+The expedited RCU grace periods cannot be accused of being subtle, +given that they for all intents and purposes hammer every CPU that +has not yet provided a quiescent state for the current expedited +grace period. +The one saving grace is that the hammer has grown a bit smaller +over time: The old call to try_stop_cpus() has been +replaced with a set of calls to smp_call_function_single(), +each of which results in an IPI to the target CPU. +The corresponding handler function checks the CPU's state, motivating +a faster quiescent state where possible, and triggering a report +of that quiescent state. +As always for RCU, once everything has spent some time in a quiescent +state, the expedited grace period has completed. + +

+The details of the smp_call_function_single() handler's +operation depend on the RCU flavor, as described in the following +sections. + +

+RCU-preempt Expedited Grace Periods

+ +

+The overall flow of the handling of a given CPU by an RCU-preempt +expedited grace period is shown in the following diagram: + +

ExpRCUFlow.svg + +

+The solid arrows denote direct action, for example, a function call. +The dotted arrows denote indirect action, for example, an IPI +or a state that is reached after some time. + +

+If a given CPU is offline or idle, synchronize_rcu_expedited() +will ignore it because idle and offline CPUs are already residing +in quiescent states. +Otherwise, the expedited grace period will use +smp_call_function_single() to send the CPU an IPI, which +is handled by sync_rcu_exp_handler(). + +

+However, because this is preemptible RCU, sync_rcu_exp_handler() +can check to see if the CPU is currently running in an RCU read-side +critical section. +If not, the handler can immediately report a quiescent state. +Otherwise, it sets flags so that the outermost rcu_read_unlock() +invocation will provide the needed quiescent-state report. +This flag-setting avoids the previous forced preemption of all +CPUs that might have RCU read-side critical sections. +In addition, this flag-setting is done so as to avoid increasing +the overhead of the common-case fastpath through the scheduler. + +

+Again because this is preemptible RCU, an RCU read-side critical section +can be preempted. +When that happens, RCU will enqueue the task, which will the continue to +block the current expedited grace period until it resumes and finds its +outermost rcu_read_unlock(). +The CPU will report a quiescent state just after enqueuing the task because +the CPU is no longer blocking the grace period. +It is instead the preempted task doing the blocking. +The list of blocked tasks is managed by rcu_preempt_ctxt_queue(), +which is called from rcu_preempt_note_context_switch(), which +in turn is called from rcu_note_context_switch(), which in +turn is called from the scheduler. + + + + + + + + +
 
Quick Quiz:
+ Why not just have the expedited grace period check the + state of all the CPUs? + After all, that would avoid all those real-time-unfriendly IPIs. +
Answer:
+ Because we want the RCU read-side critical sections to run fast, + which means no memory barriers. + Therefore, it is not possible to safely check the state from some + other CPU. + And even if it was possible to safely check the state, it would + still be necessary to IPI the CPU to safely interact with the + upcoming rcu_read_unlock() invocation, which means that + the remote state testing would not help the worst-case + latency that real-time applications care about. + +

One way to prevent your real-time + application from getting hit with these IPIs is to + build your kernel with CONFIG_NO_HZ_FULL=y. + RCU would then perceive the CPU running your application + as being idle, and it would be able to safely detect that + state without needing to IPI the CPU. +

 
+ +

+Please note that this is just the overall flow: +Additional complications can arise due to races with CPUs going idle +or offline, among other things. + +

+RCU-sched Expedited Grace Periods

+ +

+The overall flow of the handling of a given CPU by an RCU-sched +expedited grace period is shown in the following diagram: + +

ExpSchedFlow.svg + +

+As with RCU-preempt's synchronize_rcu_expedited(), +synchronize_sched_expedited() ignores offline and +idle CPUs, again because they are in remotely detectable +quiescent states. +However, the synchronize_rcu_expedited() handler +is sync_sched_exp_handler(), and because the +rcu_read_lock_sched() and rcu_read_unlock_sched() +leave no trace of their invocation, in general it is not possible to tell +whether or not the current CPU is in an RCU read-side critical section. +The best that sync_sched_exp_handler() can do is to check +for idle, on the off-chance that the CPU went idle while the IPI +was in flight. +If the CPU is idle, then tt>sync_sched_exp_handler() reports +the quiescent state. + +

+Otherwise, the handler invokes resched_cpu(), which forces +a future context switch. +At the time of the context switch, the CPU reports the quiescent state. +Should the CPU go offline first, it will report the quiescent state +at that time. + +

+Expedited Grace Period and CPU Hotplug

+ +

+The expedited nature of expedited grace periods require a much tighter +interaction with CPU hotplug operations than is required for normal +grace periods. +In addition, attempting to IPI offline CPUs will result in splats, but +failing to IPI online CPUs can result in too-short grace periods. +Neither option is acceptable in production kernels. + +

+The interaction between expedited grace periods and CPU hotplug operations +is carried out at several levels: + +

    +
  1. The number of CPUs that have ever been online is tracked + by the rcu_state structure's ->ncpus + field. + The rcu_state structure's ->ncpus_snap + field tracks the number of CPUs that have ever been online + at the beginning of an RCU expedited grace period. + Note that this number never decreases, at least in the absence + of a time machine. +
  2. The identities of the CPUs that have ever been online is + tracked by the rcu_node structure's + ->expmaskinitnext field. + The rcu_node structure's ->expmaskinit + field tracks the identities of the CPUs that were online + at least once at the beginning of the most recent RCU + expedited grace period. + The rcu_state structure's ->ncpus and + ->ncpus_snap fields are used to detect when + new CPUs have come online for the first time, that is, + when the rcu_node structure's ->expmaskinitnext + field has changed since the beginning of the last RCU + expedited grace period, which triggers an update of each + rcu_node structure's ->expmaskinit + field from its ->expmaskinitnext field. +
  3. Each rcu_node structure's ->expmaskinit + field is used to initialize that structure's + ->expmask at the beginning of each RCU + expedited grace period. + This means that only those CPUs that have been online at least + once will be considered for a given grace period. +
  4. Any CPU that goes offline will clear its bit in its leaf + rcu_node structure's ->qsmaskinitnext + field, so any CPU with that bit clear can safely be ignored. + However, it is possible for a CPU coming online or going offline + to have this bit set for some time while cpu_online + returns false. +
  5. For each non-idle CPU that RCU believes is currently online, the grace + period invokes smp_call_function_single(). + If this succeeds, the CPU was fully online. + Failure indicates that the CPU is in the process of coming online + or going offline, in which case it is necessary to wait for a + short time period and try again. + The purpose of this wait (or series of waits, as the case may be) + is to permit a concurrent CPU-hotplug operation to complete. +
  6. In the case of RCU-sched, one of the last acts of an outgoing CPU + is to invoke rcu_report_dead(), which + reports a quiescent state for that CPU. + However, this is likely paranoia-induced redundancy. +
+ + + + + + + + +
 
Quick Quiz:
+ Why all the dancing around with multiple counters and masks + tracking CPUs that were once online? + Why not just have a single set of masks tracking the currently + online CPUs and be done with it? +
Answer:
+ Maintaining single set of masks tracking the online CPUs sounds + easier, at least until you try working out all the race conditions + between grace-period initialization and CPU-hotplug operations. + For example, suppose initialization is progressing down the + tree while a CPU-offline operation is progressing up the tree. + This situation can result in bits set at the top of the tree + that have no counterparts at the bottom of the tree. + Those bits will never be cleared, which will result in + grace-period hangs. + In short, that way lies madness, to say nothing of a great many + bugs, hangs, and deadlocks. + +

+ In contrast, the current multi-mask multi-counter scheme ensures + that grace-period initialization will always see consistent masks + up and down the tree, which brings significant simplifications + over the single-mask method. + +

+ This is an instance of + + deferring work in order to avoid synchronization. + Lazily recording CPU-hotplug events at the beginning of the next + grace period greatly simplifies maintenance of the CPU-tracking + bitmasks in the rcu_node tree. +

 
+ +

+Expedited Grace Period Refinements

+ +
    +
  1. Idle-CPU checks. +
  2. + Batching via sequence counter. +
  3. + Funnel locking and wait/wakeup. +
  4. Use of Workqueues. +
  5. Stall warnings. +
+ +

Idle-CPU Checks

+ +

+Each expedited grace period checks for idle CPUs when initially forming +the mask of CPUs to be IPIed and again just before IPIing a CPU +(both checks are carried out by sync_rcu_exp_select_cpus()). +If the CPU is idle at any time between those two times, the CPU will +not be IPIed. +Instead, the task pushing the grace period forward will include the +idle CPUs in the mask passed to rcu_report_exp_cpu_mult(). + +

+For RCU-sched, there is an additional check for idle in the IPI +handler, sync_sched_exp_handler(). +If the IPI has interrupted the idle loop, then +sync_sched_exp_handler() invokes rcu_report_exp_rdp() +to report the corresponding quiescent state. + +

+For RCU-preempt, there is no specific check for idle in the +IPI handler (sync_rcu_exp_handler()), but because +RCU read-side critical sections are not permitted within the +idle loop, if sync_rcu_exp_handler() sees that the CPU is within +RCU read-side critical section, the CPU cannot possibly be idle. +Otherwise, sync_rcu_exp_handler() invokes +rcu_report_exp_rdp() to report the corresponding quiescent +state, regardless of whether or not that quiescent state was due to +the CPU being idle. + +

+In summary, RCU expedited grace periods check for idle when building +the bitmask of CPUs that must be IPIed, just before sending each IPI, +and (either explicitly or implicitly) within the IPI handler. + +

+Batching via Sequence Counter

+ +

+If each grace-period request was carried out separately, expedited +grace periods would have abysmal scalability and +problematic high-load characteristics. +Because each grace-period operation can serve an unlimited number of +updates, it is important to batch requests, so that a single +expedited grace-period operation will cover all requests in the +corresponding batch. + +

+This batching is controlled by a sequence counter named +->expedited_sequence in the rcu_state structure. +This counter has an odd value when there is an expedited grace period +in progress and an even value otherwise, so that dividing the counter +value by two gives the number of completed grace periods. +During any given update request, the counter must transition from +even to odd and then back to even, thus indicating that a grace +period has elapsed. +Therefore, if the initial value of the counter is s, +the updater must wait until the counter reaches at least the +value (s+3)&~0x1. +This counter is managed by the following access functions: + +

    +
  1. rcu_exp_gp_seq_start(), which marks the start of + an expedited grace period. +
  2. rcu_exp_gp_seq_end(), which marks the end of an + expedited grace period. +
  3. rcu_exp_gp_seq_snap(), which obtains a snapshot of + the counter. +
  4. rcu_exp_gp_seq_done(), which returns true + if a full expedited grace period has elapsed since the + corresponding call to rcu_exp_gp_seq_snap(). +
+ +

+Again, only one request in a given batch need actually carry out +a grace-period operation, which means there must be an efficient +way to identify which of many concurrent reqeusts will initiate +the grace period, and that there be an efficient way for the +remaining requests to wait for that grace period to complete. +However, that is the topic of the next section. + +

+Funnel Locking and Wait/Wakeup

+ +

+The natural way to sort out which of a batch of updaters will initiate +the expedited grace period is to use the rcu_node combining +tree, as implemented by the exp_funnel_lock() function. +The first updater corresponding to a given grace period arriving +at a given rcu_node structure records its desired grace-period +sequence number in the ->exp_seq_rq field and moves up +to the next level in the tree. +Otherwise, if the ->exp_seq_rq field already contains +the sequence number for the desired grace period or some later one, +the updater blocks on one of four wait queues in the +->exp_wq[] array, using the second-from-bottom +and third-from bottom bits as an index. +An ->exp_lock field in the rcu_node structure +synchronizes access to these fields. + +

+An empty rcu_node tree is shown in the following diagram, +with the white cells representing the ->exp_seq_rq field +and the red cells representing the elements of the +->exp_wq[] array. + +

Funnel0.svg + +

+The next diagram shows the situation after the arrival of Task A +and Task B at the leftmost and rightmost leaf rcu_node +structures, respectively. +The current value of the rcu_state structure's +->expedited_sequence field is zero, so adding three and +clearing the bottom bit results in the value two, which both tasks +record in the ->exp_seq_rq field of their respective +rcu_node structures: + +

Funnel1.svg + +

+Each of Tasks A and B will move up to the root +rcu_node structure. +Suppose that Task A wins, recording its desired grace-period sequence +number and resulting in the state shown below: + +

Funnel2.svg + +

+Task A now advances to initiate a new grace period, while Task B +moves up to the root rcu_node structure, and, seeing that +its desired sequence number is already recorded, blocks on +->exp_wq[1]. + + + + + + + + +
 
Quick Quiz:
+ Why ->exp_wq[1]? + Given that the value of these tasks' desired sequence number is + two, so shouldn't they instead block on ->exp_wq[2]? +
Answer:
+ No. + +

+ Recall that the bottom bit of the desired sequence number indicates + whether or not a grace period is currently in progress. + It is therefore necessary to shift the sequence number right one + bit position to obtain the number of the grace period. + This results in ->exp_wq[1]. +

 
+ +

+If Tasks C and D also arrive at this point, they will compute the +same desired grace-period sequence number, and see that both leaf +rcu_node structures already have that value recorded. +They will therefore block on their respective rcu_node +structures' ->exp_wq[1] fields, as shown below: + +

Funnel3.svg + +

+Task A now acquires the rcu_state structure's +->exp_mutex and initiates the grace period, which +increments ->expedited_sequence. +Therefore, if Tasks E and F arrive, they will compute +a desired sequence number of 4 and will record this value as +shown below: + +

Funnel4.svg + +

+Tasks E and F will propagate up the rcu_node +combining tree, with Task F blocking on the root rcu_node +structure and Task E wait for Task A to finish so that +it can start the next grace period. +The resulting state is as shown below: + +

Funnel5.svg + +

+Once the grace period completes, Task A +starts waking up the tasks waiting for this grace period to complete, +increments the ->expedited_sequence, +acquires the ->exp_wake_mutex and then releases the +->exp_mutex. +This results in the following state: + +

Funnel6.svg + +

+Task E can then acquire ->exp_mutex and increment +->expedited_sequence to the value three. +If new tasks G and H arrive and moves up the combining tree at the +same time, the state will be as follows: + +

Funnel7.svg + +

+Note that three of the root rcu_node structure's +waitqueues are now occupied. +However, at some point, Task A will wake up the +tasks blocked on the ->exp_wq waitqueues, resulting +in the following state: + +

Funnel8.svg + +

+Execution will continue with Tasks E and H completing +their grace periods and carrying out their wakeups. + + + + + + + + +
 
Quick Quiz:
+ What happens if Task A takes so long to do its wakeups + that Task E's grace period completes? +
Answer:
+ Then Task E will block on the ->exp_wake_mutex, + which will also prevent it from releasing ->exp_mutex, + which in turn will prevent the next grace period from starting. + This last is important in preventing overflow of the + ->exp_wq[] array. +
 
+ +

Use of Workqueues

+ +

+In earlier implementations, the task requesting the expedited +grace period also drove it to completion. +This straightforward approach had the disadvantage of needing to +account for signals sent to user tasks, +so more recent implemementations use the Linux kernel's +workqueues. + +

+The requesting task still does counter snapshotting and funnel-lock +processing, but the task reaching the top of the funnel lock +does a schedule_work() (from _synchronize_rcu_expedited() +so that a workqueue kthread does the actual grace-period processing. +Because workqueue kthreads do not accept signals, grace-period-wait +processing need not allow for signals. + +In addition, this approach allows wakeups for the previous expedited +grace period to be overlapped with processing for the next expedited +grace period. +Because there are only four sets of waitqueues, it is necessary to +ensure that the previous grace period's wakeups complete before the +next grace period's wakeups start. +This is handled by having the ->exp_mutex +guard expedited grace-period processing and the +->exp_wake_mutex guard wakeups. +The key point is that the ->exp_mutex is not released +until the first wakeup is complete, which means that the +->exp_wake_mutex has already been acquired at that point. +This approach ensures that the previous grace period's wakeups can +be carried out while the current grace period is in process, but +that these wakeups will complete before the next grace period starts. +This means that only three waitqueues are required, guaranteeing that +the four that are provided are sufficient. + +

Stall Warnings

+ +

+Expediting grace periods does nothing to speed things up when RCU +readers take too long, and therefore expedited grace periods check +for stalls just as normal grace periods do. + + + + + + + + +
 
Quick Quiz:
+ But why not just let the normal grace-period machinery + detect the stalls, given that a given reader must block + both normal and expedited grace periods? +
Answer:
+ Because it is quite possible that at a given time there + is no normal grace period in progress, in which case the + normal grace period cannot emit a stall warning. +
 
+ +The synchronize_sched_expedited_wait() function loops waiting +for the expedited grace period to end, but with a timeout set to the +current RCU CPU stall-warning time. +If this time is exceeded, any CPUs or rcu_node structures +blocking the current grace period are printed. +Each stall warning results in another pass through the loop, but the +second and subsequent passes use longer stall times. + +

+Summary

+ +

+Expedited grace periods use a sequence-number approach to promote +batching, so that a single grace-period operation can serve numerous +requests. +A funnel lock is used to efficiently identify the one task out of +a concurrent group that will request the grace period. +All members of the group will block on waitqueues provided in +the rcu_node structure. +The actual grace-period processing is carried out by a workqueue. + +

+CPU-hotplug operations are noted lazily in order to prevent the need +for tight synchronization between expedited grace periods and +CPU-hotplug operations. +The dyntick-idle counters are used to avoid sending IPIs to idle CPUs, +at least in the common case. +RCU-preempt and RCU-sched use different IPI handlers and different +code to respond to the state changes carried out by those handlers, +but otherwise use common code. + +

+Quiescent states are tracked using the rcu_node tree, +and once all necessary quiescent states have been reported, +all tasks waiting on this expedited grace period are awakened. +A pair of mutexes are used to allow one grace period's wakeups +to proceed concurrently with the next grace period's processing. + +

+This combination of mechanisms allows expedited grace periods to +run reasonably efficiently. +However, for non-time-critical tasks, normal grace periods should be +used instead because their longer duration permits much higher +degrees of batching, and thus much lower per-request overheads. + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel0.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel0.svg new file mode 100644 index 000000000000..98af66557908 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel0.svg @@ -0,0 +1,275 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 0 + + + + + + + + + + + + :0 + + + + + + + + :0 + + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel1.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel1.svg new file mode 100644 index 000000000000..e0184a37aec7 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel1.svg @@ -0,0 +1,275 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 0 + + + + + + + + + + + + A:2 + + + + + + + + B:2 + + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel2.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel2.svg new file mode 100644 index 000000000000..1bc3fed54d58 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel2.svg @@ -0,0 +1,287 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 0 + + + + + + + + + + + + :2 + + + + + + + + B:2 + + A:2 + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel3.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel3.svg new file mode 100644 index 000000000000..6d8a1bffb3e4 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel3.svg @@ -0,0 +1,323 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 0 GP: A + + + + + + + + + + + + :2 + C + + + + + + + + :2 + D + + :2 + B + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel4.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel4.svg new file mode 100644 index 000000000000..44018fd6342b --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel4.svg @@ -0,0 +1,323 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 1 GP: A + + + + + + + + + + + + E:4 + C + + + + + + + + F:4 + D + + :2 + B + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel5.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel5.svg new file mode 100644 index 000000000000..e5eef50454fb --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel5.svg @@ -0,0 +1,335 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 1 GP: A,E + + + + + + + + + + + + :4 + C + + + + + + + + :4 + D + + :4 + B + F + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel6.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel6.svg new file mode 100644 index 000000000000..fbd2c1892886 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel6.svg @@ -0,0 +1,335 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 2 GP: E Wakeup: A + + + + + + + + + + + + :4 + C + + + + + + + + :4 + D + + :4 + B + F + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel7.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel7.svg new file mode 100644 index 000000000000..502e159ed278 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel7.svg @@ -0,0 +1,347 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 3 GP: E,H Wakeup: A + + + + + + + + + + + + :4 + C + + + + + + + + :6 + D + + :6 + B + F + G + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel8.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel8.svg new file mode 100644 index 000000000000..677401551c7d --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel8.svg @@ -0,0 +1,311 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 3 GP: E,H + + + + + + + + + + + + :4 + + + + + + + + :6 + + :6 + F + G + + -- 2.5.2