On Wed, 2017-08-09 at 12:38 +0100, Tim Deegan wrote: > Hi, > > At 10:01 +0200 on 27 Jul (1501149684), Dario Faggioli wrote: > > In Xen, that is impossible, and that's particularly problematic > > when system is idle (or lightly loaded) systems, as CPUs that > > are idle may never have the chance to tell RCU about their > > quiescence, and grace periods could extend indefinitely! > > [...] > > The first step for fixing this situation is for RCU to record, > > at the beginning of a grace period, which CPUs are already idle. > > In fact, being idle, they can't be in the middle of any read-side > > critical section, and we don't have to wait for them to declare > > a grace period finished. > > AIUI this patch fixes a bug where: >  - a CPU is idle/asleep; >  - it is added to the cpumask of a new RCU grace period; and >  - because the CPU is asleep, the grace period never ends.  > Have I understood? > > I think we might be left with a race condition: >  - CPU A is about to idle.  It runs the RCU softirq and >    clears itself from the current grace period. >  - CPU B ends the grace period and starts a new one. >  - CPU A calls rcu_idle_enter() and sleeps. >  - The new grace period never ends. > So, I could not make this happen artificially, but I put together a possible sequence of events that depicts this situation. I'm putting it here, but also pasting it at https://pastebin.com/PHu6Guq0 (in case the email body is mangled. Lines starting with a '|' are the output of tracing, triggered by the execution of the code/functions reported in the other lines: CPU0 CPU1 CPU2 . . . . call_rcu(...) . . |rcu_call fn=xxx, rcp_cur=-300, rdp_quiescbatch=-300 . . . . |do_softirq . . |rcu_pending? yes (ret=2): no pending entries but new entries . |raise_softirq nr 3 . . |softirq_handler nr 3 . . rcu_process_callbacks(); . . |rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: yes, rdp_donelist: null . rdp->batch = rcp->cur + 1; . . smp_rmb; . . |rcu_next_batch, rcp_cur=-300, rdp_batch=-299, rcp_next_pending? no . if (!rcp->next_pending) { . . rcp->next_pending = 1; . . rcu_start_batch(); . . if (rcp->next_pending && rcp->completed == rcp->cur) { . rcp->next_pending = 0; . . smp_wmb; . . rcp->cur++; . . smp_mb; . . cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask); . |rcu_start_batch, rcp_cur=-299, rcp_cpumask=0x00000006 //Active CPUs: 1,2 . } . . rcu_check_quiescent_state(); . . |rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-300, rdp_qs_pending: no . if (rdp->quiescbatch != rcp->cur) { . rdp->qs_pending = 1; . . rdp->quiescbatch = rcp->cur; . |rcu_grace_start, rcp_cur=-299 . . } . call_rcu(...) . . |rcu_call fn=xxx, rcp_cur=-299, rdp_quiescbatch=-300 . . . . |do_softirq . . |rcu_pending? yes (ret=2): no pending entries but new entries . |raise_softirq nr 3 . . |softirq_handler nr 3 . . rcu_process_callbacks(); . . |rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: yes, rdp_donelist: null rdp->batch = rcp->cur + 1; . . smp_rmb; . . |rcu_next_batch, rcp_cur=-299, rdp_batch=-298, rcp_next_pending? no if (!rcp->next_pending) { . . rcp->next_pending = 1; . . rcu_start_batch(); . . if (rcp->next_pending && rcp->completed == rcp->cur) //NO!!! } . . rcu_check_quiescent_state(); . . |rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-300, rdp_qs_pending: no if (rdp->quiescbatch != rcp->cur) { . rdp->qs_pending = 1; . . rdp->quiescbatch = rcp->cur; . . |rcu_grace_start, rcp_cur=-299 . . } . . . . . . . |vcpu_block dXvY . . |raise_softirq nr 1 . . |rcu_pending? yes (ret=4): waiting for CPU to quiesce (rdp_qs_pending=0) . . |raise_softirq nr 3 . . |softirq_handler nr 1 . . |//scheduler stuff . . |runstate_change dXvY running->blocked . . |runstate_change d32767v2 runnable->running . . idle_loop() . . pm_idle() //e.g., mwait_idle() . . cpufreq_dbs_timer_suspend(); . . |timer_stop t=0xffff830319acebc0, fn=0xffff82d080252da9 . . |timer_rm_entry t=0xffff830319acebc0, cpu=2, status=0x3 . . sched_tick_suspend(); . . rcu_idle_timer_start() . . process_pending_softirqs(); . . |do_softirq, pending = 0x00000008 . . |rcu_pending? yes (ret=4): waiting for CPU to quiesce (rdp_qs_pending=0) . . |raise_softirq nr 3 . . |softirq_handler nr 3 . . rcu_process_callbacks(); . . |rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: null, rdp_donelist: null . . rcu_check_quiescent_state(); . . |rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-300, rdp_qs_pending: no . . if (rdp->quiescbatch != rcp->cur) { . . rdp->qs_pending = 1; . . rdp->quiescbatch = rcp->cur; . . |rcu_grace_start, rcp_cur=-299 . . } . . |do_softirq, pending = 0x00000000 . . |rcu_pending? yes (ret=5): waiting for CPU to quiesce (rdp_qs_pending=1) . . |raise_softirq nr 3 . . |softirq_handler nr 3 . . rcu_process_callbacks(); . . |rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: null, rdp_donelist: null . . rcu_check_quiescent_state(); . . |rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-299, rdp_qs_pending: yes . . if (rdp->quiescbatch != rcp->cur) // NO!!! . . if (!rdp->qs_pending) // NO!!! . . rdp->qs_pending = 0; . . |rcu_grace_done, rcp_cur=-299, rcp_completed=-300, rdp_quiescbatch=-299 . . if (likely(rdp->quiescbatch == rcp->cur)) { . . cpu_quiet(); . . cpumask_clear_cpu(cpu, &rcp->cpumask); . . |rcu_cpu_quiet, rcp_cur=-299, rcp_cpumask=0x00000002 // Active CPU: 1 . . if (cpumask_empty(&rcp->cpumask)) // NO!!! . |do_softirq, pending = 0x00000010 . |rcu_pending? yes (ret=5): waiting for CPU to quiesce (rdp_qs_pending=1) . |raise_softirq nr 3 . . |softirq_handler nr 3 . . rcu_process_callbacks(); . . |rcu_process_callbacks, rcp_completed=-300, rdp_batch=-299, rdp_curlist: yes, rdp_nxtlist: null, rdp_donelist: null . rcu_check_quiescent_state(); . . |rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-299, rdp_qs_pending: yes . if (rdp->quiescbatch != rcp->cur) // NO!!! . if (!rdp->qs_pending) // NO!!! . |rcu_grace_done, rcp_cur=-299, rcp_completed=-300, rdp_quiescbatch=-299 . if (likely(rdp->quiescbatch == rcp->cur)) { . cpu_quiet(rdp->cpu, rcp); . . cpumask_clear_cpu(cpu, &rcp->cpumask); . |rcu_cpu_quiet, rcp_cur=-299, rcp_cpumask=0x00000000 //No more active CPU . if (cpumask_empty(&rcp->cpumask)) { . rcp->completed = rcp->cur;. . rcu_start_batch(); . . if (rcp->next_pending && rcp->completed == rcp->cur) { . rcp->next_pending = 0; . . smp_wmb; . . . |do_softirq, pending = 0x00000000 . . |rcu_pending? no . . local_irq_disable(); . . if (!cpu_is_haltable(cpu)) //NO!!! . . |rcu_pending? no . . |pm_idle_start c2, pm_tick=3720475837, exp=1708us, pred=1024us . . if (cpu_is_haltable(cpu)) { // <--- rcu_pending: quiescbatch=-299, rcp->cur=-299 [1] . . |rcu_pending? no [2] . rcp->cur++; . . smp_mb; . . cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask); . |rcu_start_batch, rcp_cur=-298, rcp_cpumask=0x00000007 . } . } [3] . . rcu_idle_enter(cpu); . . mwait_idle_with_hints(); . . . So, this is basically what's described above, with my CPU1 being Tim's CPU B (the one that ends the grace period and starts a new one), and my CPU2 being Tim's CPU A (the one that goes idle). There's the need for CPU0 issuing a call_rcu() and queueing a new batch (or CPU1 won't start any new one). Basically, for the race to occur (in [3]), it is necessary that [2] (CPU1 doing rcp->cur++) happens after [1] (last call to rcu_pending() of CPU2, before clearing the mask and going idle). In fact, it that is not the case, rcu_pending() in CPU2 will say 'no', because of this check: if (rdp->quiescbatch != rcp->cur || rdp->qs_pending) which would prevent the CPU from going idle (and will, at least, lead to it going through check_quiescent_state() twice, clearing itself from the new grace period too). Therefore, it looks to me that the race can be avoided by making sure that there is at least one check of rcu_pending(), after a CPU has called rcu_enter_idle(). This basically means moving rcu_idle_enter() up. I was actually thinking to move it inside the tick-stopping logic too. This way, either, when CPU1 samples the cpumask for the new grace period: 1) CPU2 will be in idle_cpumask already, and it actually goes idle; 2) CPU2 will be in idle_cpumask, but then it does not go idle (cpu_is_haltable() returning false) for various reasons. This may look unideal, but if it is here, trying to go idle, it is not holding references to read-side critical sections, or at least we can consider it to be quiescing, so it's ok to ignore it for the grace period; 3) CPU2 is not in idle_cpumask, and so it will not be in the sampled mask, but the first check of rcu_pending() will lead acknowledge quiescence, and calling of cpu_quiet(); Which looks fine to me. Or am I still missing something? Thanks and Regards, Dario -- <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)