All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] xen: RCU: x86/ARM: Add support of rcu_idle_{enter, exit}
@ 2017-07-27  8:01 Dario Faggioli
  2017-07-27  8:01 ` [PATCH 1/5] xen: in do_softirq() sample smp_processor_id() once and for all Dario Faggioli
                   ` (4 more replies)
  0 siblings, 5 replies; 38+ messages in thread
From: Dario Faggioli @ 2017-07-27  8:01 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Julien Grall, Jan Beulich

Hello everyone,

It took some time, and, honestly, it was a bit of a nightmare, but I think I
did manage to find a way to make our RCU implementation in a status where it
does not only work "by chance" (as it does right now! :-O).

A bit of background in this post:

 https://www.mail-archive.com/xen-devel@lists.xen.org/msg105388.html

About the actual patch series, apart from the first two preparatory patches
(patch 1 was part of another series, but it's more convenient for me to send it
in here), this is all about the following:

*) making sure that CPUs that are idle when an RCU grace period is recorded to
   be starting, don't contribute to make such grace period longer than it needs
   to be. In fact, if they're idle, they can't hold references, and can safely
   be ignored. We do this in patch 3, using a cpumask, as Linux was doing on
   s390 (which was tickless already, back when we imported the RCU code), and
   on the early days of dynticks support on x86;

*) making sure that CPUs that have RCU related stuff to do, don't just go idle
   without actually doing those. This is rather simple, we just have to actually
   start using a function that we did import, but isn't used or called right now:
   rcu_needs_cpu(). However, given that Xen is tickless (i.e., we stop all the
   timers when a CPU becomes idle), there's the problem of how to deal with the
   CPUs that have RCU callbacks queued. In fact, just using rcu_needs_cpu(), as
   we in fact do, means they'll spin, until the end of the grace period. Although
   it's correct, we clearly don't want this, and yet, in patch 4, we let that
   happen;

*) find a solution for the problem that using rcu_needs_cpu() (which is
   _correct_ and _necessary_) poses, for CPUs with queued callbacks. This is what
   patch 5 does. Basically, we introduce some kind of periodic tick. This timer,
   however, is only started on those CPUs, and only when they enter an idle sleep.
   The idea is to make sure that, if they won't wake up soon enough (well, not
   that soon, period is 10 msec) by themselves, we'll poke them and give them a
   chance to check whether the grace period has actually ended, and invoke the
   callbacks.

Without this series applied, this is the situation on x86, looking at when
complete_domain_destroy() is invoked via call_rcu, and at when it is actually
executed (printk added by me, for the sake of this experiment, they're not in
the patches):

Idle:
(XEN) [  238.758551] XXX calling complete_domain_destroy(d5), from CPU 8, at 238758535948 ns
(XEN) [  239.338958] XXX executing complete_domain_destroy(d5), from CPU 8, at 239338943633 ns
With CPU load:
(XEN) [  322.740427] XXX calling complete_domain_destroy(d7), from CPU 13, at 322740410755 ns
(XEN) [  322.742598] XXX executing complete_domain_destroy(d7), from CPU 13, at 322742595289 ns

When the system is idle, the delay between the call and the execution is
significant (which is counterintuitive, as the system is idle!!). Moreover,
there's a big difference in the behavior of the system, between the idle and
the loaded case.

With this series applied, here's what happen:

Idle:
(XEN) [  106.341590] XXX calling complete_domain_destroy(d1), from CPU 9, at 106341573553 ns
(XEN) [  106.344391] XXX executing complete_domain_destroy(d1), from CPU 9, at 106344387574 ns
With CPU load:
(XEN) [  176.166842] XXX calling complete_domain_destroy(d2), from CPU 13, at 176166826183 ns
(XEN) [  176.167571] XXX executing complete_domain_destroy(d2), from CPU 13, at 176167568269 ns

Which I call much better! :-)

This patch series addresses the XEN-27 issue, which I think Julien wants to
consider a blocker for 4.10:

 https://xenproject.atlassian.net/browse/XEN-27

There is a git branch available, with the series in, here:

git://xenbits.xen.org/people/dariof/xen.git rel/rcu/introduce-idle-enter-exit
http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/rel/rcu/introduce-idle-enter-exit
https://travis-ci.org/fdario/xen/builds/258044027

And, finally, Stefano and Julien, I've compile tested, but have not runtime
tested the patches on ARM.

Thanks and Regards,
Dario
---
Dario Faggioli (5):
      xen: in do_softirq() sample smp_processor_id() once and for all.
      xen: ARM: suspend the tick (if in use) when going idle.
      xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
      xen: RCU: don't let a CPU with a callback go idle.
      xen: RCU: avoid busy waiting until the end of grace period.

 xen/arch/arm/domain.c         |   33 ++++++++++++-----
 xen/arch/x86/acpi/cpu_idle.c  |   31 +++++++++++-----
 xen/arch/x86/cpu/mwait-idle.c |   15 ++++++--
 xen/arch/x86/domain.c         |    8 ++++
 xen/common/rcupdate.c         |   80 +++++++++++++++++++++++++++++++++++++++--
 xen/common/softirq.c          |    8 +---
 xen/include/xen/rcupdate.h    |    6 +++
 xen/include/xen/sched.h       |    6 ++-
 8 files changed, 153 insertions(+), 34 deletions(-)
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/5] xen: in do_softirq() sample smp_processor_id() once and for all.
  2017-07-27  8:01 [PATCH 0/5] xen: RCU: x86/ARM: Add support of rcu_idle_{enter, exit} Dario Faggioli
@ 2017-07-27  8:01 ` Dario Faggioli
  2017-07-27  8:01 ` [PATCH 2/5] xen: ARM: suspend the tick (if in use) when going idle Dario Faggioli
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 38+ messages in thread
From: Dario Faggioli @ 2017-07-27  8:01 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, George Dunlap, Andrew Cooper, Tim Deegan,
	Julien Grall, Jan Beulich

In fact, right now, we read it at every iteration of the loop.
The reason it's done like this is how context switch was handled
on IA64 (see commit ae9bfcdc, "[XEN] Various softirq cleanups" [1]).

However:
1) we don't have IA64 any longer, and all the achitectures that
   we do support, are ok with sampling once and for all;
2) sampling at every iteration (slightly) affect performance;
3) sampling at every iteration is misleading, as it makes people
   believe that it is currently possible that SCHEDULE_SOFTIRQ
   moves the execution flow on another CPU (and the comment,
   by reinforcing this belief, makes things even worse!).

Therefore, let's:
- do the sampling only once, and remove the comment;
- leave an ASSERT() around, so that, if context switching
  logic changes (in current or new arches), we will notice.

[1] Some more (historical) information here:
    http://old-list-archives.xenproject.org/archives/html/xen-devel/2006-06/msg01262.html

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Tim Deegan <tim@xen.org>
---
This has been submitted already, as a part of another series. Discussion is here:
 https://lists.xen.org/archives/html/xen-devel/2017-06/msg00102.html

For the super lazy, Jan's latest word in that thread were these:
 "I've voiced my opinion, but I don't mean to block the patch. After
  all there's no active issue the change introduces."
 (https://lists.xen.org/archives/html/xen-devel/2017-06/msg00797.html)

Since then:
- changed "once and for all" with "only once", as requested by George (and
  applied his Reviewed-by, as he said I could).
---
 xen/common/softirq.c |    8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/xen/common/softirq.c b/xen/common/softirq.c
index ac12cf8..67c84ba 100644
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -27,16 +27,12 @@ static DEFINE_PER_CPU(unsigned int, batching);
 
 static void __do_softirq(unsigned long ignore_mask)
 {
-    unsigned int i, cpu;
+    unsigned int i, cpu = smp_processor_id();
     unsigned long pending;
 
     for ( ; ; )
     {
-        /*
-         * Initialise @cpu on every iteration: SCHEDULE_SOFTIRQ may move
-         * us to another processor.
-         */
-        cpu = smp_processor_id();
+        ASSERT(cpu == smp_processor_id());
 
         if ( rcu_pending(cpu) )
             rcu_check_callbacks(cpu);


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/5] xen: ARM: suspend the tick (if in use) when going idle.
  2017-07-27  8:01 [PATCH 0/5] xen: RCU: x86/ARM: Add support of rcu_idle_{enter, exit} Dario Faggioli
  2017-07-27  8:01 ` [PATCH 1/5] xen: in do_softirq() sample smp_processor_id() once and for all Dario Faggioli
@ 2017-07-27  8:01 ` Dario Faggioli
  2017-07-31 20:59   ` Stefano Stabellini
  2017-07-27  8:01 ` [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started Dario Faggioli
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 38+ messages in thread
From: Dario Faggioli @ 2017-07-27  8:01 UTC (permalink / raw)
  To: xen-devel; +Cc: Julien Grall, Stefano Stabellini

Since commit 964fae8ac ("cpuidle: suspend/resume scheduler
tick timer during cpu idle state entry/exit"), if a scheduler
has a periodic tick timer, we stop it when going idle.

This, however, is only true for x86. Make it true for ARM as
well.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Julien Grall <julien.grall@arm.com>
---
 xen/arch/arm/domain.c |   29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index 2dc8b0a..fce29cb 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -39,6 +39,25 @@
 
 DEFINE_PER_CPU(struct vcpu *, curr_vcpu);
 
+static void do_idle(void)
+{
+    unsigned int cpu = smp_processor_id();
+
+    sched_tick_suspend();
+    /* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */
+    process_pending_softirqs();
+
+    local_irq_disable();
+    if ( cpu_is_haltable(cpu) )
+    {
+        dsb(sy);
+        wfi();
+    }
+    local_irq_enable();
+
+    sched_tick_resume();
+}
+
 void idle_loop(void)
 {
     unsigned int cpu = smp_processor_id();
@@ -52,15 +71,7 @@ void idle_loop(void)
         if ( unlikely(tasklet_work_to_do(cpu)) )
             do_tasklet();
         else
-        {
-            local_irq_disable();
-            if ( cpu_is_haltable(cpu) )
-            {
-                dsb(sy);
-                wfi();
-            }
-            local_irq_enable();
-        }
+            do_idle();
 
         do_softirq();
         /*


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
  2017-07-27  8:01 [PATCH 0/5] xen: RCU: x86/ARM: Add support of rcu_idle_{enter, exit} Dario Faggioli
  2017-07-27  8:01 ` [PATCH 1/5] xen: in do_softirq() sample smp_processor_id() once and for all Dario Faggioli
  2017-07-27  8:01 ` [PATCH 2/5] xen: ARM: suspend the tick (if in use) when going idle Dario Faggioli
@ 2017-07-27  8:01 ` Dario Faggioli
  2017-07-31 21:17   ` Stefano Stabellini
                     ` (2 more replies)
  2017-07-27  8:01 ` [PATCH 4/5] xen: RCU: don't let a CPU with a callback go idle Dario Faggioli
  2017-07-27  8:01 ` [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period Dario Faggioli
  4 siblings, 3 replies; 38+ messages in thread
From: Dario Faggioli @ 2017-07-27  8:01 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Julien Grall, Stefano Stabellini, Jan Beulich

Xen is a tickless (micro-)kernel. This means that, when a CPU
becomes idle, we stop all the activity on it, including any
periodic tick or timer.

When we imported RCU from Linux, Linux (x86) was a ticking
kernel, i.e., there was a periodic timer tick always running,
even on totally idle CPUs. This was bad from a power efficiency
perspective, but it's what maked it possible to monitor the
quiescent states of all the CPUs, and hence tell when an RCU
grace period ends.

In Xen, that is impossible, and that's particularly problematic
when system is idle (or lightly loaded) systems, as CPUs that
are idle may never have the chance to tell RCU about their
quiescence, and grace periods could extend indefinitely!

This has led, on x86, to long (an unpredictable) delays between
RCU callbacks queueing and invokation. On ARM, we actually see
infinite grace periods (e.g., complate_domain_destroy() may
never be actually invoked on an idle system). See here:

 https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg02454.html

The first step for fixing this situation is for RCU to record,
at the beginning of a grace period, which CPUs are already idle.
In fact, being idle, they can't be in the middle of any read-side
critical section, and we don't have to wait for them to declare
a grace period finished.

This is tracked in a cpumask, in a way that is very similar to
how Linux also was achieving the same on s390 --which indeed was
tickless already, even back then-- and to what it started to do
for x86, from 2.6.21 on (see commit 79bf2bb3 "tick-management:
dyntick / highres functionality").

While there, also adopt the memory barrier introduced by Linux
commit commit c3f59023 ("Fix RCU race in access of nohz_cpu_mask").

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/arm/domain.c         |    2 ++
 xen/arch/x86/acpi/cpu_idle.c  |   25 +++++++++++++++++--------
 xen/arch/x86/cpu/mwait-idle.c |    9 ++++++++-
 xen/arch/x86/domain.c         |    8 +++++++-
 xen/common/rcupdate.c         |   28 ++++++++++++++++++++++++++--
 xen/include/xen/rcupdate.h    |    3 +++
 6 files changed, 63 insertions(+), 12 deletions(-)

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index fce29cb..666b7ef 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -50,8 +50,10 @@ static void do_idle(void)
     local_irq_disable();
     if ( cpu_is_haltable(cpu) )
     {
+        rcu_idle_enter(cpu);
         dsb(sy);
         wfi();
+        rcu_idle_exit(cpu);
     }
     local_irq_enable();
 
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 482b8a7..04c52e8 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -418,14 +418,16 @@ static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
     mwait_idle_with_hints(cx->address, MWAIT_ECX_INTERRUPT_BREAK);
 }
 
-static void acpi_idle_do_entry(struct acpi_processor_cx *cx)
+static void acpi_idle_do_entry(unsigned int cpu, struct acpi_processor_cx *cx)
 {
+    rcu_idle_enter(cpu);
+
     switch ( cx->entry_method )
     {
     case ACPI_CSTATE_EM_FFH:
         /* Call into architectural FFH based C-state */
         acpi_processor_ffh_cstate_enter(cx);
-        return;
+        break;
     case ACPI_CSTATE_EM_SYSIO:
         /* IO port based C-state */
         inb(cx->address);
@@ -433,12 +435,14 @@ static void acpi_idle_do_entry(struct acpi_processor_cx *cx)
            because chipsets cannot guarantee that STPCLK# signal
            gets asserted in time to freeze execution properly. */
         inl(pmtmr_ioport);
-        return;
+        break;
     case ACPI_CSTATE_EM_HALT:
         safe_halt();
         local_irq_disable();
-        return;
+        break;
     }
+
+    rcu_idle_exit(cpu);
 }
 
 static int acpi_idle_bm_check(void)
@@ -540,7 +544,8 @@ void update_idle_stats(struct acpi_processor_power *power,
 
 static void acpi_processor_idle(void)
 {
-    struct acpi_processor_power *power = processor_powers[smp_processor_id()];
+    unsigned int cpu = smp_processor_id();
+    struct acpi_processor_power *power = processor_powers[cpu];
     struct acpi_processor_cx *cx = NULL;
     int next_state;
     uint64_t t1, t2 = 0;
@@ -563,7 +568,11 @@ static void acpi_processor_idle(void)
         if ( pm_idle_save )
             pm_idle_save();
         else
+        {
+            rcu_idle_enter(cpu);
             safe_halt();
+            rcu_idle_exit(cpu);
+        }
         return;
     }
 
@@ -579,7 +588,7 @@ static void acpi_processor_idle(void)
      */
     local_irq_disable();
 
-    if ( !cpu_is_haltable(smp_processor_id()) )
+    if ( !cpu_is_haltable(cpu) )
     {
         local_irq_enable();
         sched_tick_resume();
@@ -610,7 +619,7 @@ static void acpi_processor_idle(void)
             update_last_cx_stat(power, cx, t1);
 
             /* Invoke C2 */
-            acpi_idle_do_entry(cx);
+            acpi_idle_do_entry(cpu, cx);
             /* Get end time (ticks) */
             t2 = cpuidle_get_tick();
             trace_exit_reason(irq_traced);
@@ -672,7 +681,7 @@ static void acpi_processor_idle(void)
         }
 
         /* Invoke C3 */
-        acpi_idle_do_entry(cx);
+        acpi_idle_do_entry(cpu, cx);
 
         if ( (cx->type == ACPI_STATE_C3) &&
              power->flags.bm_check && power->flags.bm_control )
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index 762dff1..ae9e92b 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -735,8 +735,11 @@ static void mwait_idle(void)
 	if (!cx) {
 		if (pm_idle_save)
 			pm_idle_save();
-		else
+		else {
+			rcu_idle_enter(cpu);
 			safe_halt();
+			rcu_idle_exit(cpu);
+		}
 		return;
 	}
 
@@ -756,6 +759,8 @@ static void mwait_idle(void)
 		return;
 	}
 
+	rcu_idle_enter(cpu);
+
 	eax = cx->address;
 	cstate = ((eax >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;
 
@@ -787,6 +792,8 @@ static void mwait_idle(void)
 		irq_traced[0], irq_traced[1], irq_traced[2], irq_traced[3]);
 
 	/* Now back in C0. */
+	rcu_idle_exit(cpu);
+
 	update_idle_stats(power, cx, before, after);
 	local_irq_enable();
 
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index dd8bf13..a6c0f66 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -73,9 +73,15 @@ void (*dead_idle) (void) __read_mostly = default_dead_idle;
 
 static void default_idle(void)
 {
+    unsigned int cpu = smp_processor_id();
+
     local_irq_disable();
-    if ( cpu_is_haltable(smp_processor_id()) )
+    if ( cpu_is_haltable(cpu) )
+    {
+        rcu_idle_enter(cpu);
         safe_halt();
+        rcu_idle_exit(cpu);
+    }
     else
         local_irq_enable();
 }
diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index 8cc5a82..f0fdc87 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -52,7 +52,8 @@ static struct rcu_ctrlblk {
     int  next_pending;  /* Is the next batch already waiting?         */
 
     spinlock_t  lock __cacheline_aligned;
-    cpumask_t   cpumask; /* CPUs that need to switch in order    */
+    cpumask_t   cpumask; /* CPUs that need to switch in order ... */
+    cpumask_t   idle_cpumask; /* ... unless they are already idle */
     /* for current batch to proceed.        */
 } __cacheline_aligned rcu_ctrlblk = {
     .cur = -300,
@@ -248,7 +249,14 @@ static void rcu_start_batch(struct rcu_ctrlblk *rcp)
         smp_wmb();
         rcp->cur++;
 
-        cpumask_copy(&rcp->cpumask, &cpu_online_map);
+       /*
+        * Accessing idle_cpumask before incrementing rcp->cur needs a
+        * Barrier  Otherwise it can cause tickless idle CPUs to be
+        * included in rcp->cpumask, which will extend graceperiods
+        * unnecessarily.
+        */
+        smp_mb();
+        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask);
     }
 }
 
@@ -474,7 +482,23 @@ static struct notifier_block cpu_nfb = {
 void __init rcu_init(void)
 {
     void *cpu = (void *)(long)smp_processor_id();
+
+    cpumask_setall(&rcu_ctrlblk.idle_cpumask);
     cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
     register_cpu_notifier(&cpu_nfb);
     open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
 }
+
+/*
+ * The CPU is becoming idle, so no more read side critical
+ * sections, and one more step toward grace period.
+ */
+void rcu_idle_enter(unsigned int cpu)
+{
+    cpumask_set_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
+}
+
+void rcu_idle_exit(unsigned int cpu)
+{
+    cpumask_clear_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
+}
diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h
index 557a7b1..561ac43 100644
--- a/xen/include/xen/rcupdate.h
+++ b/xen/include/xen/rcupdate.h
@@ -146,4 +146,7 @@ void call_rcu(struct rcu_head *head,
 
 int rcu_barrier(void);
 
+void rcu_idle_enter(unsigned int cpu);
+void rcu_idle_exit(unsigned int cpu);
+
 #endif /* __XEN_RCUPDATE_H */


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/5] xen: RCU: don't let a CPU with a callback go idle.
  2017-07-27  8:01 [PATCH 0/5] xen: RCU: x86/ARM: Add support of rcu_idle_{enter, exit} Dario Faggioli
                   ` (2 preceding siblings ...)
  2017-07-27  8:01 ` [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started Dario Faggioli
@ 2017-07-27  8:01 ` Dario Faggioli
  2017-08-07  8:38   ` Jan Beulich
  2017-07-27  8:01 ` [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period Dario Faggioli
  4 siblings, 1 reply; 38+ messages in thread
From: Dario Faggioli @ 2017-07-27  8:01 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Julien Grall, Jan Beulich

If a CPU has a callback queued, it must be ready to invoke
it, as soon as all the other CPUs involved in the grace period
has gone through a quiescent state.

But if we let such CPU go idle, we can't really tell when (if!)
it will realize that it is actually time to invoke the callback.
To solve this problem, a CPU that has a callback queued (and has
already gone through a quiescent state itself) will stay online,
until the grace period ends, and the callback can be invoked.

This is similar to what Linux does, and is the second and last
step for fixing the overly long (or infinite!) grace periods.
The problem, though, is that, within Linux, we have the tick,
so, all that is necessary is to not stop the tick for the CPU
(even if it has gone idle). In Xen, there's no tick, so we must
avoid for the CPU to go idle entirely, and let it spin on
rcu_pending(), consuming power and causing overhead.

In this commit, we implement the above, using rcu_needs_cpu(),
in a way similar to how it is used in Linux. This it correct,
useful and not wasteful for CPUs that participate in grace
period, but have not a callback queued. For the ones that
has callbacks, an optimization that avoids having to spin is
introduced in next commit.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
---
 xen/include/xen/sched.h |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 6673b27..df93a86 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -842,7 +842,8 @@ uint64_t get_cpu_idle_time(unsigned int cpu);
 
 /*
  * Used by idle loop to decide whether there is work to do:
- *  (1) Run softirqs; or (2) Play dead; or (3) Run tasklets.
+ *  (1) Deal with RCU; (2) or run softirqs; or (3) Play dead;
+ *  or (4) Run tasklets.
  *
  * About (3), if a tasklet is enqueued, it will be scheduled
  * really really soon, and hence it's pointless to try to
@@ -850,7 +851,8 @@ uint64_t get_cpu_idle_time(unsigned int cpu);
  * the tasklet_work_to_do() helper).
  */
 #define cpu_is_haltable(cpu)                    \
-    (!softirq_pending(cpu) &&                   \
+    (!rcu_needs_cpu(cpu) &&                     \
+     !softirq_pending(cpu) &&                   \
      cpu_online(cpu) &&                         \
      !per_cpu(tasklet_work_to_do, cpu))
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period.
  2017-07-27  8:01 [PATCH 0/5] xen: RCU: x86/ARM: Add support of rcu_idle_{enter, exit} Dario Faggioli
                   ` (3 preceding siblings ...)
  2017-07-27  8:01 ` [PATCH 4/5] xen: RCU: don't let a CPU with a callback go idle Dario Faggioli
@ 2017-07-27  8:01 ` Dario Faggioli
  2017-07-31 21:20   ` Stefano Stabellini
                     ` (2 more replies)
  4 siblings, 3 replies; 38+ messages in thread
From: Dario Faggioli @ 2017-07-27  8:01 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Julien Grall, Stefano Stabellini, Jan Beulich

Instead of having the CPU where a callback is queued, busy
looping on rcu_pending(), use a timer.

In fact, we let the CPU go idla,e but we program a timer
that will periodically wake it up, for checking whether the
grace period has actually ended.

It is kind of similar to introducing a periodic tick, but
with a much more limited scope, and a lot less overhead. In
fact, this timer is:
- only active for the CPU(s) that have callbacks queued,
  waiting for the end of a grace period;
- only active when those CPU(s) are idle (and stopped as
  soon as they resume execution).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/arm/domain.c         |    4 ++-
 xen/arch/x86/acpi/cpu_idle.c  |    6 +++--
 xen/arch/x86/cpu/mwait-idle.c |    6 +++--
 xen/common/rcupdate.c         |   52 ++++++++++++++++++++++++++++++++++++++++-
 xen/include/xen/rcupdate.h    |    3 ++
 5 files changed, 65 insertions(+), 6 deletions(-)

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index 666b7ef..01da96e 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -43,8 +43,9 @@ static void do_idle(void)
 {
     unsigned int cpu = smp_processor_id();
 
+    rcu_idle_timer_start();
     sched_tick_suspend();
-    /* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */
+    /* Timer related operations can raise TIMER_SOFTIRQ. Process it now. */
     process_pending_softirqs();
 
     local_irq_disable();
@@ -58,6 +59,7 @@ static void do_idle(void)
     local_irq_enable();
 
     sched_tick_resume();
+    rcu_idle_timer_stop();
 }
 
 void idle_loop(void)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 04c52e8..b97986f 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -576,10 +576,10 @@ static void acpi_processor_idle(void)
         return;
     }
 
+    rcu_idle_timer_start();
     cpufreq_dbs_timer_suspend();
-
     sched_tick_suspend();
-    /* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */
+    /* Timer related operations can raise TIMER_SOFTIRQ. Process it now. */
     process_pending_softirqs();
 
     /*
@@ -593,6 +593,7 @@ static void acpi_processor_idle(void)
         local_irq_enable();
         sched_tick_resume();
         cpufreq_dbs_timer_resume();
+        rcu_idle_timer_stop();
         return;
     }
 
@@ -726,6 +727,7 @@ static void acpi_processor_idle(void)
 
     sched_tick_resume();
     cpufreq_dbs_timer_resume();
+    rcu_idle_timer_stop();
 
     if ( cpuidle_current_governor->reflect )
         cpuidle_current_governor->reflect(power);
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index ae9e92b..c426e41 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -743,10 +743,10 @@ static void mwait_idle(void)
 		return;
 	}
 
+	rcu_idle_timer_start();
 	cpufreq_dbs_timer_suspend();
-
 	sched_tick_suspend();
-	/* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */
+	/* Timer related operations can raise TIMER_SOFTIRQ. Process it now. */
 	process_pending_softirqs();
 
 	/* Interrupts must be disabled for C2 and higher transitions. */
@@ -756,6 +756,7 @@ static void mwait_idle(void)
 		local_irq_enable();
 		sched_tick_resume();
 		cpufreq_dbs_timer_resume();
+                rcu_idle_timer_stop();
 		return;
 	}
 
@@ -802,6 +803,7 @@ static void mwait_idle(void)
 
 	sched_tick_resume();
 	cpufreq_dbs_timer_resume();
+	rcu_idle_timer_stop();
 
 	if ( cpuidle_current_governor->reflect )
 		cpuidle_current_governor->reflect(power);
diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index f0fdc87..4586f2a 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -84,8 +84,14 @@ struct rcu_data {
     int cpu;
     struct rcu_head barrier;
     long            last_rs_qlen;     /* qlen during the last resched */
+
+    /* 3) idle CPUs handling */
+    struct timer idle_timer;
+    bool idle_timer_active;
 };
 
+#define RCU_IDLE_TIMER_PERIOD MILLISECS(10)
+
 static DEFINE_PER_CPU(struct rcu_data, rcu_data);
 
 static int blimit = 10;
@@ -402,7 +408,48 @@ int rcu_needs_cpu(int cpu)
 {
     struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
 
-    return (!!rdp->curlist || rcu_pending(cpu));
+    return (!!rdp->curlist || rcu_pending(cpu)) && !rdp->idle_timer_active;
+}
+
+/*
+ * Timer for making sure the CPU where a callback is queued does
+ * periodically poke rcu_pedning(), so that it will invoke the callback
+ * not too late after the end of the grace period.
+ */
+void rcu_idle_timer_start()
+{
+    struct rcu_data *rdp = &this_cpu(rcu_data);
+
+    if (likely(!rdp->curlist))
+        return;
+
+    set_timer(&rdp->idle_timer, NOW() + RCU_IDLE_TIMER_PERIOD);
+    rdp->idle_timer_active = true;
+}
+
+void rcu_idle_timer_stop()
+{
+    struct rcu_data *rdp = &this_cpu(rcu_data);
+
+    if (likely(!rdp->idle_timer_active))
+        return;
+
+    rdp->idle_timer_active = false;
+    stop_timer(&rdp->idle_timer);
+}
+
+static void rcu_idle_timer_handler(void* data)
+{
+    /*
+     * Nothing, really... And in fact, we don't expect to ever get in here,
+     * as rcu_idle_timer_stop(), called while waking from idle, prevent that
+     * to happen by stopping the timer before the TIMER_SOFTIRQ handler has
+     * a chance to run.
+     *
+     * But that's fine, because all we want is the CPU that needs to execute
+     * the callback to be periodically woken up and check rcu_pending().
+     */
+    ASSERT_UNREACHABLE();
 }
 
 void rcu_check_callbacks(int cpu)
@@ -423,6 +470,8 @@ static void rcu_move_batch(struct rcu_data *this_rdp, struct rcu_head *list,
 static void rcu_offline_cpu(struct rcu_data *this_rdp,
                             struct rcu_ctrlblk *rcp, struct rcu_data *rdp)
 {
+    kill_timer(&rdp->idle_timer);
+
     /* If the cpu going offline owns the grace period we can block
      * indefinitely waiting for it, so flush it here.
      */
@@ -451,6 +500,7 @@ static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
     rdp->qs_pending = 0;
     rdp->cpu = cpu;
     rdp->blimit = blimit;
+    init_timer(&rdp->idle_timer, rcu_idle_timer_handler, (void*) rdp, cpu);
 }
 
 static int cpu_callback(
diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h
index 561ac43..3402eb5 100644
--- a/xen/include/xen/rcupdate.h
+++ b/xen/include/xen/rcupdate.h
@@ -149,4 +149,7 @@ int rcu_barrier(void);
 void rcu_idle_enter(unsigned int cpu);
 void rcu_idle_exit(unsigned int cpu);
 
+void rcu_idle_timer_start(void);
+void rcu_idle_timer_stop(void);
+
 #endif /* __XEN_RCUPDATE_H */


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/5] xen: ARM: suspend the tick (if in use) when going idle.
  2017-07-27  8:01 ` [PATCH 2/5] xen: ARM: suspend the tick (if in use) when going idle Dario Faggioli
@ 2017-07-31 20:59   ` Stefano Stabellini
  2017-08-01  8:53     ` Julien Grall
  0 siblings, 1 reply; 38+ messages in thread
From: Stefano Stabellini @ 2017-07-31 20:59 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Julien Grall, Stefano Stabellini

On Thu, 27 Jul 2017, Dario Faggioli wrote:
> Since commit 964fae8ac ("cpuidle: suspend/resume scheduler
> tick timer during cpu idle state entry/exit"), if a scheduler
> has a periodic tick timer, we stop it when going idle.
> 
> This, however, is only true for x86. Make it true for ARM as
> well.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>


> ---
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Julien Grall <julien.grall@arm.com>
> ---
>  xen/arch/arm/domain.c |   29 ++++++++++++++++++++---------
>  1 file changed, 20 insertions(+), 9 deletions(-)
> 
> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
> index 2dc8b0a..fce29cb 100644
> --- a/xen/arch/arm/domain.c
> +++ b/xen/arch/arm/domain.c
> @@ -39,6 +39,25 @@
>  
>  DEFINE_PER_CPU(struct vcpu *, curr_vcpu);
>  
> +static void do_idle(void)
> +{
> +    unsigned int cpu = smp_processor_id();
> +
> +    sched_tick_suspend();
> +    /* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */
> +    process_pending_softirqs();
> +
> +    local_irq_disable();
> +    if ( cpu_is_haltable(cpu) )
> +    {
> +        dsb(sy);
> +        wfi();
> +    }
> +    local_irq_enable();
> +
> +    sched_tick_resume();
> +}
> +
>  void idle_loop(void)
>  {
>      unsigned int cpu = smp_processor_id();
> @@ -52,15 +71,7 @@ void idle_loop(void)
>          if ( unlikely(tasklet_work_to_do(cpu)) )
>              do_tasklet();
>          else
> -        {
> -            local_irq_disable();
> -            if ( cpu_is_haltable(cpu) )
> -            {
> -                dsb(sy);
> -                wfi();
> -            }
> -            local_irq_enable();
> -        }
> +            do_idle();
>  
>          do_softirq();
>          /*
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
  2017-07-27  8:01 ` [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started Dario Faggioli
@ 2017-07-31 21:17   ` Stefano Stabellini
  2017-08-07  8:35   ` Jan Beulich
  2017-08-09 11:38   ` Tim Deegan
  2 siblings, 0 replies; 38+ messages in thread
From: Stefano Stabellini @ 2017-07-31 21:17 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xen-devel, Julien Grall, Stefano Stabellini, Jan Beulich, Andrew Cooper

On Thu, 27 Jul 2017, Dario Faggioli wrote:
> Xen is a tickless (micro-)kernel. This means that, when a CPU
> becomes idle, we stop all the activity on it, including any
> periodic tick or timer.
> 
> When we imported RCU from Linux, Linux (x86) was a ticking
> kernel, i.e., there was a periodic timer tick always running,
> even on totally idle CPUs. This was bad from a power efficiency
> perspective, but it's what maked it possible to monitor the
> quiescent states of all the CPUs, and hence tell when an RCU
> grace period ends.
> 
> In Xen, that is impossible, and that's particularly problematic
> when system is idle (or lightly loaded) systems, as CPUs that
> are idle may never have the chance to tell RCU about their
> quiescence, and grace periods could extend indefinitely!
> 
> This has led, on x86, to long (an unpredictable) delays between
> RCU callbacks queueing and invokation. On ARM, we actually see
> infinite grace periods (e.g., complate_domain_destroy() may
> never be actually invoked on an idle system). See here:
> 
>  https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg02454.html
> 
> The first step for fixing this situation is for RCU to record,
> at the beginning of a grace period, which CPUs are already idle.
> In fact, being idle, they can't be in the middle of any read-side
> critical section, and we don't have to wait for them to declare
> a grace period finished.
> 
> This is tracked in a cpumask, in a way that is very similar to
> how Linux also was achieving the same on s390 --which indeed was
> tickless already, even back then-- and to what it started to do
> for x86, from 2.6.21 on (see commit 79bf2bb3 "tick-management:
> dyntick / highres functionality").
> 
> While there, also adopt the memory barrier introduced by Linux
> commit commit c3f59023 ("Fix RCU race in access of nohz_cpu_mask").
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Julien Grall <julien.grall@arm.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  xen/arch/arm/domain.c         |    2 ++
>  xen/arch/x86/acpi/cpu_idle.c  |   25 +++++++++++++++++--------
>  xen/arch/x86/cpu/mwait-idle.c |    9 ++++++++-
>  xen/arch/x86/domain.c         |    8 +++++++-
>  xen/common/rcupdate.c         |   28 ++++++++++++++++++++++++++--
>  xen/include/xen/rcupdate.h    |    3 +++
>  6 files changed, 63 insertions(+), 12 deletions(-)
> 
> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
> index fce29cb..666b7ef 100644
> --- a/xen/arch/arm/domain.c
> +++ b/xen/arch/arm/domain.c
> @@ -50,8 +50,10 @@ static void do_idle(void)
>      local_irq_disable();
>      if ( cpu_is_haltable(cpu) )
>      {
> +        rcu_idle_enter(cpu);
>          dsb(sy);
>          wfi();
> +        rcu_idle_exit(cpu);
>      }
>      local_irq_enable();
>  
> diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
> index 8cc5a82..f0fdc87 100644
> --- a/xen/common/rcupdate.c
> +++ b/xen/common/rcupdate.c
> @@ -52,7 +52,8 @@ static struct rcu_ctrlblk {
>      int  next_pending;  /* Is the next batch already waiting?         */
>  
>      spinlock_t  lock __cacheline_aligned;
> -    cpumask_t   cpumask; /* CPUs that need to switch in order    */
> +    cpumask_t   cpumask; /* CPUs that need to switch in order ... */
> +    cpumask_t   idle_cpumask; /* ... unless they are already idle */
>      /* for current batch to proceed.        */
>  } __cacheline_aligned rcu_ctrlblk = {
>      .cur = -300,
> @@ -248,7 +249,14 @@ static void rcu_start_batch(struct rcu_ctrlblk *rcp)
>          smp_wmb();
>          rcp->cur++;
>  
> -        cpumask_copy(&rcp->cpumask, &cpu_online_map);
> +       /*
> +        * Accessing idle_cpumask before incrementing rcp->cur needs a
> +        * Barrier  Otherwise it can cause tickless idle CPUs to be
                    ^ otherwise


> +        * included in rcp->cpumask, which will extend graceperiods
> +        * unnecessarily.
> +        */

It doesn't look like this comment applies to this code: we are accessing
idle_cpumask after rcp->cur here. Unless you meant "Accessing
idle_cpumask *after* incrementing rcp->cur."

Also could you please add a pointer to the other barrier in the pair
(barriers always go in pair, for example I think the smp_wmb() above in
rcu_start_batch is matched by the smp_rmb() in __rcu_process_callbacks.)


> +        smp_mb();
> +        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask);
>      }
>  }
>  
> @@ -474,7 +482,23 @@ static struct notifier_block cpu_nfb = {
>  void __init rcu_init(void)
>  {
>      void *cpu = (void *)(long)smp_processor_id();
> +
> +    cpumask_setall(&rcu_ctrlblk.idle_cpumask);
>      cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
>      register_cpu_notifier(&cpu_nfb);
>      open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
>  }
> +
> +/*
> + * The CPU is becoming idle, so no more read side critical
> + * sections, and one more step toward grace period.
> + */
> +void rcu_idle_enter(unsigned int cpu)
> +{
> +    cpumask_set_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
> +}
> +
> +void rcu_idle_exit(unsigned int cpu)
> +{
> +    cpumask_clear_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
> +}
> diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h
> index 557a7b1..561ac43 100644
> --- a/xen/include/xen/rcupdate.h
> +++ b/xen/include/xen/rcupdate.h
> @@ -146,4 +146,7 @@ void call_rcu(struct rcu_head *head,
>  
>  int rcu_barrier(void);
>  
> +void rcu_idle_enter(unsigned int cpu);
> +void rcu_idle_exit(unsigned int cpu);
> +
>  #endif /* __XEN_RCUPDATE_H */
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period.
  2017-07-27  8:01 ` [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period Dario Faggioli
@ 2017-07-31 21:20   ` Stefano Stabellini
  2017-07-31 22:03     ` Dario Faggioli
  2017-08-01  8:54   ` Julien Grall
  2017-08-07  8:54   ` Jan Beulich
  2 siblings, 1 reply; 38+ messages in thread
From: Stefano Stabellini @ 2017-07-31 21:20 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xen-devel, Julien Grall, Stefano Stabellini, Jan Beulich, Andrew Cooper

On Thu, 27 Jul 2017, Dario Faggioli wrote:
> Instead of having the CPU where a callback is queued, busy
> looping on rcu_pending(), use a timer.
> 
> In fact, we let the CPU go idla,e but we program a timer
                              ^ idle,


> that will periodically wake it up, for checking whether the
> grace period has actually ended.
> 
> It is kind of similar to introducing a periodic tick, but
> with a much more limited scope, and a lot less overhead. In
> fact, this timer is:
> - only active for the CPU(s) that have callbacks queued,
>   waiting for the end of a grace period;
> - only active when those CPU(s) are idle (and stopped as
>   soon as they resume execution).
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Julien Grall <julien.grall@arm.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  xen/arch/arm/domain.c         |    4 ++-
>  xen/arch/x86/acpi/cpu_idle.c  |    6 +++--
>  xen/arch/x86/cpu/mwait-idle.c |    6 +++--
>  xen/common/rcupdate.c         |   52 ++++++++++++++++++++++++++++++++++++++++-
>  xen/include/xen/rcupdate.h    |    3 ++
>  5 files changed, 65 insertions(+), 6 deletions(-)
> 
> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
> index 666b7ef..01da96e 100644
> --- a/xen/arch/arm/domain.c
> +++ b/xen/arch/arm/domain.c
> @@ -43,8 +43,9 @@ static void do_idle(void)
>  {
>      unsigned int cpu = smp_processor_id();
>  
> +    rcu_idle_timer_start();
>      sched_tick_suspend();
> -    /* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */
> +    /* Timer related operations can raise TIMER_SOFTIRQ. Process it now. */
>      process_pending_softirqs();
>  
>      local_irq_disable();
> @@ -58,6 +59,7 @@ static void do_idle(void)
>      local_irq_enable();
>  
>      sched_tick_resume();
> +    rcu_idle_timer_stop();
>  }
>  
>  void idle_loop(void)
> diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
> index 04c52e8..b97986f 100644
> --- a/xen/arch/x86/acpi/cpu_idle.c
> +++ b/xen/arch/x86/acpi/cpu_idle.c
> @@ -576,10 +576,10 @@ static void acpi_processor_idle(void)
>          return;
>      }
>  
> +    rcu_idle_timer_start();
>      cpufreq_dbs_timer_suspend();
> -
>      sched_tick_suspend();
> -    /* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */
> +    /* Timer related operations can raise TIMER_SOFTIRQ. Process it now. */
>      process_pending_softirqs();
>  
>      /*
> @@ -593,6 +593,7 @@ static void acpi_processor_idle(void)
>          local_irq_enable();
>          sched_tick_resume();
>          cpufreq_dbs_timer_resume();
> +        rcu_idle_timer_stop();
>          return;
>      }
>  
> @@ -726,6 +727,7 @@ static void acpi_processor_idle(void)
>  
>      sched_tick_resume();
>      cpufreq_dbs_timer_resume();
> +    rcu_idle_timer_stop();
>  
>      if ( cpuidle_current_governor->reflect )
>          cpuidle_current_governor->reflect(power);
> diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
> index ae9e92b..c426e41 100644
> --- a/xen/arch/x86/cpu/mwait-idle.c
> +++ b/xen/arch/x86/cpu/mwait-idle.c
> @@ -743,10 +743,10 @@ static void mwait_idle(void)
>  		return;
>  	}
>  
> +	rcu_idle_timer_start();
>  	cpufreq_dbs_timer_suspend();
> -
>  	sched_tick_suspend();
> -	/* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */
> +	/* Timer related operations can raise TIMER_SOFTIRQ. Process it now. */
>  	process_pending_softirqs();
>  
>  	/* Interrupts must be disabled for C2 and higher transitions. */
> @@ -756,6 +756,7 @@ static void mwait_idle(void)
>  		local_irq_enable();
>  		sched_tick_resume();
>  		cpufreq_dbs_timer_resume();
> +                rcu_idle_timer_stop();
>  		return;
>  	}
>  
> @@ -802,6 +803,7 @@ static void mwait_idle(void)
>  
>  	sched_tick_resume();
>  	cpufreq_dbs_timer_resume();
> +	rcu_idle_timer_stop();
>  
>  	if ( cpuidle_current_governor->reflect )
>  		cpuidle_current_governor->reflect(power);
> diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
> index f0fdc87..4586f2a 100644
> --- a/xen/common/rcupdate.c
> +++ b/xen/common/rcupdate.c
> @@ -84,8 +84,14 @@ struct rcu_data {
>      int cpu;
>      struct rcu_head barrier;
>      long            last_rs_qlen;     /* qlen during the last resched */
> +
> +    /* 3) idle CPUs handling */
> +    struct timer idle_timer;
> +    bool idle_timer_active;
>  };
>  
> +#define RCU_IDLE_TIMER_PERIOD MILLISECS(10)

Isn't this a bit too short? How is it chosen?


>  static DEFINE_PER_CPU(struct rcu_data, rcu_data);
>  
>  static int blimit = 10;
> @@ -402,7 +408,48 @@ int rcu_needs_cpu(int cpu)
>  {
>      struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
>  
> -    return (!!rdp->curlist || rcu_pending(cpu));
> +    return (!!rdp->curlist || rcu_pending(cpu)) && !rdp->idle_timer_active;
> +}
> +
> +/*
> + * Timer for making sure the CPU where a callback is queued does
> + * periodically poke rcu_pedning(), so that it will invoke the callback
> + * not too late after the end of the grace period.
> + */
> +void rcu_idle_timer_start()
> +{
> +    struct rcu_data *rdp = &this_cpu(rcu_data);
> +
> +    if (likely(!rdp->curlist))
> +        return;
> +
> +    set_timer(&rdp->idle_timer, NOW() + RCU_IDLE_TIMER_PERIOD);
> +    rdp->idle_timer_active = true;
> +}
> +
> +void rcu_idle_timer_stop()
> +{
> +    struct rcu_data *rdp = &this_cpu(rcu_data);
> +
> +    if (likely(!rdp->idle_timer_active))
> +        return;
> +
> +    rdp->idle_timer_active = false;
> +    stop_timer(&rdp->idle_timer);
> +}
> +
> +static void rcu_idle_timer_handler(void* data)
> +{
> +    /*
> +     * Nothing, really... And in fact, we don't expect to ever get in here,
> +     * as rcu_idle_timer_stop(), called while waking from idle, prevent that
> +     * to happen by stopping the timer before the TIMER_SOFTIRQ handler has
> +     * a chance to run.
> +     *
> +     * But that's fine, because all we want is the CPU that needs to execute
> +     * the callback to be periodically woken up and check rcu_pending().
> +     */
> +    ASSERT_UNREACHABLE();
>  }
>  
>  void rcu_check_callbacks(int cpu)
> @@ -423,6 +470,8 @@ static void rcu_move_batch(struct rcu_data *this_rdp, struct rcu_head *list,
>  static void rcu_offline_cpu(struct rcu_data *this_rdp,
>                              struct rcu_ctrlblk *rcp, struct rcu_data *rdp)
>  {
> +    kill_timer(&rdp->idle_timer);
> +
>      /* If the cpu going offline owns the grace period we can block
>       * indefinitely waiting for it, so flush it here.
>       */
> @@ -451,6 +500,7 @@ static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
>      rdp->qs_pending = 0;
>      rdp->cpu = cpu;
>      rdp->blimit = blimit;
> +    init_timer(&rdp->idle_timer, rcu_idle_timer_handler, (void*) rdp, cpu);
>  }
>  
>  static int cpu_callback(
> diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h
> index 561ac43..3402eb5 100644
> --- a/xen/include/xen/rcupdate.h
> +++ b/xen/include/xen/rcupdate.h
> @@ -149,4 +149,7 @@ int rcu_barrier(void);
>  void rcu_idle_enter(unsigned int cpu);
>  void rcu_idle_exit(unsigned int cpu);
>  
> +void rcu_idle_timer_start(void);
> +void rcu_idle_timer_stop(void);
> +
>  #endif /* __XEN_RCUPDATE_H */
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period.
  2017-07-31 21:20   ` Stefano Stabellini
@ 2017-07-31 22:03     ` Dario Faggioli
  2017-07-31 23:58       ` Stefano Stabellini
  0 siblings, 1 reply; 38+ messages in thread
From: Dario Faggioli @ 2017-07-31 22:03 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, Julien Grall, Jan Beulich, Andrew Cooper


[-- Attachment #1.1: Type: text/plain, Size: 2489 bytes --]

On Mon, 2017-07-31 at 14:20 -0700, Stefano Stabellini wrote:
> On Thu, 27 Jul 2017, Dario Faggioli wrote:
> > 
> > diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
> > index f0fdc87..4586f2a 100644
> > --- a/xen/common/rcupdate.c
> > +++ b/xen/common/rcupdate.c
> > @@ -84,8 +84,14 @@ struct rcu_data {
> >      int cpu;
> >      struct rcu_head barrier;
> >      long            last_rs_qlen;     /* qlen during the last
> > resched */
> > +
> > +    /* 3) idle CPUs handling */
> > +    struct timer idle_timer;
> > +    bool idle_timer_active;
> >  };
> >  
> > +#define RCU_IDLE_TIMER_PERIOD MILLISECS(10)
> 
> Isn't this a bit too short? How is it chosen?
> 
It's totally arbitrary (and that would be the case for whatever value
we choose).

Basically, it's how long, at worst, after the actual end of a grace
period, a (batch of) callback(s) will be invoked. Currently, on Credit1
on ARM (without my patch, from this series, that suspends the tick)
that's (by chance) 30 ms (or whatever value is chosen for Credit1
timeslice). On Credit2 (on both ARM and x86), it's never, but on x86 it
(apparently) is 'however frequent time sync rendezvouses happs' (which
I don't recall, but it's longer), while on ARM is (potentially) never.

I accept suggestions about alternatives values, and I'm certainly fine
with adding a comment, containing something along the lines of the
explanation above, but I fear it's going to be hard to figure out what
value is actually the "absolute best".

In Linux (which is where the same 'callback book-keeping' happens for
them), a tick with a frequency of 1000Hz (== 1ms) is considered 'low-
latency/Deskop/real-time'. For us, as said above, tick --when it's
there-- would be 30ms by default.

I just went with something in the middle.

Also, it's not that we'll have a 10ms periodic timer going on for
significant amount of time. In fact we expect it to actually fire just
once (for each grace period). It's not 100% guaranteed that it won't be
reprogrammed and fire a couple of times, but it should not, in the vast
majority of cases.

What makes you think it's short?

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period.
  2017-07-31 22:03     ` Dario Faggioli
@ 2017-07-31 23:58       ` Stefano Stabellini
  2017-08-01  0:47         ` Dario Faggioli
  2017-08-01  8:45         ` Julien Grall
  0 siblings, 2 replies; 38+ messages in thread
From: Stefano Stabellini @ 2017-07-31 23:58 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xen-devel, Julien Grall, Stefano Stabellini, Jan Beulich, Andrew Cooper

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2638 bytes --]

On Tue, 1 Aug 2017, Dario Faggioli wrote:
> On Mon, 2017-07-31 at 14:20 -0700, Stefano Stabellini wrote:
> > On Thu, 27 Jul 2017, Dario Faggioli wrote:
> > > 
> > > diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
> > > index f0fdc87..4586f2a 100644
> > > --- a/xen/common/rcupdate.c
> > > +++ b/xen/common/rcupdate.c
> > > @@ -84,8 +84,14 @@ struct rcu_data {
> > >      int cpu;
> > >      struct rcu_head barrier;
> > >      long            last_rs_qlen;     /* qlen during the last
> > > resched */
> > > +
> > > +    /* 3) idle CPUs handling */
> > > +    struct timer idle_timer;
> > > +    bool idle_timer_active;
> > >  };
> > >  
> > > +#define RCU_IDLE_TIMER_PERIOD MILLISECS(10)
> > 
> > Isn't this a bit too short? How is it chosen?
> > 
> It's totally arbitrary (and that would be the case for whatever value
> we choose).
> 
> Basically, it's how long, at worst, after the actual end of a grace
> period, a (batch of) callback(s) will be invoked. Currently, on Credit1
> on ARM (without my patch, from this series, that suspends the tick)
> that's (by chance) 30 ms (or whatever value is chosen for Credit1
> timeslice). On Credit2 (on both ARM and x86), it's never, but on x86 it
> (apparently) is 'however frequent time sync rendezvouses happs' (which
> I don't recall, but it's longer), while on ARM is (potentially) never.
> 
> I accept suggestions about alternatives values, and I'm certainly fine
> with adding a comment, containing something along the lines of the
> explanation above, but I fear it's going to be hard to figure out what
> value is actually the "absolute best".
> 
> In Linux (which is where the same 'callback book-keeping' happens for
> them), a tick with a frequency of 1000Hz (== 1ms) is considered 'low-
> latency/Deskop/real-time'. For us, as said above, tick --when it's
> there-- would be 30ms by default.
> 
> I just went with something in the middle.
> 
> Also, it's not that we'll have a 10ms periodic timer going on for
> significant amount of time. In fact we expect it to actually fire just
> once (for each grace period). It's not 100% guaranteed that it won't be
> reprogrammed and fire a couple of times, but it should not, in the vast
> majority of cases.
> 
> What makes you think it's short?

In terms of power saving and CPU sleep states, 10ms is not much to sleep
for. I wonder if there are any power saving benefits in sleeping for
only 10ms (especially on x86 where entering and exiting CPU sleep states
takes longer, to be confirmed).  We might as well do the thing we need
to do immediately? I guess we cannot do that?

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period.
  2017-07-31 23:58       ` Stefano Stabellini
@ 2017-08-01  0:47         ` Dario Faggioli
  2017-08-01 19:13           ` Stefano Stabellini
  2017-08-01  8:45         ` Julien Grall
  1 sibling, 1 reply; 38+ messages in thread
From: Dario Faggioli @ 2017-08-01  0:47 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, Julien Grall, Jan Beulich, Andrew Cooper


[-- Attachment #1.1: Type: text/plain, Size: 4960 bytes --]

On Mon, 2017-07-31 at 16:58 -0700, Stefano Stabellini wrote:
> On Tue, 1 Aug 2017, Dario Faggioli wrote:
> > On Mon, 2017-07-31 at 14:20 -0700, Stefano Stabellini wrote:
> > > On Thu, 27 Jul 2017, Dario Faggioli wrote:
> > > > 
> > > > diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
> > > > index f0fdc87..4586f2a 100644
> > > > --- a/xen/common/rcupdate.c
> > > > +++ b/xen/common/rcupdate.c
> > > > @@ -84,8 +84,14 @@ struct rcu_data {
> > > >      int cpu;
> > > >      struct rcu_head barrier;
> > > >      long            last_rs_qlen;     /* qlen during the last
> > > > resched */
> > > > +
> > > > +    /* 3) idle CPUs handling */
> > > > +    struct timer idle_timer;
> > > > +    bool idle_timer_active;
> > > >  };
> > > >  
> > > > +#define RCU_IDLE_TIMER_PERIOD MILLISECS(10)
> > > 
> > > Isn't this a bit too short? How is it chosen?
> > > 
> > 
> > What makes you think it's short?
> 
> In terms of power saving and CPU sleep states, 10ms is not much to
> sleep
> for. I wonder if there are any power saving benefits in sleeping for
> only 10ms (especially on x86 where entering and exiting CPU sleep
> states
> takes longer, to be confirmed).
>
I *think* we should be fine with, say, 100ms. But that's again,
guess/rule of thumb, nothing more than that. And, TBH, I'm not even
sure what a good experiment/benchmark would be, to assess whether a
particular value is good or bad. :-/

>   We might as well do the thing we need
> to do immediately? I guess we cannot do that?
>
You're guessing correct, we can't. It's indeed a bit tricky, and it
took it a little bit to me as well to figure all of it out properly.

Basically, let's say that, at time t1, on CPU1, someone calls
call_rcu(). The situation on the other CPUs is: CPU0 busy; CPU2 idle
(no timer pending); CPU3 busy.

So, a new grace period starts, and its exact end will be when CPU0,
CPU1 and CPU3 have quiesced once (in Xen, quiescence means: "going
through do_softirq()").

But RCU it's a passive mechanism, i.e., we rely on each CPU coming to
the RCU core logic, and tell <<hey, btw, I've quiesced>>.
Let's say that CPU0 quiesces at time t2 > t1. CPU1 quiesces at time
t3 > t2, and goes fully idle (no timer pending). CPU3 quiesces at time
t4 > t3. Now, at time t4, CPU1 can actually invoke the callbeck, queued
at time t1, from within call_rcu().

This patch series solves two problems, of our current RCU
implementation:

1) right now, we don't only wait for CPU0, CPU1 and CPU3, we also wait 
   for CPU2. But since CPU2 is fully idle, it just won't bother 
   telling RCU that it has quiesced (well, on x86, that would actually 
   happen at some point, while on ARM, it is really possible that this 
   never happens!). This is solved in patch 3, by introducing the 
   cpumask;

2) now (after patch 3) we know that we just can avoid waiting for 
   CPU2. Good. But at time t4, when CPU3 quiesces, marking the end of
   the grace period, how would CPU1 --which is fully idle-- know that
   it can now safely invoke the callback? Again, RCU is passive, so it
   relies on CPU1 to figure that out on its own, next time it wakes
   up, e.g., because of the periodic tick timer. But we don't have a
   periodic tick timer! Well, this again means that, on x86, CPU1 will
   actually figure that out at some (unpredictable) point in future.
   On ARM, not so much. The purpose of the timer in this patch is to
   make sure it always will.
   In fact, with patches 4 and 5 applied, at time t3, we let CPU1 go 
   idle, but we program the timer. It will fire at t3+T (with T=10ms, 
   right now). When this happens, if t3+T > t4, CPU1 invokes the
   callback, and we're done. If not (and CPU1 is still idle) we retry
   in another T.

So, when you say "immediately", *when* do you actually mean?

We can't invoke the callback at t3 (i.e., immediately after CPU1
quiesces), because we need to wait for CPU3 to do the same.

We can't invoke the callback when CPU3 quiesces, at t4 (i.e.,
immediately after the grace period ends) either, because the callback
it's on CPU1, not on CPU3.

Linux's solution is not to stop the tick for CPU1 at time t3. It will
be stopped only after the first firing of the tick itself at time
t > t4 (if CPU1 is still idle, of course). This timer thing is, IMO,
pretty similar. The only difference is that we don't have a tick to not
stop... So I had to make up a very special one. :-D

TL;DR, yes, I also would have loved the world (or even just this
problem) to be simple. It's not! ;-P

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period.
  2017-07-31 23:58       ` Stefano Stabellini
  2017-08-01  0:47         ` Dario Faggioli
@ 2017-08-01  8:45         ` Julien Grall
  1 sibling, 0 replies; 38+ messages in thread
From: Julien Grall @ 2017-08-01  8:45 UTC (permalink / raw)
  To: Stefano Stabellini, Dario Faggioli
  Cc: xen-devel, nd, Jan Beulich, Andrew Cooper



On 01/08/2017 00:58, Stefano Stabellini wrote:
> On Tue, 1 Aug 2017, Dario Faggioli wrote:
>> On Mon, 2017-07-31 at 14:20 -0700, Stefano Stabellini wrote:
>>> On Thu, 27 Jul 2017, Dario Faggioli wrote:
>>>>
>>>> diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
>>>> index f0fdc87..4586f2a 100644
>>>> --- a/xen/common/rcupdate.c
>>>> +++ b/xen/common/rcupdate.c
>>>> @@ -84,8 +84,14 @@ struct rcu_data {
>>>>      int cpu;
>>>>      struct rcu_head barrier;
>>>>      long            last_rs_qlen;     /* qlen during the last
>>>> resched */
>>>> +
>>>> +    /* 3) idle CPUs handling */
>>>> +    struct timer idle_timer;
>>>> +    bool idle_timer_active;
>>>>  };
>>>>
>>>> +#define RCU_IDLE_TIMER_PERIOD MILLISECS(10)
>>>
>>> Isn't this a bit too short? How is it chosen?
>>>
>> It's totally arbitrary (and that would be the case for whatever value
>> we choose).
>>
>> Basically, it's how long, at worst, after the actual end of a grace
>> period, a (batch of) callback(s) will be invoked. Currently, on Credit1
>> on ARM (without my patch, from this series, that suspends the tick)
>> that's (by chance) 30 ms (or whatever value is chosen for Credit1
>> timeslice). On Credit2 (on both ARM and x86), it's never, but on x86 it
>> (apparently) is 'however frequent time sync rendezvouses happs' (which
>> I don't recall, but it's longer), while on ARM is (potentially) never.
>>
>> I accept suggestions about alternatives values, and I'm certainly fine
>> with adding a comment, containing something along the lines of the
>> explanation above, but I fear it's going to be hard to figure out what
>> value is actually the "absolute best".
>>
>> In Linux (which is where the same 'callback book-keeping' happens for
>> them), a tick with a frequency of 1000Hz (== 1ms) is considered 'low-
>> latency/Deskop/real-time'. For us, as said above, tick --when it's
>> there-- would be 30ms by default.
>>
>> I just went with something in the middle.
>>
>> Also, it's not that we'll have a 10ms periodic timer going on for
>> significant amount of time. In fact we expect it to actually fire just
>> once (for each grace period). It's not 100% guaranteed that it won't be
>> reprogrammed and fire a couple of times, but it should not, in the vast
>> majority of cases.
>>
>> What makes you think it's short?
>
> In terms of power saving and CPU sleep states, 10ms is not much to sleep
> for. I wonder if there are any power saving benefits in sleeping for
> only 10ms (especially on x86 where entering and exiting CPU sleep states
> takes longer, to be confirmed).  We might as well do the thing we need
> to do immediately? I guess we cannot do that?

Given that on ARM we use WFI to sleep today, I think we can already 
benefit some power saving even for 10ms. Although, it will depend on the 
processor.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/5] xen: ARM: suspend the tick (if in use) when going idle.
  2017-07-31 20:59   ` Stefano Stabellini
@ 2017-08-01  8:53     ` Julien Grall
  2017-08-01  9:26       ` Dario Faggioli
  0 siblings, 1 reply; 38+ messages in thread
From: Julien Grall @ 2017-08-01  8:53 UTC (permalink / raw)
  To: Stefano Stabellini, Dario Faggioli; +Cc: xen-devel, nd

Hi Stefano,

On 31/07/2017 21:59, Stefano Stabellini wrote:
> On Thu, 27 Jul 2017, Dario Faggioli wrote:
>> Since commit 964fae8ac ("cpuidle: suspend/resume scheduler
>> tick timer during cpu idle state entry/exit"), if a scheduler
>> has a periodic tick timer, we stop it when going idle.
>>
>> This, however, is only true for x86. Make it true for ARM as
>> well.
>>
>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
>
> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

This patch looks standalone, but please don't commit this patch without 
the rest of the series. Otherwise, we will introduce regression in Xen 
with credit1 as well.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period.
  2017-07-27  8:01 ` [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period Dario Faggioli
  2017-07-31 21:20   ` Stefano Stabellini
@ 2017-08-01  8:54   ` Julien Grall
  2017-08-01  9:17     ` Dario Faggioli
  2017-08-07  8:54   ` Jan Beulich
  2 siblings, 1 reply; 38+ messages in thread
From: Julien Grall @ 2017-08-01  8:54 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: Andrew Cooper, nd, Stefano Stabellini, Jan Beulich

Hi Dario,

On 27/07/2017 09:01, Dario Faggioli wrote:
> Instead of having the CPU where a callback is queued, busy
> looping on rcu_pending(), use a timer.
>
> In fact, we let the CPU go idla,e but we program a timer
> that will periodically wake it up, for checking whether the
> grace period has actually ended.
>
> It is kind of similar to introducing a periodic tick, but
> with a much more limited scope, and a lot less overhead. In
> fact, this timer is:
> - only active for the CPU(s) that have callbacks queued,
>   waiting for the end of a grace period;
> - only active when those CPU(s) are idle (and stopped as
>   soon as they resume execution).

If I read this correctly, it means on ARM the idling will now get 
interrupted periodically. This is a bit unfortunate, given that if you 
have a CPU doing nothing, you would still interrupt it intermittently.

I was expected that we could remove the CPU from the RCU whilst it is 
idle. Is there any reason for not doing that?

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period.
  2017-08-01  8:54   ` Julien Grall
@ 2017-08-01  9:17     ` Dario Faggioli
  2017-08-01 10:22       ` Julien Grall
  0 siblings, 1 reply; 38+ messages in thread
From: Dario Faggioli @ 2017-08-01  9:17 UTC (permalink / raw)
  To: Julien Grall, xen-devel
  Cc: Andrew Cooper, nd, Stefano Stabellini, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 3555 bytes --]

On Tue, 2017-08-01 at 09:54 +0100, Julien Grall wrote:
> Hi Dario,
> 
> On 27/07/2017 09:01, Dario Faggioli wrote:
> > Instead of having the CPU where a callback is queued, busy
> > looping on rcu_pending(), use a timer.
> > 
> > In fact, we let the CPU go idla,e but we program a timer
> > that will periodically wake it up, for checking whether the
> > grace period has actually ended.
> > 
> > It is kind of similar to introducing a periodic tick, but
> > with a much more limited scope, and a lot less overhead. In
> > fact, this timer is:
> > - only active for the CPU(s) that have callbacks queued,
> >   waiting for the end of a grace period;
> > - only active when those CPU(s) are idle (and stopped as
> >   soon as they resume execution).
> 
> If I read this correctly, it means on ARM the idling will now get 
> interrupted periodically. This is a bit unfortunate, given that if
> you 
> have a CPU doing nothing, you would still interrupt it
> intermittently.
> 
Not really periodically, not always, at least. What this really means
is that a CPU that is idle, *but* have pending RCU callbacks, will be
interrupted periodically to see if the grace period ended, so it can
invoke the callbacks.

As soon as this (callbacks being invoked) will have happened, we won't
interrupt it any longer.

And idle CPUs _without_ queued RCU callbacks, won't be interrupted at
all.

> I was expected that we could remove the CPU from the RCU whilst it
> is 
> idle. Is there any reason for not doing that?
> 
I'm not sure I understand what you mean here. I tried to explain as
good as I could how this works, and why I think it can't work in other
ways, in this reply to Stefano: <1501548445.30551.5.camel@citrix.com>

Every CPU that participates in the grace period, and has already
quiesced, is "removed from RCU", and hence, when it becomes idle, it is
never interrupted (by this timer). With the only exception of the
CPU(s) that has queued callbacks.

We simply can't forget about these CPUs, even if they go idle. If we
do, the callbacks won't be invoked never (or will only be invoked when
the CPU becomes active again, which may happen really really late,
which is part of the reason why we're seeing the bug we're seeing).

Linux does this, in the *exact* same way (well, actually, in a totally
different way, from an implementation point of view, but the effect is
indeed exactly the same):

See here:

http://elixir.free-electrons.com/linux/v2.6.21/source/kernel/time/tick-sched.c#L198

We're in tick_nohz_stop_sched_tick(), i.e., the CPU is going idle, and
the periodic timer tick is being stopped (no interruptions). But

	if (rcu_needs_cpu(cpu))
		delta_jiffies = 1;

Where, rcu_needs_cpu() means:

/*
 * Check to see if any future RCU-related work will need to be done
 * by the current CPU, even if none need be done immediately, returning
 * 1 if so.
 */
int rcu_needs_cpu(int cpu)

Then, again in tick_nohz_stop_sched_tick()

	/*
	 * Do not stop the tick, if we are only one off
	 * or if the cpu is required for rcu
	 */
	if (!ts->tick_stopped && delta_jiffies == 1)
		goto out;

And in fact, in my testing, without patches 3 and 5 applied, the bug is
still there.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/5] xen: ARM: suspend the tick (if in use) when going idle.
  2017-08-01  8:53     ` Julien Grall
@ 2017-08-01  9:26       ` Dario Faggioli
  0 siblings, 0 replies; 38+ messages in thread
From: Dario Faggioli @ 2017-08-01  9:26 UTC (permalink / raw)
  To: Julien Grall, Stefano Stabellini; +Cc: xen-devel, nd


[-- Attachment #1.1: Type: text/plain, Size: 1138 bytes --]

On Tue, 2017-08-01 at 09:53 +0100, Julien Grall wrote:
> On 31/07/2017 21:59, Stefano Stabellini wrote:
> > On Thu, 27 Jul 2017, Dario Faggioli wrote:
> > > 
> > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> > 
> > Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
> 
> This patch looks standalone, but please don't commit this patch
> without 
> the rest of the series. Otherwise, we will introduce regression in
> Xen 
> with credit1 as well.
> 
I'm not sure I'd call that a "regression". In fact, the current
situation is that the code has a bug, but the absence of this power-
efficiency optimization, on ARM, is covering the bug itself. :-D

Anyway, zero intention to turn this into a terminology bunfight, and,
overall, I even agree with you that it's probably better that this
series go in all-together. :-)

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period.
  2017-08-01  9:17     ` Dario Faggioli
@ 2017-08-01 10:22       ` Julien Grall
  2017-08-01 10:33         ` Dario Faggioli
  0 siblings, 1 reply; 38+ messages in thread
From: Julien Grall @ 2017-08-01 10:22 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: Andrew Cooper, nd, Stefano Stabellini, Jan Beulich

Hi Dario,

On 01/08/17 10:17, Dario Faggioli wrote:
> On Tue, 2017-08-01 at 09:54 +0100, Julien Grall wrote:
>> Hi Dario,
>>
>> On 27/07/2017 09:01, Dario Faggioli wrote:
>>> Instead of having the CPU where a callback is queued, busy
>>> looping on rcu_pending(), use a timer.
>>>
>>> In fact, we let the CPU go idla,e but we program a timer
>>> that will periodically wake it up, for checking whether the
>>> grace period has actually ended.
>>>
>>> It is kind of similar to introducing a periodic tick, but
>>> with a much more limited scope, and a lot less overhead. In
>>> fact, this timer is:
>>> - only active for the CPU(s) that have callbacks queued,
>>>   waiting for the end of a grace period;
>>> - only active when those CPU(s) are idle (and stopped as
>>>   soon as they resume execution).
>>
>> If I read this correctly, it means on ARM the idling will now get
>> interrupted periodically. This is a bit unfortunate, given that if
>> you
>> have a CPU doing nothing, you would still interrupt it
>> intermittently.
>>
> Not really periodically, not always, at least. What this really means
> is that a CPU that is idle, *but* have pending RCU callbacks, will be
> interrupted periodically to see if the grace period ended, so it can
> invoke the callbacks.
>
> As soon as this (callbacks being invoked) will have happened, we won't
> interrupt it any longer.
>
> And idle CPUs _without_ queued RCU callbacks, won't be interrupted at
> all.

Oh, the commit message is not clear about it. The wording gives the 
impression the timer will always be here on every idle CPU(s). In the 
case on active CPU(s) you specific mention the end of the grace period.

Thank you for the clarification and please disregard the rest of my 
e-mail then.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period.
  2017-08-01 10:22       ` Julien Grall
@ 2017-08-01 10:33         ` Dario Faggioli
  0 siblings, 0 replies; 38+ messages in thread
From: Dario Faggioli @ 2017-08-01 10:33 UTC (permalink / raw)
  To: Julien Grall, xen-devel
  Cc: Andrew Cooper, nd, Stefano Stabellini, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 883 bytes --]

On Tue, 2017-08-01 at 11:22 +0100, Julien Grall wrote:
> On 01/08/17 10:17, Dario Faggioli wrote:
> > As soon as this (callbacks being invoked) will have happened, we
> > won't
> > interrupt it any longer.
> > 
> > And idle CPUs _without_ queued RCU callbacks, won't be interrupted
> > at
> > all.
> 
> Oh, the commit message is not clear about it. The wording gives the 
> impression the timer will always be here on every idle CPU(s). In
> the 
> case on active CPU(s) you specific mention the end of the grace
> period.
> 
Ok, I'll try to improve the changelog and resend.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period.
  2017-08-01  0:47         ` Dario Faggioli
@ 2017-08-01 19:13           ` Stefano Stabellini
  2017-08-02 10:14             ` Dario Faggioli
  0 siblings, 1 reply; 38+ messages in thread
From: Stefano Stabellini @ 2017-08-01 19:13 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xen-devel, Julien Grall, Stefano Stabellini, Jan Beulich, Andrew Cooper

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5195 bytes --]

On Tue, 1 Aug 2017, Dario Faggioli wrote:
> On Mon, 2017-07-31 at 16:58 -0700, Stefano Stabellini wrote:
> > On Tue, 1 Aug 2017, Dario Faggioli wrote:
> > > On Mon, 2017-07-31 at 14:20 -0700, Stefano Stabellini wrote:
> > > > On Thu, 27 Jul 2017, Dario Faggioli wrote:
> > > > > 
> > > > > diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
> > > > > index f0fdc87..4586f2a 100644
> > > > > --- a/xen/common/rcupdate.c
> > > > > +++ b/xen/common/rcupdate.c
> > > > > @@ -84,8 +84,14 @@ struct rcu_data {
> > > > >      int cpu;
> > > > >      struct rcu_head barrier;
> > > > >      long            last_rs_qlen;     /* qlen during the last
> > > > > resched */
> > > > > +
> > > > > +    /* 3) idle CPUs handling */
> > > > > +    struct timer idle_timer;
> > > > > +    bool idle_timer_active;
> > > > >  };
> > > > >  
> > > > > +#define RCU_IDLE_TIMER_PERIOD MILLISECS(10)
> > > > 
> > > > Isn't this a bit too short? How is it chosen?
> > > > 
> > > 
> > > What makes you think it's short?
> > 
> > In terms of power saving and CPU sleep states, 10ms is not much to
> > sleep
> > for. I wonder if there are any power saving benefits in sleeping for
> > only 10ms (especially on x86 where entering and exiting CPU sleep
> > states
> > takes longer, to be confirmed).
> >
> I *think* we should be fine with, say, 100ms. But that's again,
> guess/rule of thumb, nothing more than that. And, TBH, I'm not even
> sure what a good experiment/benchmark would be, to assess whether a
> particular value is good or bad. :-/

Given the description below, it's possible that the new timer has to
fire several times before the callback can be finally invoked, right?

I would make some measurements to check how many times the timer has to
fire (how long we actually need to wait before calling the callback) in
various scenarios. Then I would choose a value to minimize the number of
unnecessary wake-ups.


> >   We might as well do the thing we need
> > to do immediately? I guess we cannot do that?
> >
> You're guessing correct, we can't. It's indeed a bit tricky, and it
> took it a little bit to me as well to figure all of it out properly.
> 
> Basically, let's say that, at time t1, on CPU1, someone calls
> call_rcu(). The situation on the other CPUs is: CPU0 busy; CPU2 idle
> (no timer pending); CPU3 busy.
> 
> So, a new grace period starts, and its exact end will be when CPU0,
> CPU1 and CPU3 have quiesced once (in Xen, quiescence means: "going
> through do_softirq()").
> 
> But RCU it's a passive mechanism, i.e., we rely on each CPU coming to
> the RCU core logic, and tell <<hey, btw, I've quiesced>>.
> Let's say that CPU0 quiesces at time t2 > t1. CPU1 quiesces at time
> t3 > t2, and goes fully idle (no timer pending). CPU3 quiesces at time
> t4 > t3. Now, at time t4, CPU1 can actually invoke the callbeck, queued
> at time t1, from within call_rcu().
> 
> This patch series solves two problems, of our current RCU
> implementation:
> 
> 1) right now, we don't only wait for CPU0, CPU1 and CPU3, we also wait 
>    for CPU2. But since CPU2 is fully idle, it just won't bother 
>    telling RCU that it has quiesced (well, on x86, that would actually 
>    happen at some point, while on ARM, it is really possible that this 
>    never happens!). This is solved in patch 3, by introducing the 
>    cpumask;
> 
> 2) now (after patch 3) we know that we just can avoid waiting for 
>    CPU2. Good. But at time t4, when CPU3 quiesces, marking the end of
>    the grace period, how would CPU1 --which is fully idle-- know that
>    it can now safely invoke the callback? Again, RCU is passive, so it
>    relies on CPU1 to figure that out on its own, next time it wakes
>    up, e.g., because of the periodic tick timer. But we don't have a
>    periodic tick timer! Well, this again means that, on x86, CPU1 will
>    actually figure that out at some (unpredictable) point in future.
>    On ARM, not so much. The purpose of the timer in this patch is to
>    make sure it always will.
>    In fact, with patches 4 and 5 applied, at time t3, we let CPU1 go 
>    idle, but we program the timer. It will fire at t3+T (with T=10ms, 
>    right now). When this happens, if t3+T > t4, CPU1 invokes the
>    callback, and we're done. If not (and CPU1 is still idle) we retry
>    in another T.
> 
> So, when you say "immediately", *when* do you actually mean?
> 
> We can't invoke the callback at t3 (i.e., immediately after CPU1
> quiesces), because we need to wait for CPU3 to do the same.
> 
> We can't invoke the callback when CPU3 quiesces, at t4 (i.e.,
> immediately after the grace period ends) either, because the callback
> it's on CPU1, not on CPU3.
> 
> Linux's solution is not to stop the tick for CPU1 at time t3. It will
> be stopped only after the first firing of the tick itself at time
> t > t4 (if CPU1 is still idle, of course). This timer thing is, IMO,
> pretty similar. The only difference is that we don't have a tick to not
> stop... So I had to make up a very special one. :-D
> 
> TL;DR, yes, I also would have loved the world (or even just this
> problem) to be simple. It's not! ;-P

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period.
  2017-08-01 19:13           ` Stefano Stabellini
@ 2017-08-02 10:14             ` Dario Faggioli
  0 siblings, 0 replies; 38+ messages in thread
From: Dario Faggioli @ 2017-08-02 10:14 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, Julien Grall, Jan Beulich, Andrew Cooper


[-- Attachment #1.1: Type: text/plain, Size: 11561 bytes --]

Hey, apologies in advance, for the very long email! :-O

On Tue, 2017-08-01 at 12:13 -0700, Stefano Stabellini wrote:
> Given the description below, it's possible that the new timer has to
> fire several times before the callback can be finally invoked, right?
> 
> I would make some measurements to check how many times the timer has
> to
> fire (how long we actually need to wait before calling the callback)
> in
> various scenarios. Then I would choose a value to minimize the number
> of
> unnecessary wake-ups.
> 
You seem to be thinking to this timer as something that disturbs the
idleness of a CPU. It's not, because the CPU is not idle! It has a
callback waiting to be invoked, and it better do that as soon as
possible, in order to:

- freeing the memory/the resources as soon as practically possible. 
  E.g., if we wait for, say, 1 sec, destroying a VM with a "passed-
  through" I/O range, and then trying to create another one, with the 
  same resource assigned, would fail until that 1 sec expires. True,
  we save more power on that CPU, but it does not look neither fair 
  nor correct to me to strive toward achieving that, in this situation.
  Power should indeed be saved, and unnecessary wake-ups prevented, but
  when there isn't anything to do! If there is stuff to do, and this is
  the case, our goal should be to get it done as soon as possible, so
  that, afterwords, we can go into a (hopefully) long sleep.

- by delaying detection of a finished grace period, we're also 
  delaying the beginning of a new one. This means, in case there are
  RCU operation waiting for it (because they arrived too late for the
  current one, and have been queued), we're potentially significantly
  delaying them too. So, basically, this again looks like the same
  kind of paradox to me, where we have stuff to do, but we decide to
  try to save power.

RCU are not a mechanism for allow the CPUs to stay idle longer, they're
a lockless and asymmetric serialization primitive. And as any
serialization primitive --either lock-based or lockless-- keeping the
duration of critical sections small is rather important.

So, I think the goal here is not minimizing the wakeups; it much rather
is completing the grace period sooner rather than later. In theory, if
rcu_needs_cpu() returns true, we may very well not even let the CPU go
idle, and have it spin (which indeed is what happens without this
patch).
That would guarantee a prompt detection of CPU quiescence, and of the
ending of a grace period. Then, since we know that spinning consumes a
lot of power, and generates overhead that, potentially, may even affect
and slow down activities going on other CPUs, we let it go idle with a
timer armed. But that's a compromise, and can't be confused with the
primary goal. And any nanosecond that we spend, on that CPU, consuming
less power than what we'd have consumed running a tight loop around
rcu_pending(), we should be thankful for it, instead of regretting that
there could have been more. :-)

In Linux, as said a few times, they do the same, and the role of this
new timer I'm introducing here, is played by the tick. The tick period
is controlled by the HZ config variable, possible values for which (in
2.6.21, where CONFIG_NO_HZ was introduced for first time for all
arches) are 100Hz, 250Hz, 300Hz and 1000Hz, which translates into 10ms,
 4ms, 3.33ms or 1ms. Default is 250HZ==4ms. The periodic tick is the
minimum time granularity of the kernel for a lot of subsystem
(especially in that kernel), so it's something that can be considered
pretty frequently running. As also said many times, we don't have any
such thing, but still I think we should use something with a similar
order of magnitude.

[NOTE that I'm not using Linux as an example because what is done in
Linux is always correct and is always what we also should do. However:
(1) RCU has been basically invented for Linux, or at least Linux is
where they've seen the widest adoption and development, and (2) our RCU
code has been (not in the best possible way, allow me to say) imported
from there. therefore, looking at what's done in Linux seems reasonable
to me, in this case.]

About the kind of measurements you're suggesting doing. I think I
understand the rationale behind it, but at the same time, I'm quite
sure that what we'd get would be pseudo-random numbers.

In fact, the length of this "wait time", although it has a relationship
with the workload being run, it also depends on a lot of other things,
within the *system* as a whole. It would be basically impossible to run
any meaningful statistical analysis on the results and come to
something that is good enough in general.

Think, for instance, at how this is not at all architecture specific
code, but the behavior we're seeing is so different between x86 and 
ARM. Think at how this doesn't really have to do anything with the
scheduler, but the behavior we observe varies, depending on some
internal implementation details of a scheduler!

What follows is a more detailed explanation of how things works, in a
simple enough scenario.

+ == CPU is busy
. == CPU is idle
R == when call_rcu() is invoked (i.e., beginning of new grace
          period)
Q == CPU quiesces (when all have done this, grace period ends).
     This happens when do_softirq() is called on the CPU
C == pending callbacks are invoked

                      t1
                      |
                      |
CPU1 +++R+++++++++++++Q..............
CPU2 ++++++++++++++++++++++++++Q++...
                               |
                               |
                               t2

If CPU1 is where call_rcu() happens, and CPU2 is the only other busy
CPU in the system at the time, and T is the period of our timer, the
following cases are possible:

x) t2 < t1

   In this case, the timer never fires:

                         t1
                         |
                         |
   CPU1 +++R+++++++++++++QC.........
   CPU2 ++++++++++++++Q.............
                      |
                      |
                      t2

x) t2 - t1 < T

   In this case, the timer only fires once

                    t1     t1+T
                    |       |
                    |       |
   CPU1 +++R++++++++Q.......C......
   CPU2 +++++++++++++++++Q.........
                         |
                         |
                         t2

x) t2 - t1 > (n-1)*T

                t1     t1+T   t1+2T   t1+3T   t1+4T
                |       |       |       |       |
                |       |       |       |       |
   CPU1 +R++++++Q...............................C..
   CPU2 ++++++++++++++++++++++++++++++++++++Q......
                                            |
                                            |
                                            t2

   In this case, the timer will fire n times (with n=4 in the example).

So, you're suggesting to measure, under different workloads, and then
do some kind of averaging/statistical analysis, t2-t1. What I'm saying
is that the occurrence of those events marked 'Q', on the appropriate
CPU, is really hard to predict.

In fact, t2-t1 is, basically, the difference between the time when the
two CPUs call do_softirq(). How much the frequency of invocation of
do_softirq() is related to the actual characteristics of the workload
we're running? well, a relationship sure exists, but not one that I
think is easy (possible?) to generalize... And here we're also talking
about considering *different* workloads! :-O

For instance, because of either hard-affinity, soft-affinity, or
scheduling weights, there may be more or less preemptions, on the CPUs
in question (and preemptions means going through do_softirq())... And
that's not related to system load a particular workload produces, as
it's in total control of the user. So we'd have not only to measure
with different workloads, but for each workload, we'd have to test a
certain number of variations of the scheduling parameters of the vCPUs!
:-O

On x86, there's an upper bound to the time CPUs will stay idle, because
there's always (something like) a timer running (for time sync
rendezvous, said Andrew, and I haven't checked myself, but have no
reason to doubt that to be the case), while on ARM there's no such
thing (or we weren't here! :-D). So we'd have to measure each workload,
with each combination of parameters, on each (present and future)
architecture we support! :-O

CPU onlining and offlining, or CPUs moving around cpupools, can cause
timers (and RCU callbacks even, in the online/offline case!) to be
moved around, which means they'd be firing (and a timer firing means
going through do_softirq()) on different CPUs from where we expected
them to. So, we'd have to measure each workload, with each combination
of parameters, on each architecture, while doing CPU online/offline,
and while moving CPUs around pools! :-O

And this was just to mention the few examples I could fish from the top
of my head...

No, IMO, there are too many variables and too many unrelated factors
involved, to even hope that any set of measurements we actually manage
to put together, can be representative of the general case.


*PERHAPS* we can think of an heuristic... Provided it's not too
complex, or at least, that it does not try to capture all this
variability (because that'd be equally impossible).

Something like this: when a CPU with queued callbacks quiesces and then
goes idle, we know how many other CPUs, among the ones that are
participating in the grace period, have not quiesced yet. Instead of
using, say, always 10ms, we can program the timer to fire at something
like 1ms*cpumask_weight(not_yet_quiesced_cpus).

In fact, I think it makes sense to assume that, if there are only few
CPUs involved in the grace period, that still has not quiesced, then
the grace period itself is about to finish, and we should check
frequently whether or not that has happened.
if, OTOH, there are a lot of CPUs involved in the grace period, that
still haven't quiesced, we can check whether the grace period has ended
in a less eager way.

This will make the period of the timer vary, i.e., it may be different
between two subsequent timer_set(), but that is not a problem at all.
Basically, we're making the timer kind of adaptive.

This does not work well in the case where there is, e.g., only 1 CPU
that has not quiesced yet (so we'd set the timer to fire very
frequently), but it takes a lot of time for such CPU to do so. Well,
it's an heuristic, so it can't always work well.. but I think this
isn't a common situation.

Would you be happy with something like this?

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
  2017-07-27  8:01 ` [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started Dario Faggioli
  2017-07-31 21:17   ` Stefano Stabellini
@ 2017-08-07  8:35   ` Jan Beulich
  2017-08-09  8:48     ` Dario Faggioli
  2017-08-09 11:38   ` Tim Deegan
  2 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2017-08-07  8:35 UTC (permalink / raw)
  To: dario.faggioli; +Cc: andrew.cooper3, julien.grall, sstabellini, xen-devel

>>> Dario Faggioli <dario.faggioli@citrix.com> 07/27/17 10:01 AM >>>
>--- a/xen/arch/x86/acpi/cpu_idle.c
>+++ b/xen/arch/x86/acpi/cpu_idle.c
>@@ -418,14 +418,16 @@ static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
>mwait_idle_with_hints(cx->address, MWAIT_ECX_INTERRUPT_BREAK);
>}
 >
>-static void acpi_idle_do_entry(struct acpi_processor_cx *cx)
>+static void acpi_idle_do_entry(unsigned int cpu, struct acpi_processor_cx *cx)
>{
>+    rcu_idle_enter(cpu);
>+
>switch ( cx->entry_method )
>{
>case ACPI_CSTATE_EM_FFH:
>/* Call into architectural FFH based C-state */
>acpi_processor_ffh_cstate_enter(cx);
>-        return;
>+        break;
>case ACPI_CSTATE_EM_SYSIO:
         >/* IO port based C-state */
>inb(cx->address);
>@@ -433,12 +435,14 @@ static void acpi_idle_do_entry(struct acpi_processor_cx *cx)
>because chipsets cannot guarantee that STPCLK# signal
>gets asserted in time to freeze execution properly. */
>inl(pmtmr_ioport);
>-        return;
>+        break;
     >case ACPI_CSTATE_EM_HALT:

Wiuld be nice if you also added blank lines between the break-s and the
subsequent case labels.

>@@ -756,6 +759,8 @@ static void mwait_idle(void)
>return;
>}
 >
>+    rcu_idle_enter(cpu);
>+
>eax = cx->address;
>cstate = ((eax >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;
 >
>@@ -787,6 +792,8 @@ static void mwait_idle(void)
>irq_traced[0], irq_traced[1], irq_traced[2], irq_traced[3]);
 >
>/* Now back in C0. */
>+    rcu_idle_exit(cpu);
>+
>update_idle_stats(power, cx, before, after);
>local_irq_enable();
 
Compared to the ACPI code, the window is quite a bit larger here. Any reason
you can't put these just around the mwait_idle_with_hints() invocation, which
would match what you do for the ACPI-based driver?

>@@ -248,7 +249,14 @@ static void rcu_start_batch(struct rcu_ctrlblk *rcp)
>smp_wmb();
>rcp->cur++;
 >
>-        cpumask_copy(&rcp->cpumask, &cpu_online_map);
>+       /*
>+        * Accessing idle_cpumask before incrementing rcp->cur needs a
>+        * Barrier  Otherwise it can cause tickless idle CPUs to be
>+        * included in rcp->cpumask, which will extend graceperiods
>+        * unnecessarily.
>+        */
>+        smp_mb();
>+        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask);

I have some trouble with understanding the comment: You don't access
->idle_cpumask before you increment ->cur. (Also, as a general remark,
please go through patch description and comments again to correct
spelling and alike issues.)

>@@ -474,7 +482,23 @@ static struct notifier_block cpu_nfb = {
>void __init rcu_init(void)
>{
>void *cpu = (void *)(long)smp_processor_id();
>+
>+    cpumask_setall(&rcu_ctrlblk.idle_cpumask);

The CPU you're running on surely is not idle initially?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/5] xen: RCU: don't let a CPU with a callback go idle.
  2017-07-27  8:01 ` [PATCH 4/5] xen: RCU: don't let a CPU with a callback go idle Dario Faggioli
@ 2017-08-07  8:38   ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2017-08-07  8:38 UTC (permalink / raw)
  To: dario.faggioli
  Cc: sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, tim, julien.grall, xen-devel

>>> Dario Faggioli <dario.faggioli@citrix.com> 07/27/17 10:01 AM >>>
>In this commit, we implement the above, using rcu_needs_cpu(),
>in a way similar to how it is used in Linux. This it correct,
>useful and not wasteful for CPUs that participate in grace
>period, but have not a callback queued. For the ones that
>has callbacks, an optimization that avoids having to spin is
>introduced in next commit.

As always, you can't know what will be the "next commit". Please use
e.g. "in a subsequent change". Other than that

Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period.
  2017-07-27  8:01 ` [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period Dario Faggioli
  2017-07-31 21:20   ` Stefano Stabellini
  2017-08-01  8:54   ` Julien Grall
@ 2017-08-07  8:54   ` Jan Beulich
  2017-08-09 17:34     ` Dario Faggioli
  2 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2017-08-07  8:54 UTC (permalink / raw)
  To: dario.faggioli; +Cc: andrew.cooper3, julien.grall, sstabellini, xen-devel

>>> Dario Faggioli <dario.faggioli@citrix.com> 07/27/17 10:01 AM >>>
>Instead of having the CPU where a callback is queued, busy
>looping on rcu_pending(), use a timer.

Isn't this rcu_needs_cpu(), according to patch 4?

>--- a/xen/arch/x86/acpi/cpu_idle.c
>+++ b/xen/arch/x86/acpi/cpu_idle.c
>@@ -576,10 +576,10 @@ static void acpi_processor_idle(void)
>return;
>}
 >
>+    rcu_idle_timer_start();
>cpufreq_dbs_timer_suspend();

Is the ordering wrt cpufreq handling here arbitrary? If so, wouldn't it be
better to suspend cpufreq handling before starting the idle timer (and
also invert the ordering when going back to normal operation below)?

>-
>sched_tick_suspend();

At which point I'd then wonder whether this couldn't be integrated into
sched_tick_{suspend,resume}(), avoiding arch-specific code to be
altered.

>@@ -756,6 +756,7 @@ static void mwait_idle(void)
>        local_irq_enable();
>        sched_tick_resume();
>        cpufreq_dbs_timer_resume();
>+                rcu_idle_timer_stop();

Indentation looks odd here, but I can't exclude this being an effect produced
by my mail web frontend.

>--- a/xen/common/rcupdate.c
>+++ b/xen/common/rcupdate.c
>@@ -84,8 +84,14 @@ struct rcu_data {
>int cpu;
     >struct rcu_head barrier;
>long            last_rs_qlen;     /* qlen during the last resched */
>+
>+    /* 3) idle CPUs handling */
>+    struct timer idle_timer;
>+    bool idle_timer_active;
>};
 >
>+#define RCU_IDLE_TIMER_PERIOD MILLISECS(10)

If I'm not mistaken someone else had already commented on this: If this
is (mostly) arbitrary, please say so in a comment.

>@@ -402,7 +408,48 @@ int rcu_needs_cpu(int cpu)
>{
>struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
 >
>-    return (!!rdp->curlist || rcu_pending(cpu));
>+    return (!!rdp->curlist || rcu_pending(cpu)) && !rdp->idle_timer_active;

Please take the opportunity and drop the pointless !! here (unless it's needed
for better matching up with the Linux original).

>+}
>+
>+/*
>+ * Timer for making sure the CPU where a callback is queued does
>+ * periodically poke rcu_pedning(), so that it will invoke the callback
>+ * not too late after the end of the grace period.
>+ */
>+void rcu_idle_timer_start()
>+{
>+    struct rcu_data *rdp = &this_cpu(rcu_data);
>+
>+    if (likely(!rdp->curlist))
>+        return;

I would have expected this to be the inverse of the original condition in
rcu_needs_cpu() - why is there no rcu_pending() invocation here?

>@@ -451,6 +500,7 @@ static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
>rdp->qs_pending = 0;
>rdp->cpu = cpu;
>rdp->blimit = blimit;
>+    init_timer(&rdp->idle_timer, rcu_idle_timer_handler, (void*) rdp, cpu);

Again, unless it is this bogus way in the Linux original, please drop the
pointless cast, or at least correct its style.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
  2017-08-07  8:35   ` Jan Beulich
@ 2017-08-09  8:48     ` Dario Faggioli
  2017-08-09  8:57       ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Dario Faggioli @ 2017-08-09  8:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, julien.grall, sstabellini, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 4202 bytes --]

On Mon, 2017-08-07 at 02:35 -0600, Jan Beulich wrote:
> > > > Dario Faggioli <dario.faggioli@citrix.com> 07/27/17 10:01 AM
> > > > >>>
> > --- a/xen/arch/x86/acpi/cpu_idle.c
> > +++ b/xen/arch/x86/acpi/cpu_idle.c
> > 
> > @@ -433,12 +435,14 @@ static void acpi_idle_do_entry(struct
> > acpi_processor_cx *cx)
> > because chipsets cannot guarantee that STPCLK# signal
> > gets asserted in time to freeze execution properly. */
> > inl(pmtmr_ioport);
> > -        return;
> > +        break;
> 
>      >case ACPI_CSTATE_EM_HALT:
> 
> Wiuld be nice if you also added blank lines between the break-s and
> the
> subsequent case labels.
> 
Ok.

> > @@ -787,6 +792,8 @@ static void mwait_idle(void)
> > irq_traced[0], irq_traced[1], irq_traced[2], irq_traced[3]);
> 
>  >
> > /* Now back in C0. */
> > +    rcu_idle_exit(cpu);
> > +
> > update_idle_stats(power, cx, before, after);
> > local_irq_enable();
> 
>  
> Compared to the ACPI code, the window is quite a bit larger here. Any
> reason
> you can't put these just around the mwait_idle_with_hints()
> invocation, which
> would match what you do for the ACPI-based driver?
> 
No, no reasons not to do that. I'll do it.

> > @@ -248,7 +249,14 @@ static void rcu_start_batch(struct rcu_ctrlblk
> > *rcp)
> > smp_wmb();
> > rcp->cur++;
> 
>  >
> > -        cpumask_copy(&rcp->cpumask, &cpu_online_map);
> > +       /*
> > +        * Accessing idle_cpumask before incrementing rcp->cur
> > needs a
> > +        * Barrier  Otherwise it can cause tickless idle CPUs to be
> > +        * included in rcp->cpumask, which will extend graceperiods
> > +        * unnecessarily.
> > +        */
> > +        smp_mb();
> > +        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp-
> > >idle_cpumask);
> 
> I have some trouble with understanding the comment: You don't access
> ->idle_cpumask before you increment ->cur.
>
It comes verbatim from the Linux commit. You're not the first one that
finds it unclear, and I don't like it either.

So, this is the Linux patch:

        if (rcp->next_pending &&
                        rcp->completed == rcp->cur) {
-               /* Can't change, since spin lock held. */
-               cpus_andnot(rsp->cpumask, cpu_online_map, nohz_cpu_mask);
-
                rcp->next_pending = 0;
-               /* next_pending == 0 must be visible in __rcu_process_callbacks()
-                * before it can see new value of cur.
+               /*
+                * next_pending == 0 must be visible in
+                * __rcu_process_callbacks() before it can see new value of cur.
                 */
                smp_wmb();
                rcp->cur++;
+
+               /*
+                * Accessing nohz_cpu_mask before incrementing rcp->cur needs a
+                * Barrier  Otherwise it can cause tickless idle CPUs to be
+                * included in rsp->cpumask, which will extend graceperiods
+                * unnecessarily.
+                */
+               smp_mb();
+               cpus_andnot(rsp->cpumask, cpu_online_map, nohz_cpu_mask);
+

_I_think_ what the original author meant was something along the line
of <<Accessing nohz_cpu_mask before incrementing rcp->cur is unsafe.
Therefore, let's access it afterwords, and put a barrier in between.>>

But yeah, as said, I don't like it myself. In fact, it's the same exact
wording used in the changelog of the patch (Linux commit
c3f5902325d3053986e7359f706581d8f032e72f), but while it is fine there,
here is completely misleading, as it does not comment/describe the
final look of the code.

I'm going to change it.

> (Also, as a general remark,
> please go through patch description and comments again to correct
> spelling and alike issues.)
> 
Sure.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
  2017-08-09  8:48     ` Dario Faggioli
@ 2017-08-09  8:57       ` Jan Beulich
  2017-08-09  9:20         ` Dario Faggioli
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2017-08-09  8:57 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: andrew.cooper3, julien.grall, sstabellini, xen-devel

>>> On 09.08.17 at 10:48, <dario.faggioli@citrix.com> wrote:
> On Mon, 2017-08-07 at 02:35 -0600, Jan Beulich wrote:
>> > > > Dario Faggioli <dario.faggioli@citrix.com> 07/27/17 10:01 AM
>> > @@ -248,7 +249,14 @@ static void rcu_start_batch(struct rcu_ctrlblk
>> > *rcp)
>> > smp_wmb();
>> > rcp->cur++;
>> 
>>  >
>> > -        cpumask_copy(&rcp->cpumask, &cpu_online_map);
>> > +       /*
>> > +        * Accessing idle_cpumask before incrementing rcp->cur
>> > needs a
>> > +        * Barrier  Otherwise it can cause tickless idle CPUs to be
>> > +        * included in rcp->cpumask, which will extend graceperiods
>> > +        * unnecessarily.
>> > +        */
>> > +        smp_mb();
>> > +        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp-
>> > >idle_cpumask);
>> 
>> I have some trouble with understanding the comment: You don't access
>> ->idle_cpumask before you increment ->cur.
>>
> It comes verbatim from the Linux commit. You're not the first one that
> finds it unclear, and I don't like it either.
> 
> So, this is the Linux patch:
> 
>         if (rcp->next_pending &&
>                         rcp->completed == rcp->cur) {
> -               /* Can't change, since spin lock held. */
> -               cpus_andnot(rsp->cpumask, cpu_online_map, nohz_cpu_mask);
> -
>                 rcp->next_pending = 0;
> -               /* next_pending == 0 must be visible in __rcu_process_callbacks()
> -                * before it can see new value of cur.
> +               /*
> +                * next_pending == 0 must be visible in
> +                * __rcu_process_callbacks() before it can see new value of cur.
>                  */
>                 smp_wmb();
>                 rcp->cur++;
> +
> +               /*
> +                * Accessing nohz_cpu_mask before incrementing rcp->cur needs 
> a
> +                * Barrier  Otherwise it can cause tickless idle CPUs to be
> +                * included in rsp->cpumask, which will extend graceperiods
> +                * unnecessarily.
> +                */
> +               smp_mb();
> +               cpus_andnot(rsp->cpumask, cpu_online_map, nohz_cpu_mask);
> +
> 
> _I_think_ what the original author meant was something along the line
> of <<Accessing nohz_cpu_mask before incrementing rcp->cur is unsafe.
> Therefore, let's access it afterwords, and put a barrier in between.>>
> 
> But yeah, as said, I don't like it myself. In fact, it's the same exact
> wording used in the changelog of the patch (Linux commit
> c3f5902325d3053986e7359f706581d8f032e72f), but while it is fine there,
> here is completely misleading, as it does not comment/describe the
> final look of the code.
> 
> I'm going to change it.

Perhaps worth submitting a Linux patch too then?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
  2017-08-09  8:57       ` Jan Beulich
@ 2017-08-09  9:20         ` Dario Faggioli
  2017-08-09  9:26           ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Dario Faggioli @ 2017-08-09  9:20 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, julien.grall, sstabellini, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1282 bytes --]

On Wed, 2017-08-09 at 02:57 -0600, Jan Beulich wrote:
> > > > On 09.08.17 at 10:48, <dario.faggioli@citrix.com> wrote:
> > 
> > _I_think_ what the original author meant was something along the
> > line
> > of <<Accessing nohz_cpu_mask before incrementing rcp->cur is
> > unsafe.
> > Therefore, let's access it afterwords, and put a barrier in
> > between.>>
> > 
> > But yeah, as said, I don't like it myself. In fact, it's the same
> > exact
> > wording used in the changelog of the patch (Linux commit
> > c3f5902325d3053986e7359f706581d8f032e72f), but while it is fine
> > there,
> > here is completely misleading, as it does not comment/describe the
> > final look of the code.
> > 
> > I'm going to change it.
> 
> Perhaps worth submitting a Linux patch too then?
> 
That code is long gone.

The file itself doesn't exist any longer, and the mask has been killed,
in favour of other mechanisms (depending of the variant of RCU, of the
degree of tickles-idleness, etc). :-)

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
  2017-08-09  9:20         ` Dario Faggioli
@ 2017-08-09  9:26           ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2017-08-09  9:26 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: andrew.cooper3, julien.grall, sstabellini, xen-devel

>>> On 09.08.17 at 11:20, <dario.faggioli@citrix.com> wrote:
> On Wed, 2017-08-09 at 02:57 -0600, Jan Beulich wrote:
>> > > > On 09.08.17 at 10:48, <dario.faggioli@citrix.com> wrote:
>> > 
>> > _I_think_ what the original author meant was something along the
>> > line
>> > of <<Accessing nohz_cpu_mask before incrementing rcp->cur is
>> > unsafe.
>> > Therefore, let's access it afterwords, and put a barrier in
>> > between.>>
>> > 
>> > But yeah, as said, I don't like it myself. In fact, it's the same
>> > exact
>> > wording used in the changelog of the patch (Linux commit
>> > c3f5902325d3053986e7359f706581d8f032e72f), but while it is fine
>> > there,
>> > here is completely misleading, as it does not comment/describe the
>> > final look of the code.
>> > 
>> > I'm going to change it.
>> 
>> Perhaps worth submitting a Linux patch too then?
>> 
> That code is long gone.
> 
> The file itself doesn't exist any longer, and the mask has been killed,
> in favour of other mechanisms (depending of the variant of RCU, of the
> degree of tickles-idleness, etc). :-)

Oh, I see - in that case we're fine with the comment being
whatever makes most sense for us.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
  2017-07-27  8:01 ` [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started Dario Faggioli
  2017-07-31 21:17   ` Stefano Stabellini
  2017-08-07  8:35   ` Jan Beulich
@ 2017-08-09 11:38   ` Tim Deegan
  2017-08-11 17:25     ` Dario Faggioli
  2017-08-14  9:19     ` Dario Faggioli
  2 siblings, 2 replies; 38+ messages in thread
From: Tim Deegan @ 2017-08-09 11:38 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xen-devel, Julien Grall, Stefano Stabellini, Jan Beulich, Andrew Cooper

Hi,

At 10:01 +0200 on 27 Jul (1501149684), Dario Faggioli wrote:
> In Xen, that is impossible, and that's particularly problematic
> when system is idle (or lightly loaded) systems, as CPUs that
> are idle may never have the chance to tell RCU about their
> quiescence, and grace periods could extend indefinitely!
[...]
> The first step for fixing this situation is for RCU to record,
> at the beginning of a grace period, which CPUs are already idle.
> In fact, being idle, they can't be in the middle of any read-side
> critical section, and we don't have to wait for them to declare
> a grace period finished.

AIUI this patch fixes a bug where:
 - a CPU is idle/asleep;
 - it is added to the cpumask of a new RCU grace period; and
 - because the CPU is asleep, the grace period never ends. 
Have I understood?

I think we might be left with a race condition:
 - CPU A is about to idle.  It runs the RCU softirq and
   clears itself from the current grace period.
 - CPU B ends the grace period and starts a new one.
 - CPU A calls rcu_idle_enter() and sleeps.
 - The new grace period never ends.

Is that fixed by your later rcu_idle_timer patch?  AIUI that's only
invoked when the calling CPU has pending RCU callbacks.

Or can it be fixed here by something like this in rcu_idle_enter?
 - lock rcp->lock
 - add ourselves to the idle mask
 - if we are in rcp->cpumask:
     - cpu_quiet()
 - unlock rcp->lock

There's also the code at the top of rcu_check_quiescent_state() that
requres _two_ idle states per batch.  I don't know what race that's
protecting against so I don't know whether we need to worry about it
here as well. :)

Cheers,

Tim.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period.
  2017-08-07  8:54   ` Jan Beulich
@ 2017-08-09 17:34     ` Dario Faggioli
  2017-08-10 13:55       ` Dario Faggioli
  0 siblings, 1 reply; 38+ messages in thread
From: Dario Faggioli @ 2017-08-09 17:34 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, julien.grall, sstabellini, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 9994 bytes --]

On Mon, 2017-08-07 at 02:54 -0600, Jan Beulich wrote:
> > > > Dario Faggioli <dario.faggioli@citrix.com> 07/27/17 10:01 AM
> > > > >>>
> > 
> > Instead of having the CPU where a callback is queued, busy
> > looping on rcu_pending(), use a timer.
> 
> Isn't this rcu_needs_cpu(), according to patch 4?
> 
I was referring to the rcu_pending() in do_softirq(). In fact, if this
(roughly) is idle_loop:

    for ( ; ; )
    {
        if ( unlikely(tasklet_work_to_do(cpu)) )
            do_tasklet();
        else
            pm_idle();
        do_softirq();

        check_for_livepatch_work();
    }

we don't have tasklet work to do, so we call pm_idle(). However, if we
have a callback queued, rcu_needs_cpu() returns true (without calling
rcu_pending()), which means cpu_is_haltable() returns false, and hence
we exit pm_idle() without actually going idle, and we call
do_softirq(), which does:

    if ( rcu_pending(cpu) )
        rcu_check_callbacks(cpu);

in a rather tight loop.

IAC, I certainly can rephrase the sentence above, and make all this
more clear.

> > --- a/xen/arch/x86/acpi/cpu_idle.c
> > +++ b/xen/arch/x86/acpi/cpu_idle.c
> > @@ -576,10 +576,10 @@ static void acpi_processor_idle(void)
> > return;
> > }
> 
>  >
> > +    rcu_idle_timer_start();
> > cpufreq_dbs_timer_suspend();
> 
> Is the ordering wrt cpufreq handling here arbitrary? 
>
Yes.

> If so, wouldn't it be
> better to suspend cpufreq handling before starting the idle timer
> (and
> also invert the ordering when going back to normal operation below)?
> 
Well, as said above, it's arbitrary. Therefore, yes, I'm ok with doing
things like you suggest.

> > -
> > sched_tick_suspend();
> 
> At which point I'd then wonder whether this couldn't be integrated
> into
> sched_tick_{suspend,resume}(), avoiding arch-specific code to be
> altered.
> 
Sure, I'm all for it. It's easier, cleaner, and makes a lot of sense!

Only problem may be if some new arch decide not/forget to call
sched_tick_suspend() and resume().

Perhaps I can add some checking in place, to make sure that this can't
happen (in debug builds only, of course).

> > @@ -756,6 +756,7 @@ static void mwait_idle(void)
> >        local_irq_enable();
> >        sched_tick_resume();
> >        cpufreq_dbs_timer_resume();
> > +                rcu_idle_timer_stop();
> 
> Indentation looks odd here, but I can't exclude this being an effect
> produced
> by my mail web frontend.
> 
Yep, it was indeed wrong. Fixed. Sorry. Thanks. :-)

> > --- a/xen/common/rcupdate.c
> > +++ b/xen/common/rcupdate.c
> > @@ -84,8 +84,14 @@ struct rcu_data {
> > 
> > +#define RCU_IDLE_TIMER_PERIOD MILLISECS(10)
> 
> If I'm not mistaken someone else had already commented on this: If
> this
> is (mostly) arbitrary, please say so in a comment.
> 
Indeed there is being a long 'debate'. :-)

For v2, I'm changing this to something more dynamic. And I'll sure add
a comment explaining the adopted solution.

> > @@ -402,7 +408,48 @@ int rcu_needs_cpu(int cpu)
> > {
> > struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
> 
>  >
> > -    return (!!rdp->curlist || rcu_pending(cpu));
> > +    return (!!rdp->curlist || rcu_pending(cpu)) && !rdp-
> > >idle_timer_active;
> 
> Please take the opportunity and drop the pointless !! here (unless
> it's needed
> for better matching up with the Linux original).
> 
It's in the Linux code, indeed:
http://elixir.free-electrons.com/linux/v2.6.21/source/kernel/rcupdate.c#L517

However, this very line is already different (because Linux has
rdp_bh->curlist, which we don't, and we have rdp->idle_timer_active).
So I guess I can drop '!!'.

Any patch touching this line coming from Linux will need manual
adjustment anyway.

> > +/*
> > + * Timer for making sure the CPU where a callback is queued does
> > + * periodically poke rcu_pedning(), so that it will invoke the
> > callback
> > + * not too late after the end of the grace period.
> > + */
> > +void rcu_idle_timer_start()
> > +{
> > +    struct rcu_data *rdp = &this_cpu(rcu_data);
> > +
> > +    if (likely(!rdp->curlist))
> > +        return;
> 
> I would have expected this to be the inverse of the original
> condition in
> rcu_needs_cpu() - why is there no rcu_pending() invocation here?
> 
The original rcu_needs_cpu() condition is:

 rcu->curlist || rcu_pending(cpu)

So, you're saying we need something like this:

 if (!rdp->curlist && !rcu_pending(cpu))
   return;

Well, if I do this, what happens is that the if evaluate to false (and
hence we don't exit the function, and we arm the timer), on CPUs that:
 - does not have a calback queued (curlist == NULL)
 - are in the process of quiescing, by going idle.

In fact, in that case, here's what happens (I compacted the trace, and
slightly edited some of the lines, for better readability, and for
keeping the highlighting the characteristics of the situation under
investigation):

* complete_domain_destroy callback is queued on CPU 0. The CPU quiesces
  (goes through softirq), but stay budy, so no timer is armed (that's 
  not super relevant, but is what happens in this case):

x-|- d0v12 rcu_call fn=complete_domain_destroy
x-|- d0v12 rcu_pending? yes (ret=2): no pending entries but new entries
x-|- d0v12 raise_softirq nr 3
x-|- d0v12 rcu_process_callbacks, rdp_curlist: null, rdp_nxtlist: yes
x-|- d0v12 do_softirq
x-|- d0v12 rcu_pending? yes (ret=5): waiting for CPU to quiesce (rdp_qs_pending=1)
x-|- d0v12 raise_softirq nr 3
x-|- d0v12 rcu_process_callbacks, rdp_curlist: yes, rdp_nxtlist: null
x-|- d0v12 rcu_check_quiesc_state, rdp_qs_pending: yes
x-|- d0v12 rcu_cpu_quiet, rcp_cpumask=0x00004100
x-|- d0v12 rcu_pending? no

* the vCPU running on CPU 2, which participates in the grace period, 
  blocks, and CPU 2 goes idle. That means that CPU 2 quiesces (goes 
  through softirq), and we can forget about it. Surely we don't need 
  the timer to fire on it, as the callback is not queued there...

|-x- d0v7 vcpu_block d0v7
|-x- d0v7 raise_softirq nr 1
|-x- d0v7 do_softirq
|-x- d0v7 rcu_pending? yes (ret=4): waiting for CPU to quiesce
(rdp_qs_pending=0)
|-x- d0v7 raise_softirq nr 3
|-x- d0v7 softirq_handler nr 1
|-x- d0v7 runstate_change d0v7 running->blocked
|-x- d?v? runstate_change d32767v14 runnable->running

  Now, this is cpufreq_dbs_timer_suspend():

|-x- d32767v14 timer_stop t=do_dbs_timer

  And this is sched_tick_suspend(), which now calls 
  rcu_idle_timer_start(), gated by the condition suggested above.

|-x- d32767v14 rcu_pending? yes (ret=4): waiting for CPU to quiesce (rdp_qs_pending=0)
|-x- d32767v14 timer_set t=rcu_idle_timer_handler, expires_in=9787us

  So, the CPU is actually quiescing, and since it has no callback
  queues, as we said, we wouldn't have needed the timer. But the
  condition is false, because, at this stage, rcu_pending() was still
  true. So, we don't bail early from rcu_idle_timer_start(), and we 
  actually have armed the timer.

|-x- d32767v14 rcu_pending? yes (ret=4): waiting for CPU to quiesce (rdp_qs_pending=0)
|-x- d32767v14 raise_softirq nr 3
|-x- d32767v14 rcu_process_callbacks, rdp_curlist: null, rdp_nxtlist: null
|-x- d32767v14 rcu_check_quiesc_state, rdp_qs_pending: no
|-x- d32767v14 rcu_process_callbacks, rdp_curlist: null, rdp_nxtlist: null
|-x- d32767v14 rcu_check_quiesc_state, rdp_qs_pending: yes
|-x- d32767v14 rcu_cpu_quiet, rcp_cpumask=0x00000100
|-x- d32767v14 pm_idle_start c2

* The timer fires, for no useful reason, and cause a spurious wakeup:

|-x- d32767v14 raise_softirq nr 0
|-x- d32767v14 pm_idle_end c2
|-x- d32767v14 irq_enter
|-x- d32767v14 irq_direct, vec=0xfa, handler=apic_timer_interrupt
|-x- d32767v14 raise_softirq nr 0
|-x- d32767v14 irq_exit, in_irq = 0
|-x- d32767v14 softirq_handler nr 0
|-x- d32767v14 timer_exec t=rcu_idle_timer_handler
|-x- d32767v14 timer_reprogr deadline=954.566us
|-x- d32767v14 rcu_pending? no
|-x- d32767v14 pm_idle_start c3

So, this is why I don't want rcu_pending() in that if. If we don't like
this, I can see about moving around a bit the timer starting and
stopping helpers (I've a couple of ideas in mind already, but I need to
try).

Actually, it's entirely possible that it is having rcu_pending(cpu) in
rcu_needs_cpu() is, for us, redundant. In fact, although it does make
sense in Linux, both code inspection and some investigation I've just
done, makes me believe that there won't be cases where a CPU is denied
going offline because it sees rcu_pending() returning 1.

In fact, when we call rcu_pending(), within cpu_is_haltable(), we have
already gone through it before. And if there were pending work, we've
raised the softirq and dealt with it. If there weren't, neither there
is now.

I'm therefore leaning toward removing rcu_pending() from the
rcu_needs_cpu() check as well. At that point, we'll indeed have the
check inside rcu_start_idle_timer(), be the opposite of the original
check in rcu_needs_cpu(). :-)

> > @@ -451,6 +500,7 @@ static void rcu_init_percpu_data(int cpu,
> > struct rcu_ctrlblk *rcp,
> > rdp->qs_pending = 0;
> > rdp->cpu = cpu;
> > rdp->blimit = blimit;
> > +    init_timer(&rdp->idle_timer, rcu_idle_timer_handler, (void*)
> > rdp, cpu);
> 
> Again, unless it is this bogus way in the Linux original, please drop
> the
> pointless cast, or at least correct its style.
> 
Ah, no, this one, I can kill it.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period.
  2017-08-09 17:34     ` Dario Faggioli
@ 2017-08-10 13:55       ` Dario Faggioli
  0 siblings, 0 replies; 38+ messages in thread
From: Dario Faggioli @ 2017-08-10 13:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, julien.grall, sstabellini, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2384 bytes --]

On Wed, 2017-08-09 at 19:34 +0200, Dario Faggioli wrote:
> On Mon, 2017-08-07 at 02:54 -0600, Jan Beulich wrote:
> > > > > Dario Faggioli <dario.faggioli@citrix.com> 07/27/17 10:01 AM
> > > +/*
> > > + * Timer for making sure the CPU where a callback is queued does
> > > + * periodically poke rcu_pedning(), so that it will invoke the
> > > callback
> > > + * not too late after the end of the grace period.
> > > + */
> > > +void rcu_idle_timer_start()
> > > +{
> > > +    struct rcu_data *rdp = &this_cpu(rcu_data);
> > > +
> > > +    if (likely(!rdp->curlist))
> > > +        return;
> > 
> > I would have expected this to be the inverse of the original
> > condition in
> > rcu_needs_cpu() - why is there no rcu_pending() invocation here?
> > 
> 
> [...]
>
> Actually, it's entirely possible that it is having rcu_pending(cpu)
> in
> rcu_needs_cpu() is, for us, redundant. In fact, although it does make
> sense in Linux, both code inspection and some investigation I've just
> done, makes me believe that there won't be cases where a CPU is
> denied
> going offline because it sees rcu_pending() returning 1.
> 
> In fact, when we call rcu_pending(), within cpu_is_haltable(), we
> have
> already gone through it before. And if there were pending work, we've
> raised the softirq and dealt with it. If there weren't, neither there
> is now.
> 
> I'm therefore leaning toward removing rcu_pending() from the
> rcu_needs_cpu() check as well. At that point, we'll indeed have the
> check inside rcu_start_idle_timer(), be the opposite of the original
> check in rcu_needs_cpu(). :-)
> 
FTR, I'm not so sure of this last thing any longer. I mean, the
analysis I provided is still correct, but I'm investigating the other
possible race existing in the code that Tim has hinted at in his mail,
and I think it could be useful to have rcu_pending() checked in here,
to solve/avoid that one.

It's also possible that I'll actually remove it from rcu_needs_cpu(),
but to move it somewhere else... As I said, I'm still looking into the
problem.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
  2017-08-09 11:38   ` Tim Deegan
@ 2017-08-11 17:25     ` Dario Faggioli
  2017-08-14 10:39       ` Tim Deegan
  2017-08-14  9:19     ` Dario Faggioli
  1 sibling, 1 reply; 38+ messages in thread
From: Dario Faggioli @ 2017-08-11 17:25 UTC (permalink / raw)
  To: Tim Deegan
  Cc: xen-devel, Julien Grall, Stefano Stabellini, Jan Beulich, Andrew Cooper


[-- Attachment #1.1: Type: text/plain, Size: 7175 bytes --]

On Wed, 2017-08-09 at 12:38 +0100, Tim Deegan wrote:
> Hi,
> 
Hey! :-)

> At 10:01 +0200 on 27 Jul (1501149684), Dario Faggioli wrote:
> > In Xen, that is impossible, and that's particularly problematic
> > when system is idle (or lightly loaded) systems, as CPUs that
> > are idle may never have the chance to tell RCU about their
> > quiescence, and grace periods could extend indefinitely!
> 
> [...]
> > The first step for fixing this situation is for RCU to record,
> > at the beginning of a grace period, which CPUs are already idle.
> > In fact, being idle, they can't be in the middle of any read-side
> > critical section, and we don't have to wait for them to declare
> > a grace period finished.
> 
> AIUI this patch fixes a bug where:
>  - a CPU is idle/asleep;
>  - it is added to the cpumask of a new RCU grace period; and
>  - because the CPU is asleep, the grace period never ends. 
> Have I understood?
> 
Yes indeed.

Basically, without this patch, all the CPUs are always added/considered
for all the grace periods. And if there are some that are idle/asleep,
we need to wait for them to wake up, before declared the grace period
ended.

So the duration of the grace period itself is out of any control (and
in fact, it happens to be reasonable, on x86, even on fully idle
system, while not so much on ARM). Even virtually infinite... :-O

> I think we might be left with a race condition:
>  - CPU A is about to idle.  It runs the RCU softirq and
>    clears itself from the current grace period.
>  - CPU B ends the grace period and starts a new one.
>  - CPU A calls rcu_idle_enter() and sleeps.
>  - The new grace period never ends.
> 
Yes, I think this is a thing.

I've give it some thoughts, and will post something about it as another
email.

> Is that fixed by your later rcu_idle_timer patch?  AIUI that's only
> invoked when the calling CPU has pending RCU callbacks.
> 
The timer is meant at dealing with the CPU with pending callback, yes.
I don't think we can solve this issue with the timer, even if we extend
its scope to all CPUs with whatever pending RCU processing... we'd
still have problems with grace periods that starts (on CPU B) after the
check for whether or not we need the timer has happened.

> Or can it be fixed here by something like this in rcu_idle_enter?
>  - lock rcp->lock
>  - add ourselves to the idle mask
>  - if we are in rcp->cpumask:
>      - cpu_quiet()
>  - unlock rcp->lock
> 
If rcu_idle_enter() leaves inside an IRQ disabled section (e.g., in
mwait_idle(), as it does right now), I can't take rcp->lock (because of
spinlock IRQ safety).

If I move it outside of that, then yes, the above may work. Or,
actually, we may not even need it... But as I said, more on this in
another message.

> There's also the code at the top of rcu_check_quiescent_state() that
> requres _two_ idle states per batch.  I don't know what race that's
> protecting against so I don't know whether we need to worry about it
> here as well. :)
> 
Not sure I'm getting this part. It's not a race, it is that, literally,
the first time (after someone has started a new batch) a CPU executes
rcu_check_quiescent_state(), realizes we're in a new grace period:

   6.681421066 -.-.-.|.--x-|--- d0v7 rcu_call fn=0xffff82d080207c4c, rcp_cur=-300, rdp_quiescbatch=-300
   6.681422601 -.-.-.|.--x-|--- d0v7 do_softirq, pending = 0x00000010
   6.681422889 -.-.-.|.--x-|--- d0v7 rcu_pending? yes (ret=2): no pending entries but new entries
   6.681423214 -.-.-.|.--x-|--- d0v7 raise_softirq nr 3
   6.681423494 -.-.-.|.--x-|--- d0v7 softirq_handler nr 3
   6.681423494 -.-.-.|.--x-|--- d0v7 rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: yes, rdp_donelist: null
   6.681423494 -.-.-.|.--x-|--- d0v7 rcu_next_batch, rcp_cur=-300, rdp_batch=-299, rcp_next_pending? no
   6.681423494 -.-.-.|.--x-|--- d0v7 rcu_start_batch, rcp_cur=-299, rcp_cpumask=0x00001442
1->6.681423494 -.-.-.|.--x-|--- d0v7 rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-300, rdp_qs_pending: no
   6.681423494 -.-.-.|.--x-|--- d0v7 rcu_grace_start, rcp_cur=-299

The second realizes the grace period ended:

   6.681425277 -.-.-.|.--x-|--- d0v7 do_softirq, pending = 0x00000010
   6.681425647 -.-.-.|.--x-|--- d0v7 rcu_pending? yes (ret=5): waiting for CPU to quiesce (rdp_qs_pending=1)
   6.681425872 -.-.-.|.--x-|--- d0v7 raise_softirq nr 3
   6.681426117 -.-.-.|.--x-|--- d0v7 softirq_handler nr 3
   6.681426117 -.-.-.|.--x-|--- d0v7 rcu_process_callbacks, rcp_completed=-300, rdp_batch=-299, rdp_curlist: yes, rdp_nxtlist: null, rdp_donelist: null
2->6.681426117 -.-.-.|.--x-|--- d0v7 rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-299, rdp_qs_pending: yes
   6.681426117 -.-.-.|.--x-|--- d0v7 rcu_grace_done, rcp_cur=-299, rcp_completed=-300, rdp_quiescbatch=-299
   6.681426117 -.-.-.|.--x-|--- d0v7 rcu_cpu_quiet, rcp_cur=-299, rcp_cpumask=0x00001042

This trace above was for the CPU which actaully started the grace
period, and which has the callback queued. Here's the same for another
one:

   6.682213282 -.-.-.x.----|--- d0v2 vcpu_block d0v2
   6.682213549 -.-.-.x.----|--- d0v2 raise_softirq nr 1
   6.682213999 -.-.-.x.----|--- d0v2 do_softirq, pending = 0x00000002
   6.682214397 -.-.-.x.----|--- d0v2 rcu_pending? yes (ret=4): waiting for CPU to quiesce (rdp_qs_pending=0)
   6.682214667 -.-.-.x.----|--- d0v2 raise_softirq nr 3
   [...]
   6.682224404 -.-.-.x.----|--- d32767v6 softirq_handler nr 3
   6.682224404 -.-.-.x.----|--- d32767v6 rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: null, rdp_donelist: null
1->6.682224404 -.-.-.x.----|--- d32767v6 rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-300, rdp_qs_pending: no
   6.682224404 -.-.-.x.----|--- d32767v6 rcu_grace_start, rcp_cur=-299
   6.682225310 -.-.-.x.----|--- d32767v6 do_softirq, pending = 0x00000000
   6.682225570 -.-.-.x.----|--- d32767v6 rcu_pending? yes (ret=5): waiting for CPU to quiesce (rdp_qs_pending=1)
   6.682225790 -.-.-.x.----|--- d32767v6 raise_softirq nr 3
   6.682226032 -.-.-.x.----|--- d32767v6 softirq_handler nr 3
   6.682226032 -.-.-.x.----|--- d32767v6 rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: null, rdp_donelist: null
2->6.682226032 -.-.-.x.----|--- d32767v6 rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-299, rdp_qs_pending: yes
   6.682226032 -.-.-.x.----|--- d32767v6 rcu_grace_done, rcp_cur=-299, rcp_completed=-300, rdp_quiescbatch=-299
   6.682226032 -.-.-.x.----|--- d32767v6 rcu_cpu_quiet, rcp_cur=-299, rcp_cpumask=0x00000000

But maybe I've not understood properly what you meant...

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
  2017-08-09 11:38   ` Tim Deegan
  2017-08-11 17:25     ` Dario Faggioli
@ 2017-08-14  9:19     ` Dario Faggioli
  1 sibling, 0 replies; 38+ messages in thread
From: Dario Faggioli @ 2017-08-14  9:19 UTC (permalink / raw)
  To: Tim Deegan
  Cc: xen-devel, Julien Grall, Stefano Stabellini, Jan Beulich, Andrew Cooper


[-- Attachment #1.1: Type: text/plain, Size: 10534 bytes --]

On Wed, 2017-08-09 at 12:38 +0100, Tim Deegan wrote:
> Hi,
> 
> At 10:01 +0200 on 27 Jul (1501149684), Dario Faggioli wrote:
> > In Xen, that is impossible, and that's particularly problematic
> > when system is idle (or lightly loaded) systems, as CPUs that
> > are idle may never have the chance to tell RCU about their
> > quiescence, and grace periods could extend indefinitely!
> 
> [...]
> > The first step for fixing this situation is for RCU to record,
> > at the beginning of a grace period, which CPUs are already idle.
> > In fact, being idle, they can't be in the middle of any read-side
> > critical section, and we don't have to wait for them to declare
> > a grace period finished.
> 
> AIUI this patch fixes a bug where:
>  - a CPU is idle/asleep;
>  - it is added to the cpumask of a new RCU grace period; and
>  - because the CPU is asleep, the grace period never ends. 
> Have I understood?
> 
> I think we might be left with a race condition:
>  - CPU A is about to idle.  It runs the RCU softirq and
>    clears itself from the current grace period.
>  - CPU B ends the grace period and starts a new one.
>  - CPU A calls rcu_idle_enter() and sleeps.
>  - The new grace period never ends.
> 
So, I could not make this happen artificially, but I put together a
possible sequence of events that depicts this situation. I'm putting it
here, but also pasting it at https://pastebin.com/PHu6Guq0 (in case the
email body is mangled.

Lines starting with a '|' are the output of tracing, triggered by the
execution of the code/functions reported in the other lines:

	CPU0				CPU1				CPU2
	.				.				.
	.				call_rcu(...)			.
	.				|rcu_call fn=xxx, rcp_cur=-300, rdp_quiescbatch=-300
	.				.				.
	.				|do_softirq			.
	.				|rcu_pending? yes (ret=2): no pending entries but new entries
	.				|raise_softirq nr 3		.
	.				|softirq_handler nr 3		.
	.				rcu_process_callbacks();	.
	.				|rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: yes, rdp_donelist: null
	.				  rdp->batch = rcp->cur + 1;	.
	.				  smp_rmb;			.
   	.				|rcu_next_batch, rcp_cur=-300, rdp_batch=-299, rcp_next_pending? no
	.				  if (!rcp->next_pending) {	.
	.				   rcp->next_pending = 1;	.
	.				   rcu_start_batch();		.
	.				     if (rcp->next_pending && rcp->completed == rcp->cur) {
	.				      rcp->next_pending = 0;	.
	.				      smp_wmb;			.
	.				      rcp->cur++;		.
	.				      smp_mb;			.
	.				      cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask);
	.				|rcu_start_batch, rcp_cur=-299, rcp_cpumask=0x00000006 //Active CPUs: 1,2
	.				  }				.
	.				  rcu_check_quiescent_state();	.
	.				|rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-300, rdp_qs_pending: no
	.				    if (rdp->quiescbatch != rcp->cur) {
	.				     rdp->qs_pending = 1;	.
	.				     rdp->quiescbatch = rcp->cur;
	.				|rcu_grace_start, rcp_cur=-299	.
	.				    }				.
	call_rcu(...)			.				.
	|rcu_call fn=xxx, rcp_cur=-299, rdp_quiescbatch=-300		.
	.				.				.
	|do_softirq			.				.
	|rcu_pending? yes (ret=2): no pending entries but new entries	.
	|raise_softirq nr 3		.				.
	|softirq_handler nr 3		.				.
	rcu_process_callbacks();	.				.
	|rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: yes, rdp_donelist: null
	  rdp->batch = rcp->cur + 1;	.				.
	  smp_rmb;			.				.
   	|rcu_next_batch, rcp_cur=-299, rdp_batch=-298, rcp_next_pending? no
	  if (!rcp->next_pending) {	.				.
	   rcp->next_pending = 1;	.				.
	   rcu_start_batch();		.				.
	     if (rcp->next_pending && rcp->completed == rcp->cur) //NO!!!
          }				.				.
	  rcu_check_quiescent_state();	.				.
	|rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-300, rdp_qs_pending: no
	  if (rdp->quiescbatch != rcp->cur) {				.
	   rdp->qs_pending = 1;		.				.
	   rdp->quiescbatch = rcp->cur;	.				.
	|rcu_grace_start, rcp_cur=-299	.				.
	  }				.				.
	.				.				.
	.				.				|vcpu_block dXvY
	.				.				|raise_softirq nr 1
	.				.				|rcu_pending? yes (ret=4): waiting for CPU to quiesce (rdp_qs_pending=0)
	.				.				|raise_softirq nr 3
	.				.				|softirq_handler nr 1
	.				.				|//scheduler stuff
	.				.				|runstate_change dXvY running->blocked
	.				.				|runstate_change d32767v2 runnable->running
	.				.				idle_loop()
	.				.				  pm_idle() //e.g., mwait_idle()
	.				.				    cpufreq_dbs_timer_suspend();
	.				.				|timer_stop t=0xffff830319acebc0, fn=0xffff82d080252da9
	.				.				|timer_rm_entry t=0xffff830319acebc0, cpu=2, status=0x3
	.				.				    sched_tick_suspend();
	.				.				      rcu_idle_timer_start()
	.				.				    process_pending_softirqs();
	.				.				|do_softirq, pending = 0x00000008
	.				.				|rcu_pending? yes (ret=4): waiting for CPU to quiesce (rdp_qs_pending=0)
	.				.				|raise_softirq nr 3
	.				.				|softirq_handler nr 3
	.				.				      rcu_process_callbacks();
	.				.				|rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: null, rdp_donelist: null
	.				.				        rcu_check_quiescent_state();
	.				.				|rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-300, rdp_qs_pending: no
	.				.				          if (rdp->quiescbatch != rcp->cur) {
	.				.				           rdp->qs_pending = 1;
	.				.				           rdp->quiescbatch = rcp->cur;
	.				.				|rcu_grace_start, rcp_cur=-299
	.				.				          }
	.				.				|do_softirq, pending = 0x00000000
	.				.				|rcu_pending? yes (ret=5): waiting for CPU to quiesce (rdp_qs_pending=1)
	.				.				|raise_softirq nr 3
	.				.				|softirq_handler nr 3
	.				.				      rcu_process_callbacks();
	.				.				|rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: null, rdp_donelist: null
	.				.				        rcu_check_quiescent_state();
	.				.				|rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-299, rdp_qs_pending: yes
	.				.				          if (rdp->quiescbatch != rcp->cur) // NO!!!
	.				.				          if (!rdp->qs_pending) // NO!!!
	.				.				          rdp->qs_pending = 0;
	.				.				|rcu_grace_done, rcp_cur=-299, rcp_completed=-300, rdp_quiescbatch=-299
	.				.				          if (likely(rdp->quiescbatch == rcp->cur)) {
	.				.				           cpu_quiet();
	.				.				             cpumask_clear_cpu(cpu, &rcp->cpumask);
	.				.				|rcu_cpu_quiet, rcp_cur=-299, rcp_cpumask=0x00000002 // Active CPU: 1
	.				.				             if (cpumask_empty(&rcp->cpumask)) // NO!!!
	.				|do_softirq, pending = 0x00000010
	.				|rcu_pending? yes (ret=5): waiting for CPU to quiesce (rdp_qs_pending=1)
	.				|raise_softirq nr 3		.
	.				|softirq_handler nr 3		.
	.				rcu_process_callbacks();	.
	.				|rcu_process_callbacks, rcp_completed=-300, rdp_batch=-299, rdp_curlist: yes, rdp_nxtlist: null, rdp_donelist: null
	.				  rcu_check_quiescent_state();	.
	.				|rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-299, rdp_qs_pending: yes
	.				    if (rdp->quiescbatch != rcp->cur) // NO!!!
	.				    if (!rdp->qs_pending) // NO!!!
	.				|rcu_grace_done, rcp_cur=-299, rcp_completed=-300, rdp_quiescbatch=-299
	.				  if (likely(rdp->quiescbatch == rcp->cur)) {
	.				   cpu_quiet(rdp->cpu, rcp);	.
	.				     cpumask_clear_cpu(cpu, &rcp->cpumask); 
	.				|rcu_cpu_quiet, rcp_cur=-299, rcp_cpumask=0x00000000 //No more active CPU
	.				     if (cpumask_empty(&rcp->cpumask)) {
	.				      rcp->completed = rcp->cur;.
	.				      rcu_start_batch();	.
	.				        if (rcp->next_pending && rcp->completed == rcp->cur) {
	.				         rcp->next_pending = 0;	.
	.				         smp_wmb;		.
	.				.				|do_softirq, pending = 0x00000000
	.				.				|rcu_pending? no
	.				.				    local_irq_disable();
	.				.				    if (!cpu_is_haltable(cpu)) //NO!!!
	.				.				|rcu_pending? no
	.				.				|pm_idle_start c2, pm_tick=3720475837, exp=1708us, pred=1024us
	.				.				    if (cpu_is_haltable(cpu)) {  // <--- rcu_pending: quiescbatch=-299, rcp->cur=-299
[1]	.				.				|rcu_pending? no
[2]	.				         rcp->cur++;		.
	.				         smp_mb;		.
	.				         cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask);
	.				|rcu_start_batch, rcp_cur=-298, rcp_cpumask=0x00000007
	.				        }
	.				     }
[3]	.				.				     rcu_idle_enter(cpu);
	.				.				     mwait_idle_with_hints();
	.				.				.

So, this is basically what's described above, with my CPU1 being Tim's
CPU B (the one that ends the grace period and starts a new one), and my
CPU2 being Tim's CPU A (the one that goes idle). There's the need for
CPU0 issuing a call_rcu() and queueing a new batch (or CPU1 won't start
any new one).

Basically, for the race to occur (in [3]), it is necessary that [2]
(CPU1 doing rcp->cur++) happens after [1] (last call to rcu_pending()
of CPU2, before clearing the mask and going idle). In fact, it that is
not the case, rcu_pending() in CPU2 will say 'no', because of this
check:

 if (rdp->quiescbatch != rcp->cur || rdp->qs_pending)

which would prevent the CPU from going idle (and will, at least, lead
to it going through check_quiescent_state() twice, clearing itself from
the new grace period too).

Therefore, it looks to me that the race can be avoided by making sure
that there is at least one check of rcu_pending(), after a CPU has
called rcu_enter_idle(). This basically means moving rcu_idle_enter()
up.

I was actually thinking to move it inside the tick-stopping logic too.
This way, either, when CPU1 samples the cpumask for the new grace
period:
1) CPU2 will be in idle_cpumask already, and it actually goes idle;
2) CPU2 will be in idle_cpumask, but then it does not go idle
(cpu_is_haltable() returning false) for various reasons. This may look
unideal, but if it is here, trying to go idle, it is not holding
references to read-side critical sections, or at least we can consider
it to be quiescing, so it's ok to ignore it for the grace period;
3) CPU2 is not in idle_cpumask, and so it will not be in the sampled
mask, but the first check of rcu_pending() will lead acknowledge
quiescence, and calling of cpu_quiet();

Which looks fine to me. Or am I still missing something?

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
  2017-08-11 17:25     ` Dario Faggioli
@ 2017-08-14 10:39       ` Tim Deegan
  2017-08-14 13:24         ` Dario Faggioli
  0 siblings, 1 reply; 38+ messages in thread
From: Tim Deegan @ 2017-08-14 10:39 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xen-devel, Julien Grall, Stefano Stabellini, Jan Beulich, Andrew Cooper

Hi,

At 11:19 +0200 on 14 Aug (1502709563), Dario Faggioli wrote:
> Basically, for the race to occur (in [3]), it is necessary that [2]
> (CPU1 doing rcp->cur++) happens after [1] (last call to rcu_pending()
> of CPU2, before clearing the mask and going idle). In fact, it that is
> not the case, rcu_pending() in CPU2 will say 'no', because of this
> check:
> 
>  if (rdp->quiescbatch != rcp->cur || rdp->qs_pending)
> 
> which would prevent the CPU from going idle (and will, at least, lead
> to it going through check_quiescent_state() twice, clearing itself from
> the new grace period too).
> 
> Therefore, it looks to me that the race can be avoided by making sure
> that there is at least one check of rcu_pending(), after a CPU has
> called rcu_enter_idle(). This basically means moving rcu_idle_enter()
> up.

Sounds plausible.

> I was actually thinking to move it inside the tick-stopping logic too.
> This way, either, when CPU1 samples the cpumask for the new grace
> period:
> 1) CPU2 will be in idle_cpumask already, and it actually goes idle;

The system works!

> 2) CPU2 will be in idle_cpumask, but then it does not go idle
> (cpu_is_haltable() returning false) for various reasons. This may look
> unideal, but if it is here, trying to go idle, it is not holding
> references to read-side critical sections, or at least we can consider
> it to be quiescing, so it's ok to ignore it for the grace period;

Yes.  This is equivalent to CPU2 actually going idle and then waking
up quickly.  As you say, since at this point the RCU logic thinks the
CPU can idle, whatever stopped it actually idling can't have been
relevant to RCU.

> 3) CPU2 is not in idle_cpumask, and so it will not be in the sampled
> mask, but the first check of rcu_pending() will lead acknowledge
> quiescence, and calling of cpu_quiet();

Yep.  And because it's not yet in idle_cpumask, it _will_ check
rcu_pending() before idling.  I think that needs an smp_mb() between
setting the idle_cpumask and checking rcp->cur, and likewise between
rcp->cur++ and cpumask_andnot() in rcu_start_batch().

Sounds good!

Cheers,

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
  2017-08-14 10:39       ` Tim Deegan
@ 2017-08-14 13:24         ` Dario Faggioli
  2017-08-14 13:54           ` Tim Deegan
  0 siblings, 1 reply; 38+ messages in thread
From: Dario Faggioli @ 2017-08-14 13:24 UTC (permalink / raw)
  To: Tim Deegan
  Cc: xen-devel, Julien Grall, Stefano Stabellini, Jan Beulich, Andrew Cooper


[-- Attachment #1.1: Type: text/plain, Size: 1676 bytes --]

On Mon, 2017-08-14 at 11:39 +0100, Tim Deegan wrote:
> Hi,
> 
Hey,

> At 11:19 +0200 on 14 Aug (1502709563), Dario Faggioli wrote:
> > Therefore, it looks to me that the race can be avoided by making
> > sure
> > that there is at least one check of rcu_pending(), after a CPU has
> > called rcu_enter_idle(). This basically means moving
> > rcu_idle_enter()
> > up.
> 
> Sounds plausible.
> 
Great to hear you think that! :-D

> > 3) CPU2 is not in idle_cpumask, and so it will not be in the
> > sampled
> > mask, but the first check of rcu_pending() will lead acknowledge
> > quiescence, and calling of cpu_quiet();
> 
> Yep.  And because it's not yet in idle_cpumask, it _will_ check
> rcu_pending() before idling.  I think that needs an smp_mb() between
> setting the idle_cpumask and checking rcp->cur, and likewise between
> rcp->cur++ and cpumask_andnot() in rcu_start_batch().
> 
So, the latter, I'm putting it there already, by importing Linux's
c3f59023, "Fix RCU race in access of nohz_cpu_mask" (in this ver
patch).

About the former... I'm not sure which check of rcp->cur you're
referring to. I think it's the one in rcu_check_quiescent_state(), but
then, I'm not sure where to actually put the barrier...

I'll keep looking, but any advice is welcome. Even after all these
years, barriers still gives me headache. :-P

Thanks for looking at the patches,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
  2017-08-14 13:24         ` Dario Faggioli
@ 2017-08-14 13:54           ` Tim Deegan
  2017-08-14 16:21             ` Dario Faggioli
  0 siblings, 1 reply; 38+ messages in thread
From: Tim Deegan @ 2017-08-14 13:54 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xen-devel, Julien Grall, Stefano Stabellini, Jan Beulich, Andrew Cooper

Hi,

At 15:24 +0200 on 14 Aug (1502724279), Dario Faggioli wrote:
> On Mon, 2017-08-14 at 11:39 +0100, Tim Deegan wrote:
> > At 11:19 +0200 on 14 Aug (1502709563), Dario Faggioli wrote:
> > > 3) CPU2 is not in idle_cpumask, and so it will not be in the
> > > sampled
> > > mask, but the first check of rcu_pending() will lead acknowledge
> > > quiescence, and calling of cpu_quiet();
> > 
> > Yep.  And because it's not yet in idle_cpumask, it _will_ check
> > rcu_pending() before idling.  I think that needs an smp_mb() between
> > setting the idle_cpumask and checking rcp->cur, and likewise between
> > rcp->cur++ and cpumask_andnot() in rcu_start_batch().
> > 
> So, the latter, I'm putting it there already, by importing Linux's
> c3f59023, "Fix RCU race in access of nohz_cpu_mask" (in this ver
> patch).

Yep, looks correct to me.

> About the former... I'm not sure which check of rcp->cur you're
> referring to. I think it's the one in rcu_check_quiescent_state(), but
> then, I'm not sure where to actually put the barrier...

I mean whatever one causes the CPU to DTRT about the new grace period.
AFAICT that's the one in __rcu_pending().  The important thing is that
that read mustn't be allowed to happen before the write to the
idle_cpumask.  I'd be inclined to put the barrier right after the
cpumask_set_cpu() in rcu_idle_enter().

> I'll keep looking, but any advice is welcome. Even after all these
> years, barriers still gives me headache. :-P

Me too! :)

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
  2017-08-14 13:54           ` Tim Deegan
@ 2017-08-14 16:21             ` Dario Faggioli
  2017-08-14 16:47               ` Tim Deegan
  0 siblings, 1 reply; 38+ messages in thread
From: Dario Faggioli @ 2017-08-14 16:21 UTC (permalink / raw)
  To: Tim Deegan
  Cc: xen-devel, Julien Grall, Stefano Stabellini, Jan Beulich, Andrew Cooper


[-- Attachment #1.1: Type: text/plain, Size: 1431 bytes --]

On Mon, 2017-08-14 at 14:54 +0100, Tim Deegan wrote:
> At 15:24 +0200 on 14 Aug (1502724279), Dario Faggioli wrote:
> > About the former... I'm not sure which check of rcp->cur you're
> > referring to. I think it's the one in rcu_check_quiescent_state(),
> > but
> > then, I'm not sure where to actually put the barrier...
> 
> I mean whatever one causes the CPU to DTRT about the new grace
> period.
> AFAICT that's the one in __rcu_pending().  The important thing is
> that
> that read mustn't be allowed to happen before the write to the
> idle_cpumask.
>
Interestingly enough, someone seems to have had a very similar
discussion before:

https://lkml.org/lkml/2005/12/8/155

>   I'd be inclined to put the barrier right after the
> cpumask_set_cpu() in rcu_idle_enter().
> 
And with conclusions similar to these: :-)

https://lkml.org/lkml/2005/12/8/308

I've no idea why they then didn't put an actual barrier in place. This
thread mention s390's ordering, as at the time, this mechanism was only
in used there, but they've not added one when making things general.

IAC, I'm fine putting down one. :-)

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
  2017-08-14 16:21             ` Dario Faggioli
@ 2017-08-14 16:47               ` Tim Deegan
  0 siblings, 0 replies; 38+ messages in thread
From: Tim Deegan @ 2017-08-14 16:47 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xen-devel, Julien Grall, Stefano Stabellini, Jan Beulich, Andrew Cooper

At 18:21 +0200 on 14 Aug (1502734919), Dario Faggioli wrote:
> On Mon, 2017-08-14 at 14:54 +0100, Tim Deegan wrote:
> > At 15:24 +0200 on 14 Aug (1502724279), Dario Faggioli wrote:
> > > About the former... I'm not sure which check of rcp->cur you're
> > > referring to. I think it's the one in rcu_check_quiescent_state(),
> > > but
> > > then, I'm not sure where to actually put the barrier...
> > 
> > I mean whatever one causes the CPU to DTRT about the new grace
> > period.
> > AFAICT that's the one in __rcu_pending().  The important thing is
> > that
> > that read mustn't be allowed to happen before the write to the
> > idle_cpumask.
> >
> Interestingly enough, someone seems to have had a very similar
> discussion before:
> 
> https://lkml.org/lkml/2005/12/8/155
> 
> >   I'd be inclined to put the barrier right after the
> > cpumask_set_cpu() in rcu_idle_enter().
> > 
> And with conclusions similar to these: :-)
> 
> https://lkml.org/lkml/2005/12/8/308
> 
> I've no idea why they then didn't put an actual barrier in place. This
> thread mention s390's ordering, as at the time, this mechanism was only
> in used there, but they've not added one when making things general.
> 
> IAC, I'm fine putting down one. :-)

Sounds good to me. :)  I think it's required (even on s390!) but of
course there may be other barriers on those paths that happen to DTRT.
E.g. cpumask_set_cpu() itself is implicitly a mb() on x86, (but not,
AFAICT, on ARM).

Cheers,

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2017-08-14 16:47 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-27  8:01 [PATCH 0/5] xen: RCU: x86/ARM: Add support of rcu_idle_{enter, exit} Dario Faggioli
2017-07-27  8:01 ` [PATCH 1/5] xen: in do_softirq() sample smp_processor_id() once and for all Dario Faggioli
2017-07-27  8:01 ` [PATCH 2/5] xen: ARM: suspend the tick (if in use) when going idle Dario Faggioli
2017-07-31 20:59   ` Stefano Stabellini
2017-08-01  8:53     ` Julien Grall
2017-08-01  9:26       ` Dario Faggioli
2017-07-27  8:01 ` [PATCH 3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started Dario Faggioli
2017-07-31 21:17   ` Stefano Stabellini
2017-08-07  8:35   ` Jan Beulich
2017-08-09  8:48     ` Dario Faggioli
2017-08-09  8:57       ` Jan Beulich
2017-08-09  9:20         ` Dario Faggioli
2017-08-09  9:26           ` Jan Beulich
2017-08-09 11:38   ` Tim Deegan
2017-08-11 17:25     ` Dario Faggioli
2017-08-14 10:39       ` Tim Deegan
2017-08-14 13:24         ` Dario Faggioli
2017-08-14 13:54           ` Tim Deegan
2017-08-14 16:21             ` Dario Faggioli
2017-08-14 16:47               ` Tim Deegan
2017-08-14  9:19     ` Dario Faggioli
2017-07-27  8:01 ` [PATCH 4/5] xen: RCU: don't let a CPU with a callback go idle Dario Faggioli
2017-08-07  8:38   ` Jan Beulich
2017-07-27  8:01 ` [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period Dario Faggioli
2017-07-31 21:20   ` Stefano Stabellini
2017-07-31 22:03     ` Dario Faggioli
2017-07-31 23:58       ` Stefano Stabellini
2017-08-01  0:47         ` Dario Faggioli
2017-08-01 19:13           ` Stefano Stabellini
2017-08-02 10:14             ` Dario Faggioli
2017-08-01  8:45         ` Julien Grall
2017-08-01  8:54   ` Julien Grall
2017-08-01  9:17     ` Dario Faggioli
2017-08-01 10:22       ` Julien Grall
2017-08-01 10:33         ` Dario Faggioli
2017-08-07  8:54   ` Jan Beulich
2017-08-09 17:34     ` Dario Faggioli
2017-08-10 13:55       ` Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.