All of lore.kernel.org
 help / color / mirror / Atom feed
* sched=null vwfi=native and call_rcu()
@ 2022-01-06  0:40 Stefano Stabellini
  2022-01-06  9:26 ` Julien Grall
  0 siblings, 1 reply; 9+ messages in thread
From: Stefano Stabellini @ 2022-01-06  0:40 UTC (permalink / raw)
  To: dfaggioli, george.dunlap
  Cc: sstabellini, julien, jbeulich, andrew.cooper3, roger.pau,
	bertrand.marquis, Volodymyr_Babchuk, jgross, xen-devel

Hi all,

As you might remember, we have an outstanding issue with call_rcu() when
sched=null vwfi=native are used. That is because in that configuration
the CPU never goes idle so rcu_idle_enter() never gets called.

The issue was caught on the field and I managed to repro the problem
doing the following:

xl destroy test
xl create ./test.cfg

Resulting in the following error:

# Parsing config from ./test.cfg
# (XEN) IRQ 54 is already used by domain 1

The test domain has 3 interrupts remapped to it and they don't get
released before the new domain creation is requested.

Just FYI, the below hacky patch seems to reliably work-around the
problem in my environment.

Do you have any suggestions on what would be the right way to solve
this issue?

Cheers,

Stefano


diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index a5a27af3de..841a5cb3c9 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -289,6 +289,9 @@ void call_rcu(struct rcu_head *head,
         force_quiescent_state(rdp, &rcu_ctrlblk);
     }
     local_irq_restore(flags);
+    /* make sure that the CPU has a chance to check RCUs */
+    set_timer(&rdp->idle_timer, NOW() + SECONDS(1));
+    rdp->idle_timer_active = true;
 }
 
 /*



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: sched=null vwfi=native and call_rcu()
  2022-01-06  0:40 sched=null vwfi=native and call_rcu() Stefano Stabellini
@ 2022-01-06  9:26 ` Julien Grall
  2022-01-07  1:52   ` Stefano Stabellini
  0 siblings, 1 reply; 9+ messages in thread
From: Julien Grall @ 2022-01-06  9:26 UTC (permalink / raw)
  To: Stefano Stabellini, dfaggioli, george.dunlap
  Cc: jbeulich, andrew.cooper3, roger.pau, bertrand.marquis,
	Volodymyr_Babchuk, jgross, xen-devel



On 06/01/2022 00:40, Stefano Stabellini wrote:
> Hi all,

Hi,

> As you might remember, we have an outstanding issue with call_rcu() when
> sched=null vwfi=native are used. That is because in that configuration
> the CPU never goes idle so rcu_idle_enter() never gets called.
> 
> The issue was caught on the field and I managed to repro the problem
> doing the following:
> 
> xl destroy test
> xl create ./test.cfg
> 
> Resulting in the following error:
> 
> # Parsing config from ./test.cfg
> # (XEN) IRQ 54 is already used by domain 1
> 
> The test domain has 3 interrupts remapped to it and they don't get
> released before the new domain creation is requested.
> 
> Just FYI, the below hacky patch seems to reliably work-around the
> problem in my environment.
> 
> Do you have any suggestions on what would be the right way to solve
> this issue?

This issue and solution were discussed numerous time on the ML. In 
short, we want to tell the RCU that CPU running in guest context are 
always quiesced. For more details, you can read the previous thread 
(which also contains a link to the one before):

https://lore.kernel.org/xen-devel/fe3dd9f0-b035-01fe-3e01-ddf065f182ab@codiax.se/

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sched=null vwfi=native and call_rcu()
  2022-01-06  9:26 ` Julien Grall
@ 2022-01-07  1:52   ` Stefano Stabellini
  2022-01-14 18:42     ` Dario Faggioli
  0 siblings, 1 reply; 9+ messages in thread
From: Stefano Stabellini @ 2022-01-07  1:52 UTC (permalink / raw)
  To: Julien Grall, dfaggioli
  Cc: Stefano Stabellini, george.dunlap, jbeulich, andrew.cooper3,
	roger.pau, bertrand.marquis, Volodymyr_Babchuk, jgross,
	xen-devel

[-- Attachment #1: Type: text/plain, Size: 1930 bytes --]

On Thu, 6 Jan 2022, Julien Grall wrote:
> On 06/01/2022 00:40, Stefano Stabellini wrote:
> > As you might remember, we have an outstanding issue with call_rcu() when
> > sched=null vwfi=native are used. That is because in that configuration
> > the CPU never goes idle so rcu_idle_enter() never gets called.
> > 
> > The issue was caught on the field and I managed to repro the problem
> > doing the following:
> > 
> > xl destroy test
> > xl create ./test.cfg
> > 
> > Resulting in the following error:
> > 
> > # Parsing config from ./test.cfg
> > # (XEN) IRQ 54 is already used by domain 1
> > 
> > The test domain has 3 interrupts remapped to it and they don't get
> > released before the new domain creation is requested.
> > 
> > Just FYI, the below hacky patch seems to reliably work-around the
> > problem in my environment.
> > 
> > Do you have any suggestions on what would be the right way to solve
> > this issue?
> 
> This issue and solution were discussed numerous time on the ML. In short, we
> want to tell the RCU that CPU running in guest context are always quiesced.
> For more details, you can read the previous thread (which also contains a link
> to the one before):
> 
> https://lore.kernel.org/xen-devel/fe3dd9f0-b035-01fe-3e01-ddf065f182ab@codiax.se/

Thanks Julien for the pointer!

Dario, I forward-ported your three patches to staging:
https://gitlab.com/xen-project/people/sstabellini/xen/-/tree/rcu-quiet

I can confirm that they fix the bug. Note that I had to add a small
change on top to remove the ASSERT at the beginning of rcu_quiet_enter:
https://gitlab.com/xen-project/people/sstabellini/xen/-/commit/6fc02b90814d3fe630715e353d16f397a5b280f9

Would you be up for submitting them for upstreaming? I would prefer if
you send out the patches because I cannot claim to understand them
completely (except for the one doing renaming :-P )

I am also attaching the four patches for your convenience.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-diff; NAME=0001-xen-RCU-bootparam-to-force-quiescence-at-every-call.patch, Size: 2066 bytes --]

From 30ac483e3c3ffa63b9e92596fa25118e25116438 Mon Sep 17 00:00:00 2001
From: Dario Faggioli <dfaggioli@suse.com>
Date: Thu, 6 Jan 2022 16:54:18 -0800
Subject: [PATCH 1/4] xen: RCU: bootparam to force quiescence at every call.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
 xen/arch/arm/traps.c       |  3 +++
 xen/common/rcupdate.c      | 10 ++++++++++
 xen/include/xen/rcupdate.h |  2 ++
 3 files changed, 15 insertions(+)

diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
index 9339d12f58..e2842ba4db 100644
--- a/xen/arch/arm/traps.c
+++ b/xen/arch/arm/traps.c
@@ -87,7 +87,10 @@ static enum {
 static int __init parse_vwfi(const char *s)
 {
 	if ( !strcmp(s, "native") )
+	{
+		rcu_always_quiesc = true;
 		vwfi = NATIVE;
+	}
 	else
 		vwfi = TRAP;
 
diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index a5a27af3de..7316271da5 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -95,6 +95,9 @@ struct rcu_data {
     bool            barrier_active;
 };
 
+bool rcu_always_quiesc = false;
+boolean_param("rcu_force_quiesc", rcu_always_quiesc);
+
 /*
  * If a CPU with RCU callbacks queued goes idle, when the grace period is
  * not finished yet, how can we make sure that the callbacks will eventually
@@ -637,6 +640,13 @@ static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
     rdp->quiescbatch = rcp->completed;
     rdp->qs_pending = 0;
     rdp->cpu = cpu;
+    if ( rcu_always_quiesc )
+    {
+        blimit = INT_MAX;
+        qhimark = 0;
+        qlowmark = 0;
+        //rsinterval = 0;
+    }
     rdp->blimit = blimit;
     init_timer(&rdp->idle_timer, rcu_idle_timer_handler, rdp, cpu);
 }
diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h
index 6f2587058e..d279e39022 100644
--- a/xen/include/xen/rcupdate.h
+++ b/xen/include/xen/rcupdate.h
@@ -78,6 +78,8 @@ struct rcu_head {
 } while (0)
 
 
+extern bool rcu_always_quiesc;
+
 int rcu_pending(int cpu);
 int rcu_needs_cpu(int cpu);
 
-- 
2.25.1


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: Type: text/x-diff; NAME=0002-xen-rename-RCU-idle-timer-and-cpumask.patch, Size: 11194 bytes --]

From bc0463017106bb1c70a49609adaca5cb6b8357ab Mon Sep 17 00:00:00 2001
From: Dario Faggioli <dfaggioli@suse.com>
Date: Thu, 6 Jan 2022 17:17:54 -0800
Subject: [PATCH 2/4] xen: rename RCU idle timer and cpumask

Both the cpumask and the timer will be used in more generic
circumnstances, not only for CPUs that go idle. Change their names to
reflect that.

No functional change.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
 xen/common/rcupdate.c        | 108 ++++++++++++++++++++---------------
 xen/include/xen/perfc_defn.h |   2 +-
 2 files changed, 62 insertions(+), 48 deletions(-)

diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index 7316271da5..96e8ca01c0 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -56,7 +56,7 @@ static struct rcu_ctrlblk {
 
     spinlock_t  lock __cacheline_aligned;
     cpumask_t   cpumask; /* CPUs that need to switch in order ... */
-    cpumask_t   idle_cpumask; /* ... unless they are already idle */
+    cpumask_t   ignore_cpumask; /* ... unless they are already idle */
     /* for current batch to proceed.        */
 } __cacheline_aligned rcu_ctrlblk = {
     .cur = -300,
@@ -88,8 +88,8 @@ struct rcu_data {
     long            last_rs_qlen;     /* qlen during the last resched */
 
     /* 3) idle CPUs handling */
-    struct timer idle_timer;
-    bool idle_timer_active;
+    struct timer cb_timer;
+    bool cb_timer_active;
 
     bool            process_callbacks;
     bool            barrier_active;
@@ -124,22 +124,22 @@ boolean_param("rcu_force_quiesc", rcu_always_quiesc);
  * CPU that is going idle. The user can change this, via a boot time
  * parameter, but only up to 100ms.
  */
-#define IDLE_TIMER_PERIOD_MAX     MILLISECS(100)
-#define IDLE_TIMER_PERIOD_DEFAULT MILLISECS(10)
-#define IDLE_TIMER_PERIOD_MIN     MICROSECS(100)
+#define CB_TIMER_PERIOD_MAX     MILLISECS(100)
+#define CB_TIMER_PERIOD_DEFAULT MILLISECS(10)
+#define CB_TIMER_PERIOD_MIN     MICROSECS(100)
 
-static s_time_t __read_mostly idle_timer_period;
+static s_time_t __read_mostly cb_timer_period;
 
 /*
  * Increment and decrement values for the idle timer handler. The algorithm
  * works as follows:
  * - if the timer actually fires, and it finds out that the grace period isn't
- *   over yet, we add IDLE_TIMER_PERIOD_INCR to the timer's period;
+ *   over yet, we add CB_TIMER_PERIOD_INCR to the timer's period;
  * - if the timer actually fires and it finds the grace period over, we
- *   subtract IDLE_TIMER_PERIOD_DECR from the timer's period.
+ *   subtract CB_TIMER_PERIOD_DECR from the timer's period.
  */
-#define IDLE_TIMER_PERIOD_INCR    MILLISECS(10)
-#define IDLE_TIMER_PERIOD_DECR    MICROSECS(100)
+#define CB_TIMER_PERIOD_INCR    MILLISECS(10)
+#define CB_TIMER_PERIOD_DECR    MICROSECS(100)
 
 static DEFINE_PER_CPU(struct rcu_data, rcu_data);
 
@@ -367,7 +367,7 @@ static void rcu_start_batch(struct rcu_ctrlblk *rcp)
         * This barrier is paired with the one in rcu_idle_enter().
         */
         smp_mb();
-        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask);
+        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->ignore_cpumask);
     }
 }
 
@@ -526,7 +526,7 @@ int rcu_needs_cpu(int cpu)
 {
     struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
 
-    return (rdp->curlist && !rdp->idle_timer_active) || rcu_pending(cpu);
+    return (rdp->curlist && !rdp->cb_timer_active) || rcu_pending(cpu);
 }
 
 /*
@@ -534,7 +534,7 @@ int rcu_needs_cpu(int cpu)
  * periodically poke rcu_pedning(), so that it will invoke the callback
  * not too late after the end of the grace period.
  */
-static void rcu_idle_timer_start(void)
+static void cb_timer_start(void)
 {
     struct rcu_data *rdp = &this_cpu(rcu_data);
 
@@ -546,48 +546,48 @@ static void rcu_idle_timer_start(void)
     if (likely(!rdp->curlist))
         return;
 
-    set_timer(&rdp->idle_timer, NOW() + idle_timer_period);
-    rdp->idle_timer_active = true;
+    set_timer(&rdp->cb_timer, NOW() + cb_timer_period);
+    rdp->cb_timer_active = true;
 }
 
-static void rcu_idle_timer_stop(void)
+static void cb_timer_stop(void)
 {
     struct rcu_data *rdp = &this_cpu(rcu_data);
 
-    if (likely(!rdp->idle_timer_active))
+    if (likely(!rdp->cb_timer_active))
         return;
 
-    rdp->idle_timer_active = false;
+    rdp->cb_timer_active = false;
 
     /*
      * In general, as the CPU is becoming active again, we don't need the
      * idle timer, and so we want to stop it.
      *
-     * However, in case we are here because idle_timer has (just) fired and
+     * However, in case we are here because cb_timer has (just) fired and
      * has woken up the CPU, we skip stop_timer() now. In fact, when a CPU
      * wakes up from idle, this code always runs before do_softirq() has the
      * chance to check and deal with TIMER_SOFTIRQ. And if we stop the timer
      * now, the TIMER_SOFTIRQ handler will see it as inactive, and will not
-     * call rcu_idle_timer_handler().
+     * call cb_timer_handler().
      *
      * Therefore, if we see that the timer is expired already, we leave it
      * alone. The TIMER_SOFTIRQ handler will then run the timer routine, and
      * deactivate it.
      */
-    if ( !timer_is_expired(&rdp->idle_timer) )
-        stop_timer(&rdp->idle_timer);
+    if ( !timer_is_expired(&rdp->cb_timer) )
+        stop_timer(&rdp->cb_timer);
 }
 
-static void rcu_idle_timer_handler(void* data)
+static void cb_timer_handler(void* data)
 {
-    perfc_incr(rcu_idle_timer);
+    perfc_incr(rcu_callback_timer);
 
     if ( !cpumask_empty(&rcu_ctrlblk.cpumask) )
-        idle_timer_period = min(idle_timer_period + IDLE_TIMER_PERIOD_INCR,
-                                IDLE_TIMER_PERIOD_MAX);
+        cb_timer_period = min(cb_timer_period + CB_TIMER_PERIOD_INCR,
+                                CB_TIMER_PERIOD_MAX);
     else
-        idle_timer_period = max(idle_timer_period - IDLE_TIMER_PERIOD_DECR,
-                                IDLE_TIMER_PERIOD_MIN);
+        cb_timer_period = max(cb_timer_period - CB_TIMER_PERIOD_DECR,
+                                CB_TIMER_PERIOD_MIN);
 }
 
 void rcu_check_callbacks(int cpu)
@@ -611,7 +611,7 @@ static void rcu_move_batch(struct rcu_data *this_rdp, struct rcu_head *list,
 static void rcu_offline_cpu(struct rcu_data *this_rdp,
                             struct rcu_ctrlblk *rcp, struct rcu_data *rdp)
 {
-    kill_timer(&rdp->idle_timer);
+    kill_timer(&rdp->cb_timer);
 
     /* If the cpu going offline owns the grace period we can block
      * indefinitely waiting for it, so flush it here.
@@ -648,7 +648,7 @@ static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
         //rsinterval = 0;
     }
     rdp->blimit = blimit;
-    init_timer(&rdp->idle_timer, rcu_idle_timer_handler, rdp, cpu);
+    init_timer(&rdp->cb_timer, cb_timer_handler, rdp, cpu);
 }
 
 static int cpu_callback(
@@ -677,25 +677,39 @@ static struct notifier_block cpu_nfb = {
     .notifier_call = cpu_callback
 };
 
+/*
+ * We're changing the name of the parameter, to better reflect the fact that
+ * the timer is used for callbacks in general, when the CPU is either idle
+ * or executing guest code. We still accept the old parameter but, if both
+ * are specified, the new one ("rcu-callback-timer-period-ms") has priority.
+ */
+#define CB_TIMER_PERIOD_DEFAULT_MS ( CB_TIMER_PERIOD_DEFAULT / MILLISECS(1) )
+static unsigned int __initdata cb_timer_period_ms = CB_TIMER_PERIOD_DEFAULT_MS;
+integer_param("rcu-callback-timer-period-ms", cb_timer_period_ms);
+
+static unsigned int __initdata idle_timer_period_ms = CB_TIMER_PERIOD_DEFAULT_MS;
+integer_param("rcu-idle-timer-period-ms", idle_timer_period_ms);
+
 void __init rcu_init(void)
 {
     void *cpu = (void *)(long)smp_processor_id();
-    static unsigned int __initdata idle_timer_period_ms =
-                                    IDLE_TIMER_PERIOD_DEFAULT / MILLISECS(1);
-    integer_param("rcu-idle-timer-period-ms", idle_timer_period_ms);
 
-    /* We don't allow 0, or anything higher than IDLE_TIMER_PERIOD_MAX */
-    if ( idle_timer_period_ms == 0 ||
-         idle_timer_period_ms > IDLE_TIMER_PERIOD_MAX / MILLISECS(1) )
+    if (idle_timer_period_ms != CB_TIMER_PERIOD_DEFAULT_MS &&
+        cb_timer_period_ms == CB_TIMER_PERIOD_DEFAULT_MS)
+        cb_timer_period_ms = idle_timer_period_ms;
+ 
+    /* We don't allow 0, or anything higher than CB_TIMER_PERIOD_MAX */
+    if ( cb_timer_period_ms == 0 ||
+         cb_timer_period_ms > CB_TIMER_PERIOD_MAX / MILLISECS(1) )
     {
-        idle_timer_period_ms = IDLE_TIMER_PERIOD_DEFAULT / MILLISECS(1);
+        cb_timer_period_ms = CB_TIMER_PERIOD_DEFAULT / MILLISECS(1);
         printk("WARNING: rcu-idle-timer-period-ms outside of "
                "(0,%"PRI_stime"]. Resetting it to %u.\n",
-               IDLE_TIMER_PERIOD_MAX / MILLISECS(1), idle_timer_period_ms);
+               CB_TIMER_PERIOD_MAX / MILLISECS(1), cb_timer_period_ms);
     }
-    idle_timer_period = MILLISECS(idle_timer_period_ms);
+    cb_timer_period = MILLISECS(cb_timer_period_ms);
 
-    cpumask_clear(&rcu_ctrlblk.idle_cpumask);
+    cpumask_clear(&rcu_ctrlblk.ignore_cpumask);
     cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
     register_cpu_notifier(&cpu_nfb);
     open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
@@ -707,8 +721,8 @@ void __init rcu_init(void)
  */
 void rcu_idle_enter(unsigned int cpu)
 {
-    ASSERT(!cpumask_test_cpu(cpu, &rcu_ctrlblk.idle_cpumask));
-    cpumask_set_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
+    ASSERT(!cpumask_test_cpu(cpu, &rcu_ctrlblk.ignore_cpumask));
+    cpumask_set_cpu(cpu, &rcu_ctrlblk.ignore_cpumask);
     /*
      * If some other CPU is starting a new grace period, we'll notice that
      * by seeing a new value in rcp->cur (different than our quiescbatch).
@@ -719,12 +733,12 @@ void rcu_idle_enter(unsigned int cpu)
      */
     smp_mb();
 
-    rcu_idle_timer_start();
+    cb_timer_start();
 }
 
 void rcu_idle_exit(unsigned int cpu)
 {
-    rcu_idle_timer_stop();
-    ASSERT(cpumask_test_cpu(cpu, &rcu_ctrlblk.idle_cpumask));
-    cpumask_clear_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
+    cb_timer_stop();
+    ASSERT(cpumask_test_cpu(cpu, &rcu_ctrlblk.ignore_cpumask));
+    cpumask_clear_cpu(cpu, &rcu_ctrlblk.ignore_cpumask);
 }
diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h
index 0027d95a60..89f56af53f 100644
--- a/xen/include/xen/perfc_defn.h
+++ b/xen/include/xen/perfc_defn.h
@@ -11,7 +11,7 @@ PERFCOUNTER(calls_from_multicall,       "calls from multicall")
 PERFCOUNTER(irqs,                   "#interrupts")
 PERFCOUNTER(ipis,                   "#IPIs")
 
-PERFCOUNTER(rcu_idle_timer,         "RCU: idle_timer")
+PERFCOUNTER(rcu_callback_timer,     "RCU: callback_timer")
 
 /* Generic scheduler counters (applicable to all schedulers) */
 PERFCOUNTER(sched_irq,              "sched: timer")
-- 
2.25.1


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #4: Type: text/x-diff; NAME=0003-xen-deal-with-vCPUs-that-do-not-yield-when-idle.patch, Size: 10226 bytes --]

From 1685dc2a2f811eb709cb62bffd8e58020ff79419 Mon Sep 17 00:00:00 2001
From: Dario Faggioli <dfaggioli@suse.com>
Date: Thu, 6 Jan 2022 17:26:48 -0800
Subject: [PATCH 3/4] xen: deal with vCPUs that do not yield when idle

Our RCU implementation needs that a CPU goes through Xen, from time to
time, e.g., for a context switch, to properly mark the end of grace
period. This usually happen often enough, and CPUs that go idle and stay
like that for a while are handled specially (so that they are recorded
as quiescent and "leave" the grace period before starting idling).

In principle, even a CPU that starts executing guest code may/should be
marked as quiescent (it certainly can't be in the middle of a read side
RCU critical section if it's leaving Xen and entering the guest!). This
isn't done and in general does not cause problems. However, if the NULL
scheduler is used and the guest is configured to not go back in Xen when
its vCPUs become idle (e.g., with "vwfi=native" on ARM) grace periods
may extend for very long time and RCU callback delayed to a point that,
for instance, a domain is not properly destroyed.

To fix that, we must start marking a CPU as quiescent as soon as it
enter the guest (and, vice versa, register it back to the current grace
period when it enters Xen). In order to do that, some changes to the API
of rcu_idle_enter/exit were necessary (and the functions were renamed
too).

Note that, exactly like in the case where the CPU goes idle, we need the
arm the callback timer when we enter guest context. In fact, if a CPU
enter a guest with an RCU callback queued and then stays in that context
for long enough, we still risk to not execute the callback itself for
long enough to have problems.

XXX ARM only for now.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
 xen/arch/arm/domain.c         |  4 ++--
 xen/arch/arm/traps.c          |  3 +++
 xen/arch/x86/acpi/cpu_idle.c  |  8 +++----
 xen/arch/x86/cpu/mwait-idle.c |  8 +++----
 xen/common/rcupdate.c         | 41 +++++++++++++++++++++++------------
 xen/include/xen/rcupdate.h    |  4 ++--
 6 files changed, 42 insertions(+), 26 deletions(-)

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index 92a6c509e5..e47168e80d 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -48,7 +48,7 @@ static void do_idle(void)
 {
     unsigned int cpu = smp_processor_id();
 
-    rcu_idle_enter(cpu);
+    rcu_quiet_enter();
     /* rcu_idle_enter() can raise TIMER_SOFTIRQ. Process it now. */
     process_pending_softirqs();
 
@@ -60,7 +60,7 @@ static void do_idle(void)
     }
     local_irq_enable();
 
-    rcu_idle_exit(cpu);
+    rcu_quiet_exit();
 }
 
 void idle_loop(void)
diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
index e2842ba4db..6bceb5b536 100644
--- a/xen/arch/arm/traps.c
+++ b/xen/arch/arm/traps.c
@@ -2047,6 +2047,8 @@ void enter_hypervisor_from_guest(void)
 {
     struct vcpu *v = current;
 
+    rcu_quiet_exit();
+
     /*
      * If we pended a virtual abort, preserve it until it gets cleared.
      * See ARM ARM DDI 0487A.j D1.14.3 (Virtual Interrupts) for details,
@@ -2337,6 +2339,7 @@ static bool check_for_vcpu_work(void)
  */
 void leave_hypervisor_to_guest(void)
 {
+    rcu_quiet_enter();
     local_irq_disable();
 
     /*
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index d788c8bffc..e3a5e67c68 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -716,7 +716,7 @@ static void acpi_processor_idle(void)
 
     cpufreq_dbs_timer_suspend();
 
-    rcu_idle_enter(cpu);
+    rcu_quiet_enter();
     /* rcu_idle_enter() can raise TIMER_SOFTIRQ. Process it now. */
     process_pending_softirqs();
 
@@ -729,7 +729,7 @@ static void acpi_processor_idle(void)
     if ( !cpu_is_haltable(cpu) )
     {
         local_irq_enable();
-        rcu_idle_exit(cpu);
+        rcu_quiet_exit();
         cpufreq_dbs_timer_resume();
         return;
     }
@@ -854,7 +854,7 @@ static void acpi_processor_idle(void)
         /* Now in C0 */
         power->last_state = &power->states[0];
         local_irq_enable();
-        rcu_idle_exit(cpu);
+        rcu_quiet_exit();
         cpufreq_dbs_timer_resume();
         return;
     }
@@ -862,7 +862,7 @@ static void acpi_processor_idle(void)
     /* Now in C0 */
     power->last_state = &power->states[0];
 
-    rcu_idle_exit(cpu);
+    rcu_quiet_exit();
     cpufreq_dbs_timer_resume();
 
     if ( cpuidle_current_governor->reflect )
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index d1739f6fc3..739f20e1fe 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -778,8 +778,8 @@ static void mwait_idle(void)
 
 	cpufreq_dbs_timer_suspend();
 
-	rcu_idle_enter(cpu);
-	/* rcu_idle_enter() can raise TIMER_SOFTIRQ. Process it now. */
+	rcu_quiet_enter();
+	/* rcu_quiet_enter() can raise TIMER_SOFTIRQ. Process it now. */
 	process_pending_softirqs();
 
 	/* Interrupts must be disabled for C2 and higher transitions. */
@@ -787,7 +787,7 @@ static void mwait_idle(void)
 
 	if (!cpu_is_haltable(cpu)) {
 		local_irq_enable();
-		rcu_idle_exit(cpu);
+		rcu_quiet_exit();
 		cpufreq_dbs_timer_resume();
 		return;
 	}
@@ -829,7 +829,7 @@ static void mwait_idle(void)
 	if (!(lapic_timer_reliable_states & (1 << cx->type)))
 		lapic_timer_on();
 
-	rcu_idle_exit(cpu);
+	rcu_quiet_exit();
 	cpufreq_dbs_timer_resume();
 
 	if ( cpuidle_current_governor->reflect )
diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index 96e8ca01c0..cfcc8bbc97 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -87,7 +87,7 @@ struct rcu_data {
     int cpu;
     long            last_rs_qlen;     /* qlen during the last resched */
 
-    /* 3) idle CPUs handling */
+    /* 3) idle (or in guest mode) CPUs handling */
     struct timer cb_timer;
     bool cb_timer_active;
 
@@ -115,6 +115,12 @@ boolean_param("rcu_force_quiesc", rcu_always_quiesc);
  * 3) it is stopped immediately, if the CPU wakes up from idle and
  *    resumes 'normal' execution.
  *
+ * Note also that the same happens if a CPU starts executing a guest that
+ * (almost) never comes back into the hypervisor. This may be the case if
+ * the guest uses "idle=poll" / "vwfi=native". Therefore, we need to handle
+ * guest entry events in the same way as the CPU going idle, i.e., consider
+ * it quiesced and arm the timer.
+ *
  * About how far in the future the timer should be programmed each time,
  * it's hard to tell (guess!!). Since this mimics Linux's periodic timer
  * tick, take values used there as an indication. In Linux 2.6.21, tick
@@ -362,9 +368,10 @@ static void rcu_start_batch(struct rcu_ctrlblk *rcp)
         * Make sure the increment of rcp->cur is visible so, even if a
         * CPU that is about to go idle, is captured inside rcp->cpumask,
         * rcu_pending() will return false, which then means cpu_quiet()
-        * will be invoked, before the CPU would actually enter idle.
+        * will be invoked, before the CPU would actually enter idle (or
+        * enter a guest).
         *
-        * This barrier is paired with the one in rcu_idle_enter().
+        * This barrier is paired with the one in rcu_quit_enter().
         */
         smp_mb();
         cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->ignore_cpumask);
@@ -534,14 +541,15 @@ int rcu_needs_cpu(int cpu)
  * periodically poke rcu_pedning(), so that it will invoke the callback
  * not too late after the end of the grace period.
  */
-static void cb_timer_start(void)
+static void cb_timer_start(unsigned int cpu)
 {
-    struct rcu_data *rdp = &this_cpu(rcu_data);
+    struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
 
     /*
      * Note that we don't check rcu_pending() here. In fact, we don't want
      * the timer armed on CPUs that are in the process of quiescing while
-     * going idle, unless they really are the ones with a queued callback.
+     * going idle or entering guest mode, unless they really have queued
+     * callbacks.
      */
     if (likely(!rdp->curlist))
         return;
@@ -550,9 +558,9 @@ static void cb_timer_start(void)
     rdp->cb_timer_active = true;
 }
 
-static void cb_timer_stop(void)
+static void cb_timer_stop(unsigned int cpu)
 {
-    struct rcu_data *rdp = &this_cpu(rcu_data);
+    struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
 
     if (likely(!rdp->cb_timer_active))
         return;
@@ -716,11 +724,14 @@ void __init rcu_init(void)
 }
 
 /*
- * The CPU is becoming idle, so no more read side critical
- * sections, and one more step toward grace period.
+ * The CPU is becoming about to either idle or enter the guest. In any of
+ * these cases, it can't have any outstanding read side critical sections
+ * so this is one step toward the end of the grace period.
  */
-void rcu_idle_enter(unsigned int cpu)
+void rcu_quiet_enter(void)
 {
+    unsigned int cpu = smp_processor_id();
+
     ASSERT(!cpumask_test_cpu(cpu, &rcu_ctrlblk.ignore_cpumask));
     cpumask_set_cpu(cpu, &rcu_ctrlblk.ignore_cpumask);
     /*
@@ -733,12 +744,14 @@ void rcu_idle_enter(unsigned int cpu)
      */
     smp_mb();
 
-    cb_timer_start();
+    cb_timer_start(cpu);
 }
 
-void rcu_idle_exit(unsigned int cpu)
+void rcu_quiet_exit(void)
 {
-    cb_timer_stop();
+    unsigned int cpu = smp_processor_id();
+
+    cb_timer_stop(cpu);
     ASSERT(cpumask_test_cpu(cpu, &rcu_ctrlblk.ignore_cpumask));
     cpumask_clear_cpu(cpu, &rcu_ctrlblk.ignore_cpumask);
 }
diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h
index d279e39022..7dd0b2d74b 100644
--- a/xen/include/xen/rcupdate.h
+++ b/xen/include/xen/rcupdate.h
@@ -179,7 +179,7 @@ void call_rcu(struct rcu_head *head,
 
 void rcu_barrier(void);
 
-void rcu_idle_enter(unsigned int cpu);
-void rcu_idle_exit(unsigned int cpu);
+void rcu_quiet_enter(void);
+void rcu_quiet_exit(void);
 
 #endif /* __XEN_RCUPDATE_H */
-- 
2.25.1


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #5: Type: text/x-diff; NAME=0004-xen-remove-ASSERT-in-rcu_quiet_enter.patch, Size: 1104 bytes --]

From 6fc02b90814d3fe630715e353d16f397a5b280f9 Mon Sep 17 00:00:00 2001
From: Stefano Stabellini <stefano.stabellini@xilinx.com>
Date: Thu, 6 Jan 2022 17:41:08 -0800
Subject: [PATCH 4/4] xen: remove ASSERT in rcu_quiet_enter

After the recent changes to the RCU subsystem and rcu_quiet_enter, the
ASSERT at the beginning of rcu_quiet_enter triggers. It is harmless, so
remove it.

Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
---
 xen/common/rcupdate.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index cfcc8bbc97..7b9e4bdf15 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -732,7 +732,8 @@ void rcu_quiet_enter(void)
 {
     unsigned int cpu = smp_processor_id();
 
-    ASSERT(!cpumask_test_cpu(cpu, &rcu_ctrlblk.ignore_cpumask));
+    if (cpumask_test_cpu(cpu, &rcu_ctrlblk.ignore_cpumask))
+        return;
     cpumask_set_cpu(cpu, &rcu_ctrlblk.ignore_cpumask);
     /*
      * If some other CPU is starting a new grace period, we'll notice that
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: sched=null vwfi=native and call_rcu()
  2022-01-07  1:52   ` Stefano Stabellini
@ 2022-01-14 18:42     ` Dario Faggioli
  2022-01-14 21:01       ` Stefano Stabellini
  0 siblings, 1 reply; 9+ messages in thread
From: Dario Faggioli @ 2022-01-14 18:42 UTC (permalink / raw)
  To: sstabellini, julien
  Cc: Jan Beulich, Juergen Gross, bertrand.marquis, roger.pau,
	Volodymyr_Babchuk, george.dunlap, xen-devel, andrew.cooper3

[-- Attachment #1: Type: text/plain, Size: 2878 bytes --]

On Thu, 2022-01-06 at 17:52 -0800, Stefano Stabellini wrote:
> On Thu, 6 Jan 2022, Julien Grall wrote:
> > 
> > This issue and solution were discussed numerous time on the ML. In
> > short, we
> > want to tell the RCU that CPU running in guest context are always
> > quiesced.
> > For more details, you can read the previous thread (which also
> > contains a link
> > to the one before):
> > 
> > https://lore.kernel.org/xen-devel/fe3dd9f0-b035-01fe-3e01-ddf065f182ab@codiax.se/
> 
> Thanks Julien for the pointer!
> 
> Dario, I forward-ported your three patches to staging:
> https://gitlab.com/xen-project/people/sstabellini/xen/-/tree/rcu-quiet
> 
Hi Stefano!

I definitely would like to see the end of this issue, so thanks a lot
for your interest and your help with the patches.

> I can confirm that they fix the bug. Note that I had to add a small
> change on top to remove the ASSERT at the beginning of
> rcu_quiet_enter:
> https://gitlab.com/xen-project/people/sstabellini/xen/-/commit/6fc02b90814d3fe630715e353d16f397a5b280f9
> 
Yeah, that should be fine.

> Would you be up for submitting them for upstreaming? I would prefer
> if
> you send out the patches because I cannot claim to understand them
> completely (except for the one doing renaming :-P )
> 
Haha! So, I am up for properly submitting, but there's one problem. As
you've probably got, the idea here is to use transitions toward the
guest and inside the hypervisor as RCU quiescence and "activation"
points.

Now, on ARM, that just meant calling rcu_quiet_exit() in
enter_hypervisor_from_guest() and calling rcu_quiet_enter() in
leave_hypervisor_to_guest(). Nice and easy, and even myself --and I'm
definitely not an ARM person-- cloud understand it (although with some
help from Julien) and put the patches together.

However, the problem is really arch independent and, despite not
surfacing equally frequently, it affects x86 as well. And for x86 the
situation is by far not equally nice, when it comes to identifying all
the places from where to call rcu_quiet_{enter,exit}().

And finding out where to put them, among the various functions that we
have in the various entry.S variants is where I stopped. The plan was
to get back to it, but as shamefully as it sounds, I could not do that
yet.

So, if anyone wants to help with this, handing over suggestions for
potential good spots, that would help a lot.

Alternatively, we can submit the series as ARM-only... But I fear that
the x86 side of things would then be easily forgotten. :-(

Thanks again and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sched=null vwfi=native and call_rcu()
  2022-01-14 18:42     ` Dario Faggioli
@ 2022-01-14 21:01       ` Stefano Stabellini
  2022-01-17 11:05         ` George Dunlap
  0 siblings, 1 reply; 9+ messages in thread
From: Stefano Stabellini @ 2022-01-14 21:01 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: sstabellini, julien, Jan Beulich, Juergen Gross,
	bertrand.marquis, roger.pau, Volodymyr_Babchuk, george.dunlap,
	xen-devel, andrew.cooper3

On Fri, 14 Jan 2022, Dario Faggioli wrote:
> On Thu, 2022-01-06 at 17:52 -0800, Stefano Stabellini wrote:
> > On Thu, 6 Jan 2022, Julien Grall wrote:
> > > 
> > > This issue and solution were discussed numerous time on the ML. In
> > > short, we
> > > want to tell the RCU that CPU running in guest context are always
> > > quiesced.
> > > For more details, you can read the previous thread (which also
> > > contains a link
> > > to the one before):
> > > 
> > > https://lore.kernel.org/xen-devel/fe3dd9f0-b035-01fe-3e01-ddf065f182ab@codiax.se/
> > 
> > Thanks Julien for the pointer!
> > 
> > Dario, I forward-ported your three patches to staging:
> > https://gitlab.com/xen-project/people/sstabellini/xen/-/tree/rcu-quiet
> > 
> Hi Stefano!
> 
> I definitely would like to see the end of this issue, so thanks a lot
> for your interest and your help with the patches.
> 
> > I can confirm that they fix the bug. Note that I had to add a small
> > change on top to remove the ASSERT at the beginning of
> > rcu_quiet_enter:
> > https://gitlab.com/xen-project/people/sstabellini/xen/-/commit/6fc02b90814d3fe630715e353d16f397a5b280f9
> > 
> Yeah, that should be fine.
> 
> > Would you be up for submitting them for upstreaming? I would prefer
> > if
> > you send out the patches because I cannot claim to understand them
> > completely (except for the one doing renaming :-P )
> > 
> Haha! So, I am up for properly submitting, but there's one problem. As
> you've probably got, the idea here is to use transitions toward the
> guest and inside the hypervisor as RCU quiescence and "activation"
> points.
> 
> Now, on ARM, that just meant calling rcu_quiet_exit() in
> enter_hypervisor_from_guest() and calling rcu_quiet_enter() in
> leave_hypervisor_to_guest(). Nice and easy, and even myself --and I'm
> definitely not an ARM person-- cloud understand it (although with some
> help from Julien) and put the patches together.
> 
> However, the problem is really arch independent and, despite not
> surfacing equally frequently, it affects x86 as well. And for x86 the
> situation is by far not equally nice, when it comes to identifying all
> the places from where to call rcu_quiet_{enter,exit}().
> 
> And finding out where to put them, among the various functions that we
> have in the various entry.S variants is where I stopped. The plan was
> to get back to it, but as shamefully as it sounds, I could not do that
> yet.
> 
> So, if anyone wants to help with this, handing over suggestions for
> potential good spots, that would help a lot.

Unfortunately I cannot volunteer due to time and also because I wouldn't
know where to look and I don't have a reproducer or a test environment
on x86. I would be flying blind.


> Alternatively, we can submit the series as ARM-only... But I fear that
> the x86 side of things would then be easily forgotten. :-(

I agree with you on this, but at the same time we are having problems
with customers in the field -- it is not like we can wait to solve the
problem on ARM any longer. And the issue is certainly far less likely to
happen on x86 (there is no vwfi=native, right?) In other words, I think
it is better to have half of the solution now to solve the worst part of
the problem than to wait more months for a full solution.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sched=null vwfi=native and call_rcu()
  2022-01-14 21:01       ` Stefano Stabellini
@ 2022-01-17 11:05         ` George Dunlap
  2022-01-17 11:13           ` Juergen Gross
  0 siblings, 1 reply; 9+ messages in thread
From: George Dunlap @ 2022-01-17 11:05 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Dario Faggioli, julien, Jan Beulich, Juergen Gross,
	bertrand.marquis, Roger Pau Monne, Volodymyr_Babchuk, xen-devel,
	Andrew Cooper



> On Jan 14, 2022, at 9:01 PM, Stefano Stabellini <sstabellini@kernel.org> wrote:
> 
> On Fri, 14 Jan 2022, Dario Faggioli wrote:
>> On Thu, 2022-01-06 at 17:52 -0800, Stefano Stabellini wrote:
>>> On Thu, 6 Jan 2022, Julien Grall wrote:
>>>> 
>>>> This issue and solution were discussed numerous time on the ML. In
>>>> short, we
>>>> want to tell the RCU that CPU running in guest context are always
>>>> quiesced.
>>>> For more details, you can read the previous thread (which also
>>>> contains a link
>>>> to the one before):
>>>> 
>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fxen-devel%2Ffe3dd9f0-b035-01fe-3e01-ddf065f182ab%40codiax.se%2F&amp;data=04%7C01%7CGeorge.Dunlap%40citrix.com%7Cb6795e0be3af416841a408d9d7a12030%7C335836de42ef43a2b145348c2ee9ca5b%7C0%7C0%7C637777909305566330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=9%2BoiFfdK3rGAeWFSNCRu5aSuYgql1XZcaGJgT3aRsOA%3D&amp;reserved=0
>>> 
>>> Thanks Julien for the pointer!
>>> 
>>> Dario, I forward-ported your three patches to staging:
>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.com%2Fxen-project%2Fpeople%2Fsstabellini%2Fxen%2F-%2Ftree%2Frcu-quiet&amp;data=04%7C01%7CGeorge.Dunlap%40citrix.com%7Cb6795e0be3af416841a408d9d7a12030%7C335836de42ef43a2b145348c2ee9ca5b%7C0%7C0%7C637777909305566330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=vrNN5KgwXj93ZThreIDNB7UgKJdPNz%2BoL98b%2FoopN8w%3D&amp;reserved=0
>>> 
>> Hi Stefano!
>> 
>> I definitely would like to see the end of this issue, so thanks a lot
>> for your interest and your help with the patches.
>> 
>>> I can confirm that they fix the bug. Note that I had to add a small
>>> change on top to remove the ASSERT at the beginning of
>>> rcu_quiet_enter:
>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.com%2Fxen-project%2Fpeople%2Fsstabellini%2Fxen%2F-%2Fcommit%2F6fc02b90814d3fe630715e353d16f397a5b280f9&amp;data=04%7C01%7CGeorge.Dunlap%40citrix.com%7Cb6795e0be3af416841a408d9d7a12030%7C335836de42ef43a2b145348c2ee9ca5b%7C0%7C0%7C637777909305566330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=vjxT35b%2FglqzzA4DCLqTjbo0bAfOjtLcvN90OFs8U9Q%3D&amp;reserved=0
>>> 
>> Yeah, that should be fine.
>> 
>>> Would you be up for submitting them for upstreaming? I would prefer
>>> if
>>> you send out the patches because I cannot claim to understand them
>>> completely (except for the one doing renaming :-P )
>>> 
>> Haha! So, I am up for properly submitting, but there's one problem. As
>> you've probably got, the idea here is to use transitions toward the
>> guest and inside the hypervisor as RCU quiescence and "activation"
>> points.
>> 
>> Now, on ARM, that just meant calling rcu_quiet_exit() in
>> enter_hypervisor_from_guest() and calling rcu_quiet_enter() in
>> leave_hypervisor_to_guest(). Nice and easy, and even myself --and I'm
>> definitely not an ARM person-- cloud understand it (although with some
>> help from Julien) and put the patches together.
>> 
>> However, the problem is really arch independent and, despite not
>> surfacing equally frequently, it affects x86 as well. And for x86 the
>> situation is by far not equally nice, when it comes to identifying all
>> the places from where to call rcu_quiet_{enter,exit}().
>> 
>> And finding out where to put them, among the various functions that we
>> have in the various entry.S variants is where I stopped. The plan was
>> to get back to it, but as shamefully as it sounds, I could not do that
>> yet.
>> 
>> So, if anyone wants to help with this, handing over suggestions for
>> potential good spots, that would help a lot.
> 
> Unfortunately I cannot volunteer due to time and also because I wouldn't
> know where to look and I don't have a reproducer or a test environment
> on x86. I would be flying blind.
> 
> 
>> Alternatively, we can submit the series as ARM-only... But I fear that
>> the x86 side of things would then be easily forgotten. :-(
> 
> I agree with you on this, but at the same time we are having problems
> with customers in the field -- it is not like we can wait to solve the
> problem on ARM any longer. And the issue is certainly far less likely to
> happen on x86 (there is no vwfi=native, right?) In other words, I think
> it is better to have half of the solution now to solve the worst part of
> the problem than to wait more months for a full solution.

An x86 equivalent of vwfi=native could be implemented easily, but AFAIK nobody has asked for it yet.  I agree that we need to fix if for ARM, and so in the absence of someone with the time to fix up the x86 side, I think fixing ARM-only is the way to go.

It would be good if we could add appropriate comments warning anyone who implements `hlt=native` on x86 the problems they’ll face and how to fix them.  Not sure the best place to do that; in the VMX / SVM code that sets the exit for HLT &c?

 -George


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sched=null vwfi=native and call_rcu()
  2022-01-17 11:05         ` George Dunlap
@ 2022-01-17 11:13           ` Juergen Gross
  2022-01-17 13:48             ` Julien Grall
  2022-01-18 17:49             ` Dario Faggioli
  0 siblings, 2 replies; 9+ messages in thread
From: Juergen Gross @ 2022-01-17 11:13 UTC (permalink / raw)
  To: George Dunlap, Stefano Stabellini
  Cc: Dario Faggioli, julien, Jan Beulich, bertrand.marquis,
	Roger Pau Monne, Volodymyr_Babchuk, xen-devel, Andrew Cooper


[-- Attachment #1.1.1: Type: text/plain, Size: 5436 bytes --]

On 17.01.22 12:05, George Dunlap wrote:
> 
> 
>> On Jan 14, 2022, at 9:01 PM, Stefano Stabellini <sstabellini@kernel.org> wrote:
>>
>> On Fri, 14 Jan 2022, Dario Faggioli wrote:
>>> On Thu, 2022-01-06 at 17:52 -0800, Stefano Stabellini wrote:
>>>> On Thu, 6 Jan 2022, Julien Grall wrote:
>>>>>
>>>>> This issue and solution were discussed numerous time on the ML. In
>>>>> short, we
>>>>> want to tell the RCU that CPU running in guest context are always
>>>>> quiesced.
>>>>> For more details, you can read the previous thread (which also
>>>>> contains a link
>>>>> to the one before):
>>>>>
>>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fxen-devel%2Ffe3dd9f0-b035-01fe-3e01-ddf065f182ab%40codiax.se%2F&amp;data=04%7C01%7CGeorge.Dunlap%40citrix.com%7Cb6795e0be3af416841a408d9d7a12030%7C335836de42ef43a2b145348c2ee9ca5b%7C0%7C0%7C637777909305566330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=9%2BoiFfdK3rGAeWFSNCRu5aSuYgql1XZcaGJgT3aRsOA%3D&amp;reserved=0
>>>>
>>>> Thanks Julien for the pointer!
>>>>
>>>> Dario, I forward-ported your three patches to staging:
>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.com%2Fxen-project%2Fpeople%2Fsstabellini%2Fxen%2F-%2Ftree%2Frcu-quiet&amp;data=04%7C01%7CGeorge.Dunlap%40citrix.com%7Cb6795e0be3af416841a408d9d7a12030%7C335836de42ef43a2b145348c2ee9ca5b%7C0%7C0%7C637777909305566330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=vrNN5KgwXj93ZThreIDNB7UgKJdPNz%2BoL98b%2FoopN8w%3D&amp;reserved=0
>>>>
>>> Hi Stefano!
>>>
>>> I definitely would like to see the end of this issue, so thanks a lot
>>> for your interest and your help with the patches.
>>>
>>>> I can confirm that they fix the bug. Note that I had to add a small
>>>> change on top to remove the ASSERT at the beginning of
>>>> rcu_quiet_enter:
>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.com%2Fxen-project%2Fpeople%2Fsstabellini%2Fxen%2F-%2Fcommit%2F6fc02b90814d3fe630715e353d16f397a5b280f9&amp;data=04%7C01%7CGeorge.Dunlap%40citrix.com%7Cb6795e0be3af416841a408d9d7a12030%7C335836de42ef43a2b145348c2ee9ca5b%7C0%7C0%7C637777909305566330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=vjxT35b%2FglqzzA4DCLqTjbo0bAfOjtLcvN90OFs8U9Q%3D&amp;reserved=0
>>>>
>>> Yeah, that should be fine.
>>>
>>>> Would you be up for submitting them for upstreaming? I would prefer
>>>> if
>>>> you send out the patches because I cannot claim to understand them
>>>> completely (except for the one doing renaming :-P )
>>>>
>>> Haha! So, I am up for properly submitting, but there's one problem. As
>>> you've probably got, the idea here is to use transitions toward the
>>> guest and inside the hypervisor as RCU quiescence and "activation"
>>> points.
>>>
>>> Now, on ARM, that just meant calling rcu_quiet_exit() in
>>> enter_hypervisor_from_guest() and calling rcu_quiet_enter() in
>>> leave_hypervisor_to_guest(). Nice and easy, and even myself --and I'm
>>> definitely not an ARM person-- cloud understand it (although with some
>>> help from Julien) and put the patches together.
>>>
>>> However, the problem is really arch independent and, despite not
>>> surfacing equally frequently, it affects x86 as well. And for x86 the
>>> situation is by far not equally nice, when it comes to identifying all
>>> the places from where to call rcu_quiet_{enter,exit}().
>>>
>>> And finding out where to put them, among the various functions that we
>>> have in the various entry.S variants is where I stopped. The plan was
>>> to get back to it, but as shamefully as it sounds, I could not do that
>>> yet.
>>>
>>> So, if anyone wants to help with this, handing over suggestions for
>>> potential good spots, that would help a lot.
>>
>> Unfortunately I cannot volunteer due to time and also because I wouldn't
>> know where to look and I don't have a reproducer or a test environment
>> on x86. I would be flying blind.
>>
>>
>>> Alternatively, we can submit the series as ARM-only... But I fear that
>>> the x86 side of things would then be easily forgotten. :-(
>>
>> I agree with you on this, but at the same time we are having problems
>> with customers in the field -- it is not like we can wait to solve the
>> problem on ARM any longer. And the issue is certainly far less likely to
>> happen on x86 (there is no vwfi=native, right?) In other words, I think
>> it is better to have half of the solution now to solve the worst part of
>> the problem than to wait more months for a full solution.
> 
> An x86 equivalent of vwfi=native could be implemented easily, but AFAIK nobody has asked for it yet.  I agree that we need to fix if for ARM, and so in the absence of someone with the time to fix up the x86 side, I think fixing ARM-only is the way to go.
> 
> It would be good if we could add appropriate comments warning anyone who implements `hlt=native` on x86 the problems they’ll face and how to fix them.  Not sure the best place to do that; in the VMX / SVM code that sets the exit for HLT &c?

But wouldn't a guest in a busy loop on x86 with NULL scheduler suffer
from the same problem?

And wouldn't that be a problem for PV guests, too?


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sched=null vwfi=native and call_rcu()
  2022-01-17 11:13           ` Juergen Gross
@ 2022-01-17 13:48             ` Julien Grall
  2022-01-18 17:49             ` Dario Faggioli
  1 sibling, 0 replies; 9+ messages in thread
From: Julien Grall @ 2022-01-17 13:48 UTC (permalink / raw)
  To: Juergen Gross, George Dunlap, Stefano Stabellini
  Cc: Dario Faggioli, Jan Beulich, bertrand.marquis, Roger Pau Monne,
	Volodymyr_Babchuk, xen-devel, Andrew Cooper

Hi,

On 17/01/2022 15:13, Juergen Gross wrote:
> On 17.01.22 12:05, George Dunlap wrote:
>>> On Jan 14, 2022, at 9:01 PM, Stefano Stabellini 
>>> <sstabellini@kernel.org> wrote:
>>> On Fri, 14 Jan 2022, Dario Faggioli wrote:
>>>> On Thu, 2022-01-06 at 17:52 -0800, Stefano Stabellini wrote:
>>>>> On Thu, 6 Jan 2022, Julien Grall wrote:
>>>> Alternatively, we can submit the series as ARM-only... But I fear that
>>>> the x86 side of things would then be easily forgotten. :-(
>>>
>>> I agree with you on this, but at the same time we are having problems
>>> with customers in the field -- it is not like we can wait to solve the
>>> problem on ARM any longer. And the issue is certainly far less likely to
>>> happen on x86 (there is no vwfi=native, right?) In other words, I think
>>> it is better to have half of the solution now to solve the worst part of
>>> the problem than to wait more months for a full solution.

Well, it all depends on how your guest OS works A "malicious" guest that 
will configure the vCPU to busy loop without wfi will result to the same 
problem (this is one of the reason why NULL scheduler is not security 
supported).

>>
>> An x86 equivalent of vwfi=native could be implemented easily, but 
>> AFAIK nobody has asked for it yet.  I agree that we need to fix if for 
>> ARM, and so in the absence of someone with the time to fix up the x86 
>> side, I think fixing ARM-only is the way to go.
>>
>> It would be good if we could add appropriate comments warning anyone 
>> who implements `hlt=native` on x86 the problems they’ll face and how 
>> to fix them.  Not sure the best place to do that; in the VMX / SVM 
>> code that sets the exit for HLT &c?
> 
> But wouldn't a guest in a busy loop on x86 with NULL scheduler suffer
> from the same problem?

This is not a problem on x86 because there will always an exit to the 
hypervisor a timed interval (IIRC for some rendezvous?). On Arm, using 
the NULL scheduler will result to a completely tickless hypervisor.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sched=null vwfi=native and call_rcu()
  2022-01-17 11:13           ` Juergen Gross
  2022-01-17 13:48             ` Julien Grall
@ 2022-01-18 17:49             ` Dario Faggioli
  1 sibling, 0 replies; 9+ messages in thread
From: Dario Faggioli @ 2022-01-18 17:49 UTC (permalink / raw)
  To: Juergen Gross, George Dunlap, Stefano Stabellini
  Cc: julien, Jan Beulich, bertrand.marquis, Roger Pau Monne,
	Volodymyr_Babchuk, xen-devel, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 1517 bytes --]

On Mon, 2022-01-17 at 12:13 +0100, Juergen Gross wrote:
> On 17.01.22 12:05, George Dunlap wrote:
> > 
> > An x86 equivalent of vwfi=native could be implemented easily, but
> > AFAIK nobody has asked for it yet.  I agree that we need to fix if
> > for ARM, and so in the absence of someone with the time to fix up
> > the x86 side, I think fixing ARM-only is the way to go.
> > 
> > It would be good if we could add appropriate comments warning
> > anyone who implements `hlt=native` on x86 the problems they’ll face
> > and how to fix them.  Not sure the best place to do that; in the
> > VMX / SVM code that sets the exit for HLT &c?
> 
> But wouldn't a guest in a busy loop on x86 with NULL scheduler suffer
> from the same problem?
> 
Right, and even more 'idle=poll' as a _guest_ kernel command line
parameter, IMO.

That does not change what happens when the guest issue an HLT, but it
drastically reduces the frequency of it doing so (or at least, it did
the last time I tried it).

So it's not exactly like wfi=native on ARM, but on the other hand, it
can be under the guest's control.

> And wouldn't that be a problem for PV guests, too?
> 
Yeah, that's one of the things that makes it tricky

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-01-18 17:49 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-06  0:40 sched=null vwfi=native and call_rcu() Stefano Stabellini
2022-01-06  9:26 ` Julien Grall
2022-01-07  1:52   ` Stefano Stabellini
2022-01-14 18:42     ` Dario Faggioli
2022-01-14 21:01       ` Stefano Stabellini
2022-01-17 11:05         ` George Dunlap
2022-01-17 11:13           ` Juergen Gross
2022-01-17 13:48             ` Julien Grall
2022-01-18 17:49             ` Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.