[Xen-devel] [PATCH v7 0/5] xen/rcu: let rcu work better with core scheduling

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [Xen-devel] [PATCH v7 0/5] xen/rcu: let rcu work better with core scheduling
@ 2020-03-25 10:55 Juergen Gross
  2020-03-25 10:55 ` [Xen-devel] [PATCH v7 1/5] xen: introduce smp_mb__[after|before]_atomic() barriers Juergen Gross
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Juergen Gross @ 2020-03-25 10:55 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Julien Grall, Wei Liu,
	Andrew Cooper, Ian Jackson, George Dunlap, Jan Beulich,
	Volodymyr Babchuk, Roger Pau Monné

Today the RCU handling in Xen is affecting scheduling in several ways.
It is raising sched softirqs without any real need and it requires
tasklets for rcu_barrier(), which interacts badly with core scheduling.

This small series repairs those issues.

Additionally some ASSERT()s are added for verification of sane rcu
handling. In order to avoid those triggering right away the obvious
violations are fixed. This includes making rcu locking functions type
safe.

Changes in V7:
- new patch 1
- added some barriers in patch 1

Changes in V6:
- added memory barrier in patch 1
- drop cpu_map_lock only at the end of rcu_barrier()
- re-add prempt_disable() in patch 3

Changes in V5:
- dropped already committed patches 1 and 4
- fixed race
- rework blocking of rcu processing with held rcu locks

Changes in V4:
- patch 5: use barrier()

Changes in V3:
- type safe locking functions (functions instead of macros)
- per-lock debug additions
- new patches 4 and 6
- fixed races

Changes in V2:
- use get_cpu_maps() in rcu_barrier() handling
- avoid recursion in rcu_barrier() handling
- new patches 3 and 4

Juergen Gross (5):
  xen: introduce smp_mb__[after|before]_atomic() barriers
  xen/rcu: don't use stop_machine_run() for rcu_barrier()
  xen: don't process rcu callbacks when holding a rcu_read_lock()
  xen/rcu: add assertions to debug build
  xen/rcu: add per-lock counter in debug builds

 xen/common/rcupdate.c        | 102 +++++++++++++++++++++++++++++++------------
 xen/common/softirq.c         |  14 +++++-
 xen/include/asm-arm/system.h |   3 ++
 xen/include/asm-x86/system.h |   3 ++
 xen/include/xen/rcupdate.h   |  77 ++++++++++++++++++++++++++------
 5 files changed, 157 insertions(+), 42 deletions(-)

-- 
2.16.4



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Xen-devel] [PATCH v7 1/5] xen: introduce smp_mb__[after|before]_atomic() barriers
  2020-03-25 10:55 [Xen-devel] [PATCH v7 0/5] xen/rcu: let rcu work better with core scheduling Juergen Gross
@ 2020-03-25 10:55 ` Juergen Gross
  2020-03-25 13:17   ` Jan Beulich
  2020-03-25 16:20   ` Julien Grall
  2020-03-25 10:55 ` [Xen-devel] [PATCH v7 2/5] xen/rcu: don't use stop_machine_run() for rcu_barrier() Juergen Gross
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Juergen Gross @ 2020-03-25 10:55 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Julien Grall, Wei Liu,
	Andrew Cooper, Jan Beulich, Volodymyr Babchuk,
	Roger Pau Monné

When using atomic variables for synchronization barriers are needed
to ensure proper data serialization. Introduce smp_mb__before_atomic()
and smp_mb__after_atomic() as in the Linux kernel for that purpose.

Use the same definitions as in the Linux kernel.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
---
V7:
- new patch
---
 xen/include/asm-arm/system.h | 3 +++
 xen/include/asm-x86/system.h | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/xen/include/asm-arm/system.h b/xen/include/asm-arm/system.h
index e5d062667d..65d5c8e423 100644
--- a/xen/include/asm-arm/system.h
+++ b/xen/include/asm-arm/system.h
@@ -30,6 +30,9 @@
 
 #define smp_wmb()       dmb(ishst)
 
+#define smp_mb__before_atomic()    smp_mb()
+#define smp_mb__after_atomic()     smp_mb()
+
 /*
  * This is used to ensure the compiler did actually allocate the register we
  * asked it for some inline assembly sequences.  Apparently we can't trust
diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h
index 069f422f0d..7e5891f3df 100644
--- a/xen/include/asm-x86/system.h
+++ b/xen/include/asm-x86/system.h
@@ -233,6 +233,9 @@ static always_inline unsigned long __xadd(
 #define set_mb(var, value) do { xchg(&var, value); } while (0)
 #define set_wmb(var, value) do { var = value; smp_wmb(); } while (0)
 
+#define smp_mb__before_atomic()    do { } while (0)
+#define smp_mb__after_atomic()     do { } while (0)
+
 /**
  * array_index_mask_nospec() - generate a mask that is ~0UL when the
  *      bounds check succeeds and 0 otherwise
-- 
2.16.4



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Xen-devel] [PATCH v7 2/5] xen/rcu: don't use stop_machine_run() for rcu_barrier()
  2020-03-25 10:55 [Xen-devel] [PATCH v7 0/5] xen/rcu: let rcu work better with core scheduling Juergen Gross
  2020-03-25 10:55 ` [Xen-devel] [PATCH v7 1/5] xen: introduce smp_mb__[after|before]_atomic() barriers Juergen Gross
@ 2020-03-25 10:55 ` Juergen Gross
  2020-03-25 13:19   ` Jan Beulich
  2020-03-25 16:13   ` Julien Grall
  2020-03-25 10:55 ` [Xen-devel] [PATCH v7 3/5] xen: don't process rcu callbacks when holding a rcu_read_lock() Juergen Gross
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Juergen Gross @ 2020-03-25 10:55 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Julien Grall, Wei Liu,
	Andrew Cooper, Ian Jackson, George Dunlap, Jan Beulich

Today rcu_barrier() is calling stop_machine_run() to synchronize all
physical cpus in order to ensure all pending rcu calls have finished
when returning.

As stop_machine_run() is using tasklets this requires scheduling of
idle vcpus on all cpus imposing the need to call rcu_barrier() on idle
cpus only in case of core scheduling being active, as otherwise a
scheduling deadlock would occur.

There is no need at all to do the syncing of the cpus in tasklets, as
rcu activity is started in __do_softirq() called whenever softirq
activity is allowed. So rcu_barrier() can easily be modified to use
softirq for synchronization of the cpus no longer requiring any
scheduling activity.

As there already is a rcu softirq reuse that for the synchronization.

Remove the barrier element from struct rcu_data as it isn't used.

Finally switch rcu_barrier() to return void as it now can never fail.

Partially-based-on-patch-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
---
V2:
- add recursion detection

V3:
- fix races (Igor Druzhinin)

V5:
- rename done_count to pending_count (Jan Beulich)
- fix race (Jan Beulich)

V6:
- add barrier (Julien Grall)
- add ASSERT() (Julien Grall)
- hold cpu_map lock until end of rcu_barrier() (Julien Grall)

V7:
- update comment (Jan Beulich)
- add barriers (Jan Beulich)
---
 xen/common/rcupdate.c      | 100 +++++++++++++++++++++++++++++++++------------
 xen/include/xen/rcupdate.h |   2 +-
 2 files changed, 74 insertions(+), 28 deletions(-)

diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index 03d84764d2..12b89565d0 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -83,7 +83,6 @@ struct rcu_data {
     struct rcu_head **donetail;
     long            blimit;           /* Upper limit on a processed batch */
     int cpu;
-    struct rcu_head barrier;
     long            last_rs_qlen;     /* qlen during the last resched */
 
     /* 3) idle CPUs handling */
@@ -91,6 +90,7 @@ struct rcu_data {
     bool idle_timer_active;
 
     bool            process_callbacks;
+    bool            barrier_active;
 };
 
 /*
@@ -143,51 +143,90 @@ static int qhimark = 10000;
 static int qlowmark = 100;
 static int rsinterval = 1000;
 
-struct rcu_barrier_data {
-    struct rcu_head head;
-    atomic_t *cpu_count;
-};
+/*
+ * rcu_barrier() handling:
+ * Two counters are used to synchronize rcu_barrier() work:
+ * - cpu_count holds the number of cpus required to finish barrier handling.
+ *   It is decremented by each cpu when it has performed all pending rcu calls.
+ * - pending_count shows whether any rcu_barrier() activity is running and
+ *   it is used to synchronize leaving rcu_barrier() only after all cpus
+ *   have finished their processing. pending_count is initialized to nr_cpus + 1
+ *   and it is decremented by each cpu when it has seen that cpu_count has
+ *   reached 0. The cpu where rcu_barrier() has been called will wait until
+ *   pending_count has been decremented to 1 (so all cpus have seen cpu_count
+ *   reaching 0) and will then set pending_count to 0 indicating there is no
+ *   rcu_barrier() running.
+ * Cpus are synchronized via softirq mechanism. rcu_barrier() is regarded to
+ * be active if pending_count is not zero. In case rcu_barrier() is called on
+ * multiple cpus it is enough to check for pending_count being not zero on entry
+ * and to call process_pending_softirqs() in a loop until pending_count drops to
+ * zero, before starting the new rcu_barrier() processing.
+ */
+static atomic_t cpu_count = ATOMIC_INIT(0);
+static atomic_t pending_count = ATOMIC_INIT(0);
 
 static void rcu_barrier_callback(struct rcu_head *head)
 {
-    struct rcu_barrier_data *data = container_of(
-        head, struct rcu_barrier_data, head);
-    atomic_inc(data->cpu_count);
+    smp_mb__before_atomic();     /* Make all writes visible to other cpus. */
+    atomic_dec(&cpu_count);
 }
 
-static int rcu_barrier_action(void *_cpu_count)
+static void rcu_barrier_action(void)
 {
-    struct rcu_barrier_data data = { .cpu_count = _cpu_count };
-
-    ASSERT(!local_irq_is_enabled());
-    local_irq_enable();
+    struct rcu_head head;
 
     /*
      * When callback is executed, all previously-queued RCU work on this CPU
-     * is completed. When all CPUs have executed their callback, data.cpu_count
-     * will have been incremented to include every online CPU.
+     * is completed. When all CPUs have executed their callback, cpu_count
+     * will have been decremented to 0.
      */
-    call_rcu(&data.head, rcu_barrier_callback);
+    call_rcu(&head, rcu_barrier_callback);
 
-    while ( atomic_read(data.cpu_count) != num_online_cpus() )
+    while ( atomic_read(&cpu_count) )
     {
         process_pending_softirqs();
         cpu_relax();
     }
 
-    local_irq_disable();
-
-    return 0;
+    smp_mb__before_atomic();
+    atomic_dec(&pending_count);
 }
 
-/*
- * As rcu_barrier() is using stop_machine_run() it is allowed to be used in
- * idle context only (see comment for stop_machine_run()).
- */
-int rcu_barrier(void)
+void rcu_barrier(void)
 {
-    atomic_t cpu_count = ATOMIC_INIT(0);
-    return stop_machine_run(rcu_barrier_action, &cpu_count, NR_CPUS);
+    unsigned int n_cpus;
+
+    ASSERT(!in_irq() && local_irq_is_enabled());
+
+    for ( ; ; )
+    {
+        if ( !atomic_read(&pending_count) && get_cpu_maps() )
+        {
+            n_cpus = num_online_cpus();
+
+            if ( atomic_cmpxchg(&pending_count, 0, n_cpus + 1) == 0 )
+                break;
+
+            put_cpu_maps();
+        }
+
+        process_pending_softirqs();
+        cpu_relax();
+    }
+
+    smp_mb__before_atomic();
+    atomic_set(&cpu_count, n_cpus);
+    cpumask_raise_softirq(&cpu_online_map, RCU_SOFTIRQ);
+
+    while ( atomic_read(&pending_count) != 1 )
+    {
+        process_pending_softirqs();
+        cpu_relax();
+    }
+
+    atomic_set(&pending_count, 0);
+
+    put_cpu_maps();
 }
 
 /* Is batch a before batch b ? */
@@ -426,6 +465,13 @@ static void rcu_process_callbacks(void)
         rdp->process_callbacks = false;
         __rcu_process_callbacks(&rcu_ctrlblk, rdp);
     }
+
+    if ( atomic_read(&cpu_count) && !rdp->barrier_active )
+    {
+        rdp->barrier_active = true;
+        rcu_barrier_action();
+        rdp->barrier_active = false;
+    }
 }
 
 static int __rcu_pending(struct rcu_ctrlblk *rcp, struct rcu_data *rdp)
diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h
index eb9b60df07..31c8b86d13 100644
--- a/xen/include/xen/rcupdate.h
+++ b/xen/include/xen/rcupdate.h
@@ -144,7 +144,7 @@ void rcu_check_callbacks(int cpu);
 void call_rcu(struct rcu_head *head, 
               void (*func)(struct rcu_head *head));
 
-int rcu_barrier(void);
+void rcu_barrier(void);
 
 void rcu_idle_enter(unsigned int cpu);
 void rcu_idle_exit(unsigned int cpu);
-- 
2.16.4



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Xen-devel] [PATCH v7 3/5] xen: don't process rcu callbacks when holding a rcu_read_lock()
  2020-03-25 10:55 [Xen-devel] [PATCH v7 0/5] xen/rcu: let rcu work better with core scheduling Juergen Gross
  2020-03-25 10:55 ` [Xen-devel] [PATCH v7 1/5] xen: introduce smp_mb__[after|before]_atomic() barriers Juergen Gross
  2020-03-25 10:55 ` [Xen-devel] [PATCH v7 2/5] xen/rcu: don't use stop_machine_run() for rcu_barrier() Juergen Gross
@ 2020-03-25 10:55 ` Juergen Gross
  2020-03-25 10:55 ` [Xen-devel] [PATCH v7 4/5] xen/rcu: add assertions to debug build Juergen Gross
  2020-03-25 10:55 ` [Xen-devel] [PATCH v7 5/5] xen/rcu: add per-lock counter in debug builds Juergen Gross
  4 siblings, 0 replies; 15+ messages in thread
From: Juergen Gross @ 2020-03-25 10:55 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Julien Grall, Wei Liu,
	Andrew Cooper, Ian Jackson, George Dunlap, Jan Beulich

Some keyhandlers are calling process_pending_softirqs() while holding
a rcu_read_lock(). This is wrong, as process_pending_softirqs() might
activate rcu calls which should not happen inside a rcu_read_lock().

For that purpose modify process_pending_softirqs() to not allow rcu
callback processing when a rcu_read_lock() is being held.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
V3:
- add RCU_SOFTIRQ to ignore in process_pending_softirqs_norcu()
  (Roger Pau Monné)

V5:
- block rcu processing depending on rch_read_lock() being held or not
  (Jan Beulich)
---
 xen/common/softirq.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/xen/common/softirq.c b/xen/common/softirq.c
index b83ad96d6c..00d676b62c 100644
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -29,6 +29,7 @@ static void __do_softirq(unsigned long ignore_mask)
 {
     unsigned int i, cpu;
     unsigned long pending;
+    bool rcu_allowed = !(ignore_mask & (1ul << RCU_SOFTIRQ));
 
     for ( ; ; )
     {
@@ -38,7 +39,7 @@ static void __do_softirq(unsigned long ignore_mask)
          */
         cpu = smp_processor_id();
 
-        if ( rcu_pending(cpu) )
+        if ( rcu_allowed && rcu_pending(cpu) )
             rcu_check_callbacks(cpu);
 
         if ( ((pending = (softirq_pending(cpu) & ~ignore_mask)) == 0)
@@ -53,9 +54,16 @@ static void __do_softirq(unsigned long ignore_mask)
 
 void process_pending_softirqs(void)
 {
+    unsigned long ignore_mask = (1ul << SCHEDULE_SOFTIRQ) |
+                                (1ul << SCHED_SLAVE_SOFTIRQ);
+
+    /* Block RCU processing in case of rcu_read_lock() held. */
+    if ( preempt_count() )
+        ignore_mask |= 1ul << RCU_SOFTIRQ;
+
     ASSERT(!in_irq() && local_irq_is_enabled());
     /* Do not enter scheduler as it can preempt the calling context. */
-    __do_softirq((1ul << SCHEDULE_SOFTIRQ) | (1ul << SCHED_SLAVE_SOFTIRQ));
+    __do_softirq(ignore_mask);
 }
 
 void do_softirq(void)
-- 
2.16.4



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Xen-devel] [PATCH v7 4/5] xen/rcu: add assertions to debug build
  2020-03-25 10:55 [Xen-devel] [PATCH v7 0/5] xen/rcu: let rcu work better with core scheduling Juergen Gross
                   ` (2 preceding siblings ...)
  2020-03-25 10:55 ` [Xen-devel] [PATCH v7 3/5] xen: don't process rcu callbacks when holding a rcu_read_lock() Juergen Gross
@ 2020-03-25 10:55 ` Juergen Gross
  2020-03-25 10:55 ` [Xen-devel] [PATCH v7 5/5] xen/rcu: add per-lock counter in debug builds Juergen Gross
  4 siblings, 0 replies; 15+ messages in thread
From: Juergen Gross @ 2020-03-25 10:55 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Julien Grall, Wei Liu,
	Andrew Cooper, Ian Jackson, George Dunlap, Jan Beulich

Xen's RCU implementation relies on no softirq handling taking place
while being in a RCU critical section. Add ASSERT()s in debug builds
in order to catch any violations.

For that purpose modify rcu_read_[un]lock() to use a dedicated percpu
counter additional to preempt_[en|dis]able() as this enables to test
that condition in __do_softirq() (ASSERT_NOT_IN_ATOMIC() is not
usable there due to __cpu_up() calling process_pending_softirqs()
while holding the cpu hotplug lock).

While at it switch the rcu_read_[un]lock() implementation to static
inline functions instead of macros.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
V3:
- add barriers to rcu_[en|dis]able() (Roger Pau Monné)
- add rcu_quiesce_allowed() to ASSERT_NOT_IN_ATOMIC (Roger Pau Monné)
- convert macros to static inline functions
- add sanity check in rcu_read_unlock()

V4:
- use barrier() in rcu_[en|dis]able() (Julien Grall)

V5:
- use rcu counter even if not using a debug build

V6:
- keep preempt_[dis|en]able() calls
---
 xen/common/rcupdate.c      |  2 ++
 xen/common/softirq.c       |  4 +++-
 xen/include/xen/rcupdate.h | 37 ++++++++++++++++++++++++++++++++++---
 3 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index 12b89565d0..01b21951e0 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -46,6 +46,8 @@
 #include <xen/cpu.h>
 #include <xen/stop_machine.h>
 
+DEFINE_PER_CPU(unsigned int, rcu_lock_cnt);
+
 /* Global control variables for rcupdate callback mechanism. */
 static struct rcu_ctrlblk {
     long cur;           /* Current batch number.                      */
diff --git a/xen/common/softirq.c b/xen/common/softirq.c
index 00d676b62c..eba65c5fc0 100644
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -31,6 +31,8 @@ static void __do_softirq(unsigned long ignore_mask)
     unsigned long pending;
     bool rcu_allowed = !(ignore_mask & (1ul << RCU_SOFTIRQ));
 
+    ASSERT(!rcu_allowed || rcu_quiesce_allowed());
+
     for ( ; ; )
     {
         /*
@@ -58,7 +60,7 @@ void process_pending_softirqs(void)
                                 (1ul << SCHED_SLAVE_SOFTIRQ);
 
     /* Block RCU processing in case of rcu_read_lock() held. */
-    if ( preempt_count() )
+    if ( !rcu_quiesce_allowed() )
         ignore_mask |= 1ul << RCU_SOFTIRQ;
 
     ASSERT(!in_irq() && local_irq_is_enabled());
diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h
index 31c8b86d13..6f2587058e 100644
--- a/xen/include/xen/rcupdate.h
+++ b/xen/include/xen/rcupdate.h
@@ -32,12 +32,35 @@
 #define __XEN_RCUPDATE_H
 
 #include <xen/cache.h>
+#include <xen/compiler.h>
 #include <xen/spinlock.h>
 #include <xen/cpumask.h>
+#include <xen/percpu.h>
 #include <xen/preempt.h>
 
 #define __rcu
 
+DECLARE_PER_CPU(unsigned int, rcu_lock_cnt);
+
+static inline void rcu_quiesce_disable(void)
+{
+    preempt_disable();
+    this_cpu(rcu_lock_cnt)++;
+    barrier();
+}
+
+static inline void rcu_quiesce_enable(void)
+{
+    barrier();
+    this_cpu(rcu_lock_cnt)--;
+    preempt_enable();
+}
+
+static inline bool rcu_quiesce_allowed(void)
+{
+    return !this_cpu(rcu_lock_cnt);
+}
+
 /**
  * struct rcu_head - callback structure for use with RCU
  * @next: next update requests in a list
@@ -91,16 +114,24 @@ typedef struct _rcu_read_lock rcu_read_lock_t;
  * will be deferred until the outermost RCU read-side critical section
  * completes.
  *
- * It is illegal to block while in an RCU read-side critical section.
+ * It is illegal to process softirqs or block while in an RCU read-side
+ * critical section.
  */
-#define rcu_read_lock(x)       ({ ((void)(x)); preempt_disable(); })
+static inline void rcu_read_lock(rcu_read_lock_t *lock)
+{
+    rcu_quiesce_disable();
+}
 
 /**
  * rcu_read_unlock - marks the end of an RCU read-side critical section.
  *
  * See rcu_read_lock() for more information.
  */
-#define rcu_read_unlock(x)     ({ ((void)(x)); preempt_enable(); })
+static inline void rcu_read_unlock(rcu_read_lock_t *lock)
+{
+    ASSERT(!rcu_quiesce_allowed());
+    rcu_quiesce_enable();
+}
 
 /*
  * So where is rcu_write_lock()?  It does not exist, as there is no
-- 
2.16.4



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Xen-devel] [PATCH v7 5/5] xen/rcu: add per-lock counter in debug builds
  2020-03-25 10:55 [Xen-devel] [PATCH v7 0/5] xen/rcu: let rcu work better with core scheduling Juergen Gross
                   ` (3 preceding siblings ...)
  2020-03-25 10:55 ` [Xen-devel] [PATCH v7 4/5] xen/rcu: add assertions to debug build Juergen Gross
@ 2020-03-25 10:55 ` Juergen Gross
  4 siblings, 0 replies; 15+ messages in thread
From: Juergen Gross @ 2020-03-25 10:55 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Julien Grall, Wei Liu,
	Andrew Cooper, Ian Jackson, George Dunlap, Jan Beulich

Add a lock specific counter to rcu read locks in debug builds. This
allows to test for matching lock/unlock calls.

This will help to avoid cases like the one fixed by commit
98ed1f43cc2c89 where different rcu read locks were referenced in the
lock and unlock calls.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
V5:
- updated commit message (Jan Beulich)
---
 xen/include/xen/rcupdate.h | 46 +++++++++++++++++++++++++++++++++-------------
 1 file changed, 33 insertions(+), 13 deletions(-)

diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h
index 6f2587058e..cda1be9c88 100644
--- a/xen/include/xen/rcupdate.h
+++ b/xen/include/xen/rcupdate.h
@@ -37,21 +37,50 @@
 #include <xen/cpumask.h>
 #include <xen/percpu.h>
 #include <xen/preempt.h>
+#include <asm/atomic.h>
 
 #define __rcu
 
+#ifndef NDEBUG
+/* * Lock type for passing to rcu_read_{lock,unlock}. */
+struct _rcu_read_lock {
+    atomic_t cnt;
+};
+typedef struct _rcu_read_lock rcu_read_lock_t;
+#define DEFINE_RCU_READ_LOCK(x) rcu_read_lock_t x = { .cnt = ATOMIC_INIT(0) }
+#define RCU_READ_LOCK_INIT(x)   atomic_set(&(x)->cnt, 0)
+
+#else
+/*
+ * Dummy lock type for passing to rcu_read_{lock,unlock}. Currently exists
+ * only to document the reason for rcu_read_lock() critical sections.
+ */
+struct _rcu_read_lock {};
+typedef struct _rcu_read_lock rcu_read_lock_t;
+#define DEFINE_RCU_READ_LOCK(x) rcu_read_lock_t x
+#define RCU_READ_LOCK_INIT(x)
+
+#endif
+
 DECLARE_PER_CPU(unsigned int, rcu_lock_cnt);
 
-static inline void rcu_quiesce_disable(void)
+static inline void rcu_quiesce_disable(rcu_read_lock_t *lock)
 {
     preempt_disable();
     this_cpu(rcu_lock_cnt)++;
+#ifndef NDEBUG
+    atomic_inc(&lock->cnt);
+#endif
     barrier();
 }
 
-static inline void rcu_quiesce_enable(void)
+static inline void rcu_quiesce_enable(rcu_read_lock_t *lock)
 {
     barrier();
+#ifndef NDEBUG
+    ASSERT(atomic_read(&lock->cnt));
+    atomic_dec(&lock->cnt);
+#endif
     this_cpu(rcu_lock_cnt)--;
     preempt_enable();
 }
@@ -81,15 +110,6 @@ struct rcu_head {
 int rcu_pending(int cpu);
 int rcu_needs_cpu(int cpu);
 
-/*
- * Dummy lock type for passing to rcu_read_{lock,unlock}. Currently exists
- * only to document the reason for rcu_read_lock() critical sections.
- */
-struct _rcu_read_lock {};
-typedef struct _rcu_read_lock rcu_read_lock_t;
-#define DEFINE_RCU_READ_LOCK(x) rcu_read_lock_t x
-#define RCU_READ_LOCK_INIT(x)
-
 /**
  * rcu_read_lock - mark the beginning of an RCU read-side critical section.
  *
@@ -119,7 +139,7 @@ typedef struct _rcu_read_lock rcu_read_lock_t;
  */
 static inline void rcu_read_lock(rcu_read_lock_t *lock)
 {
-    rcu_quiesce_disable();
+    rcu_quiesce_disable(lock);
 }
 
 /**
@@ -130,7 +150,7 @@ static inline void rcu_read_lock(rcu_read_lock_t *lock)
 static inline void rcu_read_unlock(rcu_read_lock_t *lock)
 {
     ASSERT(!rcu_quiesce_allowed());
-    rcu_quiesce_enable();
+    rcu_quiesce_enable(lock);
 }
 
 /*
-- 
2.16.4



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH v7 1/5] xen: introduce smp_mb__[after|before]_atomic() barriers
  2020-03-25 10:55 ` [Xen-devel] [PATCH v7 1/5] xen: introduce smp_mb__[after|before]_atomic() barriers Juergen Gross
@ 2020-03-25 13:17   ` Jan Beulich
  2020-03-25 16:20   ` Julien Grall
  1 sibling, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2020-03-25 13:17 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Andrew Cooper,
	xen-devel, Volodymyr Babchuk, Roger Pau Monné

On 25.03.2020 11:55, Juergen Gross wrote:
> When using atomic variables for synchronization barriers are needed
> to ensure proper data serialization. Introduce smp_mb__before_atomic()
> and smp_mb__after_atomic() as in the Linux kernel for that purpose.
> 
> Use the same definitions as in the Linux kernel.
> 
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Juergen Gross <jgross@suse.com>

Acked-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH v7 2/5] xen/rcu: don't use stop_machine_run() for rcu_barrier()
  2020-03-25 10:55 ` [Xen-devel] [PATCH v7 2/5] xen/rcu: don't use stop_machine_run() for rcu_barrier() Juergen Gross
@ 2020-03-25 13:19   ` Jan Beulich
  2020-03-25 16:13   ` Julien Grall
  1 sibling, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2020-03-25 13:19 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, xen-devel

On 25.03.2020 11:55, Juergen Gross wrote:
> Today rcu_barrier() is calling stop_machine_run() to synchronize all
> physical cpus in order to ensure all pending rcu calls have finished
> when returning.
> 
> As stop_machine_run() is using tasklets this requires scheduling of
> idle vcpus on all cpus imposing the need to call rcu_barrier() on idle
> cpus only in case of core scheduling being active, as otherwise a
> scheduling deadlock would occur.
> 
> There is no need at all to do the syncing of the cpus in tasklets, as
> rcu activity is started in __do_softirq() called whenever softirq
> activity is allowed. So rcu_barrier() can easily be modified to use
> softirq for synchronization of the cpus no longer requiring any
> scheduling activity.
> 
> As there already is a rcu softirq reuse that for the synchronization.
> 
> Remove the barrier element from struct rcu_data as it isn't used.
> 
> Finally switch rcu_barrier() to return void as it now can never fail.
> 
> Partially-based-on-patch-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> Signed-off-by: Juergen Gross <jgross@suse.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH v7 2/5] xen/rcu: don't use stop_machine_run() for rcu_barrier()
  2020-03-25 10:55 ` [Xen-devel] [PATCH v7 2/5] xen/rcu: don't use stop_machine_run() for rcu_barrier() Juergen Gross
  2020-03-25 13:19   ` Jan Beulich
@ 2020-03-25 16:13   ` Julien Grall
  2020-03-26  6:58     ` Jan Beulich
  1 sibling, 1 reply; 15+ messages in thread
From: Julien Grall @ 2020-03-25 16:13 UTC (permalink / raw)
  To: Juergen Gross, xen-devel
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Ian Jackson,
	George Dunlap, Jan Beulich

Hi Juergen,

On 25/03/2020 10:55, Juergen Gross wrote:
> Today rcu_barrier() is calling stop_machine_run() to synchronize all
> physical cpus in order to ensure all pending rcu calls have finished
> when returning.
> 
> As stop_machine_run() is using tasklets this requires scheduling of
> idle vcpus on all cpus imposing the need to call rcu_barrier() on idle
> cpus only in case of core scheduling being active, as otherwise a
> scheduling deadlock would occur.
> 
> There is no need at all to do the syncing of the cpus in tasklets, as
> rcu activity is started in __do_softirq() called whenever softirq
> activity is allowed. So rcu_barrier() can easily be modified to use
> softirq for synchronization of the cpus no longer requiring any
> scheduling activity.
> 
> As there already is a rcu softirq reuse that for the synchronization.
> 
> Remove the barrier element from struct rcu_data as it isn't used.
> 
> Finally switch rcu_barrier() to return void as it now can never fail.
> 
> Partially-based-on-patch-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
> V2:
> - add recursion detection
> 
> V3:
> - fix races (Igor Druzhinin)
> 
> V5:
> - rename done_count to pending_count (Jan Beulich)
> - fix race (Jan Beulich)
> 
> V6:
> - add barrier (Julien Grall)
> - add ASSERT() (Julien Grall)
> - hold cpu_map lock until end of rcu_barrier() (Julien Grall)
> 
> V7:
> - update comment (Jan Beulich)
> - add barriers (Jan Beulich)
> ---
>   xen/common/rcupdate.c      | 100 +++++++++++++++++++++++++++++++++------------
>   xen/include/xen/rcupdate.h |   2 +-
>   2 files changed, 74 insertions(+), 28 deletions(-)
> 
> diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
> index 03d84764d2..12b89565d0 100644
> --- a/xen/common/rcupdate.c
> +++ b/xen/common/rcupdate.c
> @@ -83,7 +83,6 @@ struct rcu_data {
>       struct rcu_head **donetail;
>       long            blimit;           /* Upper limit on a processed batch */
>       int cpu;
> -    struct rcu_head barrier;
>       long            last_rs_qlen;     /* qlen during the last resched */
>   
>       /* 3) idle CPUs handling */
> @@ -91,6 +90,7 @@ struct rcu_data {
>       bool idle_timer_active;
>   
>       bool            process_callbacks;
> +    bool            barrier_active;
>   };
>   
>   /*
> @@ -143,51 +143,90 @@ static int qhimark = 10000;
>   static int qlowmark = 100;
>   static int rsinterval = 1000;
>   
> -struct rcu_barrier_data {
> -    struct rcu_head head;
> -    atomic_t *cpu_count;
> -};
> +/*
> + * rcu_barrier() handling:
> + * Two counters are used to synchronize rcu_barrier() work:
> + * - cpu_count holds the number of cpus required to finish barrier handling.
> + *   It is decremented by each cpu when it has performed all pending rcu calls.
> + * - pending_count shows whether any rcu_barrier() activity is running and
> + *   it is used to synchronize leaving rcu_barrier() only after all cpus
> + *   have finished their processing. pending_count is initialized to nr_cpus + 1
> + *   and it is decremented by each cpu when it has seen that cpu_count has
> + *   reached 0. The cpu where rcu_barrier() has been called will wait until
> + *   pending_count has been decremented to 1 (so all cpus have seen cpu_count
> + *   reaching 0) and will then set pending_count to 0 indicating there is no
> + *   rcu_barrier() running.
> + * Cpus are synchronized via softirq mechanism. rcu_barrier() is regarded to
> + * be active if pending_count is not zero. In case rcu_barrier() is called on
> + * multiple cpus it is enough to check for pending_count being not zero on entry
> + * and to call process_pending_softirqs() in a loop until pending_count drops to
> + * zero, before starting the new rcu_barrier() processing.
> + */
> +static atomic_t cpu_count = ATOMIC_INIT(0);
> +static atomic_t pending_count = ATOMIC_INIT(0);
>   
>   static void rcu_barrier_callback(struct rcu_head *head)
>   {
> -    struct rcu_barrier_data *data = container_of(
> -        head, struct rcu_barrier_data, head);
> -    atomic_inc(data->cpu_count);
> +    smp_mb__before_atomic();     /* Make all writes visible to other cpus. */

smp_mb__before_atomic() will order both read and write. However, the 
comment suggest only the write are required to be ordered.

So either the barrier is too strong or the comment is incorrect. Can you 
clarify it?

> +    atomic_dec(&cpu_count);
>   }
>   
> -static int rcu_barrier_action(void *_cpu_count)
> +static void rcu_barrier_action(void)
>   {
> -    struct rcu_barrier_data data = { .cpu_count = _cpu_count };
> -
> -    ASSERT(!local_irq_is_enabled());
> -    local_irq_enable();
> +    struct rcu_head head;
>   
>       /*
>        * When callback is executed, all previously-queued RCU work on this CPU
> -     * is completed. When all CPUs have executed their callback, data.cpu_count
> -     * will have been incremented to include every online CPU.
> +     * is completed. When all CPUs have executed their callback, cpu_count
> +     * will have been decremented to 0.
>        */
> -    call_rcu(&data.head, rcu_barrier_callback);
> +    call_rcu(&head, rcu_barrier_callback);
>   
> -    while ( atomic_read(data.cpu_count) != num_online_cpus() )
> +    while ( atomic_read(&cpu_count) )
>       {
>           process_pending_softirqs();
>           cpu_relax();
>       }
>   
> -    local_irq_disable();
> -
> -    return 0;
> +    smp_mb__before_atomic();
> +    atomic_dec(&pending_count);
>   }
>   
> -/*
> - * As rcu_barrier() is using stop_machine_run() it is allowed to be used in
> - * idle context only (see comment for stop_machine_run()).
> - */
> -int rcu_barrier(void)
> +void rcu_barrier(void)
>   {
> -    atomic_t cpu_count = ATOMIC_INIT(0);
> -    return stop_machine_run(rcu_barrier_action, &cpu_count, NR_CPUS);
> +    unsigned int n_cpus;
> +
> +    ASSERT(!in_irq() && local_irq_is_enabled());
> +
> +    for ( ; ; )
> +    {
> +        if ( !atomic_read(&pending_count) && get_cpu_maps() )
> +        {
> +            n_cpus = num_online_cpus();
> +
> +            if ( atomic_cmpxchg(&pending_count, 0, n_cpus + 1) == 0 )
> +                break;
> +
> +            put_cpu_maps();
> +        }
> +
> +        process_pending_softirqs();
> +        cpu_relax();
> +    }
> +
> +    smp_mb__before_atomic();

Our semantic of atomic_cmpxchg() is exactly the same as Linux. I.e it 
will contain a full barrier when the cmpxchg succeed. So why do you need 
this barrier?

> +    atomic_set(&cpu_count, n_cpus);
> +    cpumask_raise_softirq(&cpu_online_map, RCU_SOFTIRQ);
> +
> +    while ( atomic_read(&pending_count) != 1 )
> +    {
> +        process_pending_softirqs();
> +        cpu_relax();
> +    }
> +
> +    atomic_set(&pending_count, 0);
> +
> +    put_cpu_maps();
>   }
>   
>   /* Is batch a before batch b ? */
> @@ -426,6 +465,13 @@ static void rcu_process_callbacks(void)
>           rdp->process_callbacks = false;
>           __rcu_process_callbacks(&rcu_ctrlblk, rdp);
>       }
> +
> +    if ( atomic_read(&cpu_count) && !rdp->barrier_active )
> +    {
> +        rdp->barrier_active = true;
> +        rcu_barrier_action();
> +        rdp->barrier_active = false;
> +    }
>   }
>   
>   static int __rcu_pending(struct rcu_ctrlblk *rcp, struct rcu_data *rdp)
> diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h
> index eb9b60df07..31c8b86d13 100644
> --- a/xen/include/xen/rcupdate.h
> +++ b/xen/include/xen/rcupdate.h
> @@ -144,7 +144,7 @@ void rcu_check_callbacks(int cpu);
>   void call_rcu(struct rcu_head *head,
>                 void (*func)(struct rcu_head *head));
>   
> -int rcu_barrier(void);
> +void rcu_barrier(void);
>   
>   void rcu_idle_enter(unsigned int cpu);
>   void rcu_idle_exit(unsigned int cpu);
> 

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH v7 1/5] xen: introduce smp_mb__[after|before]_atomic() barriers
  2020-03-25 10:55 ` [Xen-devel] [PATCH v7 1/5] xen: introduce smp_mb__[after|before]_atomic() barriers Juergen Gross
  2020-03-25 13:17   ` Jan Beulich
@ 2020-03-25 16:20   ` Julien Grall
  1 sibling, 0 replies; 15+ messages in thread
From: Julien Grall @ 2020-03-25 16:20 UTC (permalink / raw)
  To: Juergen Gross, xen-devel
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Jan Beulich,
	Volodymyr Babchuk, Roger Pau Monné

Hi Juergen,

On 25/03/2020 10:55, Juergen Gross wrote:
> When using atomic variables for synchronization barriers are needed
> to ensure proper data serialization. Introduce smp_mb__before_atomic()
> and smp_mb__after_atomic() as in the Linux kernel for that purpose.
> 
> Use the same definitions as in the Linux kernel.
> 
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Juergen Gross <jgross@suse.com>

Acked-by: Julien Grall <jgrall@amazon.com>

Cheers,

> ---
> V7:
> - new patch
> ---
>   xen/include/asm-arm/system.h | 3 +++
>   xen/include/asm-x86/system.h | 3 +++
>   2 files changed, 6 insertions(+)
> 
> diff --git a/xen/include/asm-arm/system.h b/xen/include/asm-arm/system.h
> index e5d062667d..65d5c8e423 100644
> --- a/xen/include/asm-arm/system.h
> +++ b/xen/include/asm-arm/system.h
> @@ -30,6 +30,9 @@
>   
>   #define smp_wmb()       dmb(ishst)
>   
> +#define smp_mb__before_atomic()    smp_mb()
> +#define smp_mb__after_atomic()     smp_mb()
> +
>   /*
>    * This is used to ensure the compiler did actually allocate the register we
>    * asked it for some inline assembly sequences.  Apparently we can't trust
> diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h
> index 069f422f0d..7e5891f3df 100644
> --- a/xen/include/asm-x86/system.h
> +++ b/xen/include/asm-x86/system.h
> @@ -233,6 +233,9 @@ static always_inline unsigned long __xadd(
>   #define set_mb(var, value) do { xchg(&var, value); } while (0)
>   #define set_wmb(var, value) do { var = value; smp_wmb(); } while (0)
>   
> +#define smp_mb__before_atomic()    do { } while (0)
> +#define smp_mb__after_atomic()     do { } while (0)
> +
>   /**
>    * array_index_mask_nospec() - generate a mask that is ~0UL when the
>    *      bounds check succeeds and 0 otherwise
> 

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH v7 2/5] xen/rcu: don't use stop_machine_run() for rcu_barrier()
  2020-03-25 16:13   ` Julien Grall
@ 2020-03-26  6:58     ` Jan Beulich
  2020-03-26  7:24       ` Jürgen Groß
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2020-03-26  6:58 UTC (permalink / raw)
  To: Julien Grall
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu, Andrew Cooper,
	Ian Jackson, George Dunlap, xen-devel

On 25.03.2020 17:13, Julien Grall wrote:
> On 25/03/2020 10:55, Juergen Gross wrote:
>> @@ -143,51 +143,90 @@ static int qhimark = 10000;
>>   static int qlowmark = 100;
>>   static int rsinterval = 1000;
>>   -struct rcu_barrier_data {
>> -    struct rcu_head head;
>> -    atomic_t *cpu_count;
>> -};
>> +/*
>> + * rcu_barrier() handling:
>> + * Two counters are used to synchronize rcu_barrier() work:
>> + * - cpu_count holds the number of cpus required to finish barrier handling.
>> + *   It is decremented by each cpu when it has performed all pending rcu calls.
>> + * - pending_count shows whether any rcu_barrier() activity is running and
>> + *   it is used to synchronize leaving rcu_barrier() only after all cpus
>> + *   have finished their processing. pending_count is initialized to nr_cpus + 1
>> + *   and it is decremented by each cpu when it has seen that cpu_count has
>> + *   reached 0. The cpu where rcu_barrier() has been called will wait until
>> + *   pending_count has been decremented to 1 (so all cpus have seen cpu_count
>> + *   reaching 0) and will then set pending_count to 0 indicating there is no
>> + *   rcu_barrier() running.
>> + * Cpus are synchronized via softirq mechanism. rcu_barrier() is regarded to
>> + * be active if pending_count is not zero. In case rcu_barrier() is called on
>> + * multiple cpus it is enough to check for pending_count being not zero on entry
>> + * and to call process_pending_softirqs() in a loop until pending_count drops to
>> + * zero, before starting the new rcu_barrier() processing.
>> + */
>> +static atomic_t cpu_count = ATOMIC_INIT(0);
>> +static atomic_t pending_count = ATOMIC_INIT(0);
>>     static void rcu_barrier_callback(struct rcu_head *head)
>>   {
>> -    struct rcu_barrier_data *data = container_of(
>> -        head, struct rcu_barrier_data, head);
>> -    atomic_inc(data->cpu_count);
>> +    smp_mb__before_atomic();     /* Make all writes visible to other cpus. */
> 
> smp_mb__before_atomic() will order both read and write. However, the
> comment suggest only the write are required to be ordered.
> 
> So either the barrier is too strong or the comment is incorrect. Can
> you clarify it?

Neither is the case, I guess: There simply is no smp_wmb__before_atomic()
in Linux, and if we want to follow their model we shouldn't have one
either. I'd rather take the comment to indicate that if one appeared, it
could be used here.

>> +    atomic_dec(&cpu_count);
>>   }
>>   -static int rcu_barrier_action(void *_cpu_count)
>> +static void rcu_barrier_action(void)
>>   {
>> -    struct rcu_barrier_data data = { .cpu_count = _cpu_count };
>> -
>> -    ASSERT(!local_irq_is_enabled());
>> -    local_irq_enable();
>> +    struct rcu_head head;
>>         /*
>>        * When callback is executed, all previously-queued RCU work on this CPU
>> -     * is completed. When all CPUs have executed their callback, data.cpu_count
>> -     * will have been incremented to include every online CPU.
>> +     * is completed. When all CPUs have executed their callback, cpu_count
>> +     * will have been decremented to 0.
>>        */
>> -    call_rcu(&data.head, rcu_barrier_callback);
>> +    call_rcu(&head, rcu_barrier_callback);
>>   -    while ( atomic_read(data.cpu_count) != num_online_cpus() )
>> +    while ( atomic_read(&cpu_count) )
>>       {
>>           process_pending_softirqs();
>>           cpu_relax();
>>       }
>>   -    local_irq_disable();
>> -
>> -    return 0;
>> +    smp_mb__before_atomic();
>> +    atomic_dec(&pending_count);
>>   }
>>   -/*
>> - * As rcu_barrier() is using stop_machine_run() it is allowed to be used in
>> - * idle context only (see comment for stop_machine_run()).
>> - */
>> -int rcu_barrier(void)
>> +void rcu_barrier(void)
>>   {
>> -    atomic_t cpu_count = ATOMIC_INIT(0);
>> -    return stop_machine_run(rcu_barrier_action, &cpu_count, NR_CPUS);
>> +    unsigned int n_cpus;
>> +
>> +    ASSERT(!in_irq() && local_irq_is_enabled());
>> +
>> +    for ( ; ; )
>> +    {
>> +        if ( !atomic_read(&pending_count) && get_cpu_maps() )
>> +        {
>> +            n_cpus = num_online_cpus();
>> +
>> +            if ( atomic_cmpxchg(&pending_count, 0, n_cpus + 1) == 0 )
>> +                break;
>> +
>> +            put_cpu_maps();
>> +        }
>> +
>> +        process_pending_softirqs();
>> +        cpu_relax();
>> +    }
>> +
>> +    smp_mb__before_atomic();
> 
> Our semantic of atomic_cmpxchg() is exactly the same as Linux. I.e
> it will contain a full barrier when the cmpxchg succeed. So why do you need this barrier?

I was me I think to have (wrongly) suggested a barrier was missing
here, sorry.

Jan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH v7 2/5] xen/rcu: don't use stop_machine_run() for rcu_barrier()
  2020-03-26  6:58     ` Jan Beulich
@ 2020-03-26  7:24       ` Jürgen Groß
  2020-03-26  8:49         ` Jan Beulich
  0 siblings, 1 reply; 15+ messages in thread
From: Jürgen Groß @ 2020-03-26  7:24 UTC (permalink / raw)
  To: Jan Beulich, Julien Grall
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Ian Jackson,
	George Dunlap, xen-devel

On 26.03.20 07:58, Jan Beulich wrote:
> On 25.03.2020 17:13, Julien Grall wrote:
>> On 25/03/2020 10:55, Juergen Gross wrote:
>>> @@ -143,51 +143,90 @@ static int qhimark = 10000;
>>>    static int qlowmark = 100;
>>>    static int rsinterval = 1000;
>>>    -struct rcu_barrier_data {
>>> -    struct rcu_head head;
>>> -    atomic_t *cpu_count;
>>> -};
>>> +/*
>>> + * rcu_barrier() handling:
>>> + * Two counters are used to synchronize rcu_barrier() work:
>>> + * - cpu_count holds the number of cpus required to finish barrier handling.
>>> + *   It is decremented by each cpu when it has performed all pending rcu calls.
>>> + * - pending_count shows whether any rcu_barrier() activity is running and
>>> + *   it is used to synchronize leaving rcu_barrier() only after all cpus
>>> + *   have finished their processing. pending_count is initialized to nr_cpus + 1
>>> + *   and it is decremented by each cpu when it has seen that cpu_count has
>>> + *   reached 0. The cpu where rcu_barrier() has been called will wait until
>>> + *   pending_count has been decremented to 1 (so all cpus have seen cpu_count
>>> + *   reaching 0) and will then set pending_count to 0 indicating there is no
>>> + *   rcu_barrier() running.
>>> + * Cpus are synchronized via softirq mechanism. rcu_barrier() is regarded to
>>> + * be active if pending_count is not zero. In case rcu_barrier() is called on
>>> + * multiple cpus it is enough to check for pending_count being not zero on entry
>>> + * and to call process_pending_softirqs() in a loop until pending_count drops to
>>> + * zero, before starting the new rcu_barrier() processing.
>>> + */
>>> +static atomic_t cpu_count = ATOMIC_INIT(0);
>>> +static atomic_t pending_count = ATOMIC_INIT(0);
>>>      static void rcu_barrier_callback(struct rcu_head *head)
>>>    {
>>> -    struct rcu_barrier_data *data = container_of(
>>> -        head, struct rcu_barrier_data, head);
>>> -    atomic_inc(data->cpu_count);
>>> +    smp_mb__before_atomic();     /* Make all writes visible to other cpus. */
>>
>> smp_mb__before_atomic() will order both read and write. However, the
>> comment suggest only the write are required to be ordered.
>>
>> So either the barrier is too strong or the comment is incorrect. Can
>> you clarify it?
> 
> Neither is the case, I guess: There simply is no smp_wmb__before_atomic()
> in Linux, and if we want to follow their model we shouldn't have one
> either. I'd rather take the comment to indicate that if one appeared, it
> could be used here.

Right. Currently we have the choice of either using
smp_mb__before_atomic() which is too strong for Arm, or smp_wmb() which
is too strong for x86.

> 
>>> +    atomic_dec(&cpu_count);
>>>    }
>>>    -static int rcu_barrier_action(void *_cpu_count)
>>> +static void rcu_barrier_action(void)
>>>    {
>>> -    struct rcu_barrier_data data = { .cpu_count = _cpu_count };
>>> -
>>> -    ASSERT(!local_irq_is_enabled());
>>> -    local_irq_enable();
>>> +    struct rcu_head head;
>>>          /*
>>>         * When callback is executed, all previously-queued RCU work on this CPU
>>> -     * is completed. When all CPUs have executed their callback, data.cpu_count
>>> -     * will have been incremented to include every online CPU.
>>> +     * is completed. When all CPUs have executed their callback, cpu_count
>>> +     * will have been decremented to 0.
>>>         */
>>> -    call_rcu(&data.head, rcu_barrier_callback);
>>> +    call_rcu(&head, rcu_barrier_callback);
>>>    -    while ( atomic_read(data.cpu_count) != num_online_cpus() )
>>> +    while ( atomic_read(&cpu_count) )
>>>        {
>>>            process_pending_softirqs();
>>>            cpu_relax();
>>>        }
>>>    -    local_irq_disable();
>>> -
>>> -    return 0;
>>> +    smp_mb__before_atomic();
>>> +    atomic_dec(&pending_count);
>>>    }
>>>    -/*
>>> - * As rcu_barrier() is using stop_machine_run() it is allowed to be used in
>>> - * idle context only (see comment for stop_machine_run()).
>>> - */
>>> -int rcu_barrier(void)
>>> +void rcu_barrier(void)
>>>    {
>>> -    atomic_t cpu_count = ATOMIC_INIT(0);
>>> -    return stop_machine_run(rcu_barrier_action, &cpu_count, NR_CPUS);
>>> +    unsigned int n_cpus;
>>> +
>>> +    ASSERT(!in_irq() && local_irq_is_enabled());
>>> +
>>> +    for ( ; ; )
>>> +    {
>>> +        if ( !atomic_read(&pending_count) && get_cpu_maps() )
>>> +        {
>>> +            n_cpus = num_online_cpus();
>>> +
>>> +            if ( atomic_cmpxchg(&pending_count, 0, n_cpus + 1) == 0 )
>>> +                break;
>>> +
>>> +            put_cpu_maps();
>>> +        }
>>> +
>>> +        process_pending_softirqs();
>>> +        cpu_relax();
>>> +    }
>>> +
>>> +    smp_mb__before_atomic();
>>
>> Our semantic of atomic_cmpxchg() is exactly the same as Linux. I.e
>> it will contain a full barrier when the cmpxchg succeed. So why do you need this barrier?
> 
> I was me I think to have (wrongly) suggested a barrier was missing
> here, sorry.

I'll update the patch dropping the barrier.


Juergen


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH v7 2/5] xen/rcu: don't use stop_machine_run() for rcu_barrier()
  2020-03-26  7:24       ` Jürgen Groß
@ 2020-03-26  8:49         ` Jan Beulich
  2020-03-26  8:50           ` Jürgen Groß
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2020-03-26  8:49 UTC (permalink / raw)
  To: Jürgen Groß, Julien Grall
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Ian Jackson,
	George Dunlap, xen-devel

On 26.03.2020 08:24, Jürgen Groß wrote:
> On 26.03.20 07:58, Jan Beulich wrote:
>> On 25.03.2020 17:13, Julien Grall wrote:
>>> On 25/03/2020 10:55, Juergen Gross wrote:
>>>> @@ -143,51 +143,90 @@ static int qhimark = 10000;
>>>>    static int qlowmark = 100;
>>>>    static int rsinterval = 1000;
>>>>    -struct rcu_barrier_data {
>>>> -    struct rcu_head head;
>>>> -    atomic_t *cpu_count;
>>>> -};
>>>> +/*
>>>> + * rcu_barrier() handling:
>>>> + * Two counters are used to synchronize rcu_barrier() work:
>>>> + * - cpu_count holds the number of cpus required to finish barrier handling.
>>>> + *   It is decremented by each cpu when it has performed all pending rcu calls.
>>>> + * - pending_count shows whether any rcu_barrier() activity is running and
>>>> + *   it is used to synchronize leaving rcu_barrier() only after all cpus
>>>> + *   have finished their processing. pending_count is initialized to nr_cpus + 1
>>>> + *   and it is decremented by each cpu when it has seen that cpu_count has
>>>> + *   reached 0. The cpu where rcu_barrier() has been called will wait until
>>>> + *   pending_count has been decremented to 1 (so all cpus have seen cpu_count
>>>> + *   reaching 0) and will then set pending_count to 0 indicating there is no
>>>> + *   rcu_barrier() running.
>>>> + * Cpus are synchronized via softirq mechanism. rcu_barrier() is regarded to
>>>> + * be active if pending_count is not zero. In case rcu_barrier() is called on
>>>> + * multiple cpus it is enough to check for pending_count being not zero on entry
>>>> + * and to call process_pending_softirqs() in a loop until pending_count drops to
>>>> + * zero, before starting the new rcu_barrier() processing.
>>>> + */
>>>> +static atomic_t cpu_count = ATOMIC_INIT(0);
>>>> +static atomic_t pending_count = ATOMIC_INIT(0);
>>>>      static void rcu_barrier_callback(struct rcu_head *head)
>>>>    {
>>>> -    struct rcu_barrier_data *data = container_of(
>>>> -        head, struct rcu_barrier_data, head);
>>>> -    atomic_inc(data->cpu_count);
>>>> +    smp_mb__before_atomic();     /* Make all writes visible to other cpus. */
>>>
>>> smp_mb__before_atomic() will order both read and write. However, the
>>> comment suggest only the write are required to be ordered.
>>>
>>> So either the barrier is too strong or the comment is incorrect. Can
>>> you clarify it?
>>
>> Neither is the case, I guess: There simply is no smp_wmb__before_atomic()
>> in Linux, and if we want to follow their model we shouldn't have one
>> either. I'd rather take the comment to indicate that if one appeared, it
>> could be used here.
> 
> Right. Currently we have the choice of either using
> smp_mb__before_atomic() which is too strong for Arm, or smp_wmb() which
> is too strong for x86.

For x86 smp_wmb() is actually only very slightly too strong - it expands
to just barrier(), after all. So overall perhaps that's the better
choice here (with a suitable comment)?

Jan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH v7 2/5] xen/rcu: don't use stop_machine_run() for rcu_barrier()
  2020-03-26  8:49         ` Jan Beulich
@ 2020-03-26  8:50           ` Jürgen Groß
  2020-03-26  9:14             ` Julien Grall
  0 siblings, 1 reply; 15+ messages in thread
From: Jürgen Groß @ 2020-03-26  8:50 UTC (permalink / raw)
  To: Jan Beulich, Julien Grall
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Ian Jackson,
	George Dunlap, xen-devel

On 26.03.20 09:49, Jan Beulich wrote:
> On 26.03.2020 08:24, Jürgen Groß wrote:
>> On 26.03.20 07:58, Jan Beulich wrote:
>>> On 25.03.2020 17:13, Julien Grall wrote:
>>>> On 25/03/2020 10:55, Juergen Gross wrote:
>>>>> @@ -143,51 +143,90 @@ static int qhimark = 10000;
>>>>>     static int qlowmark = 100;
>>>>>     static int rsinterval = 1000;
>>>>>     -struct rcu_barrier_data {
>>>>> -    struct rcu_head head;
>>>>> -    atomic_t *cpu_count;
>>>>> -};
>>>>> +/*
>>>>> + * rcu_barrier() handling:
>>>>> + * Two counters are used to synchronize rcu_barrier() work:
>>>>> + * - cpu_count holds the number of cpus required to finish barrier handling.
>>>>> + *   It is decremented by each cpu when it has performed all pending rcu calls.
>>>>> + * - pending_count shows whether any rcu_barrier() activity is running and
>>>>> + *   it is used to synchronize leaving rcu_barrier() only after all cpus
>>>>> + *   have finished their processing. pending_count is initialized to nr_cpus + 1
>>>>> + *   and it is decremented by each cpu when it has seen that cpu_count has
>>>>> + *   reached 0. The cpu where rcu_barrier() has been called will wait until
>>>>> + *   pending_count has been decremented to 1 (so all cpus have seen cpu_count
>>>>> + *   reaching 0) and will then set pending_count to 0 indicating there is no
>>>>> + *   rcu_barrier() running.
>>>>> + * Cpus are synchronized via softirq mechanism. rcu_barrier() is regarded to
>>>>> + * be active if pending_count is not zero. In case rcu_barrier() is called on
>>>>> + * multiple cpus it is enough to check for pending_count being not zero on entry
>>>>> + * and to call process_pending_softirqs() in a loop until pending_count drops to
>>>>> + * zero, before starting the new rcu_barrier() processing.
>>>>> + */
>>>>> +static atomic_t cpu_count = ATOMIC_INIT(0);
>>>>> +static atomic_t pending_count = ATOMIC_INIT(0);
>>>>>       static void rcu_barrier_callback(struct rcu_head *head)
>>>>>     {
>>>>> -    struct rcu_barrier_data *data = container_of(
>>>>> -        head, struct rcu_barrier_data, head);
>>>>> -    atomic_inc(data->cpu_count);
>>>>> +    smp_mb__before_atomic();     /* Make all writes visible to other cpus. */
>>>>
>>>> smp_mb__before_atomic() will order both read and write. However, the
>>>> comment suggest only the write are required to be ordered.
>>>>
>>>> So either the barrier is too strong or the comment is incorrect. Can
>>>> you clarify it?
>>>
>>> Neither is the case, I guess: There simply is no smp_wmb__before_atomic()
>>> in Linux, and if we want to follow their model we shouldn't have one
>>> either. I'd rather take the comment to indicate that if one appeared, it
>>> could be used here.
>>
>> Right. Currently we have the choice of either using
>> smp_mb__before_atomic() which is too strong for Arm, or smp_wmb() which
>> is too strong for x86.
> 
> For x86 smp_wmb() is actually only very slightly too strong - it expands
> to just barrier(), after all. So overall perhaps that's the better
> choice here (with a suitable comment)?

Fine with me.


Juergen



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH v7 2/5] xen/rcu: don't use stop_machine_run() for rcu_barrier()
  2020-03-26  8:50           ` Jürgen Groß
@ 2020-03-26  9:14             ` Julien Grall
  0 siblings, 0 replies; 15+ messages in thread
From: Julien Grall @ 2020-03-26  9:14 UTC (permalink / raw)
  To: Jürgen Groß, Jan Beulich
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Ian Jackson,
	George Dunlap, xen-devel



On 26/03/2020 08:50, Jürgen Groß wrote:
> On 26.03.20 09:49, Jan Beulich wrote:
>> On 26.03.2020 08:24, Jürgen Groß wrote:
>>> On 26.03.20 07:58, Jan Beulich wrote:
>>>> On 25.03.2020 17:13, Julien Grall wrote:
>>>>> On 25/03/2020 10:55, Juergen Gross wrote:
>>>>>> @@ -143,51 +143,90 @@ static int qhimark = 10000;
>>>>>>     static int qlowmark = 100;
>>>>>>     static int rsinterval = 1000;
>>>>>>     -struct rcu_barrier_data {
>>>>>> -    struct rcu_head head;
>>>>>> -    atomic_t *cpu_count;
>>>>>> -};
>>>>>> +/*
>>>>>> + * rcu_barrier() handling:
>>>>>> + * Two counters are used to synchronize rcu_barrier() work:
>>>>>> + * - cpu_count holds the number of cpus required to finish 
>>>>>> barrier handling.
>>>>>> + *   It is decremented by each cpu when it has performed all 
>>>>>> pending rcu calls.
>>>>>> + * - pending_count shows whether any rcu_barrier() activity is 
>>>>>> running and
>>>>>> + *   it is used to synchronize leaving rcu_barrier() only after 
>>>>>> all cpus
>>>>>> + *   have finished their processing. pending_count is initialized 
>>>>>> to nr_cpus + 1
>>>>>> + *   and it is decremented by each cpu when it has seen that 
>>>>>> cpu_count has
>>>>>> + *   reached 0. The cpu where rcu_barrier() has been called will 
>>>>>> wait until
>>>>>> + *   pending_count has been decremented to 1 (so all cpus have 
>>>>>> seen cpu_count
>>>>>> + *   reaching 0) and will then set pending_count to 0 indicating 
>>>>>> there is no
>>>>>> + *   rcu_barrier() running.
>>>>>> + * Cpus are synchronized via softirq mechanism. rcu_barrier() is 
>>>>>> regarded to
>>>>>> + * be active if pending_count is not zero. In case rcu_barrier() 
>>>>>> is called on
>>>>>> + * multiple cpus it is enough to check for pending_count being 
>>>>>> not zero on entry
>>>>>> + * and to call process_pending_softirqs() in a loop until 
>>>>>> pending_count drops to
>>>>>> + * zero, before starting the new rcu_barrier() processing.
>>>>>> + */
>>>>>> +static atomic_t cpu_count = ATOMIC_INIT(0);
>>>>>> +static atomic_t pending_count = ATOMIC_INIT(0);
>>>>>>       static void rcu_barrier_callback(struct rcu_head *head)
>>>>>>     {
>>>>>> -    struct rcu_barrier_data *data = container_of(
>>>>>> -        head, struct rcu_barrier_data, head);
>>>>>> -    atomic_inc(data->cpu_count);
>>>>>> +    smp_mb__before_atomic();     /* Make all writes visible to 
>>>>>> other cpus. */
>>>>>
>>>>> smp_mb__before_atomic() will order both read and write. However, the
>>>>> comment suggest only the write are required to be ordered.
>>>>>
>>>>> So either the barrier is too strong or the comment is incorrect. Can
>>>>> you clarify it?
>>>>
>>>> Neither is the case, I guess: There simply is no 
>>>> smp_wmb__before_atomic()
>>>> in Linux, and if we want to follow their model we shouldn't have one
>>>> either. I'd rather take the comment to indicate that if one 
>>>> appeared, it
>>>> could be used here.
>>>
>>> Right. Currently we have the choice of either using
>>> smp_mb__before_atomic() which is too strong for Arm, or smp_wmb() which
>>> is too strong for x86.
>>
>> For x86 smp_wmb() is actually only very slightly too strong - it expands
>> to just barrier(), after all. So overall perhaps that's the better
>> choice here (with a suitable comment)?
> 
> Fine with me.

I am happy with that.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-03-26  9:14 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-25 10:55 [Xen-devel] [PATCH v7 0/5] xen/rcu: let rcu work better with core scheduling Juergen Gross
2020-03-25 10:55 ` [Xen-devel] [PATCH v7 1/5] xen: introduce smp_mb__[after|before]_atomic() barriers Juergen Gross
2020-03-25 13:17   ` Jan Beulich
2020-03-25 16:20   ` Julien Grall
2020-03-25 10:55 ` [Xen-devel] [PATCH v7 2/5] xen/rcu: don't use stop_machine_run() for rcu_barrier() Juergen Gross
2020-03-25 13:19   ` Jan Beulich
2020-03-25 16:13   ` Julien Grall
2020-03-26  6:58     ` Jan Beulich
2020-03-26  7:24       ` Jürgen Groß
2020-03-26  8:49         ` Jan Beulich
2020-03-26  8:50           ` Jürgen Groß
2020-03-26  9:14             ` Julien Grall
2020-03-25 10:55 ` [Xen-devel] [PATCH v7 3/5] xen: don't process rcu callbacks when holding a rcu_read_lock() Juergen Gross
2020-03-25 10:55 ` [Xen-devel] [PATCH v7 4/5] xen/rcu: add assertions to debug build Juergen Gross
2020-03-25 10:55 ` [Xen-devel] [PATCH v7 5/5] xen/rcu: add per-lock counter in debug builds Juergen Gross

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).