linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
@ 2012-02-13  2:09 Paul E. McKenney
  2012-02-15 12:59 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 100+ messages in thread
From: Paul E. McKenney @ 2012-02-13  2:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, patches

The current implementation of synchronize_srcu_expedited() can cause
severe OS jitter due to its use of synchronize_sched(), which in turn
invokes try_stop_cpus(), which causes each CPU to be sent an IPI.
This can result in severe performance degradation for real-time workloads
and especially for short-interation-length HPC workloads.  Furthermore,
because only one instance of try_stop_cpus() can be making forward progress
at a given time, only one instance of synchronize_srcu_expedited() can
make forward progress at a time, even if they are all operating on
distinct srcu_struct structures.

This commit, inspired by an earlier implementation by Peter Zijlstra
(https://lkml.org/lkml/2012/1/31/211) and by further offline discussions,
takes a strictly algorithmic bits-in-memory approach.  This has the
disadvantage of requiring one explicit memory-barrier instruction in
each of srcu_read_lock() and srcu_read_unlock(), but on the other hand
completely dispenses with OS jitter and furthermore allows SRCU to be
used freely by CPUs that RCU believes to be idle or offline.

The update-side implementation handles the single read-side memory
barrier by rechecking the per-CPU counters after summing them and
by running through the update-side state machine twice.

This implementation has passed moderate rcutorture testing on both 32-bit
x86 and 64-bit Power.  A call_srcu() function will be present in a later
version of this patch.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index d3d5fa5..a478c8e 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -31,13 +31,19 @@
 #include <linux/rcupdate.h>
 
 struct srcu_struct_array {
-	int c[2];
+	unsigned long c[2];
 };
 
+/* Bit definitions for field ->c above and ->snap below. */
+#define SRCU_USAGE_BITS		2
+#define SRCU_REF_MASK		(ULONG_MAX >> SRCU_USAGE_BITS)
+#define SRCU_USAGE_COUNT	(SRCU_REF_MASK + 1)
+
 struct srcu_struct {
-	int completed;
+	unsigned completed;
 	struct srcu_struct_array __percpu *per_cpu_ref;
 	struct mutex mutex;
+	unsigned long snap[NR_CPUS];
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map dep_map;
 #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
index e0fe148..3d99162 100644
--- a/kernel/rcutorture.c
+++ b/kernel/rcutorture.c
@@ -620,7 +620,7 @@ static int srcu_torture_stats(char *page)
 	cnt += sprintf(&page[cnt], "%s%s per-CPU(idx=%d):",
 		       torture_type, TORTURE_FLAG, idx);
 	for_each_possible_cpu(cpu) {
-		cnt += sprintf(&page[cnt], " %d(%d,%d)", cpu,
+		cnt += sprintf(&page[cnt], " %d(%lu,%lu)", cpu,
 			       per_cpu_ptr(srcu_ctl.per_cpu_ref, cpu)->c[!idx],
 			       per_cpu_ptr(srcu_ctl.per_cpu_ref, cpu)->c[idx]);
 	}
diff --git a/kernel/srcu.c b/kernel/srcu.c
index ba35f3a..540671e 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -73,19 +73,102 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
 #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
 
 /*
- * srcu_readers_active_idx -- returns approximate number of readers
- *	active on the specified rank of per-CPU counters.
+ * Returns approximate number of readers active on the specified rank
+ * of per-CPU counters.  Also snapshots each counter's value in the
+ * corresponding element of sp->snap[] for later use validating
+ * the sum.
  */
+static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
+{
+	int cpu;
+	unsigned long sum = 0;
+	unsigned long t;
 
-static int srcu_readers_active_idx(struct srcu_struct *sp, int idx)
+	for_each_possible_cpu(cpu) {
+		t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
+		sum += t;
+		sp->snap[cpu] = t;
+	}
+	return sum & SRCU_REF_MASK;
+}
+
+/*
+ * To be called from the update side after an index flip.  Returns true
+ * if the modulo sum of the counters is stably zero, false if there is
+ * some possibility of non-zero.
+ */
+static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
 {
 	int cpu;
-	int sum;
 
-	sum = 0;
+	/*
+	 * Note that srcu_readers_active_idx() can incorrectly return
+	 * zero even though there is a pre-existing reader throughout.
+	 * To see this, suppose that task A is in a very long SRCU
+	 * read-side critical section that started on CPU 0, and that
+	 * no other reader exists, so that the modulo sum of the counters
+	 * is equal to one.  Then suppose that task B starts executing
+	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
+	 * task C starts reading on CPU 0, so that its increment is not
+	 * summed, but finishes reading on CPU 2, so that its decrement
+	 * -is- summed.  Then when task B completes its sum, it will
+	 * incorrectly get zero, despite the fact that task A has been
+	 * in its SRCU read-side critical section the whole time.
+	 *
+	 * We therefore do a validation step should srcu_readers_active_idx()
+	 * return zero.
+	 */
+	if (srcu_readers_active_idx(sp, idx) != 0)
+		return false;
+
+	/*
+	 * Since the caller recently flipped ->completed, we can see at
+	 * most one increment of each CPU's counter from this point
+	 * forward.  The reason for this is that the reader CPU must have
+	 * fetched the index before srcu_readers_active_idx checked
+	 * that CPU's counter, but not yet incremented its counter.
+	 * Its eventual counter increment will follow the read in
+	 * srcu_readers_active_idx(), and that increment is immediately
+	 * followed by smp_mb() B.  Because smp_mb() D is between
+	 * the ->completed flip and srcu_readers_active_idx()'s read,
+	 * that CPU's subsequent load of ->completed must see the new
+	 * value, and therefore increment the counter in the other rank.
+	 */
+	smp_mb(); /* A */
+
+	/*
+	 * Now, we check the ->snap array that srcu_readers_active_idx()
+	 * filled in from the per-CPU counter values.  Since both
+	 * __srcu_read_lock() and __srcu_read_unlock() increment the
+	 * upper bits of the per-CPU counter, an increment/decrement
+	 * pair will change the value of the counter.  Since there is
+	 * only one possible increment, the only way to wrap the counter
+	 * is to have a huge number of counter decrements, which requires
+	 * a huge number of tasks and huge SRCU read-side critical-section
+	 * nesting levels, even on 32-bit systems.
+	 *
+	 * All of the ways of confusing the readings require that the scan
+	 * in srcu_readers_active_idx() see the read-side task's decrement,
+	 * but not its increment.  However, between that decrement and
+	 * increment are smb_mb() B and C.  Either or both of these pair
+	 * with smp_mb() A above to ensure that the scan below will see
+	 * the read-side tasks's increment, thus noting a difference in
+	 * the counter values between the two passes.
+	 *
+	 * Therefore, if srcu_readers_active_idx() returned zero, and
+	 * none of the counters changed, we know that the zero was the
+	 * correct sum.
+	 *
+	 * Of course, it is possible that a task might be delayed
+	 * for a very long time in __srcu_read_lock() after fetching
+	 * the index but before incrementing its counter.  This
+	 * possibility will be dealt with in __synchronize_srcu().
+	 */
 	for_each_possible_cpu(cpu)
-		sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
-	return sum;
+		if (sp->snap[cpu] !=
+		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
+			return false;  /* False zero reading! */
+	return true;
 }
 
 /**
@@ -131,10 +214,11 @@ int __srcu_read_lock(struct srcu_struct *sp)
 	int idx;
 
 	preempt_disable();
-	idx = sp->completed & 0x1;
-	barrier();  /* ensure compiler looks -once- at sp->completed. */
-	per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
-	srcu_barrier();  /* ensure compiler won't misorder critical section. */
+	idx = rcu_dereference_index_check(sp->completed,
+					  rcu_read_lock_sched_held()) & 0x1;
+	ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]) +=
+		SRCU_USAGE_COUNT + 1;
+	smp_mb(); /* B */  /* Avoid leaking the critical section. */
 	preempt_enable();
 	return idx;
 }
@@ -149,8 +233,9 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
 void __srcu_read_unlock(struct srcu_struct *sp, int idx)
 {
 	preempt_disable();
-	srcu_barrier();  /* ensure compiler won't misorder critical section. */
-	per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
+	smp_mb(); /* C */  /* Avoid leaking the critical section. */
+	ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]) +=
+		SRCU_USAGE_COUNT - 1;
 	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(__srcu_read_unlock);
@@ -163,12 +248,65 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
  * we repeatedly block for 1-millisecond time periods.  This approach
  * has done well in testing, so there is no need for a config parameter.
  */
-#define SYNCHRONIZE_SRCU_READER_DELAY 10
+#define SYNCHRONIZE_SRCU_READER_DELAY 5
+
+/*
+ * Flip the readers' index by incrementing ->completed, then wait
+ * until there are no more readers using the counters referenced by
+ * the old index value.  (Recall that the index is the bottom bit
+ * of ->completed.)
+ *
+ * Of course, it is possible that a reader might be delayed for the
+ * full duration of flip_idx_and_wait() between fetching the
+ * index and incrementing its counter.  This possibility is handled
+ * by __synchronize_srcu() invoking flip_idx_and_wait() twice.
+ */
+static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
+{
+	int idx;
+	int trycount = 0;
+
+	idx = sp->completed++ & 0x1;
+
+	/*
+	 * If a reader fetches the index before the above increment,
+	 * but increments its counter after srcu_readers_active_idx_check()
+	 * sums it, then smp_mb() D will pair with __srcu_read_lock()'s
+	 * smp_mb() B to ensure that the SRCU read-side critical section
+	 * will see any updates that the current task performed before its
+	 * call to synchronize_srcu(), or to synchronize_srcu_expedited(),
+	 * as the case may be.
+	 */
+	smp_mb(); /* D */
+
+	/*
+	 * SRCU read-side critical sections are normally short, so wait
+	 * a small amount of time before possibly blocking.
+	 */
+	if (!srcu_readers_active_idx_check(sp, idx)) {
+		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
+		while (!srcu_readers_active_idx_check(sp, idx)) {
+			if (expedited && ++ trycount < 10)
+				udelay(SYNCHRONIZE_SRCU_READER_DELAY);
+			else
+				schedule_timeout_interruptible(1);
+		}
+	}
+
+	/*
+	 * The following smp_mb() E pairs with srcu_read_unlock()'s
+	 * smp_mb C to ensure that if srcu_readers_active_idx_check()
+	 * sees srcu_read_unlock()'s counter decrement, then any
+	 * of the current task's subsequent code will happen after
+	 * that SRCU read-side critical section.
+	 */
+	smp_mb(); /* E */
+}
 
 /*
  * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
  */
-static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
+static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
 {
 	int idx;
 
@@ -178,90 +316,51 @@ static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
 			   !lock_is_held(&rcu_sched_lock_map),
 			   "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
 
-	idx = sp->completed;
+	idx = ACCESS_ONCE(sp->completed);
 	mutex_lock(&sp->mutex);
 
 	/*
 	 * Check to see if someone else did the work for us while we were
-	 * waiting to acquire the lock.  We need -two- advances of
+	 * waiting to acquire the lock.  We need -three- advances of
 	 * the counter, not just one.  If there was but one, we might have
 	 * shown up -after- our helper's first synchronize_sched(), thus
 	 * having failed to prevent CPU-reordering races with concurrent
-	 * srcu_read_unlock()s on other CPUs (see comment below).  So we
-	 * either (1) wait for two or (2) supply the second ourselves.
+	 * srcu_read_unlock()s on other CPUs (see comment below).  If there
+	 * was only two, we are guaranteed to have waited through only one
+	 * full index-flip phase.  So we either (1) wait for three or
+	 * (2) supply the additional ones we need.
 	 */
 
-	if ((sp->completed - idx) >= 2) {
+	if (sp->completed == idx + 2)
+		idx = 1;
+	else if (sp->completed == idx + 3) {
 		mutex_unlock(&sp->mutex);
 		return;
-	}
-
-	sync_func();  /* Force memory barrier on all CPUs. */
+	} else
+		idx = 0;
 
 	/*
-	 * The preceding synchronize_sched() ensures that any CPU that
-	 * sees the new value of sp->completed will also see any preceding
-	 * changes to data structures made by this CPU.  This prevents
-	 * some other CPU from reordering the accesses in its SRCU
-	 * read-side critical section to precede the corresponding
-	 * srcu_read_lock() -- ensuring that such references will in
-	 * fact be protected.
+	 * If there were no helpers, then we need to do two flips of
+	 * the index.  The first flip is required if there are any
+	 * outstanding SRCU readers even if there are no new readers
+	 * running concurrently with the first counter flip.
 	 *
-	 * So it is now safe to do the flip.
-	 */
-
-	idx = sp->completed & 0x1;
-	sp->completed++;
-
-	sync_func();  /* Force memory barrier on all CPUs. */
-
-	/*
-	 * At this point, because of the preceding synchronize_sched(),
-	 * all srcu_read_lock() calls using the old counters have completed.
-	 * Their corresponding critical sections might well be still
-	 * executing, but the srcu_read_lock() primitives themselves
-	 * will have finished executing.  We initially give readers
-	 * an arbitrarily chosen 10 microseconds to get out of their
-	 * SRCU read-side critical sections, then loop waiting 1/HZ
-	 * seconds per iteration.  The 10-microsecond value has done
-	 * very well in testing.
-	 */
-
-	if (srcu_readers_active_idx(sp, idx))
-		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
-	while (srcu_readers_active_idx(sp, idx))
-		schedule_timeout_interruptible(1);
-
-	sync_func();  /* Force memory barrier on all CPUs. */
-
-	/*
-	 * The preceding synchronize_sched() forces all srcu_read_unlock()
-	 * primitives that were executing concurrently with the preceding
-	 * for_each_possible_cpu() loop to have completed by this point.
-	 * More importantly, it also forces the corresponding SRCU read-side
-	 * critical sections to have also completed, and the corresponding
-	 * references to SRCU-protected data items to be dropped.
+	 * The second flip is required when a new reader picks up
+	 * the old value of the index, but does not increment its
+	 * counter until after its counters is summed/rechecked by
+	 * srcu_readers_active_idx_check().  In this case, the current SRCU
+	 * grace period would be OK because the SRCU read-side critical
+	 * section started after this SRCU grace period started, so the
+	 * grace period is not required to wait for the reader.
 	 *
-	 * Note:
-	 *
-	 *	Despite what you might think at first glance, the
-	 *	preceding synchronize_sched() -must- be within the
-	 *	critical section ended by the following mutex_unlock().
-	 *	Otherwise, a task taking the early exit can race
-	 *	with a srcu_read_unlock(), which might have executed
-	 *	just before the preceding srcu_readers_active() check,
-	 *	and whose CPU might have reordered the srcu_read_unlock()
-	 *	with the preceding critical section.  In this case, there
-	 *	is nothing preventing the synchronize_sched() task that is
-	 *	taking the early exit from freeing a data structure that
-	 *	is still being referenced (out of order) by the task
-	 *	doing the srcu_read_unlock().
-	 *
-	 *	Alternatively, the comparison with "2" on the early exit
-	 *	could be changed to "3", but this increases synchronize_srcu()
-	 *	latency for bulk loads.  So the current code is preferred.
+	 * However, the next SRCU grace period would be waiting for the
+	 * other set of counters to go to zero, and therefore would not
+	 * wait for the reader, which would be very bad.  To avoid this
+	 * bad scenario, we flip and wait twice, clearing out both sets
+	 * of counters.
 	 */
-
+	for (; idx < 2; idx++)
+		flip_idx_and_wait(sp, expedited);
 	mutex_unlock(&sp->mutex);
 }
 
@@ -281,7 +380,7 @@ static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
  */
 void synchronize_srcu(struct srcu_struct *sp)
 {
-	__synchronize_srcu(sp, synchronize_sched);
+	__synchronize_srcu(sp, 0);
 }
 EXPORT_SYMBOL_GPL(synchronize_srcu);
 
@@ -289,18 +388,11 @@ EXPORT_SYMBOL_GPL(synchronize_srcu);
  * synchronize_srcu_expedited - Brute-force SRCU grace period
  * @sp: srcu_struct with which to synchronize.
  *
- * Wait for an SRCU grace period to elapse, but use a "big hammer"
- * approach to force the grace period to end quickly.  This consumes
- * significant time on all CPUs and is unfriendly to real-time workloads,
- * so is thus not recommended for any sort of common-case code.  In fact,
- * if you are using synchronize_srcu_expedited() in a loop, please
- * restructure your code to batch your updates, and then use a single
- * synchronize_srcu() instead.
+ * Wait for an SRCU grace period to elapse, but be more aggressive about
+ * spinning rather than blocking when waiting.
  *
  * Note that it is illegal to call this function while holding any lock
- * that is acquired by a CPU-hotplug notifier.  And yes, it is also illegal
- * to call this function from a CPU-hotplug notifier.  Failing to observe
- * these restriction will result in deadlock.  It is also illegal to call
+ * that is acquired by a CPU-hotplug notifier.  It is also illegal to call
  * synchronize_srcu_expedited() from the corresponding SRCU read-side
  * critical section; doing so will result in deadlock.  However, it is
  * perfectly legal to call synchronize_srcu_expedited() on one srcu_struct
@@ -309,7 +401,7 @@ EXPORT_SYMBOL_GPL(synchronize_srcu);
  */
 void synchronize_srcu_expedited(struct srcu_struct *sp)
 {
-	__synchronize_srcu(sp, synchronize_sched_expedited);
+	__synchronize_srcu(sp, 1);
 }
 EXPORT_SYMBOL_GPL(synchronize_srcu_expedited);
 


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-13  2:09 [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation Paul E. McKenney
@ 2012-02-15 12:59 ` Peter Zijlstra
  2012-02-16  6:35   ` Paul E. McKenney
  2012-02-15 14:31 ` Mathieu Desnoyers
  2012-02-20  7:15 ` Lai Jiangshan
  2 siblings, 1 reply; 100+ messages in thread
From: Peter Zijlstra @ 2012-02-15 12:59 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches, Avi Kiviti, Chris Mason,
	Eric Paris

On Sun, 2012-02-12 at 18:09 -0800, Paul E. McKenney wrote:
> The current implementation of synchronize_srcu_expedited() can cause
> severe OS jitter due to its use of synchronize_sched(), which in turn
> invokes try_stop_cpus(), which causes each CPU to be sent an IPI.
> This can result in severe performance degradation for real-time workloads
> and especially for short-interation-length HPC workloads.  Furthermore,
> because only one instance of try_stop_cpus() can be making forward progress
> at a given time, only one instance of synchronize_srcu_expedited() can
> make forward progress at a time, even if they are all operating on
> distinct srcu_struct structures.
> 
> This commit, inspired by an earlier implementation by Peter Zijlstra
> (https://lkml.org/lkml/2012/1/31/211) and by further offline discussions,
> takes a strictly algorithmic bits-in-memory approach.  This has the
> disadvantage of requiring one explicit memory-barrier instruction in
> each of srcu_read_lock() and srcu_read_unlock(), but on the other hand
> completely dispenses with OS jitter and furthermore allows SRCU to be
> used freely by CPUs that RCU believes to be idle or offline.
> 
> The update-side implementation handles the single read-side memory
> barrier by rechecking the per-CPU counters after summing them and
> by running through the update-side state machine twice.

Yeah, getting rid of that second memory barrier in srcu_read_lock() is
pure magic :-)

> This implementation has passed moderate rcutorture testing on both 32-bit
> x86 and 64-bit Power.  A call_srcu() function will be present in a later
> version of this patch.

Goodness ;-)

> @@ -131,10 +214,11 @@ int __srcu_read_lock(struct srcu_struct *sp)
>  	int idx;
>  
>  	preempt_disable();
> -	idx = sp->completed & 0x1;
> -	barrier();  /* ensure compiler looks -once- at sp->completed. */
> -	per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> -	srcu_barrier();  /* ensure compiler won't misorder critical section. */
> +	idx = rcu_dereference_index_check(sp->completed,
> +					  rcu_read_lock_sched_held()) & 0x1;
> +	ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]) +=
> +		SRCU_USAGE_COUNT + 1;
> +	smp_mb(); /* B */  /* Avoid leaking the critical section. */
>  	preempt_enable();
>  	return idx;
>  }

You could use __this_cpu_* muck to shorten some of that.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-13  2:09 [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation Paul E. McKenney
  2012-02-15 12:59 ` Peter Zijlstra
@ 2012-02-15 14:31 ` Mathieu Desnoyers
  2012-02-15 14:51   ` Mathieu Desnoyers
  2012-02-20  7:15 ` Lai Jiangshan
  2 siblings, 1 reply; 100+ messages in thread
From: Mathieu Desnoyers @ 2012-02-15 14:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, patches

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
[...]
> +/*
> + * To be called from the update side after an index flip.  Returns true
> + * if the modulo sum of the counters is stably zero, false if there is
> + * some possibility of non-zero.
> + */
> +static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
>  {
>  	int cpu;
> -	int sum;
>  
> -	sum = 0;
> +	/*
> +	 * Note that srcu_readers_active_idx() can incorrectly return
> +	 * zero even though there is a pre-existing reader throughout.
> +	 * To see this, suppose that task A is in a very long SRCU
> +	 * read-side critical section that started on CPU 0, and that
> +	 * no other reader exists, so that the modulo sum of the counters
> +	 * is equal to one.  Then suppose that task B starts executing
> +	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
> +	 * task C starts reading on CPU 0, so that its increment is not
> +	 * summed, but finishes reading on CPU 2, so that its decrement
> +	 * -is- summed.  Then when task B completes its sum, it will
> +	 * incorrectly get zero, despite the fact that task A has been
> +	 * in its SRCU read-side critical section the whole time.
> +	 *
> +	 * We therefore do a validation step should srcu_readers_active_idx()
> +	 * return zero.
> +	 */
> +	if (srcu_readers_active_idx(sp, idx) != 0)
> +		return false;
> +
> +	/*
> +	 * Since the caller recently flipped ->completed, we can see at
> +	 * most one increment of each CPU's counter from this point
> +	 * forward.  The reason for this is that the reader CPU must have
> +	 * fetched the index before srcu_readers_active_idx checked
> +	 * that CPU's counter, but not yet incremented its counter.
> +	 * Its eventual counter increment will follow the read in
> +	 * srcu_readers_active_idx(), and that increment is immediately
> +	 * followed by smp_mb() B.  Because smp_mb() D is between
> +	 * the ->completed flip and srcu_readers_active_idx()'s read,
> +	 * that CPU's subsequent load of ->completed must see the new
> +	 * value, and therefore increment the counter in the other rank.
> +	 */
> +	smp_mb(); /* A */
> +
> +	/*
> +	 * Now, we check the ->snap array that srcu_readers_active_idx()
> +	 * filled in from the per-CPU counter values.  Since both
> +	 * __srcu_read_lock() and __srcu_read_unlock() increment the
> +	 * upper bits of the per-CPU counter, an increment/decrement
> +	 * pair will change the value of the counter.  Since there is
> +	 * only one possible increment, the only way to wrap the counter
> +	 * is to have a huge number of counter decrements, which requires
> +	 * a huge number of tasks and huge SRCU read-side critical-section
> +	 * nesting levels, even on 32-bit systems.
> +	 *
> +	 * All of the ways of confusing the readings require that the scan
> +	 * in srcu_readers_active_idx() see the read-side task's decrement,
> +	 * but not its increment.  However, between that decrement and
> +	 * increment are smb_mb() B and C.  Either or both of these pair
> +	 * with smp_mb() A above to ensure that the scan below will see
> +	 * the read-side tasks's increment, thus noting a difference in
> +	 * the counter values between the two passes.

Hi Paul,

I think the implementation is correct, but the explanation above might
be improved. Let's consider the following a scenario, where a reader is
migrated between increment of the counter and issuing the memory barrier
in the read lock:

A,B,C are readers
D is synchronize_rcu (one flip'n'wait)

CPU A               CPU B           CPU C         CPU D
                                    c[1]++
                                    smp_mb(1)
                                                  read c[0] -> 0
c[0]++
(implicit smp_mb (2))
      -> migrated ->
                    (implicit smp_mb (3))
                    smp_mb (4)
                    smp_mb (5)
                    c[1]--
                                                  read c[1] -> -1
                                                  read c[2] -> 1
                                                  (false 0 sum)
                                                  smp_mb (6)
                                                  re-check each.
                                    c[1]--

re-check: because we observed c[1] == -1, thanks to the implicit memory
barriers within thread migration (2 and 3), we can assume that we _will_
observe the updated value of c[0] after smp_mb (6).

The current explanation states that memory barriers 4 and 5, along with  
6, are responsible for ensuring that the increment will be observed by 
the re-check. However, I doubt they have anything to do with it: it's 
rather the implicit memory barriers in thread migration, along with
program order guarantees on writes to the same address, that seems to be
the reason why we can do this ordering assumption.

Does it make sense, or shall I get another coffee to wake myself up ?
;)

Thanks,

Mathieu

> +	 *
> +	 * Therefore, if srcu_readers_active_idx() returned zero, and
> +	 * none of the counters changed, we know that the zero was the
> +	 * correct sum.
> +	 *
> +	 * Of course, it is possible that a task might be delayed
> +	 * for a very long time in __srcu_read_lock() after fetching
> +	 * the index but before incrementing its counter.  This
> +	 * possibility will be dealt with in __synchronize_srcu().
> +	 */
>  	for_each_possible_cpu(cpu)
> -		sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
> -	return sum;
> +		if (sp->snap[cpu] !=
> +		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> +			return false;  /* False zero reading! */
> +	return true;
>  }


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-15 14:31 ` Mathieu Desnoyers
@ 2012-02-15 14:51   ` Mathieu Desnoyers
  2012-02-16  6:38     ` Paul E. McKenney
  0 siblings, 1 reply; 100+ messages in thread
From: Mathieu Desnoyers @ 2012-02-15 14:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, patches

* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> [...]
> > +/*
> > + * To be called from the update side after an index flip.  Returns true
> > + * if the modulo sum of the counters is stably zero, false if there is
> > + * some possibility of non-zero.
> > + */
> > +static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> >  {
> >  	int cpu;
> > -	int sum;
> >  
> > -	sum = 0;
> > +	/*
> > +	 * Note that srcu_readers_active_idx() can incorrectly return
> > +	 * zero even though there is a pre-existing reader throughout.
> > +	 * To see this, suppose that task A is in a very long SRCU
> > +	 * read-side critical section that started on CPU 0, and that
> > +	 * no other reader exists, so that the modulo sum of the counters
> > +	 * is equal to one.  Then suppose that task B starts executing
> > +	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
> > +	 * task C starts reading on CPU 0, so that its increment is not
> > +	 * summed, but finishes reading on CPU 2, so that its decrement
> > +	 * -is- summed.  Then when task B completes its sum, it will
> > +	 * incorrectly get zero, despite the fact that task A has been
> > +	 * in its SRCU read-side critical section the whole time.
> > +	 *
> > +	 * We therefore do a validation step should srcu_readers_active_idx()
> > +	 * return zero.
> > +	 */
> > +	if (srcu_readers_active_idx(sp, idx) != 0)
> > +		return false;
> > +
> > +	/*
> > +	 * Since the caller recently flipped ->completed, we can see at
> > +	 * most one increment of each CPU's counter from this point
> > +	 * forward.  The reason for this is that the reader CPU must have
> > +	 * fetched the index before srcu_readers_active_idx checked
> > +	 * that CPU's counter, but not yet incremented its counter.
> > +	 * Its eventual counter increment will follow the read in
> > +	 * srcu_readers_active_idx(), and that increment is immediately
> > +	 * followed by smp_mb() B.  Because smp_mb() D is between
> > +	 * the ->completed flip and srcu_readers_active_idx()'s read,
> > +	 * that CPU's subsequent load of ->completed must see the new
> > +	 * value, and therefore increment the counter in the other rank.
> > +	 */
> > +	smp_mb(); /* A */
> > +
> > +	/*
> > +	 * Now, we check the ->snap array that srcu_readers_active_idx()
> > +	 * filled in from the per-CPU counter values.  Since both
> > +	 * __srcu_read_lock() and __srcu_read_unlock() increment the
> > +	 * upper bits of the per-CPU counter, an increment/decrement
> > +	 * pair will change the value of the counter.  Since there is
> > +	 * only one possible increment, the only way to wrap the counter
> > +	 * is to have a huge number of counter decrements, which requires
> > +	 * a huge number of tasks and huge SRCU read-side critical-section
> > +	 * nesting levels, even on 32-bit systems.
> > +	 *
> > +	 * All of the ways of confusing the readings require that the scan
> > +	 * in srcu_readers_active_idx() see the read-side task's decrement,
> > +	 * but not its increment.  However, between that decrement and
> > +	 * increment are smb_mb() B and C.  Either or both of these pair
> > +	 * with smp_mb() A above to ensure that the scan below will see
> > +	 * the read-side tasks's increment, thus noting a difference in
> > +	 * the counter values between the two passes.
> 
> Hi Paul,
> 
> I think the implementation is correct, but the explanation above might
> be improved. Let's consider the following a scenario, where a reader is
> migrated between increment of the counter and issuing the memory barrier
> in the read lock:
> 
> A,B,C are readers
> D is synchronize_rcu (one flip'n'wait)
> 
> CPU A               CPU B           CPU C         CPU D
>                                     c[1]++
>                                     smp_mb(1)
>                                                   read c[0] -> 0
> c[0]++
> (implicit smp_mb (2))
>       -> migrated ->
>                     (implicit smp_mb (3))
>                     smp_mb (4)
>                     smp_mb (5)
>                     c[1]--
>                                                   read c[1] -> -1
>                                                   read c[2] -> 1
>                                                   (false 0 sum)
>                                                   smp_mb (6)
>                                                   re-check each.
>                                     c[1]--
> 
> re-check: because we observed c[1] == -1, thanks to the implicit memory
> barriers within thread migration (2 and 3), we can assume that we _will_
> observe the updated value of c[0] after smp_mb (6).
> 
> The current explanation states that memory barriers 4 and 5, along with  
> 6, are responsible for ensuring that the increment will be observed by 
> the re-check. However, I doubt they have anything to do with it: it's 
> rather the implicit memory barriers in thread migration, along with
> program order guarantees on writes to the same address, that seems to be
> the reason why we can do this ordering assumption.

Please disregard the part about program order: CPU A writes to c[0], and
CPU B writes to c[1], which are two different memory locations. The rest
of my discussion stands though.

Simply reasoning about write to c[0], memory barriers 2-3, write to
c[1], along with c[1] read, memory barrier 6, and then c[0] read is
enough to explain the ordering guarantees you need, without invoking
program order.

Thanks,

Mathieu

> 
> Does it make sense, or shall I get another coffee to wake myself up ?
> ;)
> 
> Thanks,
> 
> Mathieu
> 
> > +	 *
> > +	 * Therefore, if srcu_readers_active_idx() returned zero, and
> > +	 * none of the counters changed, we know that the zero was the
> > +	 * correct sum.
> > +	 *
> > +	 * Of course, it is possible that a task might be delayed
> > +	 * for a very long time in __srcu_read_lock() after fetching
> > +	 * the index but before incrementing its counter.  This
> > +	 * possibility will be dealt with in __synchronize_srcu().
> > +	 */
> >  	for_each_possible_cpu(cpu)
> > -		sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
> > -	return sum;
> > +		if (sp->snap[cpu] !=
> > +		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> > +			return false;  /* False zero reading! */
> > +	return true;
> >  }
> 
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-15 12:59 ` Peter Zijlstra
@ 2012-02-16  6:35   ` Paul E. McKenney
  2012-02-16 10:50     ` Mathieu Desnoyers
  0 siblings, 1 reply; 100+ messages in thread
From: Paul E. McKenney @ 2012-02-16  6:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches, Avi Kiviti, Chris Mason,
	Eric Paris

On Wed, Feb 15, 2012 at 01:59:23PM +0100, Peter Zijlstra wrote:
> On Sun, 2012-02-12 at 18:09 -0800, Paul E. McKenney wrote:
> > The current implementation of synchronize_srcu_expedited() can cause
> > severe OS jitter due to its use of synchronize_sched(), which in turn
> > invokes try_stop_cpus(), which causes each CPU to be sent an IPI.
> > This can result in severe performance degradation for real-time workloads
> > and especially for short-interation-length HPC workloads.  Furthermore,
> > because only one instance of try_stop_cpus() can be making forward progress
> > at a given time, only one instance of synchronize_srcu_expedited() can
> > make forward progress at a time, even if they are all operating on
> > distinct srcu_struct structures.
> > 
> > This commit, inspired by an earlier implementation by Peter Zijlstra
> > (https://lkml.org/lkml/2012/1/31/211) and by further offline discussions,
> > takes a strictly algorithmic bits-in-memory approach.  This has the
> > disadvantage of requiring one explicit memory-barrier instruction in
> > each of srcu_read_lock() and srcu_read_unlock(), but on the other hand
> > completely dispenses with OS jitter and furthermore allows SRCU to be
> > used freely by CPUs that RCU believes to be idle or offline.
> > 
> > The update-side implementation handles the single read-side memory
> > barrier by rechecking the per-CPU counters after summing them and
> > by running through the update-side state machine twice.
> 
> Yeah, getting rid of that second memory barrier in srcu_read_lock() is
> pure magic :-)
> 
> > This implementation has passed moderate rcutorture testing on both 32-bit
> > x86 and 64-bit Power.  A call_srcu() function will be present in a later
> > version of this patch.
> 
> Goodness ;-)

Glad you like the magic and the prospect of call_srcu().  ;-)

> > @@ -131,10 +214,11 @@ int __srcu_read_lock(struct srcu_struct *sp)
> >  	int idx;
> >  
> >  	preempt_disable();
> > -	idx = sp->completed & 0x1;
> > -	barrier();  /* ensure compiler looks -once- at sp->completed. */
> > -	per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> > -	srcu_barrier();  /* ensure compiler won't misorder critical section. */
> > +	idx = rcu_dereference_index_check(sp->completed,
> > +					  rcu_read_lock_sched_held()) & 0x1;
> > +	ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]) +=
> > +		SRCU_USAGE_COUNT + 1;
> > +	smp_mb(); /* B */  /* Avoid leaking the critical section. */
> >  	preempt_enable();
> >  	return idx;
> >  }
> 
> You could use __this_cpu_* muck to shorten some of that.

Ah, so something like this?

	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 
		SRCU_USAGE_COUNT + 1;

Now that you mention it, this does look nicer, applied here and to
srcu_read_unlock().

> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>


							Thanx, Paul


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-15 14:51   ` Mathieu Desnoyers
@ 2012-02-16  6:38     ` Paul E. McKenney
  2012-02-16 11:00       ` Mathieu Desnoyers
  0 siblings, 1 reply; 100+ messages in thread
From: Paul E. McKenney @ 2012-02-16  6:38 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, patches

On Wed, Feb 15, 2012 at 09:51:44AM -0500, Mathieu Desnoyers wrote:
> * Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > [...]
> > > +/*
> > > + * To be called from the update side after an index flip.  Returns true
> > > + * if the modulo sum of the counters is stably zero, false if there is
> > > + * some possibility of non-zero.
> > > + */
> > > +static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> > >  {
> > >  	int cpu;
> > > -	int sum;
> > >  
> > > -	sum = 0;
> > > +	/*
> > > +	 * Note that srcu_readers_active_idx() can incorrectly return
> > > +	 * zero even though there is a pre-existing reader throughout.
> > > +	 * To see this, suppose that task A is in a very long SRCU
> > > +	 * read-side critical section that started on CPU 0, and that
> > > +	 * no other reader exists, so that the modulo sum of the counters
> > > +	 * is equal to one.  Then suppose that task B starts executing
> > > +	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
> > > +	 * task C starts reading on CPU 0, so that its increment is not
> > > +	 * summed, but finishes reading on CPU 2, so that its decrement
> > > +	 * -is- summed.  Then when task B completes its sum, it will
> > > +	 * incorrectly get zero, despite the fact that task A has been
> > > +	 * in its SRCU read-side critical section the whole time.
> > > +	 *
> > > +	 * We therefore do a validation step should srcu_readers_active_idx()
> > > +	 * return zero.
> > > +	 */
> > > +	if (srcu_readers_active_idx(sp, idx) != 0)
> > > +		return false;
> > > +
> > > +	/*
> > > +	 * Since the caller recently flipped ->completed, we can see at
> > > +	 * most one increment of each CPU's counter from this point
> > > +	 * forward.  The reason for this is that the reader CPU must have
> > > +	 * fetched the index before srcu_readers_active_idx checked
> > > +	 * that CPU's counter, but not yet incremented its counter.
> > > +	 * Its eventual counter increment will follow the read in
> > > +	 * srcu_readers_active_idx(), and that increment is immediately
> > > +	 * followed by smp_mb() B.  Because smp_mb() D is between
> > > +	 * the ->completed flip and srcu_readers_active_idx()'s read,
> > > +	 * that CPU's subsequent load of ->completed must see the new
> > > +	 * value, and therefore increment the counter in the other rank.
> > > +	 */
> > > +	smp_mb(); /* A */
> > > +
> > > +	/*
> > > +	 * Now, we check the ->snap array that srcu_readers_active_idx()
> > > +	 * filled in from the per-CPU counter values.  Since both
> > > +	 * __srcu_read_lock() and __srcu_read_unlock() increment the
> > > +	 * upper bits of the per-CPU counter, an increment/decrement
> > > +	 * pair will change the value of the counter.  Since there is
> > > +	 * only one possible increment, the only way to wrap the counter
> > > +	 * is to have a huge number of counter decrements, which requires
> > > +	 * a huge number of tasks and huge SRCU read-side critical-section
> > > +	 * nesting levels, even on 32-bit systems.
> > > +	 *
> > > +	 * All of the ways of confusing the readings require that the scan
> > > +	 * in srcu_readers_active_idx() see the read-side task's decrement,
> > > +	 * but not its increment.  However, between that decrement and
> > > +	 * increment are smb_mb() B and C.  Either or both of these pair
> > > +	 * with smp_mb() A above to ensure that the scan below will see
> > > +	 * the read-side tasks's increment, thus noting a difference in
> > > +	 * the counter values between the two passes.
> > 
> > Hi Paul,
> > 
> > I think the implementation is correct, but the explanation above might
> > be improved. Let's consider the following a scenario, where a reader is
> > migrated between increment of the counter and issuing the memory barrier
> > in the read lock:
> > 
> > A,B,C are readers
> > D is synchronize_rcu (one flip'n'wait)
> > 
> > CPU A               CPU B           CPU C         CPU D
> >                                     c[1]++
> >                                     smp_mb(1)
> >                                                   read c[0] -> 0
> > c[0]++
> > (implicit smp_mb (2))
> >       -> migrated ->
> >                     (implicit smp_mb (3))
> >                     smp_mb (4)
> >                     smp_mb (5)
> >                     c[1]--
> >                                                   read c[1] -> -1
> >                                                   read c[2] -> 1
> >                                                   (false 0 sum)
> >                                                   smp_mb (6)
> >                                                   re-check each.
> >                                     c[1]--
> > 
> > re-check: because we observed c[1] == -1, thanks to the implicit memory
> > barriers within thread migration (2 and 3), we can assume that we _will_
> > observe the updated value of c[0] after smp_mb (6).
> > 
> > The current explanation states that memory barriers 4 and 5, along with  
> > 6, are responsible for ensuring that the increment will be observed by 
> > the re-check. However, I doubt they have anything to do with it: it's 
> > rather the implicit memory barriers in thread migration, along with
> > program order guarantees on writes to the same address, that seems to be
> > the reason why we can do this ordering assumption.
> 
> Please disregard the part about program order: CPU A writes to c[0], and
> CPU B writes to c[1], which are two different memory locations. The rest
> of my discussion stands though.
> 
> Simply reasoning about write to c[0], memory barriers 2-3, write to
> c[1], along with c[1] read, memory barrier 6, and then c[0] read is
> enough to explain the ordering guarantees you need, without invoking
> program order.

I am assuming that if the scheduler migrates a process, it applies enough
memory ordering to allow the proof to operate as if it had stayed on a
single CPU throughout.  The reasoning for this would consider the
scheduler access and memory barriers -- but there would be an arbitrarily
large number of migration patterns, so I am not convinced that it would
help...

							Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > Does it make sense, or shall I get another coffee to wake myself up ?
> > ;)
> > 
> > Thanks,
> > 
> > Mathieu
> > 
> > > +	 *
> > > +	 * Therefore, if srcu_readers_active_idx() returned zero, and
> > > +	 * none of the counters changed, we know that the zero was the
> > > +	 * correct sum.
> > > +	 *
> > > +	 * Of course, it is possible that a task might be delayed
> > > +	 * for a very long time in __srcu_read_lock() after fetching
> > > +	 * the index but before incrementing its counter.  This
> > > +	 * possibility will be dealt with in __synchronize_srcu().
> > > +	 */
> > >  	for_each_possible_cpu(cpu)
> > > -		sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
> > > -	return sum;
> > > +		if (sp->snap[cpu] !=
> > > +		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> > > +			return false;  /* False zero reading! */
> > > +	return true;
> > >  }
> > 
> > 
> > -- 
> > Mathieu Desnoyers
> > Operating System Efficiency R&D Consultant
> > EfficiOS Inc.
> > http://www.efficios.com
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-16  6:35   ` Paul E. McKenney
@ 2012-02-16 10:50     ` Mathieu Desnoyers
  2012-02-16 10:52       ` Peter Zijlstra
  0 siblings, 1 reply; 100+ messages in thread
From: Mathieu Desnoyers @ 2012-02-16 10:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm, josh,
	niv, tglx, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, patches, Avi Kiviti, Chris Mason, Eric Paris

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Wed, Feb 15, 2012 at 01:59:23PM +0100, Peter Zijlstra wrote:
> > On Sun, 2012-02-12 at 18:09 -0800, Paul E. McKenney wrote:
> > > The current implementation of synchronize_srcu_expedited() can cause
> > > severe OS jitter due to its use of synchronize_sched(), which in turn
> > > invokes try_stop_cpus(), which causes each CPU to be sent an IPI.
> > > This can result in severe performance degradation for real-time workloads
> > > and especially for short-interation-length HPC workloads.  Furthermore,
> > > because only one instance of try_stop_cpus() can be making forward progress
> > > at a given time, only one instance of synchronize_srcu_expedited() can
> > > make forward progress at a time, even if they are all operating on
> > > distinct srcu_struct structures.
> > > 
> > > This commit, inspired by an earlier implementation by Peter Zijlstra
> > > (https://lkml.org/lkml/2012/1/31/211) and by further offline discussions,
> > > takes a strictly algorithmic bits-in-memory approach.  This has the
> > > disadvantage of requiring one explicit memory-barrier instruction in
> > > each of srcu_read_lock() and srcu_read_unlock(), but on the other hand
> > > completely dispenses with OS jitter and furthermore allows SRCU to be
> > > used freely by CPUs that RCU believes to be idle or offline.
> > > 
> > > The update-side implementation handles the single read-side memory
> > > barrier by rechecking the per-CPU counters after summing them and
> > > by running through the update-side state machine twice.
> > 
> > Yeah, getting rid of that second memory barrier in srcu_read_lock() is
> > pure magic :-)
> > 
> > > This implementation has passed moderate rcutorture testing on both 32-bit
> > > x86 and 64-bit Power.  A call_srcu() function will be present in a later
> > > version of this patch.
> > 
> > Goodness ;-)
> 
> Glad you like the magic and the prospect of call_srcu().  ;-)
> 
> > > @@ -131,10 +214,11 @@ int __srcu_read_lock(struct srcu_struct *sp)
> > >  	int idx;
> > >  
> > >  	preempt_disable();
> > > -	idx = sp->completed & 0x1;
> > > -	barrier();  /* ensure compiler looks -once- at sp->completed. */
> > > -	per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
> > > -	srcu_barrier();  /* ensure compiler won't misorder critical section. */
> > > +	idx = rcu_dereference_index_check(sp->completed,
> > > +					  rcu_read_lock_sched_held()) & 0x1;
> > > +	ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]) +=
> > > +		SRCU_USAGE_COUNT + 1;
> > > +	smp_mb(); /* B */  /* Avoid leaking the critical section. */
> > >  	preempt_enable();
> > >  	return idx;
> > >  }
> > 
> > You could use __this_cpu_* muck to shorten some of that.
> 
> Ah, so something like this?
> 
> 	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 
> 		SRCU_USAGE_COUNT + 1;
> 
> Now that you mention it, this does look nicer, applied here and to
> srcu_read_unlock().

I think Peter refers to __this_cpu_add().

Thanks,

Mathieu

> 
> > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> 
> 							Thanx, Paul
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-16 10:50     ` Mathieu Desnoyers
@ 2012-02-16 10:52       ` Peter Zijlstra
  2012-02-16 11:14         ` Mathieu Desnoyers
  0 siblings, 1 reply; 100+ messages in thread
From: Peter Zijlstra @ 2012-02-16 10:52 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, linux-kernel, mingo, laijs, dipankar, akpm,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches, Avi Kiviti, Chris Mason,
	Eric Paris

On Thu, 2012-02-16 at 05:50 -0500, Mathieu Desnoyers wrote:
> > Ah, so something like this?
> > 
> >       ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 
> >               SRCU_USAGE_COUNT + 1;
> > 
> > Now that you mention it, this does look nicer, applied here and to
> > srcu_read_unlock().
> 
> I think Peter refers to __this_cpu_add(). 

I'm not sure that implies the ACCESS_ONCE() thing

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-16  6:38     ` Paul E. McKenney
@ 2012-02-16 11:00       ` Mathieu Desnoyers
  2012-02-16 11:51         ` Peter Zijlstra
  0 siblings, 1 reply; 100+ messages in thread
From: Mathieu Desnoyers @ 2012-02-16 11:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mathieu Desnoyers, linux-kernel, mingo, laijs, dipankar, akpm,
	josh, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Wed, Feb 15, 2012 at 09:51:44AM -0500, Mathieu Desnoyers wrote:
> > * Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > [...]
> > > > +/*
> > > > + * To be called from the update side after an index flip.  Returns true
> > > > + * if the modulo sum of the counters is stably zero, false if there is
> > > > + * some possibility of non-zero.
> > > > + */
> > > > +static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> > > >  {
> > > >  	int cpu;
> > > > -	int sum;
> > > >  
> > > > -	sum = 0;
> > > > +	/*
> > > > +	 * Note that srcu_readers_active_idx() can incorrectly return
> > > > +	 * zero even though there is a pre-existing reader throughout.
> > > > +	 * To see this, suppose that task A is in a very long SRCU
> > > > +	 * read-side critical section that started on CPU 0, and that
> > > > +	 * no other reader exists, so that the modulo sum of the counters
> > > > +	 * is equal to one.  Then suppose that task B starts executing
> > > > +	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
> > > > +	 * task C starts reading on CPU 0, so that its increment is not
> > > > +	 * summed, but finishes reading on CPU 2, so that its decrement
> > > > +	 * -is- summed.  Then when task B completes its sum, it will
> > > > +	 * incorrectly get zero, despite the fact that task A has been
> > > > +	 * in its SRCU read-side critical section the whole time.
> > > > +	 *
> > > > +	 * We therefore do a validation step should srcu_readers_active_idx()
> > > > +	 * return zero.
> > > > +	 */
> > > > +	if (srcu_readers_active_idx(sp, idx) != 0)
> > > > +		return false;
> > > > +
> > > > +	/*
> > > > +	 * Since the caller recently flipped ->completed, we can see at
> > > > +	 * most one increment of each CPU's counter from this point
> > > > +	 * forward.  The reason for this is that the reader CPU must have
> > > > +	 * fetched the index before srcu_readers_active_idx checked
> > > > +	 * that CPU's counter, but not yet incremented its counter.
> > > > +	 * Its eventual counter increment will follow the read in
> > > > +	 * srcu_readers_active_idx(), and that increment is immediately
> > > > +	 * followed by smp_mb() B.  Because smp_mb() D is between
> > > > +	 * the ->completed flip and srcu_readers_active_idx()'s read,
> > > > +	 * that CPU's subsequent load of ->completed must see the new
> > > > +	 * value, and therefore increment the counter in the other rank.
> > > > +	 */
> > > > +	smp_mb(); /* A */
> > > > +
> > > > +	/*
> > > > +	 * Now, we check the ->snap array that srcu_readers_active_idx()
> > > > +	 * filled in from the per-CPU counter values.  Since both
> > > > +	 * __srcu_read_lock() and __srcu_read_unlock() increment the
> > > > +	 * upper bits of the per-CPU counter, an increment/decrement
> > > > +	 * pair will change the value of the counter.  Since there is
> > > > +	 * only one possible increment, the only way to wrap the counter
> > > > +	 * is to have a huge number of counter decrements, which requires
> > > > +	 * a huge number of tasks and huge SRCU read-side critical-section
> > > > +	 * nesting levels, even on 32-bit systems.
> > > > +	 *
> > > > +	 * All of the ways of confusing the readings require that the scan
> > > > +	 * in srcu_readers_active_idx() see the read-side task's decrement,
> > > > +	 * but not its increment.  However, between that decrement and
> > > > +	 * increment are smb_mb() B and C.  Either or both of these pair
> > > > +	 * with smp_mb() A above to ensure that the scan below will see
> > > > +	 * the read-side tasks's increment, thus noting a difference in
> > > > +	 * the counter values between the two passes.
> > > 
> > > Hi Paul,
> > > 
> > > I think the implementation is correct, but the explanation above might
> > > be improved. Let's consider the following a scenario, where a reader is
> > > migrated between increment of the counter and issuing the memory barrier
> > > in the read lock:
> > > 
> > > A,B,C are readers
> > > D is synchronize_rcu (one flip'n'wait)
> > > 
> > > CPU A               CPU B           CPU C         CPU D
> > >                                     c[1]++
> > >                                     smp_mb(1)
> > >                                                   read c[0] -> 0
> > > c[0]++
> > > (implicit smp_mb (2))
> > >       -> migrated ->
> > >                     (implicit smp_mb (3))
> > >                     smp_mb (4)
> > >                     smp_mb (5)
> > >                     c[1]--
> > >                                                   read c[1] -> -1
> > >                                                   read c[2] -> 1
> > >                                                   (false 0 sum)
> > >                                                   smp_mb (6)
> > >                                                   re-check each.
> > >                                     c[1]--
> > > 
> > > re-check: because we observed c[1] == -1, thanks to the implicit memory
> > > barriers within thread migration (2 and 3), we can assume that we _will_
> > > observe the updated value of c[0] after smp_mb (6).
> > > 
> > > The current explanation states that memory barriers 4 and 5, along with  
> > > 6, are responsible for ensuring that the increment will be observed by 
> > > the re-check. However, I doubt they have anything to do with it: it's 
> > > rather the implicit memory barriers in thread migration, along with
> > > program order guarantees on writes to the same address, that seems to be
> > > the reason why we can do this ordering assumption.
> > 
> > Please disregard the part about program order: CPU A writes to c[0], and
> > CPU B writes to c[1], which are two different memory locations. The rest
> > of my discussion stands though.
> > 
> > Simply reasoning about write to c[0], memory barriers 2-3, write to
> > c[1], along with c[1] read, memory barrier 6, and then c[0] read is
> > enough to explain the ordering guarantees you need, without invoking
> > program order.
> 
> I am assuming that if the scheduler migrates a process, it applies enough
> memory ordering to allow the proof to operate as if it had stayed on a
> single CPU throughout.

When applied to per-cpu variables, the reasoning cannot invoke program
order though, since we're touching two different memory locations.

> The reasoning for this would consider the
> scheduler access and memory barriers

indeed,

>-- but there would be an arbitrarily
> large number of migration patterns, so I am not convinced that it would
> help...

This brings the following question then: which memory barriers, in the
scheduler activity, act as full memory barriers to migrated threads ? I
see that the rq_lock is taken, but this lock is permeable in one
direction (operations can spill into the critical section). I'm probably
missing something else, but this something else probably needs to be
documented somewhere, since we are doing tons of assumptions based on
it.

Thanks,

Mathieu

> 
> 							Thanx, Paul
> 
> > Thanks,
> > 
> > Mathieu
> > 
> > > 
> > > Does it make sense, or shall I get another coffee to wake myself up ?
> > > ;)
> > > 
> > > Thanks,
> > > 
> > > Mathieu
> > > 
> > > > +	 *
> > > > +	 * Therefore, if srcu_readers_active_idx() returned zero, and
> > > > +	 * none of the counters changed, we know that the zero was the
> > > > +	 * correct sum.
> > > > +	 *
> > > > +	 * Of course, it is possible that a task might be delayed
> > > > +	 * for a very long time in __srcu_read_lock() after fetching
> > > > +	 * the index but before incrementing its counter.  This
> > > > +	 * possibility will be dealt with in __synchronize_srcu().
> > > > +	 */
> > > >  	for_each_possible_cpu(cpu)
> > > > -		sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
> > > > -	return sum;
> > > > +		if (sp->snap[cpu] !=
> > > > +		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> > > > +			return false;  /* False zero reading! */
> > > > +	return true;
> > > >  }
> > > 
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > Operating System Efficiency R&D Consultant
> > > EfficiOS Inc.
> > > http://www.efficios.com
> > 
> > -- 
> > Mathieu Desnoyers
> > Operating System Efficiency R&D Consultant
> > EfficiOS Inc.
> > http://www.efficios.com
> > 
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-16 10:52       ` Peter Zijlstra
@ 2012-02-16 11:14         ` Mathieu Desnoyers
  0 siblings, 0 replies; 100+ messages in thread
From: Mathieu Desnoyers @ 2012-02-16 11:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, linux-kernel, mingo, laijs, dipankar, akpm,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches, Avi Kiviti, Chris Mason,
	Eric Paris

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2012-02-16 at 05:50 -0500, Mathieu Desnoyers wrote:
> > > Ah, so something like this?
> > > 
> > >       ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 
> > >               SRCU_USAGE_COUNT + 1;
> > > 
> > > Now that you mention it, this does look nicer, applied here and to
> > > srcu_read_unlock().
> > 
> > I think Peter refers to __this_cpu_add(). 
> 
> I'm not sure that implies the ACCESS_ONCE() thing
> 

Fair point. The "generic" fallback for this_cpu_add does not imply the
ACCESS_ONCE() semantic AFAIK. I don't know if there would be a clean way
to get both without duplicating these operations needlessly.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-16 11:00       ` Mathieu Desnoyers
@ 2012-02-16 11:51         ` Peter Zijlstra
  2012-02-16 12:18           ` Mathieu Desnoyers
  0 siblings, 1 reply; 100+ messages in thread
From: Peter Zijlstra @ 2012-02-16 11:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Mathieu Desnoyers, linux-kernel, mingo, laijs,
	dipankar, akpm, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Thu, 2012-02-16 at 06:00 -0500, Mathieu Desnoyers wrote:
> This brings the following question then: which memory barriers, in the
> scheduler activity, act as full memory barriers to migrated threads ? I
> see that the rq_lock is taken, but this lock is permeable in one
> direction (operations can spill into the critical section). I'm probably
> missing something else, but this something else probably needs to be
> documented somewhere, since we are doing tons of assumptions based on
> it. 

A migration consists of two context switches, one switching out the task
on the old cpu, and one switching in the task on the new cpu.

Now on x86 all the rq->lock grabbery is plenty implied memory barriers
to make anybody happy.

But I think, since there's guaranteed order (can't switch to before
switching from) you can match the UNLOCK from the switch-from to the
LOCK from the switch-to to make your complete MB.

Does that work or do we need more?

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-16 11:51         ` Peter Zijlstra
@ 2012-02-16 12:18           ` Mathieu Desnoyers
  2012-02-16 12:44             ` Peter Zijlstra
  0 siblings, 1 reply; 100+ messages in thread
From: Mathieu Desnoyers @ 2012-02-16 12:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mathieu Desnoyers, linux-kernel, mingo, laijs,
	dipankar, akpm, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2012-02-16 at 06:00 -0500, Mathieu Desnoyers wrote:
> > This brings the following question then: which memory barriers, in the
> > scheduler activity, act as full memory barriers to migrated threads ? I
> > see that the rq_lock is taken, but this lock is permeable in one
> > direction (operations can spill into the critical section). I'm probably
> > missing something else, but this something else probably needs to be
> > documented somewhere, since we are doing tons of assumptions based on
> > it. 
> 
> A migration consists of two context switches, one switching out the task
> on the old cpu, and one switching in the task on the new cpu.

If we have memory barriers on both context switches, then we should be
good. If just fail to see them.

> Now on x86 all the rq->lock grabbery is plenty implied memory barriers
> to make anybody happy.

Indeed. Outside of x86 is far less certain though.

> But I think, since there's guaranteed order (can't switch to before
> switching from) you can match the UNLOCK from the switch-from to the
> LOCK from the switch-to to make your complete MB.
> 
> Does that work or do we need more?

Hrm, I think we'd need a little more than just lock/unlock ordering
guarantees. Let's consider the following, where the stores would be
expected to be seen as "store A before store B" by CPU 2

CPU 0             CPU 1               CPU 2

                                      load B, smp_rmb, load A in loop,
                                      expecting that when updated A is
                                      observed, B is always observed as
                                      updated too.
store A
(lock is permeable:
outside can leak
inside)
lock(rq->lock)

      -> migration ->

                  unlock(rq->lock)
                  (lock is permeable:
                  outside can leak inside)
                  store B

As we notice, the "store A" could theoretically be still pending in
CPU 0's write buffers when store B occurs, because the memory barrier
associated with "lock" only has acquire semantic (so memory operations
prior to the lock can leak into the critical section).

Given that the unlock(rq->lock) on CPU 0 is not guaranteed to happen in
a bound time-frame, no memory barrier with release semantic can be
assumed to have happened. This could happen if we have a long critical
section holding the rq->lock on CPU 0, and a much shorter critical
section on CPU 1.

Does that make sense, or should I get my first morning coffee ? :)

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-16 12:18           ` Mathieu Desnoyers
@ 2012-02-16 12:44             ` Peter Zijlstra
  2012-02-16 14:52               ` Mathieu Desnoyers
  2012-02-16 15:13               ` Paul E. McKenney
  0 siblings, 2 replies; 100+ messages in thread
From: Peter Zijlstra @ 2012-02-16 12:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Mathieu Desnoyers, linux-kernel, mingo, laijs,
	dipankar, akpm, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Thu, 2012-02-16 at 07:18 -0500, Mathieu Desnoyers wrote:
> 
> Hrm, I think we'd need a little more than just lock/unlock ordering
> guarantees. Let's consider the following, where the stores would be
> expected to be seen as "store A before store B" by CPU 2
> 
> CPU 0             CPU 1               CPU 2
> 
>                                       load B, smp_rmb, load A in loop,
>                                       expecting that when updated A is
>                                       observed, B is always observed as
>                                       updated too.
> store A
> (lock is permeable:
> outside can leak
> inside)
> lock(rq->lock)
> 
>       -> migration ->
> 
>                   unlock(rq->lock)
>                   (lock is permeable:
>                   outside can leak inside)
>                   store B

You got the pairing the wrong way around, I suggested:

  store A

  switch-out
    UNLOCK

  	-> migration ->

			switch-in
			  LOCK

			store B

While both LOCK and UNLOCK are semi-permeable, A won't pass the UNLOCK
and B won't pass the LOCK.

Yes, A can pass switch-out LOCK, but that doesn't matter much since the
switch-in cannot happen until we've passed UNLOCK.

And yes B can pass switch-in UNLOCK, but again, I can't see that being a
problem since the LOCK will avoid it being visible before A.

> Does that make sense, or should I get my first morning coffee ? :) 

Probably.. but that's not saying I'm not wrong ;-)

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-16 12:44             ` Peter Zijlstra
@ 2012-02-16 14:52               ` Mathieu Desnoyers
  2012-02-16 14:58                 ` Peter Zijlstra
  2012-02-16 15:13               ` Paul E. McKenney
  1 sibling, 1 reply; 100+ messages in thread
From: Mathieu Desnoyers @ 2012-02-16 14:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mathieu Desnoyers, linux-kernel, mingo, laijs,
	dipankar, akpm, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2012-02-16 at 07:18 -0500, Mathieu Desnoyers wrote:
> > 
> > Hrm, I think we'd need a little more than just lock/unlock ordering
> > guarantees. Let's consider the following, where the stores would be
> > expected to be seen as "store A before store B" by CPU 2
> > 
> > CPU 0             CPU 1               CPU 2
> > 
> >                                       load B, smp_rmb, load A in loop,
> >                                       expecting that when updated A is
> >                                       observed, B is always observed as
> >                                       updated too.
> > store A
> > (lock is permeable:
> > outside can leak
> > inside)
> > lock(rq->lock)
> > 
> >       -> migration ->
> > 
> >                   unlock(rq->lock)
> >                   (lock is permeable:
> >                   outside can leak inside)
> >                   store B
> 
> You got the pairing the wrong way around, I suggested:
> 
>   store A
> 
>   switch-out
>     UNLOCK
> 
>   	-> migration ->
> 
> 			switch-in
> 			  LOCK
> 
> 			store B
> 
> While both LOCK and UNLOCK are semi-permeable, A won't pass the UNLOCK
> and B won't pass the LOCK.
> 
> Yes, A can pass switch-out LOCK, but that doesn't matter much since the
> switch-in cannot happen until we've passed UNLOCK.
> 
> And yes B can pass switch-in UNLOCK, but again, I can't see that being a
> problem since the LOCK will avoid it being visible before A.

Ah, so this is what I missed: the context switch has its lock/unlock
pair, the following migration is performed under its own lock/unlock
pair, and the following context switch also has its lock/unlock pair. So
yes, this should be sufficient to act as a full memory barrier.

> 
> > Does that make sense, or should I get my first morning coffee ? :) 
> 
> Probably.. but that's not saying I'm not wrong ;-)

It does pass my 1st morning coffee test still, so it looks good, at
least to me. :-)

Back to the initial subject: I think it would be important for general
code understanding that when RCU operates tricks on per-cpu variables
based on scheduler migration memory ordering assumption, that it tells
so explicitely, rather than claiming that the memory barriers match
those at RCU read lock/unlock sites, which is not quite right.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-16 14:52               ` Mathieu Desnoyers
@ 2012-02-16 14:58                 ` Peter Zijlstra
  0 siblings, 0 replies; 100+ messages in thread
From: Peter Zijlstra @ 2012-02-16 14:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Mathieu Desnoyers, linux-kernel, mingo, laijs,
	dipankar, akpm, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Thu, 2012-02-16 at 09:52 -0500, Mathieu Desnoyers wrote:
> Ah, so this is what I missed: the context switch has its lock/unlock
> pair, the following migration is performed under its own lock/unlock
> pair, and the following context switch also has its lock/unlock pair. So
> yes, this should be sufficient to act as a full memory barrier. 

In fact it typically looks like:

 LOCK(rq0->lock)
  switch-from-A
 UNLOCK(rq0->lock);

	LOCK(rq0->lock);
	LOCK(rq1->lock);
	  migrate-A
	UNLOCK(rq1->lock);
	UNLOCK(rq0->lock);

		LOCK(rq1->lock);
		  switch-to-A
		UNLOCK(rq1->lock);

the migrate taking both locks involved guarantees that the switch-to
always comes _after_ the switch_from. The migrate might be performed on
a completely unrelated cpu although typically it would be either the old
(push) or the new (pull).

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-16 12:44             ` Peter Zijlstra
  2012-02-16 14:52               ` Mathieu Desnoyers
@ 2012-02-16 15:13               ` Paul E. McKenney
  1 sibling, 0 replies; 100+ messages in thread
From: Paul E. McKenney @ 2012-02-16 15:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Mathieu Desnoyers, linux-kernel, mingo, laijs,
	dipankar, akpm, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Thu, Feb 16, 2012 at 01:44:41PM +0100, Peter Zijlstra wrote:
> On Thu, 2012-02-16 at 07:18 -0500, Mathieu Desnoyers wrote:
> > 
> > Hrm, I think we'd need a little more than just lock/unlock ordering
> > guarantees. Let's consider the following, where the stores would be
> > expected to be seen as "store A before store B" by CPU 2
> > 
> > CPU 0             CPU 1               CPU 2
> > 
> >                                       load B, smp_rmb, load A in loop,
> >                                       expecting that when updated A is
> >                                       observed, B is always observed as
> >                                       updated too.
> > store A
> > (lock is permeable:
> > outside can leak
> > inside)
> > lock(rq->lock)
> > 
> >       -> migration ->
> > 
> >                   unlock(rq->lock)
> >                   (lock is permeable:
> >                   outside can leak inside)
> >                   store B
> 
> You got the pairing the wrong way around, I suggested:
> 
>   store A
> 
>   switch-out
>     UNLOCK
> 
>   	-> migration ->
> 
> 			switch-in
> 			  LOCK
> 
> 			store B
> 
> While both LOCK and UNLOCK are semi-permeable, A won't pass the UNLOCK
> and B won't pass the LOCK.
> 
> Yes, A can pass switch-out LOCK, but that doesn't matter much since the
> switch-in cannot happen until we've passed UNLOCK.
> 
> And yes B can pass switch-in UNLOCK, but again, I can't see that being a
> problem since the LOCK will avoid it being visible before A.
> 
> > Does that make sense, or should I get my first morning coffee ? :) 
> 
> Probably.. but that's not saying I'm not wrong ;-)

It does look good to me, but given that I don't drink coffee, you should
take that with a large grain of salt.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-13  2:09 [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation Paul E. McKenney
  2012-02-15 12:59 ` Peter Zijlstra
  2012-02-15 14:31 ` Mathieu Desnoyers
@ 2012-02-20  7:15 ` Lai Jiangshan
  2012-02-20 17:44   ` Paul E. McKenney
  2 siblings, 1 reply; 100+ messages in thread
From: Lai Jiangshan @ 2012-02-20  7:15 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On 02/13/2012 10:09 AM, Paul E. McKenney wrote:

>  /*
>   * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
>   */
> -static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
> +static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
>  {
>  	int idx;
>  
> @@ -178,90 +316,51 @@ static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
>  			   !lock_is_held(&rcu_sched_lock_map),
>  			   "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
>  
> -	idx = sp->completed;
> +	idx = ACCESS_ONCE(sp->completed);
>  	mutex_lock(&sp->mutex);
>  
>  	/*
>  	 * Check to see if someone else did the work for us while we were
> -	 * waiting to acquire the lock.  We need -two- advances of
> +	 * waiting to acquire the lock.  We need -three- advances of
>  	 * the counter, not just one.  If there was but one, we might have
>  	 * shown up -after- our helper's first synchronize_sched(), thus
>  	 * having failed to prevent CPU-reordering races with concurrent
> -	 * srcu_read_unlock()s on other CPUs (see comment below).  So we
> -	 * either (1) wait for two or (2) supply the second ourselves.
> +	 * srcu_read_unlock()s on other CPUs (see comment below).  If there
> +	 * was only two, we are guaranteed to have waited through only one
> +	 * full index-flip phase.  So we either (1) wait for three or
> +	 * (2) supply the additional ones we need.
>  	 */
>  
> -	if ((sp->completed - idx) >= 2) {
> +	if (sp->completed == idx + 2)
> +		idx = 1;
> +	else if (sp->completed == idx + 3) {
>  		mutex_unlock(&sp->mutex);
>  		return;
> -	}
> -
> -	sync_func();  /* Force memory barrier on all CPUs. */
> +	} else
> +		idx = 0;


Hi, Paul

I don't think this check-and-return path is needed since we will introduce call_srcu().
We just need a correct code to show how it works and to be used for a while,
and new call_srcu() will be implemented based on this correct code which will be removed.

And I think this unneeded check-and-return path is incorrect. See the following:

Reader			Updater						Helper thread
			old_ptr = rcu_ptr;
			/* rcu_ptr = NULL; but be reordered to (1) */
			start synchronize_srcu()
			idx = ACCESS_ONCE(sp->completed);(2)
									synchronize_srcu()
									synchronize_srcu()
srcu_read_lock();
old_ptr = rcu_ptr;
			rcu_ptr = NULL;(1)
			mutex_lock() and read sp->completed
			and return from synchronize_srcu()
			free(old_ptr);
use freed old_ptr
srcu_read_unlock();


So, we need a smb_mb() between (1) and (2) to force the order.

__synchronize_srcu() {
	smp_mb(); /* F */
	idx = ACCESS_ONCE(sp->completed); /* (2) */
	....
}

And this smp_mb() F is paired with helper's smp_mb() D. So if Updater sees X advances of
->completed, Updater must sees X times of *full* flip_and_wait(). So We need see -two- advances of
->completed from Helper only, not -three-.

        if (sp->completed == idx + 1)
                idx = 1;
        else if (sp->completed == idx + 2) {
                mutex_unlock(&sp->mutex);
                return;
        } else
                idx = 0;


Or simpler:

__synchronize_srcu() {
	unsigned int idx;   /* <-------- unsigned */

	/* comments for smp_mb() F */
	smp_mb(); /* F */
	idx = ACCESS_ONCE(sp->completed);

	mutex_lock(&sp->mutex);
	idx = sp->completed - idx;

	/* original comments */
	for (; idx < 2; idx++)
                flip_idx_and_wait(sp, expedited);
        mutex_unlock(&sp->mutex);
}

At last, I can't understand the comments of this check-and-return path.
So maybe the above reply and I are totally wrong.
But the comments of this check-and-return path does not describe the code
well(especially the order), and it contains the old "synchronize_sched()"
which make me confuse.

My conclusion, we can just remove the check-and-return path to reduce
the complexity since we will introduce call_srcu().

This new srcu is very great, especially the SRCU_USAGE_COUNT for every
lock/unlock witch forces any increment/decrement pair changes the counter
for me.


Thanks,
Lai


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-20  7:15 ` Lai Jiangshan
@ 2012-02-20 17:44   ` Paul E. McKenney
  2012-02-21  1:11     ` Lai Jiangshan
  0 siblings, 1 reply; 100+ messages in thread
From: Paul E. McKenney @ 2012-02-20 17:44 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On Mon, Feb 20, 2012 at 03:15:33PM +0800, Lai Jiangshan wrote:
> On 02/13/2012 10:09 AM, Paul E. McKenney wrote:
> 
> >  /*
> >   * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
> >   */
> > -static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
> > +static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
> >  {
> >  	int idx;
> >  
> > @@ -178,90 +316,51 @@ static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void))
> >  			   !lock_is_held(&rcu_sched_lock_map),
> >  			   "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
> >  
> > -	idx = sp->completed;
> > +	idx = ACCESS_ONCE(sp->completed);
> >  	mutex_lock(&sp->mutex);
> >  
> >  	/*
> >  	 * Check to see if someone else did the work for us while we were
> > -	 * waiting to acquire the lock.  We need -two- advances of
> > +	 * waiting to acquire the lock.  We need -three- advances of
> >  	 * the counter, not just one.  If there was but one, we might have
> >  	 * shown up -after- our helper's first synchronize_sched(), thus
> >  	 * having failed to prevent CPU-reordering races with concurrent
> > -	 * srcu_read_unlock()s on other CPUs (see comment below).  So we
> > -	 * either (1) wait for two or (2) supply the second ourselves.
> > +	 * srcu_read_unlock()s on other CPUs (see comment below).  If there
> > +	 * was only two, we are guaranteed to have waited through only one
> > +	 * full index-flip phase.  So we either (1) wait for three or
> > +	 * (2) supply the additional ones we need.
> >  	 */
> >  
> > -	if ((sp->completed - idx) >= 2) {
> > +	if (sp->completed == idx + 2)
> > +		idx = 1;
> > +	else if (sp->completed == idx + 3) {
> >  		mutex_unlock(&sp->mutex);
> >  		return;
> > -	}
> > -
> > -	sync_func();  /* Force memory barrier on all CPUs. */
> > +	} else
> > +		idx = 0;
> 
> 
> Hi, Paul
> 
> I don't think this check-and-return path is needed since we will introduce call_srcu().
> 
> We just need a correct code to show how it works and to be used for a while,
> and new call_srcu() will be implemented based on this correct code which will be removed.

Hello, Lai!

Yep, this code will be replaced with a state machine driven by callbacks.

> And I think this unneeded check-and-return path is incorrect. See the following:
> 
> Reader			Updater						Helper thread
> 			old_ptr = rcu_ptr;
> 			/* rcu_ptr = NULL; but be reordered to (1) */
> 			start synchronize_srcu()
> 			idx = ACCESS_ONCE(sp->completed);(2)
> 									synchronize_srcu()
> 									synchronize_srcu()
> srcu_read_lock();
> old_ptr = rcu_ptr;
> 			rcu_ptr = NULL;(1)
> 			mutex_lock() and read sp->completed
> 			and return from synchronize_srcu()
> 			free(old_ptr);
> use freed old_ptr
> srcu_read_unlock();
> 
> 
> So, we need a smb_mb() between (1) and (2) to force the order.
> 
> __synchronize_srcu() {
> 	smp_mb(); /* F */
> 	idx = ACCESS_ONCE(sp->completed); /* (2) */

And one here as well because mutex_lock() allows code to bleed in from
outside the critical section.

> 	....
> }

Good catch!  And shows the limitations of testing -- I hit this pretty
hard and didn't get a failure.  I was focused so much on the complex
part of the patch that I failed to get the simple stuff right!!!

Shows the value of the Linux community's review processes, I guess.  ;-)

> And this smp_mb() F is paired with helper's smp_mb() D. So if Updater sees X advances of
> ->completed, Updater must sees X times of *full* flip_and_wait(). So We need see -two- advances of
> ->completed from Helper only, not -three-.

Hmmm...  Let's see...  The case I was worried about is where the updater
samples ->completed just before it is incremented, then samples it again
just after it is incremented a second time.  So you are arguing that this
cannot happen because the second sample occurs after acquiring the lock,
so that the second flip-and-wait cycle has to have already completed,
correct?

So waiting for three is appropriate for mutex_try_lock(), but is overly
conservative for mutex_lock().

>         if (sp->completed == idx + 1)
>                 idx = 1;
>         else if (sp->completed == idx + 2) {
>                 mutex_unlock(&sp->mutex);
>                 return;
>         } else
>                 idx = 0;
> 
> 
> Or simpler:
> 
> __synchronize_srcu() {
> 	unsigned int idx;   /* <-------- unsigned */
> 
> 	/* comments for smp_mb() F */
> 	smp_mb(); /* F */
> 	idx = ACCESS_ONCE(sp->completed);
> 
> 	mutex_lock(&sp->mutex);
> 	idx = sp->completed - idx;
> 
> 	/* original comments */
> 	for (; idx < 2; idx++)
>                 flip_idx_and_wait(sp, expedited);
>         mutex_unlock(&sp->mutex);
> }
> 
> At last, I can't understand the comments of this check-and-return path.
> So maybe the above reply and I are totally wrong.

I -think- you might be correct, but my approach is going to be to implement
call_srcu() which will eliminate this anyway.

> But the comments of this check-and-return path does not describe the code
> well(especially the order), and it contains the old "synchronize_sched()"
> which make me confuse.

The diffs are confusing -- I have to look at the actual code in this case.

> My conclusion, we can just remove the check-and-return path to reduce
> the complexity since we will introduce call_srcu().

If I actually submit the above upstream, that would be quite reasonable.
My thought is that patch remains RFC and the upstream version has
call_srcu().

> This new srcu is very great, especially the SRCU_USAGE_COUNT for every
> lock/unlock witch forces any increment/decrement pair changes the counter
> for me.

Glad you like it!  ;-)

And thank you for your review and feedback!

							Thanx, Paul

> Thanks,
> Lai
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-20 17:44   ` Paul E. McKenney
@ 2012-02-21  1:11     ` Lai Jiangshan
  2012-02-21  1:50       ` Paul E. McKenney
  0 siblings, 1 reply; 100+ messages in thread
From: Lai Jiangshan @ 2012-02-21  1:11 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On 02/21/2012 01:44 AM, Paul E. McKenney wrote:

> 
>> My conclusion, we can just remove the check-and-return path to reduce
>> the complexity since we will introduce call_srcu().
> 
> If I actually submit the above upstream, that would be quite reasonable.
> My thought is that patch remains RFC and the upstream version has
> call_srcu().

Does the work of call_srcu() is started or drafted?

> 
>> This new srcu is very great, especially the SRCU_USAGE_COUNT for every
>> lock/unlock witch forces any increment/decrement pair changes the counter
>> for me.
> 
> Glad you like it!  ;-)
> 
> And thank you for your review and feedback!

Could you add my Reviewed-by when this patch is last submitted?


Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>

Thanks
Lai

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-21  1:11     ` Lai Jiangshan
@ 2012-02-21  1:50       ` Paul E. McKenney
  2012-02-21  8:44         ` Lai Jiangshan
  0 siblings, 1 reply; 100+ messages in thread
From: Paul E. McKenney @ 2012-02-21  1:50 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On Tue, Feb 21, 2012 at 09:11:47AM +0800, Lai Jiangshan wrote:
> On 02/21/2012 01:44 AM, Paul E. McKenney wrote:
> 
> > 
> >> My conclusion, we can just remove the check-and-return path to reduce
> >> the complexity since we will introduce call_srcu().
> > 
> > If I actually submit the above upstream, that would be quite reasonable.
> > My thought is that patch remains RFC and the upstream version has
> > call_srcu().
> 
> Does the work of call_srcu() is started or drafted?

I do have a draft design, and am currently beating it into shape.
No actual code yet, though.  The general idea at the moment is as follows:

o	The state machine must be preemptible.	I recently received
	a bug report about 200-microsecond latency spikes on a system
	with more than a thousand CPUs, so the summation of the per-CPU
	counters and subsequent recheck cannot be in a preempt-disable
	region.  I am therefore currently thinking in terms of a kthread.

o	At the moment, having a per-srcu_struct kthread seems excessive.
	I am planning on a single kthread to do the counter summation
	and checking.  Further parallelism might be useful in the future,
	but I would want to see someone run into problems before adding
	more complexity.

o	There needs to be a linked list of srcu_struct structures so
	that they can be traversed by the state-machine kthread.

o	If there are expedited SRCU callbacks anywhere, the kthread
	would scan through the list of srcu_struct structures quickly
	(perhaps pausing a few microseconds between).  If there are no
	expedited SRCU callbacks, the kthread would wait a jiffy or so
	between scans.

o	If a given srcu_struct structure has been scanned too many times
	(say, more than ten times) while waiting for the counters to go
	to zero, it loses expeditedness.  It makes no sense for the kthread
	to go CPU-bound just because some SRCU reader somewhere is blocked
	in its SRCU read-side critical section.

o	Expedited SRCU callbacks cannot be delayed by normal SRCU
	callbacks, but neither can expedited callbacks be allowed to
	starve normal callbacks.  I am thinking in terms of invoking these
	from softirq context, with a pair of multi-tailed callback queues
	per CPU, stored in the same structure as the per-CPU counters.

o	There are enough srcu_struct structures in the Linux that
	it does not make sense to force softirq to dig through them all
	any time any one of them has callbacks ready to invoke.  One way
	to deal with this is to have a per-CPU set of linked lists of
	of srcu_struct_array structures, so that the kthread enqueues
	a given structure when it transitions to having callbacks ready
	to invoke, and softirq dequeues it.  This can be done locklessly
	given that there is only one producer and one consumer.

o	We can no longer use the trick of pushing callbacks to another
	CPU from the CPU_DYING notifier because it is likely that CPU
	hotplug will stop using stop_cpus().  I am therefore thinking
	in terms of a set of orphanages (two for normal, two more for
	expedited -- one set of each for callbacks ready to invoke,
	the other for still-waiting callbacks).

o	There will need to be an srcu_barrier() that can be called
	before cleanup_srcu_struct().  Otherwise, someone will end up
	freeing up an srcu_struct that still has callbacks outstanding.

But what did you have in mind?

> >> This new srcu is very great, especially the SRCU_USAGE_COUNT for every
> >> lock/unlock witch forces any increment/decrement pair changes the counter
> >> for me.
> > 
> > Glad you like it!  ;-)
> > 
> > And thank you for your review and feedback!
> 
> Could you add my Reviewed-by when this patch is last submitted?
> 
> 
> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>

Will do, thank you!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-21  1:50       ` Paul E. McKenney
@ 2012-02-21  8:44         ` Lai Jiangshan
  2012-02-21 17:24           ` Paul E. McKenney
  0 siblings, 1 reply; 100+ messages in thread
From: Lai Jiangshan @ 2012-02-21  8:44 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On 02/21/2012 09:50 AM, Paul E. McKenney wrote:
> On Tue, Feb 21, 2012 at 09:11:47AM +0800, Lai Jiangshan wrote:
>> On 02/21/2012 01:44 AM, Paul E. McKenney wrote:
>>
>>>
>>>> My conclusion, we can just remove the check-and-return path to reduce
>>>> the complexity since we will introduce call_srcu().
>>>
>>> If I actually submit the above upstream, that would be quite reasonable.
>>> My thought is that patch remains RFC and the upstream version has
>>> call_srcu().
>>
>> Does the work of call_srcu() is started or drafted?
> 
> I do have a draft design, and am currently beating it into shape.
> No actual code yet, though.  The general idea at the moment is as follows:

If you don't mind, I will implement it.(it requires your new version of SRCU implementation)

> 
> o	The state machine must be preemptible.	I recently received
> 	a bug report about 200-microsecond latency spikes on a system
> 	with more than a thousand CPUs, so the summation of the per-CPU
> 	counters and subsequent recheck cannot be in a preempt-disable
> 	region.  I am therefore currently thinking in terms of a kthread.

sure.

Addition:
	Srcu callbacks must run on process context and sleepable,
	(and therefore they finish orderless). We can't change
	the current rcu-callback, we must design a new for srcu.
	These do not introduce any complexity, we reuse the workqueue.
	All ready srcu-callback will be delivered to workqueue.
	so state-machine-thread only to do counter summation
	and checking and delivering.

	workqueue do all the complexity for us, it will auto
	alloc threads when needed(when the callback is sleep, or
	there are too much ready callbacks).

	struct srcu_head is a little bigger than struct rcu_head, 
	it contain a union for struct work_struct.

	(synchronize_srcu()'s callbacks are special handled)

> 
> o	At the moment, having a per-srcu_struct kthread seems excessive.
> 	I am planning on a single kthread to do the counter summation
> 	and checking.  Further parallelism might be useful in the future,
> 	but I would want to see someone run into problems before adding
> 	more complexity.

Simple is important, I vote for a single kthread to do the counter summation
and checking, and left the convenient that we can introduce parallelism
thread easy.

> 
> o	There needs to be a linked list of srcu_struct structures so
> 	that they can be traversed by the state-machine kthread.
> 
> o	If there are expedited SRCU callbacks anywhere, the kthread
> 	would scan through the list of srcu_struct structures quickly
> 	(perhaps pausing a few microseconds between).  If there are no
> 	expedited SRCU callbacks, the kthread would wait a jiffy or so
> 	between scans.

Sure
But I think generic expedited SRCU callbacks are make no sense,
we could just allow expedited SRCU callbacks for synchronize_srcu_expedited()
only.

> 
> o	If a given srcu_struct structure has been scanned too many times
> 	(say, more than ten times) while waiting for the counters to go
> 	to zero, it loses expeditedness.  It makes no sense for the kthread
> 	to go CPU-bound just because some SRCU reader somewhere is blocked
> 	in its SRCU read-side critical section.
> 
> o	Expedited SRCU callbacks cannot be delayed by normal SRCU
> 	callbacks, but neither can expedited callbacks be allowed to
> 	starve normal callbacks.  I am thinking in terms of invoking these
> 	from softirq context, with a pair of multi-tailed callback queues
> 	per CPU, stored in the same structure as the per-CPU counters.

My Addition design, callbacks are not running in softirq context.
And state-machine-thread is not delay by any normal callbacks.

But all synchronize_srcu()'s callback are done in state-machine-thread
(it is just a wake up), not in workqueue.

Since we allow expedited SRCU callbacks for synchronize_srcu_expedited() only,
Expedited SRCU callbacks will not be delayed by normal srcu callbacks.

I don't think we need to use per CPU callback queues, since SRCU callbacks
are rare(compared to rcu callbacks)

> 
> o	There are enough srcu_struct structures in the Linux that
> 	it does not make sense to force softirq to dig through them all
> 	any time any one of them has callbacks ready to invoke.  One way
> 	to deal with this is to have a per-CPU set of linked lists of
> 	of srcu_struct_array structures, so that the kthread enqueues
> 	a given structure when it transitions to having callbacks ready
> 	to invoke, and softirq dequeues it.  This can be done locklessly
> 	given that there is only one producer and one consumer.
> 
> o	We can no longer use the trick of pushing callbacks to another
> 	CPU from the CPU_DYING notifier because it is likely that CPU
> 	hotplug will stop using stop_cpus().  I am therefore thinking
> 	in terms of a set of orphanages (two for normal, two more for
> 	expedited -- one set of each for callbacks ready to invoke,
> 	the other for still-waiting callbacks).

no such issues in srcu.c if we use workqueue.

> 
> o	There will need to be an srcu_barrier() that can be called
> 	before cleanup_srcu_struct().  Otherwise, someone will end up
> 	freeing up an srcu_struct that still has callbacks outstanding.

Sure.

when use workqueue, the delivering is ordered.

srcu_barrier()
{
	synchronzie_srcu();
	flush_workqueue();
}


> 
> But what did you have in mind?

I agree most of your design, but my relaxing the ability for callbacks and
using workqueue ideas make things simpler.

Thanks,
Lai

> 
>>>> This new srcu is very great, especially the SRCU_USAGE_COUNT for every
>>>> lock/unlock witch forces any increment/decrement pair changes the counter
>>>> for me.
>>>
>>> Glad you like it!  ;-)
>>>
>>> And thank you for your review and feedback!
>>
>> Could you add my Reviewed-by when this patch is last submitted?
>>
>>
>> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> 
> Will do, thank you!
> 
> 							Thanx, Paul
> 
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation
  2012-02-21  8:44         ` Lai Jiangshan
@ 2012-02-21 17:24           ` Paul E. McKenney
  2012-02-22  9:29             ` [PATCH 1/3 RFC paul/rcu/srcu] srcu: Remove fast check path Lai Jiangshan
                               ` (3 more replies)
  0 siblings, 4 replies; 100+ messages in thread
From: Paul E. McKenney @ 2012-02-21 17:24 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On Tue, Feb 21, 2012 at 04:44:22PM +0800, Lai Jiangshan wrote:
> On 02/21/2012 09:50 AM, Paul E. McKenney wrote:
> > On Tue, Feb 21, 2012 at 09:11:47AM +0800, Lai Jiangshan wrote:
> >> On 02/21/2012 01:44 AM, Paul E. McKenney wrote:
> >>
> >>>
> >>>> My conclusion, we can just remove the check-and-return path to reduce
> >>>> the complexity since we will introduce call_srcu().
> >>>
> >>> If I actually submit the above upstream, that would be quite reasonable.
> >>> My thought is that patch remains RFC and the upstream version has
> >>> call_srcu().
> >>
> >> Does the work of call_srcu() is started or drafted?
> > 
> > I do have a draft design, and am currently beating it into shape.
> > No actual code yet, though.  The general idea at the moment is as follows:
> 
> If you don't mind, I will implement it.(it requires your new version of SRCU implementation)

I would very much welcome a patch from you for call_srcu()!

I have an rcu/srcu branch for this purpose at:

	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git

> > o	The state machine must be preemptible.	I recently received
> > 	a bug report about 200-microsecond latency spikes on a system
> > 	with more than a thousand CPUs, so the summation of the per-CPU
> > 	counters and subsequent recheck cannot be in a preempt-disable
> > 	region.  I am therefore currently thinking in terms of a kthread.
> 
> sure.
> 
> Addition:
> 	Srcu callbacks must run on process context and sleepable,
> 	(and therefore they finish orderless). We can't change
> 	the current rcu-callback, we must design a new for srcu.
> 	These do not introduce any complexity, we reuse the workqueue.
> 	All ready srcu-callback will be delivered to workqueue.
> 	so state-machine-thread only to do counter summation
> 	and checking and delivering.
> 
> 	workqueue do all the complexity for us, it will auto
> 	alloc threads when needed(when the callback is sleep, or
> 	there are too much ready callbacks).
> 
> 	struct srcu_head is a little bigger than struct rcu_head, 
> 	it contain a union for struct work_struct.
> 
> 	(synchronize_srcu()'s callbacks are special handled)

Full separation of counter summation etc. from callback invocation sounds
very good.  There does need to be something like an srcu_barrier(), so the
SRCU callbacks posted by a given CPU do need to be invoked in the order
that they were posted (unless you have some other trick in mind, in which
case please let us all know what it is).  Given the need to keep things
in order, I do not believe that we can allow the SRCU callbacks to sleep.
Note that there is no need for srcu_barrier() to wait on the expedited
SRCU callbacks -- there should not be an call_srcu_expedited(), so the
only way for the callbacks to appear is synchronize_srcu_expedited(),
which does not return until after its callback has been invoked.
This means that expedited SRCU callbacks need not be ordered -- in fact,
the expedited path need not use callbacks at all, if that works better.
(I was thinking in terms of expedited callbacks for SRCU because that
makes the state machine simpler.)

That said, the idea of invoking them from a workqueue seems reasonable,
for example, you can do local_disable_bh() around each invocation.  Doing
this also gives us the flexibility to move SRCU callback invocation into
softirq if that proves necessary for whatever reason.  (Yes, I do still
well remember the 3.0 fun and excitement with RCU and kthreads!)

But I have to ask...  Why does SRCU need more space than RCU?  Can't
the selection of workqueue be implicit based on which CPU the callback
is queued on?

> > o	At the moment, having a per-srcu_struct kthread seems excessive.
> > 	I am planning on a single kthread to do the counter summation
> > 	and checking.  Further parallelism might be useful in the future,
> > 	but I would want to see someone run into problems before adding
> > 	more complexity.
> 
> Simple is important, I vote for a single kthread to do the counter summation
> and checking, and left the convenient that we can introduce parallelism
> thread easy.

Very good!

> > o	There needs to be a linked list of srcu_struct structures so
> > 	that they can be traversed by the state-machine kthread.
> > 
> > o	If there are expedited SRCU callbacks anywhere, the kthread
> > 	would scan through the list of srcu_struct structures quickly
> > 	(perhaps pausing a few microseconds between).  If there are no
> > 	expedited SRCU callbacks, the kthread would wait a jiffy or so
> > 	between scans.
> 
> Sure
> But I think generic expedited SRCU callbacks are make no sense,
> we could just allow expedited SRCU callbacks for synchronize_srcu_expedited()
> only.

Agreed -- We don't have call_rcu_expedited(), so we should definitely
not provide call_srcu_expedited().

> > o	If a given srcu_struct structure has been scanned too many times
> > 	(say, more than ten times) while waiting for the counters to go
> > 	to zero, it loses expeditedness.  It makes no sense for the kthread
> > 	to go CPU-bound just because some SRCU reader somewhere is blocked
> > 	in its SRCU read-side critical section.
> > 
> > o	Expedited SRCU callbacks cannot be delayed by normal SRCU
> > 	callbacks, but neither can expedited callbacks be allowed to
> > 	starve normal callbacks.  I am thinking in terms of invoking these
> > 	from softirq context, with a pair of multi-tailed callback queues
> > 	per CPU, stored in the same structure as the per-CPU counters.
> 
> My Addition design, callbacks are not running in softirq context.
> And state-machine-thread is not delay by any normal callbacks.
> 
> But all synchronize_srcu()'s callback are done in state-machine-thread
> (it is just a wake up), not in workqueue.

So the idea is that the function pointer is a task_struct pointer in this
case, and that you check for a text address to decide which?  In any case,
a per-callback wake up should be OK.

> Since we allow expedited SRCU callbacks for synchronize_srcu_expedited() only,
> Expedited SRCU callbacks will not be delayed by normal srcu callbacks.

OK, good.

> I don't think we need to use per CPU callback queues, since SRCU callbacks
> are rare(compared to rcu callbacks)

Indeed, everything currently in the kernel would turn into a wakeup.  ;-)

> > o	There are enough srcu_struct structures in the Linux that
> > 	it does not make sense to force softirq to dig through them all
> > 	any time any one of them has callbacks ready to invoke.  One way
> > 	to deal with this is to have a per-CPU set of linked lists of
> > 	of srcu_struct_array structures, so that the kthread enqueues
> > 	a given structure when it transitions to having callbacks ready
> > 	to invoke, and softirq dequeues it.  This can be done locklessly
> > 	given that there is only one producer and one consumer.
> > 
> > o	We can no longer use the trick of pushing callbacks to another
> > 	CPU from the CPU_DYING notifier because it is likely that CPU
> > 	hotplug will stop using stop_cpus().  I am therefore thinking
> > 	in terms of a set of orphanages (two for normal, two more for
> > 	expedited -- one set of each for callbacks ready to invoke,
> > 	the other for still-waiting callbacks).
> 
> no such issues in srcu.c if we use workqueue.

OK, sounds good.

> > o	There will need to be an srcu_barrier() that can be called
> > 	before cleanup_srcu_struct().  Otherwise, someone will end up
> > 	freeing up an srcu_struct that still has callbacks outstanding.
> 
> Sure.
> 
> when use workqueue, the delivering is ordered.
> 
> srcu_barrier()
> {
> 	synchronzie_srcu();
> 	flush_workqueue();
> }

Would this handle the case where the SRCU kthread migrated from one CPU
to another?  My guess is that you would need to flush all the workqueues
that might have ever had SRCU-related work pending on them.

> > But what did you have in mind?
> 
> I agree most of your design, but my relaxing the ability for callbacks and
> using workqueue ideas make things simpler.

Your approach looks good in general.  My main concerns are:

1.	We should prohibit sleeping in SRCU callback functions, at least
	for the near future -- just in case something forces us to move
	to softirq.

2.	If SRCU callbacks can use rcu_head, that would be good -- I don't
	yet see why any additional space is needed.

							Thanx, Paul

> Thanks,
> Lai
> 
> > 
> >>>> This new srcu is very great, especially the SRCU_USAGE_COUNT for every
> >>>> lock/unlock witch forces any increment/decrement pair changes the counter
> >>>> for me.
> >>>
> >>> Glad you like it!  ;-)
> >>>
> >>> And thank you for your review and feedback!
> >>
> >> Could you add my Reviewed-by when this patch is last submitted?
> >>
> >>
> >> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> > 
> > Will do, thank you!
> > 
> > 							Thanx, Paul
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 1/3 RFC paul/rcu/srcu] srcu: Remove fast check path
  2012-02-21 17:24           ` Paul E. McKenney
@ 2012-02-22  9:29             ` Lai Jiangshan
  2012-02-22  9:29             ` [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock() Lai Jiangshan
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 100+ messages in thread
From: Lai Jiangshan @ 2012-02-22  9:29 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

Hi, All

These three patches reduce the states of srcu.
the call_srcu() will be implemented based on this new algorithm if it has no problem.

It is an aggressive algorithm, it needs more reviews, please examine it critically
and leave your comments.

Thanks,
Lai.

>From abe3fd64d08f74f13e8111e333a9790e9e6d782c Mon Sep 17 00:00:00 2001
From: Lai Jiangshan <laijs@cn.fujitsu.com>
Date: Wed, 22 Feb 2012 10:15:48 +0800
Subject: [PATCH 1/3 RFC paul/rcu/srcu] srcu: Remove fast check path

This fast check path is used for optimizing the situation
that there are many concurrently update site.

But we have no suck situation in currect kernel.
And it introduces complexity, so we just remove it.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 kernel/srcu.c |   25 +------------------------
 1 files changed, 1 insertions(+), 24 deletions(-)

diff --git a/kernel/srcu.c b/kernel/srcu.c
index 84c9b97..17e95bc 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -308,7 +308,7 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
  */
 static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
 {
-	int idx;
+	int idx = 0;
 
 	rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
 			   !lock_is_held(&rcu_bh_lock_map) &&
@@ -316,32 +316,9 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
 			   !lock_is_held(&rcu_sched_lock_map),
 			   "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
 
-	smp_mb();  /* Ensure prior action happens before grace period. */
-	idx = ACCESS_ONCE(sp->completed);
-	smp_mb();  /* Access to ->completed before lock acquisition. */
 	mutex_lock(&sp->mutex);
 
 	/*
-	 * Check to see if someone else did the work for us while we were
-	 * waiting to acquire the lock.  We need -three- advances of
-	 * the counter, not just one.  If there was but one, we might have
-	 * shown up -after- our helper's first synchronize_sched(), thus
-	 * having failed to prevent CPU-reordering races with concurrent
-	 * srcu_read_unlock()s on other CPUs (see comment below).  If there
-	 * was only two, we are guaranteed to have waited through only one
-	 * full index-flip phase.  So we either (1) wait for three or
-	 * (2) supply the additional ones we need.
-	 */
-
-	if (sp->completed == idx + 2)
-		idx = 1;
-	else if (sp->completed == idx + 3) {
-		mutex_unlock(&sp->mutex);
-		return;
-	} else
-		idx = 0;
-
-	/*
 	 * If there were no helpers, then we need to do two flips of
 	 * the index.  The first flip is required if there are any
 	 * outstanding SRCU readers even if there are no new readers
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()
  2012-02-21 17:24           ` Paul E. McKenney
  2012-02-22  9:29             ` [PATCH 1/3 RFC paul/rcu/srcu] srcu: Remove fast check path Lai Jiangshan
@ 2012-02-22  9:29             ` Lai Jiangshan
  2012-02-22  9:50               ` Peter Zijlstra
  2012-02-22 21:20               ` Paul E. McKenney
  2012-02-22  9:29             ` [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period Lai Jiangshan
  2012-03-06  8:42             ` [RFC PATCH 0/6 paul/rcu/srcu] srcu: implement call_srcu() Lai Jiangshan
  3 siblings, 2 replies; 100+ messages in thread
From: Lai Jiangshan @ 2012-02-22  9:29 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

>From de49bb517e6367776e2226b931346ab6c798b122 Mon Sep 17 00:00:00 2001
From: Lai Jiangshan <laijs@cn.fujitsu.com>
Date: Wed, 22 Feb 2012 10:41:59 +0800
Subject: [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()

The algorithm/smp_mb()s ensure 'there is only one srcu_read_lock()
between flip and recheck for a cpu'.
Increment of the upper bit for srcu_read_lock() only can
ensure a single pair of lock/unlock change the cpu counter.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 include/linux/srcu.h |    2 +-
 kernel/srcu.c        |   11 +++++------
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index a478c8e..5b49d41 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -35,7 +35,7 @@ struct srcu_struct_array {
 };
 
 /* Bit definitions for field ->c above and ->snap below. */
-#define SRCU_USAGE_BITS		2
+#define SRCU_USAGE_BITS		1
 #define SRCU_REF_MASK		(ULONG_MAX >> SRCU_USAGE_BITS)
 #define SRCU_USAGE_COUNT	(SRCU_REF_MASK + 1)
 
diff --git a/kernel/srcu.c b/kernel/srcu.c
index 17e95bc..a51ac48 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -138,10 +138,10 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
 
 	/*
 	 * Now, we check the ->snap array that srcu_readers_active_idx()
-	 * filled in from the per-CPU counter values.  Since both
-	 * __srcu_read_lock() and __srcu_read_unlock() increment the
-	 * upper bits of the per-CPU counter, an increment/decrement
-	 * pair will change the value of the counter.  Since there is
+	 * filled in from the per-CPU counter values. Since
+	 * __srcu_read_lock() increments the upper bits of
+	 * the per-CPU counter, an increment/decrement pair will
+	 * change the value of the counter.  Since there is
 	 * only one possible increment, the only way to wrap the counter
 	 * is to have a huge number of counter decrements, which requires
 	 * a huge number of tasks and huge SRCU read-side critical-section
@@ -234,8 +234,7 @@ void __srcu_read_unlock(struct srcu_struct *sp, int idx)
 {
 	preempt_disable();
 	smp_mb(); /* C */  /* Avoid leaking the critical section. */
-	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
-		SRCU_USAGE_COUNT - 1;
+	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += -1;
 	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(__srcu_read_unlock);
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period
  2012-02-21 17:24           ` Paul E. McKenney
  2012-02-22  9:29             ` [PATCH 1/3 RFC paul/rcu/srcu] srcu: Remove fast check path Lai Jiangshan
  2012-02-22  9:29             ` [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock() Lai Jiangshan
@ 2012-02-22  9:29             ` Lai Jiangshan
  2012-02-23  1:01               ` Paul E. McKenney
  2012-02-24  8:06               ` Lai Jiangshan
  2012-03-06  8:42             ` [RFC PATCH 0/6 paul/rcu/srcu] srcu: implement call_srcu() Lai Jiangshan
  3 siblings, 2 replies; 100+ messages in thread
From: Lai Jiangshan @ 2012-02-22  9:29 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

>From 4ddf62aaf2c4ebe6b9d4a1c596e8b43a678f1f0d Mon Sep 17 00:00:00 2001
From: Lai Jiangshan <laijs@cn.fujitsu.com>
Date: Wed, 22 Feb 2012 14:12:02 +0800
Subject: [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period

flip_idx_and_wait() is not changed, and is split as two functions
and only a short comments is added for smp_mb() E.

__synchronize_srcu() use a different algorithm for "leak" readers.
detail is in the comments of the patch.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 kernel/srcu.c |  105 ++++++++++++++++++++++++++++++++++----------------------
 1 files changed, 64 insertions(+), 41 deletions(-)

diff --git a/kernel/srcu.c b/kernel/srcu.c
index a51ac48..346f9d7 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -249,6 +249,37 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
  */
 #define SYNCHRONIZE_SRCU_READER_DELAY 5
 
+static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
+{
+	int trycount = 0;
+
+	/*
+	 * SRCU read-side critical sections are normally short, so wait
+	 * a small amount of time before possibly blocking.
+	 */
+	if (!srcu_readers_active_idx_check(sp, idx)) {
+		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
+		while (!srcu_readers_active_idx_check(sp, idx)) {
+			if (expedited && ++ trycount < 10)
+				udelay(SYNCHRONIZE_SRCU_READER_DELAY);
+			else
+				schedule_timeout_interruptible(1);
+		}
+	}
+
+	/*
+	 * The following smp_mb() E pairs with srcu_read_unlock()'s
+	 * smp_mb C to ensure that if srcu_readers_active_idx_check()
+	 * sees srcu_read_unlock()'s counter decrement, then any
+	 * of the current task's subsequent code will happen after
+	 * that SRCU read-side critical section.
+	 *
+	 * It also ensures the order between the above waiting and
+	 * the next flipping.
+	 */
+	smp_mb(); /* E */
+}
+
 /*
  * Flip the readers' index by incrementing ->completed, then wait
  * until there are no more readers using the counters referenced by
@@ -258,12 +289,12 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
  * Of course, it is possible that a reader might be delayed for the
  * full duration of flip_idx_and_wait() between fetching the
  * index and incrementing its counter.  This possibility is handled
- * by __synchronize_srcu() invoking flip_idx_and_wait() twice.
+ * by the next __synchronize_srcu() invoking wait_idx() for such readers
+ * before start a new grace perioad.
  */
 static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
 {
 	int idx;
-	int trycount = 0;
 
 	idx = sp->completed++ & 0x1;
 
@@ -278,28 +309,7 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
 	 */
 	smp_mb(); /* D */
 
-	/*
-	 * SRCU read-side critical sections are normally short, so wait
-	 * a small amount of time before possibly blocking.
-	 */
-	if (!srcu_readers_active_idx_check(sp, idx)) {
-		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
-		while (!srcu_readers_active_idx_check(sp, idx)) {
-			if (expedited && ++ trycount < 10)
-				udelay(SYNCHRONIZE_SRCU_READER_DELAY);
-			else
-				schedule_timeout_interruptible(1);
-		}
-	}
-
-	/*
-	 * The following smp_mb() E pairs with srcu_read_unlock()'s
-	 * smp_mb C to ensure that if srcu_readers_active_idx_check()
-	 * sees srcu_read_unlock()'s counter decrement, then any
-	 * of the current task's subsequent code will happen after
-	 * that SRCU read-side critical section.
-	 */
-	smp_mb(); /* E */
+	wait_idx(sp, idx, expedited);
 }
 
 /*
@@ -307,8 +317,6 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
  */
 static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
 {
-	int idx = 0;
-
 	rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
 			   !lock_is_held(&rcu_bh_lock_map) &&
 			   !lock_is_held(&rcu_lock_map) &&
@@ -318,27 +326,42 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
 	mutex_lock(&sp->mutex);
 
 	/*
-	 * If there were no helpers, then we need to do two flips of
-	 * the index.  The first flip is required if there are any
-	 * outstanding SRCU readers even if there are no new readers
-	 * running concurrently with the first counter flip.
-	 *
-	 * The second flip is required when a new reader picks up
+	 * When in the previous grace perioad, if a reader picks up
 	 * the old value of the index, but does not increment its
 	 * counter until after its counters is summed/rechecked by
-	 * srcu_readers_active_idx_check().  In this case, the current SRCU
+	 * srcu_readers_active_idx_check(). In this case, the previous SRCU
 	 * grace period would be OK because the SRCU read-side critical
-	 * section started after this SRCU grace period started, so the
+	 * section started after the SRCU grace period started, so the
 	 * grace period is not required to wait for the reader.
 	 *
-	 * However, the next SRCU grace period would be waiting for the
-	 * other set of counters to go to zero, and therefore would not
-	 * wait for the reader, which would be very bad.  To avoid this
-	 * bad scenario, we flip and wait twice, clearing out both sets
-	 * of counters.
+	 * However, such leftover readers affect this new SRCU grace period.
+	 * So we have to wait for such readers. This wait_idx() should be
+	 * considerred as the wait_idx() in the flip_idx_and_wait() of
+	 * the previous grace perioad except that it is for leftover readers
+	 * started before this synchronize_srcu(). So when it returns,
+	 * there is no leftover readers that starts before this grace period.
+	 *
+	 * If there are some leftover readers that do not increment its
+	 * counter until after its counters is summed/rechecked by
+	 * srcu_readers_active_idx_check(), In this case, this SRCU
+	 * grace period would be OK as above comments says. We defines
+	 * such readers as leftover-leftover readers, we consider these
+	 * readers fteched index of (sp->completed + 1), it means they
+	 * are considerred as exactly the same as the readers after this
+	 * grace period.
+	 *
+	 * wait_idx() is expected very fast, because leftover readers
+	 * are unlikely produced.
 	 */
-	for (; idx < 2; idx++)
-		flip_idx_and_wait(sp, expedited);
+	wait_idx(sp, (sp->completed - 1) & 0x1, expedited);
+
+	/*
+	 * Starts a new grace period, this flip is required if there are
+	 * any outstanding SRCU readers even if there are no new readers
+	 * running concurrently with the counter flip.
+	 */
+	flip_idx_and_wait(sp, expedited);
+
 	mutex_unlock(&sp->mutex);
 }
 
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()
  2012-02-22  9:29             ` [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock() Lai Jiangshan
@ 2012-02-22  9:50               ` Peter Zijlstra
  2012-02-22 21:20               ` Paul E. McKenney
  1 sibling, 0 replies; 100+ messages in thread
From: Peter Zijlstra @ 2012-02-22  9:50 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: paulmck, linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On Wed, 2012-02-22 at 17:29 +0800, Lai Jiangshan wrote:
> +       ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += -1;

That just looks funny.. 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()
  2012-02-22  9:29             ` [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock() Lai Jiangshan
  2012-02-22  9:50               ` Peter Zijlstra
@ 2012-02-22 21:20               ` Paul E. McKenney
  2012-02-22 21:26                 ` Paul E. McKenney
  1 sibling, 1 reply; 100+ messages in thread
From: Paul E. McKenney @ 2012-02-22 21:20 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On Wed, Feb 22, 2012 at 05:29:32PM +0800, Lai Jiangshan wrote:
> >From de49bb517e6367776e2226b931346ab6c798b122 Mon Sep 17 00:00:00 2001
> From: Lai Jiangshan <laijs@cn.fujitsu.com>
> Date: Wed, 22 Feb 2012 10:41:59 +0800
> Subject: [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()
> 
> The algorithm/smp_mb()s ensure 'there is only one srcu_read_lock()
> between flip and recheck for a cpu'.
> Increment of the upper bit for srcu_read_lock() only can
> ensure a single pair of lock/unlock change the cpu counter.

Very nice!  Also makes is more clear in that no combination of operations
including exactly one increment can possibly wrap back to the same value,
because the upper bit would be different.

In deference to Peter Zijlstra's sensibilities, I changed the:

	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += -1;

to:

	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) -= 1;

I did manage to resist the temptation to instead say:

	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) -= +1;

;-)

Queued, thank you!

							Thanx, Paul

> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> ---
>  include/linux/srcu.h |    2 +-
>  kernel/srcu.c        |   11 +++++------
>  2 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index a478c8e..5b49d41 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -35,7 +35,7 @@ struct srcu_struct_array {
>  };
> 
>  /* Bit definitions for field ->c above and ->snap below. */
> -#define SRCU_USAGE_BITS		2
> +#define SRCU_USAGE_BITS		1
>  #define SRCU_REF_MASK		(ULONG_MAX >> SRCU_USAGE_BITS)
>  #define SRCU_USAGE_COUNT	(SRCU_REF_MASK + 1)
> 
> diff --git a/kernel/srcu.c b/kernel/srcu.c
> index 17e95bc..a51ac48 100644
> --- a/kernel/srcu.c
> +++ b/kernel/srcu.c
> @@ -138,10 +138,10 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> 
>  	/*
>  	 * Now, we check the ->snap array that srcu_readers_active_idx()
> -	 * filled in from the per-CPU counter values.  Since both
> -	 * __srcu_read_lock() and __srcu_read_unlock() increment the
> -	 * upper bits of the per-CPU counter, an increment/decrement
> -	 * pair will change the value of the counter.  Since there is
> +	 * filled in from the per-CPU counter values. Since
> +	 * __srcu_read_lock() increments the upper bits of
> +	 * the per-CPU counter, an increment/decrement pair will
> +	 * change the value of the counter.  Since there is
>  	 * only one possible increment, the only way to wrap the counter
>  	 * is to have a huge number of counter decrements, which requires
>  	 * a huge number of tasks and huge SRCU read-side critical-section
> @@ -234,8 +234,7 @@ void __srcu_read_unlock(struct srcu_struct *sp, int idx)
>  {
>  	preempt_disable();
>  	smp_mb(); /* C */  /* Avoid leaking the critical section. */
> -	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
> -		SRCU_USAGE_COUNT - 1;
> +	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += -1;
>  	preempt_enable();
>  }
>  EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> -- 
> 1.7.4.4
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()
  2012-02-22 21:20               ` Paul E. McKenney
@ 2012-02-22 21:26                 ` Paul E. McKenney
  2012-02-22 21:39                   ` Steven Rostedt
  0 siblings, 1 reply; 100+ messages in thread
From: Paul E. McKenney @ 2012-02-22 21:26 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On Wed, Feb 22, 2012 at 01:20:56PM -0800, Paul E. McKenney wrote:
> On Wed, Feb 22, 2012 at 05:29:32PM +0800, Lai Jiangshan wrote:
> > >From de49bb517e6367776e2226b931346ab6c798b122 Mon Sep 17 00:00:00 2001
> > From: Lai Jiangshan <laijs@cn.fujitsu.com>
> > Date: Wed, 22 Feb 2012 10:41:59 +0800
> > Subject: [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()
> > 
> > The algorithm/smp_mb()s ensure 'there is only one srcu_read_lock()
> > between flip and recheck for a cpu'.
> > Increment of the upper bit for srcu_read_lock() only can
> > ensure a single pair of lock/unlock change the cpu counter.
> 
> Very nice!  Also makes is more clear in that no combination of operations
> including exactly one increment can possibly wrap back to the same value,
> because the upper bit would be different.

Make that without underflow -- one increment and 2^31+1 decrements would
in fact return the counter to its original value, but that would require
cramming more than two billion tasks into a 32-bit address space, which
I believe to be sufficiently unlikely.  (Famous last words...)

							Thanx, Paul

> In deference to Peter Zijlstra's sensibilities, I changed the:
> 
> 	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += -1;
> 
> to:
> 
> 	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) -= 1;
> 
> I did manage to resist the temptation to instead say:
> 
> 	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) -= +1;
> 
> ;-)
> 
> Queued, thank you!
> 
> 							Thanx, Paul
> 
> > Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> > ---
> >  include/linux/srcu.h |    2 +-
> >  kernel/srcu.c        |   11 +++++------
> >  2 files changed, 6 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> > index a478c8e..5b49d41 100644
> > --- a/include/linux/srcu.h
> > +++ b/include/linux/srcu.h
> > @@ -35,7 +35,7 @@ struct srcu_struct_array {
> >  };
> > 
> >  /* Bit definitions for field ->c above and ->snap below. */
> > -#define SRCU_USAGE_BITS		2
> > +#define SRCU_USAGE_BITS		1
> >  #define SRCU_REF_MASK		(ULONG_MAX >> SRCU_USAGE_BITS)
> >  #define SRCU_USAGE_COUNT	(SRCU_REF_MASK + 1)
> > 
> > diff --git a/kernel/srcu.c b/kernel/srcu.c
> > index 17e95bc..a51ac48 100644
> > --- a/kernel/srcu.c
> > +++ b/kernel/srcu.c
> > @@ -138,10 +138,10 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> > 
> >  	/*
> >  	 * Now, we check the ->snap array that srcu_readers_active_idx()
> > -	 * filled in from the per-CPU counter values.  Since both
> > -	 * __srcu_read_lock() and __srcu_read_unlock() increment the
> > -	 * upper bits of the per-CPU counter, an increment/decrement
> > -	 * pair will change the value of the counter.  Since there is
> > +	 * filled in from the per-CPU counter values. Since
> > +	 * __srcu_read_lock() increments the upper bits of
> > +	 * the per-CPU counter, an increment/decrement pair will
> > +	 * change the value of the counter.  Since there is
> >  	 * only one possible increment, the only way to wrap the counter
> >  	 * is to have a huge number of counter decrements, which requires
> >  	 * a huge number of tasks and huge SRCU read-side critical-section
> > @@ -234,8 +234,7 @@ void __srcu_read_unlock(struct srcu_struct *sp, int idx)
> >  {
> >  	preempt_disable();
> >  	smp_mb(); /* C */  /* Avoid leaking the critical section. */
> > -	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
> > -		SRCU_USAGE_COUNT - 1;
> > +	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += -1;
> >  	preempt_enable();
> >  }
> >  EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> > -- 
> > 1.7.4.4
> > 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()
  2012-02-22 21:26                 ` Paul E. McKenney
@ 2012-02-22 21:39                   ` Steven Rostedt
  2012-02-23  1:01                     ` Paul E. McKenney
  0 siblings, 1 reply; 100+ messages in thread
From: Steven Rostedt @ 2012-02-22 21:39 UTC (permalink / raw)
  To: paulmck
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, peterz, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Wed, 2012-02-22 at 13:26 -0800, Paul E. McKenney wrote:

> Make that without underflow -- one increment and 2^31+1 decrements would
> in fact return the counter to its original value, but that would require
> cramming more than two billion tasks into a 32-bit address space, which
> I believe to be sufficiently unlikely.  (Famous last words...)

I'll just expect to see you as President of the United States, counting
your money you won in the lottery, and being awarded a Nobel Prize for
curing cancer.

-- Steve




^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period
  2012-02-22  9:29             ` [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period Lai Jiangshan
@ 2012-02-23  1:01               ` Paul E. McKenney
  2012-02-24  8:06               ` Lai Jiangshan
  1 sibling, 0 replies; 100+ messages in thread
From: Paul E. McKenney @ 2012-02-23  1:01 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On Wed, Feb 22, 2012 at 05:29:36PM +0800, Lai Jiangshan wrote:
> >From 4ddf62aaf2c4ebe6b9d4a1c596e8b43a678f1f0d Mon Sep 17 00:00:00 2001
> From: Lai Jiangshan <laijs@cn.fujitsu.com>
> Date: Wed, 22 Feb 2012 14:12:02 +0800
> Subject: [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period
> 
> flip_idx_and_wait() is not changed, and is split as two functions
> and only a short comments is added for smp_mb() E.
> 
> __synchronize_srcu() use a different algorithm for "leak" readers.
> detail is in the comments of the patch.
> 
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>

And I queued this one as well, with some adjustment to the comments.

These are now available at:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/srcu

Assuming testing goes well, these might go into 3.4.

							Thanx, Paul

> ---
>  kernel/srcu.c |  105 ++++++++++++++++++++++++++++++++++----------------------
>  1 files changed, 64 insertions(+), 41 deletions(-)
> 
> diff --git a/kernel/srcu.c b/kernel/srcu.c
> index a51ac48..346f9d7 100644
> --- a/kernel/srcu.c
> +++ b/kernel/srcu.c
> @@ -249,6 +249,37 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>   */
>  #define SYNCHRONIZE_SRCU_READER_DELAY 5
> 
> +static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> +{
> +	int trycount = 0;
> +
> +	/*
> +	 * SRCU read-side critical sections are normally short, so wait
> +	 * a small amount of time before possibly blocking.
> +	 */
> +	if (!srcu_readers_active_idx_check(sp, idx)) {
> +		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> +		while (!srcu_readers_active_idx_check(sp, idx)) {
> +			if (expedited && ++ trycount < 10)
> +				udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> +			else
> +				schedule_timeout_interruptible(1);
> +		}
> +	}
> +
> +	/*
> +	 * The following smp_mb() E pairs with srcu_read_unlock()'s
> +	 * smp_mb C to ensure that if srcu_readers_active_idx_check()
> +	 * sees srcu_read_unlock()'s counter decrement, then any
> +	 * of the current task's subsequent code will happen after
> +	 * that SRCU read-side critical section.
> +	 *
> +	 * It also ensures the order between the above waiting and
> +	 * the next flipping.
> +	 */
> +	smp_mb(); /* E */
> +}
> +
>  /*
>   * Flip the readers' index by incrementing ->completed, then wait
>   * until there are no more readers using the counters referenced by
> @@ -258,12 +289,12 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>   * Of course, it is possible that a reader might be delayed for the
>   * full duration of flip_idx_and_wait() between fetching the
>   * index and incrementing its counter.  This possibility is handled
> - * by __synchronize_srcu() invoking flip_idx_and_wait() twice.
> + * by the next __synchronize_srcu() invoking wait_idx() for such readers
> + * before start a new grace perioad.
>   */
>  static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
>  {
>  	int idx;
> -	int trycount = 0;
> 
>  	idx = sp->completed++ & 0x1;
> 
> @@ -278,28 +309,7 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
>  	 */
>  	smp_mb(); /* D */
> 
> -	/*
> -	 * SRCU read-side critical sections are normally short, so wait
> -	 * a small amount of time before possibly blocking.
> -	 */
> -	if (!srcu_readers_active_idx_check(sp, idx)) {
> -		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> -		while (!srcu_readers_active_idx_check(sp, idx)) {
> -			if (expedited && ++ trycount < 10)
> -				udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> -			else
> -				schedule_timeout_interruptible(1);
> -		}
> -	}
> -
> -	/*
> -	 * The following smp_mb() E pairs with srcu_read_unlock()'s
> -	 * smp_mb C to ensure that if srcu_readers_active_idx_check()
> -	 * sees srcu_read_unlock()'s counter decrement, then any
> -	 * of the current task's subsequent code will happen after
> -	 * that SRCU read-side critical section.
> -	 */
> -	smp_mb(); /* E */
> +	wait_idx(sp, idx, expedited);
>  }
> 
>  /*
> @@ -307,8 +317,6 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
>   */
>  static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
>  {
> -	int idx = 0;
> -
>  	rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
>  			   !lock_is_held(&rcu_bh_lock_map) &&
>  			   !lock_is_held(&rcu_lock_map) &&
> @@ -318,27 +326,42 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
>  	mutex_lock(&sp->mutex);
> 
>  	/*
> -	 * If there were no helpers, then we need to do two flips of
> -	 * the index.  The first flip is required if there are any
> -	 * outstanding SRCU readers even if there are no new readers
> -	 * running concurrently with the first counter flip.
> -	 *
> -	 * The second flip is required when a new reader picks up
> +	 * When in the previous grace perioad, if a reader picks up
>  	 * the old value of the index, but does not increment its
>  	 * counter until after its counters is summed/rechecked by
> -	 * srcu_readers_active_idx_check().  In this case, the current SRCU
> +	 * srcu_readers_active_idx_check(). In this case, the previous SRCU
>  	 * grace period would be OK because the SRCU read-side critical
> -	 * section started after this SRCU grace period started, so the
> +	 * section started after the SRCU grace period started, so the
>  	 * grace period is not required to wait for the reader.
>  	 *
> -	 * However, the next SRCU grace period would be waiting for the
> -	 * other set of counters to go to zero, and therefore would not
> -	 * wait for the reader, which would be very bad.  To avoid this
> -	 * bad scenario, we flip and wait twice, clearing out both sets
> -	 * of counters.
> +	 * However, such leftover readers affect this new SRCU grace period.
> +	 * So we have to wait for such readers. This wait_idx() should be
> +	 * considerred as the wait_idx() in the flip_idx_and_wait() of
> +	 * the previous grace perioad except that it is for leftover readers
> +	 * started before this synchronize_srcu(). So when it returns,
> +	 * there is no leftover readers that starts before this grace period.
> +	 *
> +	 * If there are some leftover readers that do not increment its
> +	 * counter until after its counters is summed/rechecked by
> +	 * srcu_readers_active_idx_check(), In this case, this SRCU
> +	 * grace period would be OK as above comments says. We defines
> +	 * such readers as leftover-leftover readers, we consider these
> +	 * readers fteched index of (sp->completed + 1), it means they
> +	 * are considerred as exactly the same as the readers after this
> +	 * grace period.
> +	 *
> +	 * wait_idx() is expected very fast, because leftover readers
> +	 * are unlikely produced.
>  	 */
> -	for (; idx < 2; idx++)
> -		flip_idx_and_wait(sp, expedited);
> +	wait_idx(sp, (sp->completed - 1) & 0x1, expedited);
> +
> +	/*
> +	 * Starts a new grace period, this flip is required if there are
> +	 * any outstanding SRCU readers even if there are no new readers
> +	 * running concurrently with the counter flip.
> +	 */
> +	flip_idx_and_wait(sp, expedited);
> +
>  	mutex_unlock(&sp->mutex);
>  }
> 
> -- 
> 1.7.4.4
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock()
  2012-02-22 21:39                   ` Steven Rostedt
@ 2012-02-23  1:01                     ` Paul E. McKenney
  0 siblings, 0 replies; 100+ messages in thread
From: Paul E. McKenney @ 2012-02-23  1:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, peterz, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Wed, Feb 22, 2012 at 04:39:15PM -0500, Steven Rostedt wrote:
> On Wed, 2012-02-22 at 13:26 -0800, Paul E. McKenney wrote:
> 
> > Make that without underflow -- one increment and 2^31+1 decrements would
> > in fact return the counter to its original value, but that would require
> > cramming more than two billion tasks into a 32-bit address space, which
> > I believe to be sufficiently unlikely.  (Famous last words...)
> 
> I'll just expect to see you as President of the United States, counting
> your money you won in the lottery, and being awarded a Nobel Prize for
> curing cancer.

Those possibilities also seem to me to be sufficiently unlikely.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period
  2012-02-22  9:29             ` [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period Lai Jiangshan
  2012-02-23  1:01               ` Paul E. McKenney
@ 2012-02-24  8:06               ` Lai Jiangshan
  2012-02-24 20:01                 ` Paul E. McKenney
  1 sibling, 1 reply; 100+ messages in thread
From: Lai Jiangshan @ 2012-02-24  8:06 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On 02/22/2012 05:29 PM, Lai Jiangshan wrote:
>>From 4ddf62aaf2c4ebe6b9d4a1c596e8b43a678f1f0d Mon Sep 17 00:00:00 2001
> From: Lai Jiangshan <laijs@cn.fujitsu.com>
> Date: Wed, 22 Feb 2012 14:12:02 +0800
> Subject: [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period
> 
> flip_idx_and_wait() is not changed, and is split as two functions
> and only a short comments is added for smp_mb() E.
> 
> __synchronize_srcu() use a different algorithm for "leak" readers.
> detail is in the comments of the patch.
> 
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> ---
>  kernel/srcu.c |  105 ++++++++++++++++++++++++++++++++++----------------------
>  1 files changed, 64 insertions(+), 41 deletions(-)
> 
> diff --git a/kernel/srcu.c b/kernel/srcu.c
> index a51ac48..346f9d7 100644
> --- a/kernel/srcu.c
> +++ b/kernel/srcu.c
> @@ -249,6 +249,37 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>   */
>  #define SYNCHRONIZE_SRCU_READER_DELAY 5
>  
> +static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> +{
> +	int trycount = 0;

Hi, Paul

	smp_mb() D also needs to be moved here, could you fix it before push it.
I thought it(smp_mb()) always here in my mind, wrong assumption.

Sorry.

Thanks,
Lai

> +
> +	/*
> +	 * SRCU read-side critical sections are normally short, so wait
> +	 * a small amount of time before possibly blocking.
> +	 */
> +	if (!srcu_readers_active_idx_check(sp, idx)) {
> +		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> +		while (!srcu_readers_active_idx_check(sp, idx)) {
> +			if (expedited && ++ trycount < 10)
> +				udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> +			else
> +				schedule_timeout_interruptible(1);
> +		}
> +	}
> +
> +	/*
> +	 * The following smp_mb() E pairs with srcu_read_unlock()'s
> +	 * smp_mb C to ensure that if srcu_readers_active_idx_check()
> +	 * sees srcu_read_unlock()'s counter decrement, then any
> +	 * of the current task's subsequent code will happen after
> +	 * that SRCU read-side critical section.
> +	 *
> +	 * It also ensures the order between the above waiting and
> +	 * the next flipping.
> +	 */
> +	smp_mb(); /* E */
> +}
> +
>  /*
>   * Flip the readers' index by incrementing ->completed, then wait
>   * until there are no more readers using the counters referenced by
> @@ -258,12 +289,12 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>   * Of course, it is possible that a reader might be delayed for the
>   * full duration of flip_idx_and_wait() between fetching the
>   * index and incrementing its counter.  This possibility is handled
> - * by __synchronize_srcu() invoking flip_idx_and_wait() twice.
> + * by the next __synchronize_srcu() invoking wait_idx() for such readers
> + * before start a new grace perioad.
>   */
>  static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
>  {
>  	int idx;
> -	int trycount = 0;
>  
>  	idx = sp->completed++ & 0x1;
>  
> @@ -278,28 +309,7 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
>  	 */
>  	smp_mb(); /* D */
>  
> -	/*
> -	 * SRCU read-side critical sections are normally short, so wait
> -	 * a small amount of time before possibly blocking.
> -	 */
> -	if (!srcu_readers_active_idx_check(sp, idx)) {
> -		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> -		while (!srcu_readers_active_idx_check(sp, idx)) {
> -			if (expedited && ++ trycount < 10)
> -				udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> -			else
> -				schedule_timeout_interruptible(1);
> -		}
> -	}
> -
> -	/*
> -	 * The following smp_mb() E pairs with srcu_read_unlock()'s
> -	 * smp_mb C to ensure that if srcu_readers_active_idx_check()
> -	 * sees srcu_read_unlock()'s counter decrement, then any
> -	 * of the current task's subsequent code will happen after
> -	 * that SRCU read-side critical section.
> -	 */
> -	smp_mb(); /* E */
> +	wait_idx(sp, idx, expedited);
>  }
>  
>  /*
> @@ -307,8 +317,6 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
>   */
>  static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
>  {
> -	int idx = 0;
> -
>  	rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
>  			   !lock_is_held(&rcu_bh_lock_map) &&
>  			   !lock_is_held(&rcu_lock_map) &&
> @@ -318,27 +326,42 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
>  	mutex_lock(&sp->mutex);
>  
>  	/*
> -	 * If there were no helpers, then we need to do two flips of
> -	 * the index.  The first flip is required if there are any
> -	 * outstanding SRCU readers even if there are no new readers
> -	 * running concurrently with the first counter flip.
> -	 *
> -	 * The second flip is required when a new reader picks up
> +	 * When in the previous grace perioad, if a reader picks up
>  	 * the old value of the index, but does not increment its
>  	 * counter until after its counters is summed/rechecked by
> -	 * srcu_readers_active_idx_check().  In this case, the current SRCU
> +	 * srcu_readers_active_idx_check(). In this case, the previous SRCU
>  	 * grace period would be OK because the SRCU read-side critical
> -	 * section started after this SRCU grace period started, so the
> +	 * section started after the SRCU grace period started, so the
>  	 * grace period is not required to wait for the reader.
>  	 *
> -	 * However, the next SRCU grace period would be waiting for the
> -	 * other set of counters to go to zero, and therefore would not
> -	 * wait for the reader, which would be very bad.  To avoid this
> -	 * bad scenario, we flip and wait twice, clearing out both sets
> -	 * of counters.
> +	 * However, such leftover readers affect this new SRCU grace period.
> +	 * So we have to wait for such readers. This wait_idx() should be
> +	 * considerred as the wait_idx() in the flip_idx_and_wait() of
> +	 * the previous grace perioad except that it is for leftover readers
> +	 * started before this synchronize_srcu(). So when it returns,
> +	 * there is no leftover readers that starts before this grace period.
> +	 *
> +	 * If there are some leftover readers that do not increment its
> +	 * counter until after its counters is summed/rechecked by
> +	 * srcu_readers_active_idx_check(), In this case, this SRCU
> +	 * grace period would be OK as above comments says. We defines
> +	 * such readers as leftover-leftover readers, we consider these
> +	 * readers fteched index of (sp->completed + 1), it means they
> +	 * are considerred as exactly the same as the readers after this
> +	 * grace period.
> +	 *
> +	 * wait_idx() is expected very fast, because leftover readers
> +	 * are unlikely produced.
>  	 */
> -	for (; idx < 2; idx++)
> -		flip_idx_and_wait(sp, expedited);
> +	wait_idx(sp, (sp->completed - 1) & 0x1, expedited);
> +
> +	/*
> +	 * Starts a new grace period, this flip is required if there are
> +	 * any outstanding SRCU readers even if there are no new readers
> +	 * running concurrently with the counter flip.
> +	 */
> +	flip_idx_and_wait(sp, expedited);
> +
>  	mutex_unlock(&sp->mutex);
>  }
>  


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period
  2012-02-24  8:06               ` Lai Jiangshan
@ 2012-02-24 20:01                 ` Paul E. McKenney
  2012-02-27  8:01                   ` [PATCH 1/2 RFC] srcu: change the comments of the wait algorithm Lai Jiangshan
  2012-02-27  8:01                   ` [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm Lai Jiangshan
  0 siblings, 2 replies; 100+ messages in thread
From: Paul E. McKenney @ 2012-02-24 20:01 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On Fri, Feb 24, 2012 at 04:06:01PM +0800, Lai Jiangshan wrote:
> On 02/22/2012 05:29 PM, Lai Jiangshan wrote:
> >>From 4ddf62aaf2c4ebe6b9d4a1c596e8b43a678f1f0d Mon Sep 17 00:00:00 2001
> > From: Lai Jiangshan <laijs@cn.fujitsu.com>
> > Date: Wed, 22 Feb 2012 14:12:02 +0800
> > Subject: [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period
> > 
> > flip_idx_and_wait() is not changed, and is split as two functions
> > and only a short comments is added for smp_mb() E.
> > 
> > __synchronize_srcu() use a different algorithm for "leak" readers.
> > detail is in the comments of the patch.
> > 
> > Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> > ---
> >  kernel/srcu.c |  105 ++++++++++++++++++++++++++++++++++----------------------
> >  1 files changed, 64 insertions(+), 41 deletions(-)
> > 
> > diff --git a/kernel/srcu.c b/kernel/srcu.c
> > index a51ac48..346f9d7 100644
> > --- a/kernel/srcu.c
> > +++ b/kernel/srcu.c
> > @@ -249,6 +249,37 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> >   */
> >  #define SYNCHRONIZE_SRCU_READER_DELAY 5
> >  
> > +static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> > +{
> > +	int trycount = 0;
> 
> Hi, Paul
> 
> 	smp_mb() D also needs to be moved here, could you fix it before push it.
> I thought it(smp_mb()) always here in my mind, wrong assumption.

Good catch -- I should have seen this myself.  I committed this and pushed
it to:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/srcu

> Sorry.

Not a problem, though it does point out the need for review and testing.
So I am feeling a bit nervous about pushing this into 3.4, and am
beginning to think that it needs mechanical proof as well as more testing.

Thoughts?

							Thanx, Paul

> Thanks,
> Lai
> 
> > +
> > +	/*
> > +	 * SRCU read-side critical sections are normally short, so wait
> > +	 * a small amount of time before possibly blocking.
> > +	 */
> > +	if (!srcu_readers_active_idx_check(sp, idx)) {
> > +		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> > +		while (!srcu_readers_active_idx_check(sp, idx)) {
> > +			if (expedited && ++ trycount < 10)
> > +				udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> > +			else
> > +				schedule_timeout_interruptible(1);
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * The following smp_mb() E pairs with srcu_read_unlock()'s
> > +	 * smp_mb C to ensure that if srcu_readers_active_idx_check()
> > +	 * sees srcu_read_unlock()'s counter decrement, then any
> > +	 * of the current task's subsequent code will happen after
> > +	 * that SRCU read-side critical section.
> > +	 *
> > +	 * It also ensures the order between the above waiting and
> > +	 * the next flipping.
> > +	 */
> > +	smp_mb(); /* E */
> > +}
> > +
> >  /*
> >   * Flip the readers' index by incrementing ->completed, then wait
> >   * until there are no more readers using the counters referenced by
> > @@ -258,12 +289,12 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> >   * Of course, it is possible that a reader might be delayed for the
> >   * full duration of flip_idx_and_wait() between fetching the
> >   * index and incrementing its counter.  This possibility is handled
> > - * by __synchronize_srcu() invoking flip_idx_and_wait() twice.
> > + * by the next __synchronize_srcu() invoking wait_idx() for such readers
> > + * before start a new grace perioad.
> >   */
> >  static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
> >  {
> >  	int idx;
> > -	int trycount = 0;
> >  
> >  	idx = sp->completed++ & 0x1;
> >  
> > @@ -278,28 +309,7 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
> >  	 */
> >  	smp_mb(); /* D */
> >  
> > -	/*
> > -	 * SRCU read-side critical sections are normally short, so wait
> > -	 * a small amount of time before possibly blocking.
> > -	 */
> > -	if (!srcu_readers_active_idx_check(sp, idx)) {
> > -		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> > -		while (!srcu_readers_active_idx_check(sp, idx)) {
> > -			if (expedited && ++ trycount < 10)
> > -				udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> > -			else
> > -				schedule_timeout_interruptible(1);
> > -		}
> > -	}
> > -
> > -	/*
> > -	 * The following smp_mb() E pairs with srcu_read_unlock()'s
> > -	 * smp_mb C to ensure that if srcu_readers_active_idx_check()
> > -	 * sees srcu_read_unlock()'s counter decrement, then any
> > -	 * of the current task's subsequent code will happen after
> > -	 * that SRCU read-side critical section.
> > -	 */
> > -	smp_mb(); /* E */
> > +	wait_idx(sp, idx, expedited);
> >  }
> >  
> >  /*
> > @@ -307,8 +317,6 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
> >   */
> >  static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
> >  {
> > -	int idx = 0;
> > -
> >  	rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
> >  			   !lock_is_held(&rcu_bh_lock_map) &&
> >  			   !lock_is_held(&rcu_lock_map) &&
> > @@ -318,27 +326,42 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
> >  	mutex_lock(&sp->mutex);
> >  
> >  	/*
> > -	 * If there were no helpers, then we need to do two flips of
> > -	 * the index.  The first flip is required if there are any
> > -	 * outstanding SRCU readers even if there are no new readers
> > -	 * running concurrently with the first counter flip.
> > -	 *
> > -	 * The second flip is required when a new reader picks up
> > +	 * When in the previous grace perioad, if a reader picks up
> >  	 * the old value of the index, but does not increment its
> >  	 * counter until after its counters is summed/rechecked by
> > -	 * srcu_readers_active_idx_check().  In this case, the current SRCU
> > +	 * srcu_readers_active_idx_check(). In this case, the previous SRCU
> >  	 * grace period would be OK because the SRCU read-side critical
> > -	 * section started after this SRCU grace period started, so the
> > +	 * section started after the SRCU grace period started, so the
> >  	 * grace period is not required to wait for the reader.
> >  	 *
> > -	 * However, the next SRCU grace period would be waiting for the
> > -	 * other set of counters to go to zero, and therefore would not
> > -	 * wait for the reader, which would be very bad.  To avoid this
> > -	 * bad scenario, we flip and wait twice, clearing out both sets
> > -	 * of counters.
> > +	 * However, such leftover readers affect this new SRCU grace period.
> > +	 * So we have to wait for such readers. This wait_idx() should be
> > +	 * considerred as the wait_idx() in the flip_idx_and_wait() of
> > +	 * the previous grace perioad except that it is for leftover readers
> > +	 * started before this synchronize_srcu(). So when it returns,
> > +	 * there is no leftover readers that starts before this grace period.
> > +	 *
> > +	 * If there are some leftover readers that do not increment its
> > +	 * counter until after its counters is summed/rechecked by
> > +	 * srcu_readers_active_idx_check(), In this case, this SRCU
> > +	 * grace period would be OK as above comments says. We defines
> > +	 * such readers as leftover-leftover readers, we consider these
> > +	 * readers fteched index of (sp->completed + 1), it means they
> > +	 * are considerred as exactly the same as the readers after this
> > +	 * grace period.
> > +	 *
> > +	 * wait_idx() is expected very fast, because leftover readers
> > +	 * are unlikely produced.
> >  	 */
> > -	for (; idx < 2; idx++)
> > -		flip_idx_and_wait(sp, expedited);
> > +	wait_idx(sp, (sp->completed - 1) & 0x1, expedited);
> > +
> > +	/*
> > +	 * Starts a new grace period, this flip is required if there are
> > +	 * any outstanding SRCU readers even if there are no new readers
> > +	 * running concurrently with the counter flip.
> > +	 */
> > +	flip_idx_and_wait(sp, expedited);
> > +
> >  	mutex_unlock(&sp->mutex);
> >  }
> >  
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 1/2 RFC] srcu: change the comments of the wait algorithm
  2012-02-24 20:01                 ` Paul E. McKenney
@ 2012-02-27  8:01                   ` Lai Jiangshan
  2012-02-27  8:01                   ` [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm Lai Jiangshan
  1 sibling, 0 replies; 100+ messages in thread
From: Lai Jiangshan @ 2012-02-27  8:01 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

Hi, ALL

The call_srcu() will be sent soon(may be in 2 days). I found something is not
good in current sruc when I implement it, so I do more prepare for it.

The second patch is inspired Peter. I had decided to use per-cpu machine,
the the snap array makes me unhappy. If a machine is sleeping/preempted
while checking, the other machine can't not check the same srcu_struct.
It is nothing big, but it also blocks the sychronize_srcu_expedited().
I hope sychronize_srcu_expedited() can't be blocked when it try to do its
fast-checking. So I try to find non-block checking algorithm, and I find
Peter's.

The most things in these two patches are comments, so I bring a lot
troubles to Paul because my poor English.

Thanks,
Lai

>From 77af819872ddab065d3a46758471b80f31b30e5e Mon Sep 17 00:00:00 2001
From: Lai Jiangshan <laijs@cn.fujitsu.com>
Date: Mon, 27 Feb 2012 10:52:00 +0800
Subject: [PATCH 1/2] srcu: change the comments of the wait algorithm

The original comments does not describe the essential of the wait algorithm
well.

The safe of srcu-protected data and srcu critical section is provided by
wait_idx(), not the flipping.

The two index of the active counter array and the flipping are just used to keep
the wait_idx() from starvation.
(the flip also provides "only one srcu_read_lock() at most after flip
for every cpu", this coupling will be remove in future(next patch))

The code will be split as pieces between every machine-states for call_srcu(),
It is very hard to provide the exactly semantics as original comments,
So I have to consider the exactly what the algorithm, and I change this
comments.

The code is not changed, but it is refactored a little.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 kernel/srcu.c |   75 +++++++++++++++++++++++++++++----------------------------
 1 files changed, 38 insertions(+), 37 deletions(-)

diff --git a/kernel/srcu.c b/kernel/srcu.c
index b6b9ea2..47ee35d 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -249,6 +249,10 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
  */
 #define SYNCHRONIZE_SRCU_READER_DELAY 5
 
+/*
+ * Wait until all the readers(which starts before this wait_idx()
+ * with the specified idx) complete.
+ */
 static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
 {
 	int trycount = 0;
@@ -291,24 +295,9 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
 	smp_mb(); /* E */
 }
 
-/*
- * Flip the readers' index by incrementing ->completed, then wait
- * until there are no more readers using the counters referenced by
- * the old index value.  (Recall that the index is the bottom bit
- * of ->completed.)
- *
- * Of course, it is possible that a reader might be delayed for the
- * full duration of flip_idx_and_wait() between fetching the
- * index and incrementing its counter.  This possibility is handled
- * by the next __synchronize_srcu() invoking wait_idx() for such readers
- * before starting a new grace period.
- */
-static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
+static void srcu_flip(struct srcu_struct *sp)
 {
-	int idx;
-
-	idx = sp->completed++ & 0x1;
-	wait_idx(sp, idx, expedited);
+	sp->completed++;
 }
 
 /*
@@ -316,6 +305,8 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
  */
 static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
 {
+	int busy_idx;
+
 	rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
 			   !lock_is_held(&rcu_bh_lock_map) &&
 			   !lock_is_held(&rcu_lock_map) &&
@@ -323,8 +314,31 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
 			   "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
 
 	mutex_lock(&sp->mutex);
+	busy_idx = sp->completed & 0X1UL;
 
 	/*
+	 * There are some readers start with idx=0, and some others start
+	 * with idx=1. So two wait_idx()s are enough for synchronize:
+	 * __synchronize_srcu() {
+	 * 	wait_idx(sp, 0, expedited);
+	 * 	wait_idx(sp, 1, expedited);
+	 * }
+	 * When it returns, all started readers have complete.
+	 *
+	 * But synchronizer may be starved by the readers, example,
+	 * if sp->complete & 0x1L == 1, wait_idx(sp, 1, expedited)
+	 * may not returns if there are continuous readers start
+	 * with idx=1.
+	 *
+	 * So we need to flip the busy index to keep synchronizer
+	 * from starvation.
+	 */
+
+	/*
+	 * The above comments assume we have readers with all the
+	 * 2 idx. It does have this probability, some readers may
+	 * hold the reader lock with idx=1-busy_idx:
+	 *
 	 * Suppose that during the previous grace period, a reader
 	 * picked up the old value of the index, but did not increment
 	 * its counter until after the previous instance of
@@ -333,31 +347,18 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
 	 * not start until after the grace period started, so the grace
 	 * period was not obligated to wait for that reader.
 	 *
-	 * However, the current SRCU grace period does have to wait for
-	 * that reader.  This is handled by invoking wait_idx() on the
-	 * non-active set of counters (hence sp->completed - 1).  Once
-	 * wait_idx() returns, we know that all readers that picked up
-	 * the old value of ->completed and that already incremented their
-	 * counter will have completed.
-	 *
-	 * But what about readers that picked up the old value of
-	 * ->completed, but -still- have not managed to increment their
-	 * counter?  We do not need to wait for those readers, because
-	 * they will have started their SRCU read-side critical section
-	 * after the current grace period starts.
-	 *
-	 * Because it is unlikely that readers will be preempted between
-	 * fetching ->completed and incrementing their counter, wait_idx()
+	 * Because this probability is not high, wait_idx()
 	 * will normally not need to wait.
 	 */
-	wait_idx(sp, (sp->completed - 1) & 0x1, expedited);
+	wait_idx(sp, 1 - busy_idx, expedited);
+
+	/* flip the index to ensure the return of the next wait_idx() */
+	srcu_flip(sp);
 
 	/*
-	 * Now that wait_idx() has waited for the really old readers,
-	 * invoke flip_idx_and_wait() to flip the counter and wait
-	 * for current SRCU readers.
+	 * Now that wait_idx() has waited for the really old readers.
 	 */
-	flip_idx_and_wait(sp, expedited);
+	wait_idx(sp, busy_idx, expedited);
 
 	mutex_unlock(&sp->mutex);
 }
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm
  2012-02-24 20:01                 ` Paul E. McKenney
  2012-02-27  8:01                   ` [PATCH 1/2 RFC] srcu: change the comments of the wait algorithm Lai Jiangshan
@ 2012-02-27  8:01                   ` Lai Jiangshan
  2012-02-27 18:30                     ` Paul E. McKenney
  1 sibling, 1 reply; 100+ messages in thread
From: Lai Jiangshan @ 2012-02-27  8:01 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

>From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
From: Lai Jiangshan <laijs@cn.fujitsu.com>
Date: Mon, 27 Feb 2012 14:22:47 +0800
Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm

This patch implement the algorithm as Peter's:
https://lkml.org/lkml/2012/2/1/119

o	Make the checking lock-free and we can perform parallel checking,
	Although almost parallel checking makes no sense, but we need it
	when 1) the original checking task is preempted for long, 2)
	sychronize_srcu_expedited(), 3) avoid lock(see next)

o	Since it is lock-free, we save a mutex in state machine for
	call_srcu().

o	Remove the SRCU_REF_MASK and remove the coupling with the flipping.
	(so we can remove the preempt_disable() in future, but use
	 __this_cpu_inc() instead.)

o	reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
	more intuitive.

Inspired-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 include/linux/srcu.h |    7 +--
 kernel/srcu.c        |  137 ++++++++++++++++++++-----------------------------
 2 files changed, 57 insertions(+), 87 deletions(-)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 5b49d41..15354db 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -32,18 +32,13 @@
 
 struct srcu_struct_array {
 	unsigned long c[2];
+	unsigned long seq[2];
 };
 
-/* Bit definitions for field ->c above and ->snap below. */
-#define SRCU_USAGE_BITS		1
-#define SRCU_REF_MASK		(ULONG_MAX >> SRCU_USAGE_BITS)
-#define SRCU_USAGE_COUNT	(SRCU_REF_MASK + 1)
-
 struct srcu_struct {
 	unsigned completed;
 	struct srcu_struct_array __percpu *per_cpu_ref;
 	struct mutex mutex;
-	unsigned long snap[NR_CPUS];
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map dep_map;
 #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
diff --git a/kernel/srcu.c b/kernel/srcu.c
index 47ee35d..376b583 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -73,10 +73,25 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
 #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
 
 /*
+ * Returns approximate total sequence of readers on the specified rank
+ * of per-CPU counters.
+ */
+static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
+{
+	int cpu;
+	unsigned long sum = 0;
+	unsigned long t;
+
+	for_each_possible_cpu(cpu) {
+		t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
+		sum += t;
+	}
+	return sum;
+}
+
+/*
  * Returns approximate number of readers active on the specified rank
- * of per-CPU counters.  Also snapshots each counter's value in the
- * corresponding element of sp->snap[] for later use validating
- * the sum.
+ * of per-CPU counters.
  */
 static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
 {
@@ -87,26 +102,36 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
 	for_each_possible_cpu(cpu) {
 		t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
 		sum += t;
-		sp->snap[cpu] = t;
 	}
-	return sum & SRCU_REF_MASK;
+	return sum;
 }
 
-/*
- * To be called from the update side after an index flip.  Returns true
- * if the modulo sum of the counters is stably zero, false if there is
- * some possibility of non-zero.
- */
 static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
 {
 	int cpu;
+	unsigned long seq;
+
+	seq = srcu_readers_seq_idx(sp, idx);
+
+	/*
+	 * smp_mb() A pairs with smp_mb() B for critical section.
+	 * It ensures that the SRCU read-side critical section whose
+	 * read-lock is not seen by the following srcu_readers_active_idx()
+	 * will see any updates that before the current task performed before.
+	 * (So we don't need to care these readers this time)
+	 *
+	 * Also, if we see the increment of the seq, we must see the
+	 * increment of the active counter in the following
+	 * srcu_readers_active_idx().
+	 */
+	smp_mb(); /* A */
 
 	/*
 	 * Note that srcu_readers_active_idx() can incorrectly return
 	 * zero even though there is a pre-existing reader throughout.
 	 * To see this, suppose that task A is in a very long SRCU
 	 * read-side critical section that started on CPU 0, and that
-	 * no other reader exists, so that the modulo sum of the counters
+	 * no other reader exists, so that the sum of the counters
 	 * is equal to one.  Then suppose that task B starts executing
 	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
 	 * task C starts reading on CPU 0, so that its increment is not
@@ -122,53 +147,26 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
 		return false;
 
 	/*
-	 * Since the caller recently flipped ->completed, we can see at
-	 * most one increment of each CPU's counter from this point
-	 * forward.  The reason for this is that the reader CPU must have
-	 * fetched the index before srcu_readers_active_idx checked
-	 * that CPU's counter, but not yet incremented its counter.
-	 * Its eventual counter increment will follow the read in
-	 * srcu_readers_active_idx(), and that increment is immediately
-	 * followed by smp_mb() B.  Because smp_mb() D is between
-	 * the ->completed flip and srcu_readers_active_idx()'s read,
-	 * that CPU's subsequent load of ->completed must see the new
-	 * value, and therefore increment the counter in the other rank.
-	 */
-	smp_mb(); /* A */
-
-	/*
-	 * Now, we check the ->snap array that srcu_readers_active_idx()
-	 * filled in from the per-CPU counter values. Since
-	 * __srcu_read_lock() increments the upper bits of the per-CPU
-	 * counter, an increment/decrement pair will change the value
-	 * of the counter.  Since there is only one possible increment,
-	 * the only way to wrap the counter is to have a huge number of
-	 * counter decrements, which requires a huge number of tasks and
-	 * huge SRCU read-side critical-section nesting levels, even on
-	 * 32-bit systems.
-	 *
-	 * All of the ways of confusing the readings require that the scan
-	 * in srcu_readers_active_idx() see the read-side task's decrement,
-	 * but not its increment.  However, between that decrement and
-	 * increment are smb_mb() B and C.  Either or both of these pair
-	 * with smp_mb() A above to ensure that the scan below will see
-	 * the read-side tasks's increment, thus noting a difference in
-	 * the counter values between the two passes.
+	 * Validation step, smp_mb() D pairs with smp_mb() C. If the above
+	 * srcu_readers_active_idx() see a decrement of the active counter
+	 * in srcu_read_unlock(), it should see one of these for corresponding
+	 * srcu_read_lock():
+	 * 	See the increment of the active counter,
+	 * 	Failed to see the increment of the active counter.
+	 * The second one can cause srcu_readers_active_idx() incorrectly
+	 * return zero, but it means the above srcu_readers_seq_idx() does not
+	 * see the increment of the seq(ref: comments of smp_mb() A),
+	 * and the following srcu_readers_seq_idx() sees the increment of
+	 * the seq. The seq is changed.
 	 *
-	 * Therefore, if srcu_readers_active_idx() returned zero, and
-	 * none of the counters changed, we know that the zero was the
-	 * correct sum.
-	 *
-	 * Of course, it is possible that a task might be delayed
-	 * for a very long time in __srcu_read_lock() after fetching
-	 * the index but before incrementing its counter.  This
-	 * possibility will be dealt with in __synchronize_srcu().
+	 * This smp_mb() D pairs with smp_mb() C for critical section.
+	 * then any of the current task's subsequent code will happen after
+	 * that SRCU read-side critical section whose read-unlock is seen in
+	 * srcu_readers_active_idx().
 	 */
-	for_each_possible_cpu(cpu)
-		if (sp->snap[cpu] !=
-		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
-			return false;  /* False zero reading! */
-	return true;
+	smp_mb(); /* D */
+
+	return srcu_readers_seq_idx(sp, idx) == seq;
 }
 
 /**
@@ -216,9 +214,9 @@ int __srcu_read_lock(struct srcu_struct *sp)
 	preempt_disable();
 	idx = rcu_dereference_index_check(sp->completed,
 					  rcu_read_lock_sched_held()) & 0x1;
-	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
-		SRCU_USAGE_COUNT + 1;
+	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 1;
 	smp_mb(); /* B */  /* Avoid leaking the critical section. */
+	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->seq[idx]) += 1;
 	preempt_enable();
 	return idx;
 }
@@ -258,17 +256,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
 	int trycount = 0;
 
 	/*
-	 * If a reader fetches the index before the ->completed increment,
-	 * but increments its counter after srcu_readers_active_idx_check()
-	 * sums it, then smp_mb() D will pair with __srcu_read_lock()'s
-	 * smp_mb() B to ensure that the SRCU read-side critical section
-	 * will see any updates that the current task performed before its
-	 * call to synchronize_srcu(), or to synchronize_srcu_expedited(),
-	 * as the case may be.
-	 */
-	smp_mb(); /* D */
-
-	/*
 	 * SRCU read-side critical sections are normally short, so wait
 	 * a small amount of time before possibly blocking.
 	 */
@@ -281,18 +268,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
 				schedule_timeout_interruptible(1);
 		}
 	}
-
-	/*
-	 * The following smp_mb() E pairs with srcu_read_unlock()'s
-	 * smp_mb C to ensure that if srcu_readers_active_idx_check()
-	 * sees srcu_read_unlock()'s counter decrement, then any
-	 * of the current task's subsequent code will happen after
-	 * that SRCU read-side critical section.
-	 *
-	 * It also ensures the order between the above waiting and
-	 * the next flipping.
-	 */
-	smp_mb(); /* E */
 }
 
 static void srcu_flip(struct srcu_struct *sp)
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm
  2012-02-27  8:01                   ` [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm Lai Jiangshan
@ 2012-02-27 18:30                     ` Paul E. McKenney
  2012-02-28  1:51                       ` Lai Jiangshan
  0 siblings, 1 reply; 100+ messages in thread
From: Paul E. McKenney @ 2012-02-27 18:30 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On Mon, Feb 27, 2012 at 04:01:04PM +0800, Lai Jiangshan wrote:
> >From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
> From: Lai Jiangshan <laijs@cn.fujitsu.com>
> Date: Mon, 27 Feb 2012 14:22:47 +0800
> Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm
> 
> This patch implement the algorithm as Peter's:
> https://lkml.org/lkml/2012/2/1/119
> 
> o	Make the checking lock-free and we can perform parallel checking,
> 	Although almost parallel checking makes no sense, but we need it
> 	when 1) the original checking task is preempted for long, 2)
> 	sychronize_srcu_expedited(), 3) avoid lock(see next)
> 
> o	Since it is lock-free, we save a mutex in state machine for
> 	call_srcu().
> 
> o	Remove the SRCU_REF_MASK and remove the coupling with the flipping.
> 	(so we can remove the preempt_disable() in future, but use
> 	 __this_cpu_inc() instead.)
> 
> o	reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
> 	more intuitive.

Hello, Lai,

Interesting approach!

What happens given the following sequence of events?

o	CPU 0 in srcu_readers_active_idx_check() invokes
	srcu_readers_seq_idx(), getting some number back.

o	CPU 0 invokes srcu_readers_active_idx(), summing the
	->c[] array up through CPU 3.

o	CPU 1 invokes __srcu_read_lock(), and increments its counter
	but not yet its ->seq[] element.

o	CPU 0 completes its summing of the ->c[] array, incorrectly
	obtaining zero.

o	CPU 0 invokes srcu_readers_seq_idx(), getting the same
	number back that it got last time.

o	In parallel with the previous step, CPU 1 executes out of order
	(as permitted by the lack of a second memory barrier in
	__srcu_read_lock()), starting up the critical section before
	incrementing its ->seq[] element.

o	Because CPU 0 is not aware that CPU 1 is an SRCU reader, it
	completes the SRCU grace period before CPU 1 completes its
	SRCU read-side critical section.

This actually might be safe, but I need to think more about it.  In the
meantime, I figured I should ask your thoughts.

							Thanx, Paul

> Inspired-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> ---
>  include/linux/srcu.h |    7 +--
>  kernel/srcu.c        |  137 ++++++++++++++++++++-----------------------------
>  2 files changed, 57 insertions(+), 87 deletions(-)
> 
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index 5b49d41..15354db 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -32,18 +32,13 @@
> 
>  struct srcu_struct_array {
>  	unsigned long c[2];
> +	unsigned long seq[2];
>  };
> 
> -/* Bit definitions for field ->c above and ->snap below. */
> -#define SRCU_USAGE_BITS		1
> -#define SRCU_REF_MASK		(ULONG_MAX >> SRCU_USAGE_BITS)
> -#define SRCU_USAGE_COUNT	(SRCU_REF_MASK + 1)
> -
>  struct srcu_struct {
>  	unsigned completed;
>  	struct srcu_struct_array __percpu *per_cpu_ref;
>  	struct mutex mutex;
> -	unsigned long snap[NR_CPUS];
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>  	struct lockdep_map dep_map;
>  #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> diff --git a/kernel/srcu.c b/kernel/srcu.c
> index 47ee35d..376b583 100644
> --- a/kernel/srcu.c
> +++ b/kernel/srcu.c
> @@ -73,10 +73,25 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
>  #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> 
>  /*
> + * Returns approximate total sequence of readers on the specified rank
> + * of per-CPU counters.
> + */
> +static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
> +{
> +	int cpu;
> +	unsigned long sum = 0;
> +	unsigned long t;
> +
> +	for_each_possible_cpu(cpu) {
> +		t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
> +		sum += t;
> +	}
> +	return sum;
> +}
> +
> +/*
>   * Returns approximate number of readers active on the specified rank
> - * of per-CPU counters.  Also snapshots each counter's value in the
> - * corresponding element of sp->snap[] for later use validating
> - * the sum.
> + * of per-CPU counters.
>   */
>  static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
>  {
> @@ -87,26 +102,36 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
>  	for_each_possible_cpu(cpu) {
>  		t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
>  		sum += t;
> -		sp->snap[cpu] = t;
>  	}
> -	return sum & SRCU_REF_MASK;
> +	return sum;
>  }
> 
> -/*
> - * To be called from the update side after an index flip.  Returns true
> - * if the modulo sum of the counters is stably zero, false if there is
> - * some possibility of non-zero.
> - */
>  static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
>  {
>  	int cpu;
> +	unsigned long seq;
> +
> +	seq = srcu_readers_seq_idx(sp, idx);
> +
> +	/*
> +	 * smp_mb() A pairs with smp_mb() B for critical section.
> +	 * It ensures that the SRCU read-side critical section whose
> +	 * read-lock is not seen by the following srcu_readers_active_idx()
> +	 * will see any updates that before the current task performed before.
> +	 * (So we don't need to care these readers this time)
> +	 *
> +	 * Also, if we see the increment of the seq, we must see the
> +	 * increment of the active counter in the following
> +	 * srcu_readers_active_idx().
> +	 */
> +	smp_mb(); /* A */
> 
>  	/*
>  	 * Note that srcu_readers_active_idx() can incorrectly return
>  	 * zero even though there is a pre-existing reader throughout.
>  	 * To see this, suppose that task A is in a very long SRCU
>  	 * read-side critical section that started on CPU 0, and that
> -	 * no other reader exists, so that the modulo sum of the counters
> +	 * no other reader exists, so that the sum of the counters
>  	 * is equal to one.  Then suppose that task B starts executing
>  	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
>  	 * task C starts reading on CPU 0, so that its increment is not
> @@ -122,53 +147,26 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
>  		return false;
> 
>  	/*
> -	 * Since the caller recently flipped ->completed, we can see at
> -	 * most one increment of each CPU's counter from this point
> -	 * forward.  The reason for this is that the reader CPU must have
> -	 * fetched the index before srcu_readers_active_idx checked
> -	 * that CPU's counter, but not yet incremented its counter.
> -	 * Its eventual counter increment will follow the read in
> -	 * srcu_readers_active_idx(), and that increment is immediately
> -	 * followed by smp_mb() B.  Because smp_mb() D is between
> -	 * the ->completed flip and srcu_readers_active_idx()'s read,
> -	 * that CPU's subsequent load of ->completed must see the new
> -	 * value, and therefore increment the counter in the other rank.
> -	 */
> -	smp_mb(); /* A */
> -
> -	/*
> -	 * Now, we check the ->snap array that srcu_readers_active_idx()
> -	 * filled in from the per-CPU counter values. Since
> -	 * __srcu_read_lock() increments the upper bits of the per-CPU
> -	 * counter, an increment/decrement pair will change the value
> -	 * of the counter.  Since there is only one possible increment,
> -	 * the only way to wrap the counter is to have a huge number of
> -	 * counter decrements, which requires a huge number of tasks and
> -	 * huge SRCU read-side critical-section nesting levels, even on
> -	 * 32-bit systems.
> -	 *
> -	 * All of the ways of confusing the readings require that the scan
> -	 * in srcu_readers_active_idx() see the read-side task's decrement,
> -	 * but not its increment.  However, between that decrement and
> -	 * increment are smb_mb() B and C.  Either or both of these pair
> -	 * with smp_mb() A above to ensure that the scan below will see
> -	 * the read-side tasks's increment, thus noting a difference in
> -	 * the counter values between the two passes.
> +	 * Validation step, smp_mb() D pairs with smp_mb() C. If the above
> +	 * srcu_readers_active_idx() see a decrement of the active counter
> +	 * in srcu_read_unlock(), it should see one of these for corresponding
> +	 * srcu_read_lock():
> +	 * 	See the increment of the active counter,
> +	 * 	Failed to see the increment of the active counter.
> +	 * The second one can cause srcu_readers_active_idx() incorrectly
> +	 * return zero, but it means the above srcu_readers_seq_idx() does not
> +	 * see the increment of the seq(ref: comments of smp_mb() A),
> +	 * and the following srcu_readers_seq_idx() sees the increment of
> +	 * the seq. The seq is changed.
>  	 *
> -	 * Therefore, if srcu_readers_active_idx() returned zero, and
> -	 * none of the counters changed, we know that the zero was the
> -	 * correct sum.
> -	 *
> -	 * Of course, it is possible that a task might be delayed
> -	 * for a very long time in __srcu_read_lock() after fetching
> -	 * the index but before incrementing its counter.  This
> -	 * possibility will be dealt with in __synchronize_srcu().
> +	 * This smp_mb() D pairs with smp_mb() C for critical section.
> +	 * then any of the current task's subsequent code will happen after
> +	 * that SRCU read-side critical section whose read-unlock is seen in
> +	 * srcu_readers_active_idx().
>  	 */
> -	for_each_possible_cpu(cpu)
> -		if (sp->snap[cpu] !=
> -		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> -			return false;  /* False zero reading! */
> -	return true;
> +	smp_mb(); /* D */
> +
> +	return srcu_readers_seq_idx(sp, idx) == seq;
>  }
> 
>  /**
> @@ -216,9 +214,9 @@ int __srcu_read_lock(struct srcu_struct *sp)
>  	preempt_disable();
>  	idx = rcu_dereference_index_check(sp->completed,
>  					  rcu_read_lock_sched_held()) & 0x1;
> -	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
> -		SRCU_USAGE_COUNT + 1;
> +	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 1;
>  	smp_mb(); /* B */  /* Avoid leaking the critical section. */
> +	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->seq[idx]) += 1;
>  	preempt_enable();
>  	return idx;
>  }
> @@ -258,17 +256,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
>  	int trycount = 0;
> 
>  	/*
> -	 * If a reader fetches the index before the ->completed increment,
> -	 * but increments its counter after srcu_readers_active_idx_check()
> -	 * sums it, then smp_mb() D will pair with __srcu_read_lock()'s
> -	 * smp_mb() B to ensure that the SRCU read-side critical section
> -	 * will see any updates that the current task performed before its
> -	 * call to synchronize_srcu(), or to synchronize_srcu_expedited(),
> -	 * as the case may be.
> -	 */
> -	smp_mb(); /* D */
> -
> -	/*
>  	 * SRCU read-side critical sections are normally short, so wait
>  	 * a small amount of time before possibly blocking.
>  	 */
> @@ -281,18 +268,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
>  				schedule_timeout_interruptible(1);
>  		}
>  	}
> -
> -	/*
> -	 * The following smp_mb() E pairs with srcu_read_unlock()'s
> -	 * smp_mb C to ensure that if srcu_readers_active_idx_check()
> -	 * sees srcu_read_unlock()'s counter decrement, then any
> -	 * of the current task's subsequent code will happen after
> -	 * that SRCU read-side critical section.
> -	 *
> -	 * It also ensures the order between the above waiting and
> -	 * the next flipping.
> -	 */
> -	smp_mb(); /* E */
>  }
> 
>  static void srcu_flip(struct srcu_struct *sp)
> -- 
> 1.7.4.4
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm
  2012-02-27 18:30                     ` Paul E. McKenney
@ 2012-02-28  1:51                       ` Lai Jiangshan
  2012-02-28 13:47                         ` Paul E. McKenney
  0 siblings, 1 reply; 100+ messages in thread
From: Lai Jiangshan @ 2012-02-28  1:51 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On 02/28/2012 02:30 AM, Paul E. McKenney wrote:
> On Mon, Feb 27, 2012 at 04:01:04PM +0800, Lai Jiangshan wrote:
>> >From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
>> From: Lai Jiangshan <laijs@cn.fujitsu.com>
>> Date: Mon, 27 Feb 2012 14:22:47 +0800
>> Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm
>>
>> This patch implement the algorithm as Peter's:
>> https://lkml.org/lkml/2012/2/1/119
>>
>> o	Make the checking lock-free and we can perform parallel checking,
>> 	Although almost parallel checking makes no sense, but we need it
>> 	when 1) the original checking task is preempted for long, 2)
>> 	sychronize_srcu_expedited(), 3) avoid lock(see next)
>>
>> o	Since it is lock-free, we save a mutex in state machine for
>> 	call_srcu().
>>
>> o	Remove the SRCU_REF_MASK and remove the coupling with the flipping.
>> 	(so we can remove the preempt_disable() in future, but use
>> 	 __this_cpu_inc() instead.)
>>
>> o	reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
>> 	more intuitive.
> 
> Hello, Lai,
> 
> Interesting approach!
> 
> What happens given the following sequence of events?
> 
> o	CPU 0 in srcu_readers_active_idx_check() invokes
> 	srcu_readers_seq_idx(), getting some number back.
> 
> o	CPU 0 invokes srcu_readers_active_idx(), summing the
> 	->c[] array up through CPU 3.
> 
> o	CPU 1 invokes __srcu_read_lock(), and increments its counter
> 	but not yet its ->seq[] element.


Any __srcu_read_lock() whose increment of active counter is not seen
by srcu_readers_active_idx() is considerred as
"reader-started-after-this-srcu_readers_active_idx_check()",
We don't need to wait.

As you said, this srcu C.S 's increment seq is not seen by above
srcu_readers_seq_idx().

> 
> o	CPU 0 completes its summing of the ->c[] array, incorrectly
> 	obtaining zero.
> 
> o	CPU 0 invokes srcu_readers_seq_idx(), getting the same
> 	number back that it got last time.

If it incorrectly get zero, it means __srcu_read_unlock() is seen
in srcu_readers_active_idx(), and it means the increment of
seq is seen in this srcu_readers_seq_idx(), it is different
from the above seq that it got last time.

increment of seq is not seen by above srcu_readers_seq_idx(),
but is seen by later one, so the two returned seq is different,
this is the core of Peter's algorithm, and this was written
in the comments(Sorry for my bad English). Or maybe I miss
your means in this mail.

Thanks,
Lai

> 
> o	In parallel with the previous step, CPU 1 executes out of order
> 	(as permitted by the lack of a second memory barrier in
> 	__srcu_read_lock()), starting up the critical section before
> 	incrementing its ->seq[] element.
> 
> o	Because CPU 0 is not aware that CPU 1 is an SRCU reader, it
> 	completes the SRCU grace period before CPU 1 completes its
> 	SRCU read-side critical section.
> 
> This actually might be safe, but I need to think more about it.  In the
> meantime, I figured I should ask your thoughts.
> 
> 							Thanx, Paul
> 
>> Inspired-by: Peter Zijlstra <peterz@infradead.org>
>> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
>> ---
>>  include/linux/srcu.h |    7 +--
>>  kernel/srcu.c        |  137 ++++++++++++++++++++-----------------------------
>>  2 files changed, 57 insertions(+), 87 deletions(-)
>>
>> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
>> index 5b49d41..15354db 100644
>> --- a/include/linux/srcu.h
>> +++ b/include/linux/srcu.h
>> @@ -32,18 +32,13 @@
>>
>>  struct srcu_struct_array {
>>  	unsigned long c[2];
>> +	unsigned long seq[2];
>>  };
>>
>> -/* Bit definitions for field ->c above and ->snap below. */
>> -#define SRCU_USAGE_BITS		1
>> -#define SRCU_REF_MASK		(ULONG_MAX >> SRCU_USAGE_BITS)
>> -#define SRCU_USAGE_COUNT	(SRCU_REF_MASK + 1)
>> -
>>  struct srcu_struct {
>>  	unsigned completed;
>>  	struct srcu_struct_array __percpu *per_cpu_ref;
>>  	struct mutex mutex;
>> -	unsigned long snap[NR_CPUS];
>>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>>  	struct lockdep_map dep_map;
>>  #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>> diff --git a/kernel/srcu.c b/kernel/srcu.c
>> index 47ee35d..376b583 100644
>> --- a/kernel/srcu.c
>> +++ b/kernel/srcu.c
>> @@ -73,10 +73,25 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
>>  #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>>
>>  /*
>> + * Returns approximate total sequence of readers on the specified rank
>> + * of per-CPU counters.
>> + */
>> +static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
>> +{
>> +	int cpu;
>> +	unsigned long sum = 0;
>> +	unsigned long t;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
>> +		sum += t;
>> +	}
>> +	return sum;
>> +}
>> +
>> +/*
>>   * Returns approximate number of readers active on the specified rank
>> - * of per-CPU counters.  Also snapshots each counter's value in the
>> - * corresponding element of sp->snap[] for later use validating
>> - * the sum.
>> + * of per-CPU counters.
>>   */
>>  static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
>>  {
>> @@ -87,26 +102,36 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
>>  	for_each_possible_cpu(cpu) {
>>  		t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
>>  		sum += t;
>> -		sp->snap[cpu] = t;
>>  	}
>> -	return sum & SRCU_REF_MASK;
>> +	return sum;
>>  }
>>
>> -/*
>> - * To be called from the update side after an index flip.  Returns true
>> - * if the modulo sum of the counters is stably zero, false if there is
>> - * some possibility of non-zero.
>> - */
>>  static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
>>  {
>>  	int cpu;
>> +	unsigned long seq;
>> +
>> +	seq = srcu_readers_seq_idx(sp, idx);
>> +
>> +	/*
>> +	 * smp_mb() A pairs with smp_mb() B for critical section.
>> +	 * It ensures that the SRCU read-side critical section whose
>> +	 * read-lock is not seen by the following srcu_readers_active_idx()
>> +	 * will see any updates that before the current task performed before.
>> +	 * (So we don't need to care these readers this time)
>> +	 *
>> +	 * Also, if we see the increment of the seq, we must see the
>> +	 * increment of the active counter in the following
>> +	 * srcu_readers_active_idx().
>> +	 */
>> +	smp_mb(); /* A */
>>
>>  	/*
>>  	 * Note that srcu_readers_active_idx() can incorrectly return
>>  	 * zero even though there is a pre-existing reader throughout.
>>  	 * To see this, suppose that task A is in a very long SRCU
>>  	 * read-side critical section that started on CPU 0, and that
>> -	 * no other reader exists, so that the modulo sum of the counters
>> +	 * no other reader exists, so that the sum of the counters
>>  	 * is equal to one.  Then suppose that task B starts executing
>>  	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
>>  	 * task C starts reading on CPU 0, so that its increment is not
>> @@ -122,53 +147,26 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
>>  		return false;
>>
>>  	/*
>> -	 * Since the caller recently flipped ->completed, we can see at
>> -	 * most one increment of each CPU's counter from this point
>> -	 * forward.  The reason for this is that the reader CPU must have
>> -	 * fetched the index before srcu_readers_active_idx checked
>> -	 * that CPU's counter, but not yet incremented its counter.
>> -	 * Its eventual counter increment will follow the read in
>> -	 * srcu_readers_active_idx(), and that increment is immediately
>> -	 * followed by smp_mb() B.  Because smp_mb() D is between
>> -	 * the ->completed flip and srcu_readers_active_idx()'s read,
>> -	 * that CPU's subsequent load of ->completed must see the new
>> -	 * value, and therefore increment the counter in the other rank.
>> -	 */
>> -	smp_mb(); /* A */
>> -
>> -	/*
>> -	 * Now, we check the ->snap array that srcu_readers_active_idx()
>> -	 * filled in from the per-CPU counter values. Since
>> -	 * __srcu_read_lock() increments the upper bits of the per-CPU
>> -	 * counter, an increment/decrement pair will change the value
>> -	 * of the counter.  Since there is only one possible increment,
>> -	 * the only way to wrap the counter is to have a huge number of
>> -	 * counter decrements, which requires a huge number of tasks and
>> -	 * huge SRCU read-side critical-section nesting levels, even on
>> -	 * 32-bit systems.
>> -	 *
>> -	 * All of the ways of confusing the readings require that the scan
>> -	 * in srcu_readers_active_idx() see the read-side task's decrement,
>> -	 * but not its increment.  However, between that decrement and
>> -	 * increment are smb_mb() B and C.  Either or both of these pair
>> -	 * with smp_mb() A above to ensure that the scan below will see
>> -	 * the read-side tasks's increment, thus noting a difference in
>> -	 * the counter values between the two passes.
>> +	 * Validation step, smp_mb() D pairs with smp_mb() C. If the above
>> +	 * srcu_readers_active_idx() see a decrement of the active counter
>> +	 * in srcu_read_unlock(), it should see one of these for corresponding
>> +	 * srcu_read_lock():
>> +	 * 	See the increment of the active counter,
>> +	 * 	Failed to see the increment of the active counter.
>> +	 * The second one can cause srcu_readers_active_idx() incorrectly
>> +	 * return zero, but it means the above srcu_readers_seq_idx() does not
>> +	 * see the increment of the seq(ref: comments of smp_mb() A),
>> +	 * and the following srcu_readers_seq_idx() sees the increment of
>> +	 * the seq. The seq is changed.
>>  	 *
>> -	 * Therefore, if srcu_readers_active_idx() returned zero, and
>> -	 * none of the counters changed, we know that the zero was the
>> -	 * correct sum.
>> -	 *
>> -	 * Of course, it is possible that a task might be delayed
>> -	 * for a very long time in __srcu_read_lock() after fetching
>> -	 * the index but before incrementing its counter.  This
>> -	 * possibility will be dealt with in __synchronize_srcu().
>> +	 * This smp_mb() D pairs with smp_mb() C for critical section.
>> +	 * then any of the current task's subsequent code will happen after
>> +	 * that SRCU read-side critical section whose read-unlock is seen in
>> +	 * srcu_readers_active_idx().
>>  	 */
>> -	for_each_possible_cpu(cpu)
>> -		if (sp->snap[cpu] !=
>> -		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
>> -			return false;  /* False zero reading! */
>> -	return true;
>> +	smp_mb(); /* D */
>> +
>> +	return srcu_readers_seq_idx(sp, idx) == seq;
>>  }
>>
>>  /**
>> @@ -216,9 +214,9 @@ int __srcu_read_lock(struct srcu_struct *sp)
>>  	preempt_disable();
>>  	idx = rcu_dereference_index_check(sp->completed,
>>  					  rcu_read_lock_sched_held()) & 0x1;
>> -	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
>> -		SRCU_USAGE_COUNT + 1;
>> +	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 1;
>>  	smp_mb(); /* B */  /* Avoid leaking the critical section. */
>> +	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->seq[idx]) += 1;
>>  	preempt_enable();
>>  	return idx;
>>  }
>> @@ -258,17 +256,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
>>  	int trycount = 0;
>>
>>  	/*
>> -	 * If a reader fetches the index before the ->completed increment,
>> -	 * but increments its counter after srcu_readers_active_idx_check()
>> -	 * sums it, then smp_mb() D will pair with __srcu_read_lock()'s
>> -	 * smp_mb() B to ensure that the SRCU read-side critical section
>> -	 * will see any updates that the current task performed before its
>> -	 * call to synchronize_srcu(), or to synchronize_srcu_expedited(),
>> -	 * as the case may be.
>> -	 */
>> -	smp_mb(); /* D */
>> -
>> -	/*
>>  	 * SRCU read-side critical sections are normally short, so wait
>>  	 * a small amount of time before possibly blocking.
>>  	 */
>> @@ -281,18 +268,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
>>  				schedule_timeout_interruptible(1);
>>  		}
>>  	}
>> -
>> -	/*
>> -	 * The following smp_mb() E pairs with srcu_read_unlock()'s
>> -	 * smp_mb C to ensure that if srcu_readers_active_idx_check()
>> -	 * sees srcu_read_unlock()'s counter decrement, then any
>> -	 * of the current task's subsequent code will happen after
>> -	 * that SRCU read-side critical section.
>> -	 *
>> -	 * It also ensures the order between the above waiting and
>> -	 * the next flipping.
>> -	 */
>> -	smp_mb(); /* E */
>>  }
>>
>>  static void srcu_flip(struct srcu_struct *sp)
>> -- 
>> 1.7.4.4
>>
> 
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm
  2012-02-28  1:51                       ` Lai Jiangshan
@ 2012-02-28 13:47                         ` Paul E. McKenney
  2012-02-29 10:07                           ` Lai Jiangshan
  0 siblings, 1 reply; 100+ messages in thread
From: Paul E. McKenney @ 2012-02-28 13:47 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On Tue, Feb 28, 2012 at 09:51:22AM +0800, Lai Jiangshan wrote:
> On 02/28/2012 02:30 AM, Paul E. McKenney wrote:
> > On Mon, Feb 27, 2012 at 04:01:04PM +0800, Lai Jiangshan wrote:
> >> >From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
> >> From: Lai Jiangshan <laijs@cn.fujitsu.com>
> >> Date: Mon, 27 Feb 2012 14:22:47 +0800
> >> Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm
> >>
> >> This patch implement the algorithm as Peter's:
> >> https://lkml.org/lkml/2012/2/1/119
> >>
> >> o	Make the checking lock-free and we can perform parallel checking,
> >> 	Although almost parallel checking makes no sense, but we need it
> >> 	when 1) the original checking task is preempted for long, 2)
> >> 	sychronize_srcu_expedited(), 3) avoid lock(see next)
> >>
> >> o	Since it is lock-free, we save a mutex in state machine for
> >> 	call_srcu().
> >>
> >> o	Remove the SRCU_REF_MASK and remove the coupling with the flipping.
> >> 	(so we can remove the preempt_disable() in future, but use
> >> 	 __this_cpu_inc() instead.)
> >>
> >> o	reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
> >> 	more intuitive.
> > 
> > Hello, Lai,
> > 
> > Interesting approach!
> > 
> > What happens given the following sequence of events?
> > 
> > o	CPU 0 in srcu_readers_active_idx_check() invokes
> > 	srcu_readers_seq_idx(), getting some number back.
> > 
> > o	CPU 0 invokes srcu_readers_active_idx(), summing the
> > 	->c[] array up through CPU 3.
> > 
> > o	CPU 1 invokes __srcu_read_lock(), and increments its counter
> > 	but not yet its ->seq[] element.
> 
> 
> Any __srcu_read_lock() whose increment of active counter is not seen
> by srcu_readers_active_idx() is considerred as
> "reader-started-after-this-srcu_readers_active_idx_check()",
> We don't need to wait.
> 
> As you said, this srcu C.S 's increment seq is not seen by above
> srcu_readers_seq_idx().
> 
> > 
> > o	CPU 0 completes its summing of the ->c[] array, incorrectly
> > 	obtaining zero.
> > 
> > o	CPU 0 invokes srcu_readers_seq_idx(), getting the same
> > 	number back that it got last time.
> 
> If it incorrectly get zero, it means __srcu_read_unlock() is seen
> in srcu_readers_active_idx(), and it means the increment of
> seq is seen in this srcu_readers_seq_idx(), it is different
> from the above seq that it got last time.
> 
> increment of seq is not seen by above srcu_readers_seq_idx(),
> but is seen by later one, so the two returned seq is different,
> this is the core of Peter's algorithm, and this was written
> in the comments(Sorry for my bad English). Or maybe I miss
> your means in this mail.

OK, good, this analysis agrees with what I was thinking.

So my next question is about the lock freedom.  This lock freedom has to
be limited in nature and carefully implemented.  The reasons for this are:

1.	Readers can block in any case, which can of course block both
	synchronize_srcu_expedited() and synchronize_srcu().

2.	Because only one CPU at a time can be incrementing ->completed,
	some sort of lock with preemption disabling will of course be
	needed.  Alternatively, an rt_mutex could be used for its
	priority-inheritance properties.

3.	Once some CPU has incremented ->completed, all CPUs that might
	still be summing up the old indexes must stop.  If they don't,
	they might incorrectly call a too-short grace period in case of
	->seq[]-sum overflow on 32-bit systems.

Or did you have something else in mind?

							Thanx, Paul

> Thanks,
> Lai
> 
> > 
> > o	In parallel with the previous step, CPU 1 executes out of order
> > 	(as permitted by the lack of a second memory barrier in
> > 	__srcu_read_lock()), starting up the critical section before
> > 	incrementing its ->seq[] element.
> > 
> > o	Because CPU 0 is not aware that CPU 1 is an SRCU reader, it
> > 	completes the SRCU grace period before CPU 1 completes its
> > 	SRCU read-side critical section.
> > 
> > This actually might be safe, but I need to think more about it.  In the
> > meantime, I figured I should ask your thoughts.
> > 
> > 							Thanx, Paul
> > 
> >> Inspired-by: Peter Zijlstra <peterz@infradead.org>
> >> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> >> ---
> >>  include/linux/srcu.h |    7 +--
> >>  kernel/srcu.c        |  137 ++++++++++++++++++++-----------------------------
> >>  2 files changed, 57 insertions(+), 87 deletions(-)
> >>
> >> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> >> index 5b49d41..15354db 100644
> >> --- a/include/linux/srcu.h
> >> +++ b/include/linux/srcu.h
> >> @@ -32,18 +32,13 @@
> >>
> >>  struct srcu_struct_array {
> >>  	unsigned long c[2];
> >> +	unsigned long seq[2];
> >>  };
> >>
> >> -/* Bit definitions for field ->c above and ->snap below. */
> >> -#define SRCU_USAGE_BITS		1
> >> -#define SRCU_REF_MASK		(ULONG_MAX >> SRCU_USAGE_BITS)
> >> -#define SRCU_USAGE_COUNT	(SRCU_REF_MASK + 1)
> >> -
> >>  struct srcu_struct {
> >>  	unsigned completed;
> >>  	struct srcu_struct_array __percpu *per_cpu_ref;
> >>  	struct mutex mutex;
> >> -	unsigned long snap[NR_CPUS];
> >>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >>  	struct lockdep_map dep_map;
> >>  #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> >> diff --git a/kernel/srcu.c b/kernel/srcu.c
> >> index 47ee35d..376b583 100644
> >> --- a/kernel/srcu.c
> >> +++ b/kernel/srcu.c
> >> @@ -73,10 +73,25 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
> >>  #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> >>
> >>  /*
> >> + * Returns approximate total sequence of readers on the specified rank
> >> + * of per-CPU counters.
> >> + */
> >> +static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
> >> +{
> >> +	int cpu;
> >> +	unsigned long sum = 0;
> >> +	unsigned long t;
> >> +
> >> +	for_each_possible_cpu(cpu) {
> >> +		t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
> >> +		sum += t;
> >> +	}
> >> +	return sum;
> >> +}
> >> +
> >> +/*
> >>   * Returns approximate number of readers active on the specified rank
> >> - * of per-CPU counters.  Also snapshots each counter's value in the
> >> - * corresponding element of sp->snap[] for later use validating
> >> - * the sum.
> >> + * of per-CPU counters.
> >>   */
> >>  static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> >>  {
> >> @@ -87,26 +102,36 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> >>  	for_each_possible_cpu(cpu) {
> >>  		t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
> >>  		sum += t;
> >> -		sp->snap[cpu] = t;
> >>  	}
> >> -	return sum & SRCU_REF_MASK;
> >> +	return sum;
> >>  }
> >>
> >> -/*
> >> - * To be called from the update side after an index flip.  Returns true
> >> - * if the modulo sum of the counters is stably zero, false if there is
> >> - * some possibility of non-zero.
> >> - */
> >>  static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> >>  {
> >>  	int cpu;
> >> +	unsigned long seq;
> >> +
> >> +	seq = srcu_readers_seq_idx(sp, idx);
> >> +
> >> +	/*
> >> +	 * smp_mb() A pairs with smp_mb() B for critical section.
> >> +	 * It ensures that the SRCU read-side critical section whose
> >> +	 * read-lock is not seen by the following srcu_readers_active_idx()
> >> +	 * will see any updates that before the current task performed before.
> >> +	 * (So we don't need to care these readers this time)
> >> +	 *
> >> +	 * Also, if we see the increment of the seq, we must see the
> >> +	 * increment of the active counter in the following
> >> +	 * srcu_readers_active_idx().
> >> +	 */
> >> +	smp_mb(); /* A */
> >>
> >>  	/*
> >>  	 * Note that srcu_readers_active_idx() can incorrectly return
> >>  	 * zero even though there is a pre-existing reader throughout.
> >>  	 * To see this, suppose that task A is in a very long SRCU
> >>  	 * read-side critical section that started on CPU 0, and that
> >> -	 * no other reader exists, so that the modulo sum of the counters
> >> +	 * no other reader exists, so that the sum of the counters
> >>  	 * is equal to one.  Then suppose that task B starts executing
> >>  	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
> >>  	 * task C starts reading on CPU 0, so that its increment is not
> >> @@ -122,53 +147,26 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> >>  		return false;
> >>
> >>  	/*
> >> -	 * Since the caller recently flipped ->completed, we can see at
> >> -	 * most one increment of each CPU's counter from this point
> >> -	 * forward.  The reason for this is that the reader CPU must have
> >> -	 * fetched the index before srcu_readers_active_idx checked
> >> -	 * that CPU's counter, but not yet incremented its counter.
> >> -	 * Its eventual counter increment will follow the read in
> >> -	 * srcu_readers_active_idx(), and that increment is immediately
> >> -	 * followed by smp_mb() B.  Because smp_mb() D is between
> >> -	 * the ->completed flip and srcu_readers_active_idx()'s read,
> >> -	 * that CPU's subsequent load of ->completed must see the new
> >> -	 * value, and therefore increment the counter in the other rank.
> >> -	 */
> >> -	smp_mb(); /* A */
> >> -
> >> -	/*
> >> -	 * Now, we check the ->snap array that srcu_readers_active_idx()
> >> -	 * filled in from the per-CPU counter values. Since
> >> -	 * __srcu_read_lock() increments the upper bits of the per-CPU
> >> -	 * counter, an increment/decrement pair will change the value
> >> -	 * of the counter.  Since there is only one possible increment,
> >> -	 * the only way to wrap the counter is to have a huge number of
> >> -	 * counter decrements, which requires a huge number of tasks and
> >> -	 * huge SRCU read-side critical-section nesting levels, even on
> >> -	 * 32-bit systems.
> >> -	 *
> >> -	 * All of the ways of confusing the readings require that the scan
> >> -	 * in srcu_readers_active_idx() see the read-side task's decrement,
> >> -	 * but not its increment.  However, between that decrement and
> >> -	 * increment are smb_mb() B and C.  Either or both of these pair
> >> -	 * with smp_mb() A above to ensure that the scan below will see
> >> -	 * the read-side tasks's increment, thus noting a difference in
> >> -	 * the counter values between the two passes.
> >> +	 * Validation step, smp_mb() D pairs with smp_mb() C. If the above
> >> +	 * srcu_readers_active_idx() see a decrement of the active counter
> >> +	 * in srcu_read_unlock(), it should see one of these for corresponding
> >> +	 * srcu_read_lock():
> >> +	 * 	See the increment of the active counter,
> >> +	 * 	Failed to see the increment of the active counter.
> >> +	 * The second one can cause srcu_readers_active_idx() incorrectly
> >> +	 * return zero, but it means the above srcu_readers_seq_idx() does not
> >> +	 * see the increment of the seq(ref: comments of smp_mb() A),
> >> +	 * and the following srcu_readers_seq_idx() sees the increment of
> >> +	 * the seq. The seq is changed.
> >>  	 *
> >> -	 * Therefore, if srcu_readers_active_idx() returned zero, and
> >> -	 * none of the counters changed, we know that the zero was the
> >> -	 * correct sum.
> >> -	 *
> >> -	 * Of course, it is possible that a task might be delayed
> >> -	 * for a very long time in __srcu_read_lock() after fetching
> >> -	 * the index but before incrementing its counter.  This
> >> -	 * possibility will be dealt with in __synchronize_srcu().
> >> +	 * This smp_mb() D pairs with smp_mb() C for critical section.
> >> +	 * then any of the current task's subsequent code will happen after
> >> +	 * that SRCU read-side critical section whose read-unlock is seen in
> >> +	 * srcu_readers_active_idx().
> >>  	 */
> >> -	for_each_possible_cpu(cpu)
> >> -		if (sp->snap[cpu] !=
> >> -		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> >> -			return false;  /* False zero reading! */
> >> -	return true;
> >> +	smp_mb(); /* D */
> >> +
> >> +	return srcu_readers_seq_idx(sp, idx) == seq;
> >>  }
> >>
> >>  /**
> >> @@ -216,9 +214,9 @@ int __srcu_read_lock(struct srcu_struct *sp)
> >>  	preempt_disable();
> >>  	idx = rcu_dereference_index_check(sp->completed,
> >>  					  rcu_read_lock_sched_held()) & 0x1;
> >> -	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
> >> -		SRCU_USAGE_COUNT + 1;
> >> +	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 1;
> >>  	smp_mb(); /* B */  /* Avoid leaking the critical section. */
> >> +	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->seq[idx]) += 1;
> >>  	preempt_enable();
> >>  	return idx;
> >>  }
> >> @@ -258,17 +256,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> >>  	int trycount = 0;
> >>
> >>  	/*
> >> -	 * If a reader fetches the index before the ->completed increment,
> >> -	 * but increments its counter after srcu_readers_active_idx_check()
> >> -	 * sums it, then smp_mb() D will pair with __srcu_read_lock()'s
> >> -	 * smp_mb() B to ensure that the SRCU read-side critical section
> >> -	 * will see any updates that the current task performed before its
> >> -	 * call to synchronize_srcu(), or to synchronize_srcu_expedited(),
> >> -	 * as the case may be.
> >> -	 */
> >> -	smp_mb(); /* D */
> >> -
> >> -	/*
> >>  	 * SRCU read-side critical sections are normally short, so wait
> >>  	 * a small amount of time before possibly blocking.
> >>  	 */
> >> @@ -281,18 +268,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> >>  				schedule_timeout_interruptible(1);
> >>  		}
> >>  	}
> >> -
> >> -	/*
> >> -	 * The following smp_mb() E pairs with srcu_read_unlock()'s
> >> -	 * smp_mb C to ensure that if srcu_readers_active_idx_check()
> >> -	 * sees srcu_read_unlock()'s counter decrement, then any
> >> -	 * of the current task's subsequent code will happen after
> >> -	 * that SRCU read-side critical section.
> >> -	 *
> >> -	 * It also ensures the order between the above waiting and
> >> -	 * the next flipping.
> >> -	 */
> >> -	smp_mb(); /* E */
> >>  }
> >>
> >>  static void srcu_flip(struct srcu_struct *sp)
> >> -- 
> >> 1.7.4.4
> >>
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm
  2012-02-28 13:47                         ` Paul E. McKenney
@ 2012-02-29 10:07                           ` Lai Jiangshan
  2012-02-29 13:55                             ` Paul E. McKenney
  0 siblings, 1 reply; 100+ messages in thread
From: Lai Jiangshan @ 2012-02-29 10:07 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On 02/28/2012 09:47 PM, Paul E. McKenney wrote:
> On Tue, Feb 28, 2012 at 09:51:22AM +0800, Lai Jiangshan wrote:
>> On 02/28/2012 02:30 AM, Paul E. McKenney wrote:
>>> On Mon, Feb 27, 2012 at 04:01:04PM +0800, Lai Jiangshan wrote:
>>>> >From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
>>>> From: Lai Jiangshan <laijs@cn.fujitsu.com>
>>>> Date: Mon, 27 Feb 2012 14:22:47 +0800
>>>> Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm
>>>>
>>>> This patch implement the algorithm as Peter's:
>>>> https://lkml.org/lkml/2012/2/1/119
>>>>
>>>> o	Make the checking lock-free and we can perform parallel checking,
>>>> 	Although almost parallel checking makes no sense, but we need it
>>>> 	when 1) the original checking task is preempted for long, 2)
>>>> 	sychronize_srcu_expedited(), 3) avoid lock(see next)
>>>>
>>>> o	Since it is lock-free, we save a mutex in state machine for
>>>> 	call_srcu().
>>>>
>>>> o	Remove the SRCU_REF_MASK and remove the coupling with the flipping.
>>>> 	(so we can remove the preempt_disable() in future, but use
>>>> 	 __this_cpu_inc() instead.)
>>>>
>>>> o	reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
>>>> 	more intuitive.
>>>
>>> Hello, Lai,
>>>
>>> Interesting approach!
>>>
>>> What happens given the following sequence of events?
>>>
>>> o	CPU 0 in srcu_readers_active_idx_check() invokes
>>> 	srcu_readers_seq_idx(), getting some number back.
>>>
>>> o	CPU 0 invokes srcu_readers_active_idx(), summing the
>>> 	->c[] array up through CPU 3.
>>>
>>> o	CPU 1 invokes __srcu_read_lock(), and increments its counter
>>> 	but not yet its ->seq[] element.
>>
>>
>> Any __srcu_read_lock() whose increment of active counter is not seen
>> by srcu_readers_active_idx() is considerred as
>> "reader-started-after-this-srcu_readers_active_idx_check()",
>> We don't need to wait.
>>
>> As you said, this srcu C.S 's increment seq is not seen by above
>> srcu_readers_seq_idx().
>>
>>>
>>> o	CPU 0 completes its summing of the ->c[] array, incorrectly
>>> 	obtaining zero.
>>>
>>> o	CPU 0 invokes srcu_readers_seq_idx(), getting the same
>>> 	number back that it got last time.
>>
>> If it incorrectly get zero, it means __srcu_read_unlock() is seen
>> in srcu_readers_active_idx(), and it means the increment of
>> seq is seen in this srcu_readers_seq_idx(), it is different
>> from the above seq that it got last time.
>>
>> increment of seq is not seen by above srcu_readers_seq_idx(),
>> but is seen by later one, so the two returned seq is different,
>> this is the core of Peter's algorithm, and this was written
>> in the comments(Sorry for my bad English). Or maybe I miss
>> your means in this mail.
> 
> OK, good, this analysis agrees with what I was thinking.
> 
> So my next question is about the lock freedom.  This lock freedom has to
> be limited in nature and carefully implemented.  The reasons for this are:
> 
> 1.	Readers can block in any case, which can of course block both
> 	synchronize_srcu_expedited() and synchronize_srcu().
> 
> 2.	Because only one CPU at a time can be incrementing ->completed,
> 	some sort of lock with preemption disabling will of course be
> 	needed.  Alternatively, an rt_mutex could be used for its
> 	priority-inheritance properties.
> 
> 3.	Once some CPU has incremented ->completed, all CPUs that might
> 	still be summing up the old indexes must stop.  If they don't,
> 	they might incorrectly call a too-short grace period in case of
> 	->seq[]-sum overflow on 32-bit systems.
> 
> Or did you have something else in mind?

When flip happens when check_zero, this check_zero will no be
committed even it is success.

I play too much with lock-free for call_srcu(), the code becomes complicated,
I just give up lock-free for call_srcu(), the main aim of call_srcu() is simple.

(But I still like Peter's approach, it has some other good thing
besides lock-free-checking, if you don't like it, I will send
another patch to fix srcu_readers_active())

Thanks,
Lai

> 
> 							Thanx, Paul
> 
>> Thanks,
>> Lai
>>
>>>
>>> o	In parallel with the previous step, CPU 1 executes out of order
>>> 	(as permitted by the lack of a second memory barrier in
>>> 	__srcu_read_lock()), starting up the critical section before
>>> 	incrementing its ->seq[] element.
>>>
>>> o	Because CPU 0 is not aware that CPU 1 is an SRCU reader, it
>>> 	completes the SRCU grace period before CPU 1 completes its
>>> 	SRCU read-side critical section.
>>>
>>> This actually might be safe, but I need to think more about it.  In the
>>> meantime, I figured I should ask your thoughts.
>>>
>>> 							Thanx, Paul
>>>
>>>> Inspired-by: Peter Zijlstra <peterz@infradead.org>
>>>> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
>>>> ---
>>>>  include/linux/srcu.h |    7 +--
>>>>  kernel/srcu.c        |  137 ++++++++++++++++++++-----------------------------
>>>>  2 files changed, 57 insertions(+), 87 deletions(-)
>>>>
>>>> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
>>>> index 5b49d41..15354db 100644
>>>> --- a/include/linux/srcu.h
>>>> +++ b/include/linux/srcu.h
>>>> @@ -32,18 +32,13 @@
>>>>
>>>>  struct srcu_struct_array {
>>>>  	unsigned long c[2];
>>>> +	unsigned long seq[2];
>>>>  };
>>>>
>>>> -/* Bit definitions for field ->c above and ->snap below. */
>>>> -#define SRCU_USAGE_BITS		1
>>>> -#define SRCU_REF_MASK		(ULONG_MAX >> SRCU_USAGE_BITS)
>>>> -#define SRCU_USAGE_COUNT	(SRCU_REF_MASK + 1)
>>>> -
>>>>  struct srcu_struct {
>>>>  	unsigned completed;
>>>>  	struct srcu_struct_array __percpu *per_cpu_ref;
>>>>  	struct mutex mutex;
>>>> -	unsigned long snap[NR_CPUS];
>>>>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>>>>  	struct lockdep_map dep_map;
>>>>  #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>>>> diff --git a/kernel/srcu.c b/kernel/srcu.c
>>>> index 47ee35d..376b583 100644
>>>> --- a/kernel/srcu.c
>>>> +++ b/kernel/srcu.c
>>>> @@ -73,10 +73,25 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
>>>>  #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>>>>
>>>>  /*
>>>> + * Returns approximate total sequence of readers on the specified rank
>>>> + * of per-CPU counters.
>>>> + */
>>>> +static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
>>>> +{
>>>> +	int cpu;
>>>> +	unsigned long sum = 0;
>>>> +	unsigned long t;
>>>> +
>>>> +	for_each_possible_cpu(cpu) {
>>>> +		t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
>>>> +		sum += t;
>>>> +	}
>>>> +	return sum;
>>>> +}
>>>> +
>>>> +/*
>>>>   * Returns approximate number of readers active on the specified rank
>>>> - * of per-CPU counters.  Also snapshots each counter's value in the
>>>> - * corresponding element of sp->snap[] for later use validating
>>>> - * the sum.
>>>> + * of per-CPU counters.
>>>>   */
>>>>  static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
>>>>  {
>>>> @@ -87,26 +102,36 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
>>>>  	for_each_possible_cpu(cpu) {
>>>>  		t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
>>>>  		sum += t;
>>>> -		sp->snap[cpu] = t;
>>>>  	}
>>>> -	return sum & SRCU_REF_MASK;
>>>> +	return sum;
>>>>  }
>>>>
>>>> -/*
>>>> - * To be called from the update side after an index flip.  Returns true
>>>> - * if the modulo sum of the counters is stably zero, false if there is
>>>> - * some possibility of non-zero.
>>>> - */
>>>>  static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
>>>>  {
>>>>  	int cpu;
>>>> +	unsigned long seq;
>>>> +
>>>> +	seq = srcu_readers_seq_idx(sp, idx);
>>>> +
>>>> +	/*
>>>> +	 * smp_mb() A pairs with smp_mb() B for critical section.
>>>> +	 * It ensures that the SRCU read-side critical section whose
>>>> +	 * read-lock is not seen by the following srcu_readers_active_idx()
>>>> +	 * will see any updates that before the current task performed before.
>>>> +	 * (So we don't need to care these readers this time)
>>>> +	 *
>>>> +	 * Also, if we see the increment of the seq, we must see the
>>>> +	 * increment of the active counter in the following
>>>> +	 * srcu_readers_active_idx().
>>>> +	 */
>>>> +	smp_mb(); /* A */
>>>>
>>>>  	/*
>>>>  	 * Note that srcu_readers_active_idx() can incorrectly return
>>>>  	 * zero even though there is a pre-existing reader throughout.
>>>>  	 * To see this, suppose that task A is in a very long SRCU
>>>>  	 * read-side critical section that started on CPU 0, and that
>>>> -	 * no other reader exists, so that the modulo sum of the counters
>>>> +	 * no other reader exists, so that the sum of the counters
>>>>  	 * is equal to one.  Then suppose that task B starts executing
>>>>  	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
>>>>  	 * task C starts reading on CPU 0, so that its increment is not
>>>> @@ -122,53 +147,26 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
>>>>  		return false;
>>>>
>>>>  	/*
>>>> -	 * Since the caller recently flipped ->completed, we can see at
>>>> -	 * most one increment of each CPU's counter from this point
>>>> -	 * forward.  The reason for this is that the reader CPU must have
>>>> -	 * fetched the index before srcu_readers_active_idx checked
>>>> -	 * that CPU's counter, but not yet incremented its counter.
>>>> -	 * Its eventual counter increment will follow the read in
>>>> -	 * srcu_readers_active_idx(), and that increment is immediately
>>>> -	 * followed by smp_mb() B.  Because smp_mb() D is between
>>>> -	 * the ->completed flip and srcu_readers_active_idx()'s read,
>>>> -	 * that CPU's subsequent load of ->completed must see the new
>>>> -	 * value, and therefore increment the counter in the other rank.
>>>> -	 */
>>>> -	smp_mb(); /* A */
>>>> -
>>>> -	/*
>>>> -	 * Now, we check the ->snap array that srcu_readers_active_idx()
>>>> -	 * filled in from the per-CPU counter values. Since
>>>> -	 * __srcu_read_lock() increments the upper bits of the per-CPU
>>>> -	 * counter, an increment/decrement pair will change the value
>>>> -	 * of the counter.  Since there is only one possible increment,
>>>> -	 * the only way to wrap the counter is to have a huge number of
>>>> -	 * counter decrements, which requires a huge number of tasks and
>>>> -	 * huge SRCU read-side critical-section nesting levels, even on
>>>> -	 * 32-bit systems.
>>>> -	 *
>>>> -	 * All of the ways of confusing the readings require that the scan
>>>> -	 * in srcu_readers_active_idx() see the read-side task's decrement,
>>>> -	 * but not its increment.  However, between that decrement and
>>>> -	 * increment are smb_mb() B and C.  Either or both of these pair
>>>> -	 * with smp_mb() A above to ensure that the scan below will see
>>>> -	 * the read-side tasks's increment, thus noting a difference in
>>>> -	 * the counter values between the two passes.
>>>> +	 * Validation step, smp_mb() D pairs with smp_mb() C. If the above
>>>> +	 * srcu_readers_active_idx() see a decrement of the active counter
>>>> +	 * in srcu_read_unlock(), it should see one of these for corresponding
>>>> +	 * srcu_read_lock():
>>>> +	 * 	See the increment of the active counter,
>>>> +	 * 	Failed to see the increment of the active counter.
>>>> +	 * The second one can cause srcu_readers_active_idx() incorrectly
>>>> +	 * return zero, but it means the above srcu_readers_seq_idx() does not
>>>> +	 * see the increment of the seq(ref: comments of smp_mb() A),
>>>> +	 * and the following srcu_readers_seq_idx() sees the increment of
>>>> +	 * the seq. The seq is changed.
>>>>  	 *
>>>> -	 * Therefore, if srcu_readers_active_idx() returned zero, and
>>>> -	 * none of the counters changed, we know that the zero was the
>>>> -	 * correct sum.
>>>> -	 *
>>>> -	 * Of course, it is possible that a task might be delayed
>>>> -	 * for a very long time in __srcu_read_lock() after fetching
>>>> -	 * the index but before incrementing its counter.  This
>>>> -	 * possibility will be dealt with in __synchronize_srcu().
>>>> +	 * This smp_mb() D pairs with smp_mb() C for critical section.
>>>> +	 * then any of the current task's subsequent code will happen after
>>>> +	 * that SRCU read-side critical section whose read-unlock is seen in
>>>> +	 * srcu_readers_active_idx().
>>>>  	 */
>>>> -	for_each_possible_cpu(cpu)
>>>> -		if (sp->snap[cpu] !=
>>>> -		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
>>>> -			return false;  /* False zero reading! */
>>>> -	return true;
>>>> +	smp_mb(); /* D */
>>>> +
>>>> +	return srcu_readers_seq_idx(sp, idx) == seq;
>>>>  }
>>>>
>>>>  /**
>>>> @@ -216,9 +214,9 @@ int __srcu_read_lock(struct srcu_struct *sp)
>>>>  	preempt_disable();
>>>>  	idx = rcu_dereference_index_check(sp->completed,
>>>>  					  rcu_read_lock_sched_held()) & 0x1;
>>>> -	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
>>>> -		SRCU_USAGE_COUNT + 1;
>>>> +	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 1;
>>>>  	smp_mb(); /* B */  /* Avoid leaking the critical section. */
>>>> +	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->seq[idx]) += 1;
>>>>  	preempt_enable();
>>>>  	return idx;
>>>>  }
>>>> @@ -258,17 +256,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
>>>>  	int trycount = 0;
>>>>
>>>>  	/*
>>>> -	 * If a reader fetches the index before the ->completed increment,
>>>> -	 * but increments its counter after srcu_readers_active_idx_check()
>>>> -	 * sums it, then smp_mb() D will pair with __srcu_read_lock()'s
>>>> -	 * smp_mb() B to ensure that the SRCU read-side critical section
>>>> -	 * will see any updates that the current task performed before its
>>>> -	 * call to synchronize_srcu(), or to synchronize_srcu_expedited(),
>>>> -	 * as the case may be.
>>>> -	 */
>>>> -	smp_mb(); /* D */
>>>> -
>>>> -	/*
>>>>  	 * SRCU read-side critical sections are normally short, so wait
>>>>  	 * a small amount of time before possibly blocking.
>>>>  	 */
>>>> @@ -281,18 +268,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
>>>>  				schedule_timeout_interruptible(1);
>>>>  		}
>>>>  	}
>>>> -
>>>> -	/*
>>>> -	 * The following smp_mb() E pairs with srcu_read_unlock()'s
>>>> -	 * smp_mb C to ensure that if srcu_readers_active_idx_check()
>>>> -	 * sees srcu_read_unlock()'s counter decrement, then any
>>>> -	 * of the current task's subsequent code will happen after
>>>> -	 * that SRCU read-side critical section.
>>>> -	 *
>>>> -	 * It also ensures the order between the above waiting and
>>>> -	 * the next flipping.
>>>> -	 */
>>>> -	smp_mb(); /* E */
>>>>  }
>>>>
>>>>  static void srcu_flip(struct srcu_struct *sp)
>>>> -- 
>>>> 1.7.4.4
>>>>
>>>
>>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm
  2012-02-29 10:07                           ` Lai Jiangshan
@ 2012-02-29 13:55                             ` Paul E. McKenney
  2012-03-01  2:31                               ` Lai Jiangshan
  0 siblings, 1 reply; 100+ messages in thread
From: Paul E. McKenney @ 2012-02-29 13:55 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On Wed, Feb 29, 2012 at 06:07:32PM +0800, Lai Jiangshan wrote:
> On 02/28/2012 09:47 PM, Paul E. McKenney wrote:
> > On Tue, Feb 28, 2012 at 09:51:22AM +0800, Lai Jiangshan wrote:
> >> On 02/28/2012 02:30 AM, Paul E. McKenney wrote:
> >>> On Mon, Feb 27, 2012 at 04:01:04PM +0800, Lai Jiangshan wrote:
> >>>> >From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
> >>>> From: Lai Jiangshan <laijs@cn.fujitsu.com>
> >>>> Date: Mon, 27 Feb 2012 14:22:47 +0800
> >>>> Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm
> >>>>
> >>>> This patch implement the algorithm as Peter's:
> >>>> https://lkml.org/lkml/2012/2/1/119
> >>>>
> >>>> o	Make the checking lock-free and we can perform parallel checking,
> >>>> 	Although almost parallel checking makes no sense, but we need it
> >>>> 	when 1) the original checking task is preempted for long, 2)
> >>>> 	sychronize_srcu_expedited(), 3) avoid lock(see next)
> >>>>
> >>>> o	Since it is lock-free, we save a mutex in state machine for
> >>>> 	call_srcu().
> >>>>
> >>>> o	Remove the SRCU_REF_MASK and remove the coupling with the flipping.
> >>>> 	(so we can remove the preempt_disable() in future, but use
> >>>> 	 __this_cpu_inc() instead.)
> >>>>
> >>>> o	reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
> >>>> 	more intuitive.
> >>>
> >>> Hello, Lai,
> >>>
> >>> Interesting approach!
> >>>
> >>> What happens given the following sequence of events?
> >>>
> >>> o	CPU 0 in srcu_readers_active_idx_check() invokes
> >>> 	srcu_readers_seq_idx(), getting some number back.
> >>>
> >>> o	CPU 0 invokes srcu_readers_active_idx(), summing the
> >>> 	->c[] array up through CPU 3.
> >>>
> >>> o	CPU 1 invokes __srcu_read_lock(), and increments its counter
> >>> 	but not yet its ->seq[] element.
> >>
> >>
> >> Any __srcu_read_lock() whose increment of active counter is not seen
> >> by srcu_readers_active_idx() is considerred as
> >> "reader-started-after-this-srcu_readers_active_idx_check()",
> >> We don't need to wait.
> >>
> >> As you said, this srcu C.S 's increment seq is not seen by above
> >> srcu_readers_seq_idx().
> >>
> >>>
> >>> o	CPU 0 completes its summing of the ->c[] array, incorrectly
> >>> 	obtaining zero.
> >>>
> >>> o	CPU 0 invokes srcu_readers_seq_idx(), getting the same
> >>> 	number back that it got last time.
> >>
> >> If it incorrectly get zero, it means __srcu_read_unlock() is seen
> >> in srcu_readers_active_idx(), and it means the increment of
> >> seq is seen in this srcu_readers_seq_idx(), it is different
> >> from the above seq that it got last time.
> >>
> >> increment of seq is not seen by above srcu_readers_seq_idx(),
> >> but is seen by later one, so the two returned seq is different,
> >> this is the core of Peter's algorithm, and this was written
> >> in the comments(Sorry for my bad English). Or maybe I miss
> >> your means in this mail.
> > 
> > OK, good, this analysis agrees with what I was thinking.
> > 
> > So my next question is about the lock freedom.  This lock freedom has to
> > be limited in nature and carefully implemented.  The reasons for this are:
> > 
> > 1.	Readers can block in any case, which can of course block both
> > 	synchronize_srcu_expedited() and synchronize_srcu().
> > 
> > 2.	Because only one CPU at a time can be incrementing ->completed,
> > 	some sort of lock with preemption disabling will of course be
> > 	needed.  Alternatively, an rt_mutex could be used for its
> > 	priority-inheritance properties.
> > 
> > 3.	Once some CPU has incremented ->completed, all CPUs that might
> > 	still be summing up the old indexes must stop.  If they don't,
> > 	they might incorrectly call a too-short grace period in case of
> > 	->seq[]-sum overflow on 32-bit systems.
> > 
> > Or did you have something else in mind?
> 
> When flip happens when check_zero, this check_zero will no be
> committed even it is success.

But if the CPU in check_zero isn't blocking the grace period, then
->completed could overflow while that CPU was preempted.  Then how
would this CPU know that the flip had happened?

> I play too much with lock-free for call_srcu(), the code becomes complicated,
> I just give up lock-free for call_srcu(), the main aim of call_srcu() is simple.

Makes sense to me!

> (But I still like Peter's approach, it has some other good thing
> besides lock-free-checking, if you don't like it, I will send
> another patch to fix srcu_readers_active())

Try them both and check their performance &c.  If within espilon of
each other, pick whichever one you prefer.

							Thanx, Paul

> Thanks,
> Lai
> 
> > 
> > 							Thanx, Paul
> > 
> >> Thanks,
> >> Lai
> >>
> >>>
> >>> o	In parallel with the previous step, CPU 1 executes out of order
> >>> 	(as permitted by the lack of a second memory barrier in
> >>> 	__srcu_read_lock()), starting up the critical section before
> >>> 	incrementing its ->seq[] element.
> >>>
> >>> o	Because CPU 0 is not aware that CPU 1 is an SRCU reader, it
> >>> 	completes the SRCU grace period before CPU 1 completes its
> >>> 	SRCU read-side critical section.
> >>>
> >>> This actually might be safe, but I need to think more about it.  In the
> >>> meantime, I figured I should ask your thoughts.
> >>>
> >>> 							Thanx, Paul
> >>>
> >>>> Inspired-by: Peter Zijlstra <peterz@infradead.org>
> >>>> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> >>>> ---
> >>>>  include/linux/srcu.h |    7 +--
> >>>>  kernel/srcu.c        |  137 ++++++++++++++++++++-----------------------------
> >>>>  2 files changed, 57 insertions(+), 87 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> >>>> index 5b49d41..15354db 100644
> >>>> --- a/include/linux/srcu.h
> >>>> +++ b/include/linux/srcu.h
> >>>> @@ -32,18 +32,13 @@
> >>>>
> >>>>  struct srcu_struct_array {
> >>>>  	unsigned long c[2];
> >>>> +	unsigned long seq[2];
> >>>>  };
> >>>>
> >>>> -/* Bit definitions for field ->c above and ->snap below. */
> >>>> -#define SRCU_USAGE_BITS		1
> >>>> -#define SRCU_REF_MASK		(ULONG_MAX >> SRCU_USAGE_BITS)
> >>>> -#define SRCU_USAGE_COUNT	(SRCU_REF_MASK + 1)
> >>>> -
> >>>>  struct srcu_struct {
> >>>>  	unsigned completed;
> >>>>  	struct srcu_struct_array __percpu *per_cpu_ref;
> >>>>  	struct mutex mutex;
> >>>> -	unsigned long snap[NR_CPUS];
> >>>>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >>>>  	struct lockdep_map dep_map;
> >>>>  #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> >>>> diff --git a/kernel/srcu.c b/kernel/srcu.c
> >>>> index 47ee35d..376b583 100644
> >>>> --- a/kernel/srcu.c
> >>>> +++ b/kernel/srcu.c
> >>>> @@ -73,10 +73,25 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
> >>>>  #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> >>>>
> >>>>  /*
> >>>> + * Returns approximate total sequence of readers on the specified rank
> >>>> + * of per-CPU counters.
> >>>> + */
> >>>> +static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
> >>>> +{
> >>>> +	int cpu;
> >>>> +	unsigned long sum = 0;
> >>>> +	unsigned long t;
> >>>> +
> >>>> +	for_each_possible_cpu(cpu) {
> >>>> +		t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
> >>>> +		sum += t;
> >>>> +	}
> >>>> +	return sum;
> >>>> +}
> >>>> +
> >>>> +/*
> >>>>   * Returns approximate number of readers active on the specified rank
> >>>> - * of per-CPU counters.  Also snapshots each counter's value in the
> >>>> - * corresponding element of sp->snap[] for later use validating
> >>>> - * the sum.
> >>>> + * of per-CPU counters.
> >>>>   */
> >>>>  static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> >>>>  {
> >>>> @@ -87,26 +102,36 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> >>>>  	for_each_possible_cpu(cpu) {
> >>>>  		t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
> >>>>  		sum += t;
> >>>> -		sp->snap[cpu] = t;
> >>>>  	}
> >>>> -	return sum & SRCU_REF_MASK;
> >>>> +	return sum;
> >>>>  }
> >>>>
> >>>> -/*
> >>>> - * To be called from the update side after an index flip.  Returns true
> >>>> - * if the modulo sum of the counters is stably zero, false if there is
> >>>> - * some possibility of non-zero.
> >>>> - */
> >>>>  static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> >>>>  {
> >>>>  	int cpu;
> >>>> +	unsigned long seq;
> >>>> +
> >>>> +	seq = srcu_readers_seq_idx(sp, idx);
> >>>> +
> >>>> +	/*
> >>>> +	 * smp_mb() A pairs with smp_mb() B for critical section.
> >>>> +	 * It ensures that the SRCU read-side critical section whose
> >>>> +	 * read-lock is not seen by the following srcu_readers_active_idx()
> >>>> +	 * will see any updates that before the current task performed before.
> >>>> +	 * (So we don't need to care these readers this time)
> >>>> +	 *
> >>>> +	 * Also, if we see the increment of the seq, we must see the
> >>>> +	 * increment of the active counter in the following
> >>>> +	 * srcu_readers_active_idx().
> >>>> +	 */
> >>>> +	smp_mb(); /* A */
> >>>>
> >>>>  	/*
> >>>>  	 * Note that srcu_readers_active_idx() can incorrectly return
> >>>>  	 * zero even though there is a pre-existing reader throughout.
> >>>>  	 * To see this, suppose that task A is in a very long SRCU
> >>>>  	 * read-side critical section that started on CPU 0, and that
> >>>> -	 * no other reader exists, so that the modulo sum of the counters
> >>>> +	 * no other reader exists, so that the sum of the counters
> >>>>  	 * is equal to one.  Then suppose that task B starts executing
> >>>>  	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
> >>>>  	 * task C starts reading on CPU 0, so that its increment is not
> >>>> @@ -122,53 +147,26 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> >>>>  		return false;
> >>>>
> >>>>  	/*
> >>>> -	 * Since the caller recently flipped ->completed, we can see at
> >>>> -	 * most one increment of each CPU's counter from this point
> >>>> -	 * forward.  The reason for this is that the reader CPU must have
> >>>> -	 * fetched the index before srcu_readers_active_idx checked
> >>>> -	 * that CPU's counter, but not yet incremented its counter.
> >>>> -	 * Its eventual counter increment will follow the read in
> >>>> -	 * srcu_readers_active_idx(), and that increment is immediately
> >>>> -	 * followed by smp_mb() B.  Because smp_mb() D is between
> >>>> -	 * the ->completed flip and srcu_readers_active_idx()'s read,
> >>>> -	 * that CPU's subsequent load of ->completed must see the new
> >>>> -	 * value, and therefore increment the counter in the other rank.
> >>>> -	 */
> >>>> -	smp_mb(); /* A */
> >>>> -
> >>>> -	/*
> >>>> -	 * Now, we check the ->snap array that srcu_readers_active_idx()
> >>>> -	 * filled in from the per-CPU counter values. Since
> >>>> -	 * __srcu_read_lock() increments the upper bits of the per-CPU
> >>>> -	 * counter, an increment/decrement pair will change the value
> >>>> -	 * of the counter.  Since there is only one possible increment,
> >>>> -	 * the only way to wrap the counter is to have a huge number of
> >>>> -	 * counter decrements, which requires a huge number of tasks and
> >>>> -	 * huge SRCU read-side critical-section nesting levels, even on
> >>>> -	 * 32-bit systems.
> >>>> -	 *
> >>>> -	 * All of the ways of confusing the readings require that the scan
> >>>> -	 * in srcu_readers_active_idx() see the read-side task's decrement,
> >>>> -	 * but not its increment.  However, between that decrement and
> >>>> -	 * increment are smb_mb() B and C.  Either or both of these pair
> >>>> -	 * with smp_mb() A above to ensure that the scan below will see
> >>>> -	 * the read-side tasks's increment, thus noting a difference in
> >>>> -	 * the counter values between the two passes.
> >>>> +	 * Validation step, smp_mb() D pairs with smp_mb() C. If the above
> >>>> +	 * srcu_readers_active_idx() see a decrement of the active counter
> >>>> +	 * in srcu_read_unlock(), it should see one of these for corresponding
> >>>> +	 * srcu_read_lock():
> >>>> +	 * 	See the increment of the active counter,
> >>>> +	 * 	Failed to see the increment of the active counter.
> >>>> +	 * The second one can cause srcu_readers_active_idx() incorrectly
> >>>> +	 * return zero, but it means the above srcu_readers_seq_idx() does not
> >>>> +	 * see the increment of the seq(ref: comments of smp_mb() A),
> >>>> +	 * and the following srcu_readers_seq_idx() sees the increment of
> >>>> +	 * the seq. The seq is changed.
> >>>>  	 *
> >>>> -	 * Therefore, if srcu_readers_active_idx() returned zero, and
> >>>> -	 * none of the counters changed, we know that the zero was the
> >>>> -	 * correct sum.
> >>>> -	 *
> >>>> -	 * Of course, it is possible that a task might be delayed
> >>>> -	 * for a very long time in __srcu_read_lock() after fetching
> >>>> -	 * the index but before incrementing its counter.  This
> >>>> -	 * possibility will be dealt with in __synchronize_srcu().
> >>>> +	 * This smp_mb() D pairs with smp_mb() C for critical section.
> >>>> +	 * then any of the current task's subsequent code will happen after
> >>>> +	 * that SRCU read-side critical section whose read-unlock is seen in
> >>>> +	 * srcu_readers_active_idx().
> >>>>  	 */
> >>>> -	for_each_possible_cpu(cpu)
> >>>> -		if (sp->snap[cpu] !=
> >>>> -		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> >>>> -			return false;  /* False zero reading! */
> >>>> -	return true;
> >>>> +	smp_mb(); /* D */
> >>>> +
> >>>> +	return srcu_readers_seq_idx(sp, idx) == seq;
> >>>>  }
> >>>>
> >>>>  /**
> >>>> @@ -216,9 +214,9 @@ int __srcu_read_lock(struct srcu_struct *sp)
> >>>>  	preempt_disable();
> >>>>  	idx = rcu_dereference_index_check(sp->completed,
> >>>>  					  rcu_read_lock_sched_held()) & 0x1;
> >>>> -	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
> >>>> -		SRCU_USAGE_COUNT + 1;
> >>>> +	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 1;
> >>>>  	smp_mb(); /* B */  /* Avoid leaking the critical section. */
> >>>> +	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->seq[idx]) += 1;
> >>>>  	preempt_enable();
> >>>>  	return idx;
> >>>>  }
> >>>> @@ -258,17 +256,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> >>>>  	int trycount = 0;
> >>>>
> >>>>  	/*
> >>>> -	 * If a reader fetches the index before the ->completed increment,
> >>>> -	 * but increments its counter after srcu_readers_active_idx_check()
> >>>> -	 * sums it, then smp_mb() D will pair with __srcu_read_lock()'s
> >>>> -	 * smp_mb() B to ensure that the SRCU read-side critical section
> >>>> -	 * will see any updates that the current task performed before its
> >>>> -	 * call to synchronize_srcu(), or to synchronize_srcu_expedited(),
> >>>> -	 * as the case may be.
> >>>> -	 */
> >>>> -	smp_mb(); /* D */
> >>>> -
> >>>> -	/*
> >>>>  	 * SRCU read-side critical sections are normally short, so wait
> >>>>  	 * a small amount of time before possibly blocking.
> >>>>  	 */
> >>>> @@ -281,18 +268,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> >>>>  				schedule_timeout_interruptible(1);
> >>>>  		}
> >>>>  	}
> >>>> -
> >>>> -	/*
> >>>> -	 * The following smp_mb() E pairs with srcu_read_unlock()'s
> >>>> -	 * smp_mb C to ensure that if srcu_readers_active_idx_check()
> >>>> -	 * sees srcu_read_unlock()'s counter decrement, then any
> >>>> -	 * of the current task's subsequent code will happen after
> >>>> -	 * that SRCU read-side critical section.
> >>>> -	 *
> >>>> -	 * It also ensures the order between the above waiting and
> >>>> -	 * the next flipping.
> >>>> -	 */
> >>>> -	smp_mb(); /* E */
> >>>>  }
> >>>>
> >>>>  static void srcu_flip(struct srcu_struct *sp)
> >>>> -- 
> >>>> 1.7.4.4
> >>>>
> >>>
> >>>
> >>
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm
  2012-02-29 13:55                             ` Paul E. McKenney
@ 2012-03-01  2:31                               ` Lai Jiangshan
  2012-03-01 13:20                                 ` Paul E. McKenney
  0 siblings, 1 reply; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-01  2:31 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On 02/29/2012 09:55 PM, Paul E. McKenney wrote:
> On Wed, Feb 29, 2012 at 06:07:32PM +0800, Lai Jiangshan wrote:
>> On 02/28/2012 09:47 PM, Paul E. McKenney wrote:
>>> On Tue, Feb 28, 2012 at 09:51:22AM +0800, Lai Jiangshan wrote:
>>>> On 02/28/2012 02:30 AM, Paul E. McKenney wrote:
>>>>> On Mon, Feb 27, 2012 at 04:01:04PM +0800, Lai Jiangshan wrote:
>>>>>> >From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
>>>>>> From: Lai Jiangshan <laijs@cn.fujitsu.com>
>>>>>> Date: Mon, 27 Feb 2012 14:22:47 +0800
>>>>>> Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm
>>>>>>
>>>>>> This patch implement the algorithm as Peter's:
>>>>>> https://lkml.org/lkml/2012/2/1/119
>>>>>>
>>>>>> o	Make the checking lock-free and we can perform parallel checking,
>>>>>> 	Although almost parallel checking makes no sense, but we need it
>>>>>> 	when 1) the original checking task is preempted for long, 2)
>>>>>> 	sychronize_srcu_expedited(), 3) avoid lock(see next)
>>>>>>
>>>>>> o	Since it is lock-free, we save a mutex in state machine for
>>>>>> 	call_srcu().
>>>>>>
>>>>>> o	Remove the SRCU_REF_MASK and remove the coupling with the flipping.
>>>>>> 	(so we can remove the preempt_disable() in future, but use
>>>>>> 	 __this_cpu_inc() instead.)
>>>>>>
>>>>>> o	reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
>>>>>> 	more intuitive.
>>>>>
>>>>> Hello, Lai,
>>>>>
>>>>> Interesting approach!
>>>>>
>>>>> What happens given the following sequence of events?
>>>>>
>>>>> o	CPU 0 in srcu_readers_active_idx_check() invokes
>>>>> 	srcu_readers_seq_idx(), getting some number back.
>>>>>
>>>>> o	CPU 0 invokes srcu_readers_active_idx(), summing the
>>>>> 	->c[] array up through CPU 3.
>>>>>
>>>>> o	CPU 1 invokes __srcu_read_lock(), and increments its counter
>>>>> 	but not yet its ->seq[] element.
>>>>
>>>>
>>>> Any __srcu_read_lock() whose increment of active counter is not seen
>>>> by srcu_readers_active_idx() is considerred as
>>>> "reader-started-after-this-srcu_readers_active_idx_check()",
>>>> We don't need to wait.
>>>>
>>>> As you said, this srcu C.S 's increment seq is not seen by above
>>>> srcu_readers_seq_idx().
>>>>
>>>>>
>>>>> o	CPU 0 completes its summing of the ->c[] array, incorrectly
>>>>> 	obtaining zero.
>>>>>
>>>>> o	CPU 0 invokes srcu_readers_seq_idx(), getting the same
>>>>> 	number back that it got last time.
>>>>
>>>> If it incorrectly get zero, it means __srcu_read_unlock() is seen
>>>> in srcu_readers_active_idx(), and it means the increment of
>>>> seq is seen in this srcu_readers_seq_idx(), it is different
>>>> from the above seq that it got last time.
>>>>
>>>> increment of seq is not seen by above srcu_readers_seq_idx(),
>>>> but is seen by later one, so the two returned seq is different,
>>>> this is the core of Peter's algorithm, and this was written
>>>> in the comments(Sorry for my bad English). Or maybe I miss
>>>> your means in this mail.
>>>
>>> OK, good, this analysis agrees with what I was thinking.
>>>
>>> So my next question is about the lock freedom.  This lock freedom has to
>>> be limited in nature and carefully implemented.  The reasons for this are:
>>>
>>> 1.	Readers can block in any case, which can of course block both
>>> 	synchronize_srcu_expedited() and synchronize_srcu().
>>>
>>> 2.	Because only one CPU at a time can be incrementing ->completed,
>>> 	some sort of lock with preemption disabling will of course be
>>> 	needed.  Alternatively, an rt_mutex could be used for its
>>> 	priority-inheritance properties.
>>>
>>> 3.	Once some CPU has incremented ->completed, all CPUs that might
>>> 	still be summing up the old indexes must stop.  If they don't,
>>> 	they might incorrectly call a too-short grace period in case of
>>> 	->seq[]-sum overflow on 32-bit systems.
>>>
>>> Or did you have something else in mind?
>>
>> When flip happens when check_zero, this check_zero will no be
>> committed even it is success.
> 
> But if the CPU in check_zero isn't blocking the grace period, then
> ->completed could overflow while that CPU was preempted.  Then how
> would this CPU know that the flip had happened?

as you said, check the ->completed.
but disable the overflow for ->completed.

there is a spinlock for srcu_struct(including locking for flipping)

1) assume we need to wait on widx
2) use srcu_read_lock() to hold a reference of the 1-widx active counter
3) release the spinlock
4) do_check_zero
5) gain the spinlock
6) srcu_read_unlock()
7) if ->completed is not changed, and there is no other later check_zero which
   is committed earlier than us, we will commit our check_zero if we success.

too complicated.

Thanks,
Lai

> 
>> I play too much with lock-free for call_srcu(), the code becomes complicated,
>> I just give up lock-free for call_srcu(), the main aim of call_srcu() is simple.
> 
> Makes sense to me!
> 
>> (But I still like Peter's approach, it has some other good thing
>> besides lock-free-checking, if you don't like it, I will send
>> another patch to fix srcu_readers_active())
> 
> Try them both and check their performance &c.  If within espilon of
> each other, pick whichever one you prefer.
> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm
  2012-03-01  2:31                               ` Lai Jiangshan
@ 2012-03-01 13:20                                 ` Paul E. McKenney
  2012-03-10  3:41                                   ` Lai Jiangshan
  0 siblings, 1 reply; 100+ messages in thread
From: Paul E. McKenney @ 2012-03-01 13:20 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On Thu, Mar 01, 2012 at 10:31:22AM +0800, Lai Jiangshan wrote:
> On 02/29/2012 09:55 PM, Paul E. McKenney wrote:
> > On Wed, Feb 29, 2012 at 06:07:32PM +0800, Lai Jiangshan wrote:
> >> On 02/28/2012 09:47 PM, Paul E. McKenney wrote:
> >>> On Tue, Feb 28, 2012 at 09:51:22AM +0800, Lai Jiangshan wrote:
> >>>> On 02/28/2012 02:30 AM, Paul E. McKenney wrote:
> >>>>> On Mon, Feb 27, 2012 at 04:01:04PM +0800, Lai Jiangshan wrote:
> >>>>>> >From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
> >>>>>> From: Lai Jiangshan <laijs@cn.fujitsu.com>
> >>>>>> Date: Mon, 27 Feb 2012 14:22:47 +0800
> >>>>>> Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm
> >>>>>>
> >>>>>> This patch implement the algorithm as Peter's:
> >>>>>> https://lkml.org/lkml/2012/2/1/119
> >>>>>>
> >>>>>> o	Make the checking lock-free and we can perform parallel checking,
> >>>>>> 	Although almost parallel checking makes no sense, but we need it
> >>>>>> 	when 1) the original checking task is preempted for long, 2)
> >>>>>> 	sychronize_srcu_expedited(), 3) avoid lock(see next)
> >>>>>>
> >>>>>> o	Since it is lock-free, we save a mutex in state machine for
> >>>>>> 	call_srcu().
> >>>>>>
> >>>>>> o	Remove the SRCU_REF_MASK and remove the coupling with the flipping.
> >>>>>> 	(so we can remove the preempt_disable() in future, but use
> >>>>>> 	 __this_cpu_inc() instead.)
> >>>>>>
> >>>>>> o	reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
> >>>>>> 	more intuitive.
> >>>>>
> >>>>> Hello, Lai,
> >>>>>
> >>>>> Interesting approach!
> >>>>>
> >>>>> What happens given the following sequence of events?
> >>>>>
> >>>>> o	CPU 0 in srcu_readers_active_idx_check() invokes
> >>>>> 	srcu_readers_seq_idx(), getting some number back.
> >>>>>
> >>>>> o	CPU 0 invokes srcu_readers_active_idx(), summing the
> >>>>> 	->c[] array up through CPU 3.
> >>>>>
> >>>>> o	CPU 1 invokes __srcu_read_lock(), and increments its counter
> >>>>> 	but not yet its ->seq[] element.
> >>>>
> >>>>
> >>>> Any __srcu_read_lock() whose increment of active counter is not seen
> >>>> by srcu_readers_active_idx() is considerred as
> >>>> "reader-started-after-this-srcu_readers_active_idx_check()",
> >>>> We don't need to wait.
> >>>>
> >>>> As you said, this srcu C.S 's increment seq is not seen by above
> >>>> srcu_readers_seq_idx().
> >>>>
> >>>>>
> >>>>> o	CPU 0 completes its summing of the ->c[] array, incorrectly
> >>>>> 	obtaining zero.
> >>>>>
> >>>>> o	CPU 0 invokes srcu_readers_seq_idx(), getting the same
> >>>>> 	number back that it got last time.
> >>>>
> >>>> If it incorrectly get zero, it means __srcu_read_unlock() is seen
> >>>> in srcu_readers_active_idx(), and it means the increment of
> >>>> seq is seen in this srcu_readers_seq_idx(), it is different
> >>>> from the above seq that it got last time.
> >>>>
> >>>> increment of seq is not seen by above srcu_readers_seq_idx(),
> >>>> but is seen by later one, so the two returned seq is different,
> >>>> this is the core of Peter's algorithm, and this was written
> >>>> in the comments(Sorry for my bad English). Or maybe I miss
> >>>> your means in this mail.
> >>>
> >>> OK, good, this analysis agrees with what I was thinking.
> >>>
> >>> So my next question is about the lock freedom.  This lock freedom has to
> >>> be limited in nature and carefully implemented.  The reasons for this are:
> >>>
> >>> 1.	Readers can block in any case, which can of course block both
> >>> 	synchronize_srcu_expedited() and synchronize_srcu().
> >>>
> >>> 2.	Because only one CPU at a time can be incrementing ->completed,
> >>> 	some sort of lock with preemption disabling will of course be
> >>> 	needed.  Alternatively, an rt_mutex could be used for its
> >>> 	priority-inheritance properties.
> >>>
> >>> 3.	Once some CPU has incremented ->completed, all CPUs that might
> >>> 	still be summing up the old indexes must stop.  If they don't,
> >>> 	they might incorrectly call a too-short grace period in case of
> >>> 	->seq[]-sum overflow on 32-bit systems.
> >>>
> >>> Or did you have something else in mind?
> >>
> >> When flip happens when check_zero, this check_zero will no be
> >> committed even it is success.
> > 
> > But if the CPU in check_zero isn't blocking the grace period, then
> > ->completed could overflow while that CPU was preempted.  Then how
> > would this CPU know that the flip had happened?
> 
> as you said, check the ->completed.
> but disable the overflow for ->completed.
> 
> there is a spinlock for srcu_struct(including locking for flipping)
> 
> 1) assume we need to wait on widx
> 2) use srcu_read_lock() to hold a reference of the 1-widx active counter
> 3) release the spinlock
> 4) do_check_zero
> 5) gain the spinlock
> 6) srcu_read_unlock()
> 7) if ->completed is not changed, and there is no other later check_zero which
>    is committed earlier than us, we will commit our check_zero if we success.
> 
> too complicated.

Plus I don't see how it disables overflow for ->completed.

As you said earlier, abandoning the goal of lock freedom sounds like the
best approach.  Then you can indeed just hold the srcu_struct's mutex
across the whole thing.

							Thanx, Paul

> Thanks,
> Lai
> 
> > 
> >> I play too much with lock-free for call_srcu(), the code becomes complicated,
> >> I just give up lock-free for call_srcu(), the main aim of call_srcu() is simple.
> > 
> > Makes sense to me!
> > 
> >> (But I still like Peter's approach, it has some other good thing
> >> besides lock-free-checking, if you don't like it, I will send
> >> another patch to fix srcu_readers_active())
> > 
> > Try them both and check their performance &c.  If within espilon of
> > each other, pick whichever one you prefer.
> > 
> > 							Thanx, Paul
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* [RFC PATCH 0/6 paul/rcu/srcu] srcu: implement call_srcu()
  2012-02-21 17:24           ` Paul E. McKenney
                               ` (2 preceding siblings ...)
  2012-02-22  9:29             ` [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period Lai Jiangshan
@ 2012-03-06  8:42             ` Lai Jiangshan
  2012-03-06  9:57               ` [PATCH 1/6] remove unused srcu_barrier() Lai Jiangshan
  3 siblings, 1 reply; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-06  8:42 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, peterz, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches, --,
	1.7.4.4

Patch 1~4 are for preparing.
Patch 1,2 can be merge now, maybe patch 3 can be merge now.

Patch 5 is the draft per-cpu state machine call_srcu() implementation
	it will be split before merge.

Patch 6 is for srcu torture test.

Lai Jiangshan (6):
  remove unused srcu_barrier()
  Don't touch the snap in srcu_readers_active()
  use "int trycount" instead of "bool expedited"
  remove flip_idx_and_wait()
  implement call_srcu()
  add srcu torture test

 include/linux/srcu.h |   49 +++++-
 kernel/rcutorture.c  |   68 ++++++++-
 kernel/srcu.c        |  421 +++++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 454 insertions(+), 84 deletions(-)

-- 
1.7.4.4


^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 1/6] remove unused srcu_barrier()
  2012-03-06  8:42             ` [RFC PATCH 0/6 paul/rcu/srcu] srcu: implement call_srcu() Lai Jiangshan
@ 2012-03-06  9:57               ` Lai Jiangshan
  2012-03-06  9:57                 ` [PATCH 2/6] Don't touch the snap in srcu_readers_active() Lai Jiangshan
                                   ` (5 more replies)
  0 siblings, 6 replies; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-06  9:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, peterz, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

srcu_barrier() is unused now.
This identifier is need for later.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 include/linux/srcu.h |    6 ------
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 5b49d41..df8f5f7 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -49,12 +49,6 @@ struct srcu_struct {
 #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
 };
 
-#ifndef CONFIG_PREEMPT
-#define srcu_barrier() barrier()
-#else /* #ifndef CONFIG_PREEMPT */
-#define srcu_barrier()
-#endif /* #else #ifndef CONFIG_PREEMPT */
-
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 
 int __init_srcu_struct(struct srcu_struct *sp, const char *name,
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 2/6] Don't touch the snap in srcu_readers_active()
  2012-03-06  9:57               ` [PATCH 1/6] remove unused srcu_barrier() Lai Jiangshan
@ 2012-03-06  9:57                 ` Lai Jiangshan
  2012-03-08 19:14                   ` Paul E. McKenney
  2012-03-06  9:57                 ` [PATCH 3/6] use "int trycount" instead of "bool expedited" Lai Jiangshan
                                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-06  9:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, peterz, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

srcu_readers_active() is called without the mutex, but it touch the snap.
also achieve better cache locality a little

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 kernel/srcu.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/kernel/srcu.c b/kernel/srcu.c
index b6b9ea2..fbe2d5f 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -181,7 +181,14 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
  */
 static int srcu_readers_active(struct srcu_struct *sp)
 {
-	return srcu_readers_active_idx(sp, 0) + srcu_readers_active_idx(sp, 1);
+	int cpu;
+	unsigned long sum = 0;
+
+	for_each_possible_cpu(cpu) {
+		sum += ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[0]);
+		sum += ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[1]);
+	}
+	return sum & SRCU_REF_MASK;
 }
 
 /**
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 3/6] use "int trycount" instead of "bool expedited"
  2012-03-06  9:57               ` [PATCH 1/6] remove unused srcu_barrier() Lai Jiangshan
  2012-03-06  9:57                 ` [PATCH 2/6] Don't touch the snap in srcu_readers_active() Lai Jiangshan
@ 2012-03-06  9:57                 ` Lai Jiangshan
  2012-03-08 19:25                   ` Paul E. McKenney
  2012-03-06  9:57                 ` [PATCH 4/6] remove flip_idx_and_wait() Lai Jiangshan
                                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-06  9:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, peterz, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 kernel/srcu.c |   27 ++++++++++++++-------------
 1 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/kernel/srcu.c b/kernel/srcu.c
index fbe2d5f..d5f3450 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -254,12 +254,12 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
  * we repeatedly block for 1-millisecond time periods.  This approach
  * has done well in testing, so there is no need for a config parameter.
  */
-#define SYNCHRONIZE_SRCU_READER_DELAY 5
+#define SYNCHRONIZE_SRCU_READER_DELAY	5
+#define SYNCHRONIZE_SRCU_TRYCOUNT	2
+#define SYNCHRONIZE_SRCU_EXP_TRYCOUNT	12
 
-static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
+static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
 {
-	int trycount = 0;
-
 	/*
 	 * If a reader fetches the index before the ->completed increment,
 	 * but increments its counter after srcu_readers_active_idx_check()
@@ -278,9 +278,10 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
 	if (!srcu_readers_active_idx_check(sp, idx)) {
 		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
 		while (!srcu_readers_active_idx_check(sp, idx)) {
-			if (expedited && ++ trycount < 10)
+			if (trycount > 0) {
+				trycount--;
 				udelay(SYNCHRONIZE_SRCU_READER_DELAY);
-			else
+			} else
 				schedule_timeout_interruptible(1);
 		}
 	}
@@ -310,18 +311,18 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
  * by the next __synchronize_srcu() invoking wait_idx() for such readers
  * before starting a new grace period.
  */
-static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
+static void flip_idx_and_wait(struct srcu_struct *sp, int trycount)
 {
 	int idx;
 
 	idx = sp->completed++ & 0x1;
-	wait_idx(sp, idx, expedited);
+	wait_idx(sp, idx, trycount);
 }
 
 /*
  * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
  */
-static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
+static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
 {
 	rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
 			   !lock_is_held(&rcu_bh_lock_map) &&
@@ -357,14 +358,14 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
 	 * fetching ->completed and incrementing their counter, wait_idx()
 	 * will normally not need to wait.
 	 */
-	wait_idx(sp, (sp->completed - 1) & 0x1, expedited);
+	wait_idx(sp, (sp->completed - 1) & 0x1, trycount);
 
 	/*
 	 * Now that wait_idx() has waited for the really old readers,
 	 * invoke flip_idx_and_wait() to flip the counter and wait
 	 * for current SRCU readers.
 	 */
-	flip_idx_and_wait(sp, expedited);
+	flip_idx_and_wait(sp, trycount);
 
 	mutex_unlock(&sp->mutex);
 }
@@ -385,7 +386,7 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
  */
 void synchronize_srcu(struct srcu_struct *sp)
 {
-	__synchronize_srcu(sp, 0);
+	__synchronize_srcu(sp, SYNCHRONIZE_SRCU_TRYCOUNT);
 }
 EXPORT_SYMBOL_GPL(synchronize_srcu);
 
@@ -406,7 +407,7 @@ EXPORT_SYMBOL_GPL(synchronize_srcu);
  */
 void synchronize_srcu_expedited(struct srcu_struct *sp)
 {
-	__synchronize_srcu(sp, 1);
+	__synchronize_srcu(sp, SYNCHRONIZE_SRCU_EXP_TRYCOUNT);
 }
 EXPORT_SYMBOL_GPL(synchronize_srcu_expedited);
 
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 4/6] remove flip_idx_and_wait()
  2012-03-06  9:57               ` [PATCH 1/6] remove unused srcu_barrier() Lai Jiangshan
  2012-03-06  9:57                 ` [PATCH 2/6] Don't touch the snap in srcu_readers_active() Lai Jiangshan
  2012-03-06  9:57                 ` [PATCH 3/6] use "int trycount" instead of "bool expedited" Lai Jiangshan
@ 2012-03-06  9:57                 ` Lai Jiangshan
  2012-03-06 10:41                   ` Peter Zijlstra
  2012-03-07  3:54                   ` [RFC PATCH 5/5 single-thread-version] implement per-domain single-thread state machine call_srcu() Lai Jiangshan
  2012-03-06  9:57                 ` [RFC PATCH 5/6] implement per-cpu&per-domain " Lai Jiangshan
                                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-06  9:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, peterz, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

flip and check are the basic steps of gp. call them directly.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 kernel/srcu.c |   35 +++++++++++++++++------------------
 1 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/kernel/srcu.c b/kernel/srcu.c
index d5f3450..d101ed5 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -300,23 +300,12 @@ static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
 }
 
 /*
- * Flip the readers' index by incrementing ->completed, then wait
- * until there are no more readers using the counters referenced by
- * the old index value.  (Recall that the index is the bottom bit
- * of ->completed.)
- *
- * Of course, it is possible that a reader might be delayed for the
- * full duration of flip_idx_and_wait() between fetching the
- * index and incrementing its counter.  This possibility is handled
- * by the next __synchronize_srcu() invoking wait_idx() for such readers
- * before starting a new grace period.
+ * Flip the readers' index by incrementing ->completed, then new
+ * readers will use counters referenced on new index value.
  */
-static void flip_idx_and_wait(struct srcu_struct *sp, int trycount)
+static void srcu_flip(struct srcu_struct *sp)
 {
-	int idx;
-
-	idx = sp->completed++ & 0x1;
-	wait_idx(sp, idx, trycount);
+	ACCESS_ONCE(sp->completed)++;
 }
 
 /*
@@ -362,10 +351,20 @@ static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
 
 	/*
 	 * Now that wait_idx() has waited for the really old readers,
-	 * invoke flip_idx_and_wait() to flip the counter and wait
-	 * for current SRCU readers.
+	 *
+	 * Flip the readers' index by incrementing ->completed, then wait
+	 * until there are no more readers using the counters referenced by
+	 * the old index value.  (Recall that the index is the bottom bit
+	 * of ->completed.)
+	 *
+	 * Of course, it is possible that a reader might be delayed for the
+	 * full duration of flip_idx_and_wait() between fetching the
+	 * index and incrementing its counter.  This possibility is handled
+	 * by the next __synchronize_srcu() invoking wait_idx() for such
+	 * readers before starting a new grace period.
 	 */
-	flip_idx_and_wait(sp, trycount);
+	srcu_flip(sp);
+	wait_idx(sp, (sp->completed - 1) & 0x1, trycount);
 
 	mutex_unlock(&sp->mutex);
 }
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06  9:57               ` [PATCH 1/6] remove unused srcu_barrier() Lai Jiangshan
                                   ` (2 preceding siblings ...)
  2012-03-06  9:57                 ` [PATCH 4/6] remove flip_idx_and_wait() Lai Jiangshan
@ 2012-03-06  9:57                 ` Lai Jiangshan
  2012-03-06 10:47                   ` Peter Zijlstra
                                     ` (8 more replies)
  2012-03-06  9:57                 ` [PATCH 6/6] add srcu torture test Lai Jiangshan
  2012-03-08 19:03                 ` [PATCH 1/6] remove unused srcu_barrier() Paul E. McKenney
  5 siblings, 9 replies; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-06  9:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, peterz, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

Known issues:
	1) All call_srcu() compete at its own ->gp_lock.
	 = It is a fake problem, Update-site should be rare for a *specified*
	   SRCU domain.

	2) Since it is per-domain and update-site is rare, per-cpu is overkill,
	   and slow down the processing a little.
	 = The answer is unknow now. If <5callbacks/grace-perioad, I suggest
	   using single state-machine per-domain, otherwise per-cpu per-domain.

	3) The processing for per-cpu state machines are not full parallelism.
	 = It is not a problem when callbacks are rare.
	   It can be easy fixed, reduce the region of ->gp_lock in state machine,
	   I will add this code when need.

	4) If a reader is preempted/sleeping long, and there are several
	   cpu state machines active(have pending callbacks), These cpus may
	   do periodic check_zero, they all failed, it is waste.
	 = We can choose a leader of these cpus-state-machine to do the
	   periodic check_zero, instead of several cpus.
	   It does not need to be fixed now, will be fixed later.

Design:
o	The srcu callback is new thing, I hope it is completely preemptible,
	even sleepable. It does in this implemetation, I use work_struct
	to stand for every srcu callback.
	I notice some RCU callbacks also use work_struct for sleepable work.
	But this indirect way push complexity to the caller and these callbacks
	are not rcu_barrier()-awared.

o	state machine is light way, it is preemptible when checking.
	(if issue3 is fixed, it will be almost full-preemptible), but it does
	not sleep(avoid to occupy a thread when sleep).
	It is also a work_struct. So, there is no thread occupied
	by SRCU when the srcu is not actived(no callbacks).
	It processes 10 callbacks at most and return.(also return the thread)
	So the whole SRCU is efficient in resource usage.

o	callback is called in the same CPU when it is queued.
	state machine work_struct and the callback work_struct may be run
	without schedled in in a worker thread.

o	workqueue handle all hot-plug issues for us.
	So this implementation highly depends on workqueue. DRY.


Others:
	srcu_head is bigger, it is worth, it provides more ability and simplify
	the srcu code.

	srcu_barrier() is not tested.

	I think *_expedited can be removed, because mb()-based srcu are
	extremely faster than before.

The trip of a callback:
	1) call_srcu(), record(snap) the chck_seq of the srcu_domain.

	2) if callback->chck_seq < srcu_domain->zero_seq[0] &&
	      callback->chck_seq < srcu_domain->zero_seq[1]
	   diriver the callback.
	   if not, it stays, let the state machine do flip or do check_zero.

	3) the callback is called in the workqueue.

The process of check_zero:
	1) advance srcu_domain->chck_seq when needed
	   (optional, but I make it bigger than all callbacks)

	2) do check_zero

	3) commit the srcu_domain->chck_seq to ->zero_seq[idx] when success.

	=) so if a callback see callback->chck_seq < srcu_domain->zero_seq[idx],
	   it means it see the whole above step which means all readers start
	   before the callback queued(but increase on active counter of idx)
	   are all completed.


Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 include/linux/srcu.h |   43 ++++++-
 kernel/srcu.c        |  400 ++++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 379 insertions(+), 64 deletions(-)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index df8f5f7..c375985 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -29,6 +29,22 @@
 
 #include <linux/mutex.h>
 #include <linux/rcupdate.h>
+#include <linux/workqueue.h>
+
+struct srcu_head;
+typedef void (*srcu_callback_func_t)(struct srcu_head *head);
+
+struct srcu_head {
+	union {
+		struct {
+			struct srcu_head *next;
+			/* the snap of sp->chck_seq when queued */
+			unsigned long chck_seq;
+			srcu_callback_func_t func;
+		};
+		struct work_struct work;
+	};
+};
 
 struct srcu_struct_array {
 	unsigned long c[2];
@@ -39,10 +55,32 @@ struct srcu_struct_array {
 #define SRCU_REF_MASK		(ULONG_MAX >> SRCU_USAGE_BITS)
 #define SRCU_USAGE_COUNT	(SRCU_REF_MASK + 1)
 
+struct srcu_cpu_struct;
+
 struct srcu_struct {
 	unsigned completed;
+
+	/* the sequence of check zero */
+	unsigned long chck_seq;
+	/* the snap of chck_seq when the last callback is queued */
+	unsigned long callback_chck_seq;
+
+	/* the sequence value of succeed check(chck_seq) */
+	unsigned long zero_seq[2];
+
+	/* protects the above completed and sequence values */
+	spinlock_t gp_lock;
+
+	/* protects all the fields here except callback_chck_seq */
+	struct mutex flip_check_mutex;
+	/*
+	 * Fileds of the intersection of gp_lock-protected fields and
+	 * flip_check_mutex-protected fields can only be modified with
+	 * the two lock are both held, can be read with one of lock held.
+	 */
+
 	struct srcu_struct_array __percpu *per_cpu_ref;
-	struct mutex mutex;
+	struct srcu_cpu_struct __percpu *srcu_per_cpu;
 	unsigned long snap[NR_CPUS];
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map dep_map;
@@ -236,4 +274,7 @@ static inline void srcu_read_unlock_raw(struct srcu_struct *sp, int idx)
 	local_irq_restore(flags);
 }
 
+void call_srcu(struct srcu_struct *sp, struct srcu_head *head,
+		srcu_callback_func_t func);
+void srcu_barrier(struct srcu_struct *sp);
 #endif
diff --git a/kernel/srcu.c b/kernel/srcu.c
index d101ed5..8c71ae8 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -33,13 +33,63 @@
 #include <linux/smp.h>
 #include <linux/delay.h>
 #include <linux/srcu.h>
+#include <linux/completion.h>
+
+#define SRCU_CALLBACK_BATCH	10
+#define SRCU_INTERVAL		1
+
+/* protected by sp->gp_lock */
+struct srcu_cpu_struct {
+	/* callback queue for handling */
+	struct srcu_head *head, **tail;
+
+	/* the struct srcu_struct of this struct srcu_cpu_struct */
+	struct srcu_struct *sp;
+	struct delayed_work work;
+};
+
+/*
+ * State machine process for every CPU, it may run on wrong CPU
+ * during hotplugging or synchronize_srcu() scheldule it after
+ * migrated.
+ */
+static void process_srcu_cpu_struct(struct work_struct *work);
+
+static struct workqueue_struct *srcu_callback_wq;
 
 static int init_srcu_struct_fields(struct srcu_struct *sp)
 {
+	int cpu;
+	struct srcu_cpu_struct *scp;
+
+	mutex_init(&sp->flip_check_mutex);
+	spin_lock_init(&sp->gp_lock);
 	sp->completed = 0;
-	mutex_init(&sp->mutex);
+	sp->chck_seq = 0;
+	sp->callback_chck_seq = 0;
+	sp->zero_seq[0] = 0;
+	sp->zero_seq[1] = 0;
+
 	sp->per_cpu_ref = alloc_percpu(struct srcu_struct_array);
-	return sp->per_cpu_ref ? 0 : -ENOMEM;
+	if (!sp->per_cpu_ref)
+		return -ENOMEM;
+
+	sp->srcu_per_cpu = alloc_percpu(struct srcu_cpu_struct);
+	if (!sp->srcu_per_cpu) {
+		free_percpu(sp->per_cpu_ref);
+		return -ENOMEM;
+	}
+
+	for_each_possible_cpu(cpu) {
+		scp = per_cpu_ptr(sp->srcu_per_cpu, cpu);
+
+		scp->sp = sp;
+		scp->head = NULL;
+		scp->tail = &scp->head;
+		INIT_DELAYED_WORK(&scp->work, process_srcu_cpu_struct);
+	}
+
+	return 0;
 }
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -208,6 +258,8 @@ void cleanup_srcu_struct(struct srcu_struct *sp)
 		return;
 	free_percpu(sp->per_cpu_ref);
 	sp->per_cpu_ref = NULL;
+	free_percpu(sp->srcu_per_cpu);
+	sp->srcu_per_cpu = NULL;
 }
 EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
 
@@ -247,6 +299,19 @@ void __srcu_read_unlock(struct srcu_struct *sp, int idx)
 EXPORT_SYMBOL_GPL(__srcu_read_unlock);
 
 /*
+ * 'return left < right;' but handle the overflow issues.
+ * The same as 'return (long)(right - left) > 0;' but it cares more.
+ */
+static inline
+bool safe_less_than(unsigned long left, unsigned long right, unsigned long max)
+{
+	unsigned long a = right - left;
+	unsigned long b = max - left;
+
+	return !!a && (a <= b);
+}
+
+/*
  * We use an adaptive strategy for synchronize_srcu() and especially for
  * synchronize_srcu_expedited().  We spin for a fixed time period
  * (defined below) to allow SRCU readers to exit their read-side critical
@@ -254,11 +319,11 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
  * we repeatedly block for 1-millisecond time periods.  This approach
  * has done well in testing, so there is no need for a config parameter.
  */
-#define SYNCHRONIZE_SRCU_READER_DELAY	5
+#define SRCU_RETRY_CHECK_DELAY		5
 #define SYNCHRONIZE_SRCU_TRYCOUNT	2
 #define SYNCHRONIZE_SRCU_EXP_TRYCOUNT	12
 
-static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
+static bool do_check_zero_idx(struct srcu_struct *sp, int idx, int trycount)
 {
 	/*
 	 * If a reader fetches the index before the ->completed increment,
@@ -271,19 +336,12 @@ static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
 	 */
 	smp_mb(); /* D */
 
-	/*
-	 * SRCU read-side critical sections are normally short, so wait
-	 * a small amount of time before possibly blocking.
-	 */
-	if (!srcu_readers_active_idx_check(sp, idx)) {
-		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
-		while (!srcu_readers_active_idx_check(sp, idx)) {
-			if (trycount > 0) {
-				trycount--;
-				udelay(SYNCHRONIZE_SRCU_READER_DELAY);
-			} else
-				schedule_timeout_interruptible(1);
-		}
+	for (;;) {
+		if (srcu_readers_active_idx_check(sp, idx))
+			break;
+		if (--trycount <= 0)
+			return false;
+		udelay(SRCU_RETRY_CHECK_DELAY);
 	}
 
 	/*
@@ -297,6 +355,8 @@ static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
 	 * the next flipping.
 	 */
 	smp_mb(); /* E */
+
+	return true;
 }
 
 /*
@@ -308,65 +368,234 @@ static void srcu_flip(struct srcu_struct *sp)
 	ACCESS_ONCE(sp->completed)++;
 }
 
+/* Must called with sp->flip_check_mutex and sp->gp_lock held; */
+static bool check_zero_idx(struct srcu_struct *sp, int idx,
+		struct srcu_head *head, int trycount)
+{
+	unsigned long chck_seq;
+	bool checked_zero;
+
+	/* find out the check sequence number for this check */
+	if (sp->chck_seq == sp->callback_chck_seq ||
+	    sp->chck_seq == head->chck_seq)
+		sp->chck_seq++;
+	chck_seq = sp->chck_seq;
+
+	spin_unlock_irq(&sp->gp_lock);
+	checked_zero = do_check_zero_idx(sp, idx, trycount);
+	spin_lock_irq(&sp->gp_lock);
+
+	if (!checked_zero)
+		return false;
+
+	/* commit the succeed check */
+	sp->zero_seq[idx] = chck_seq;
+
+	return true;
+}
+
+/*
+ * Are the @head completed? will try do check zero when it is not.
+ *
+ * Must be called with sp->flip_check_mutex and sp->gp_lock held;
+ * Must be called from process contex, because the check may be long
+ * The sp->gp_lock may be released and regained.
+ */
+static bool complete_head_flip_check(struct srcu_struct *sp,
+		struct srcu_head *head, int trycount)
+{
+	int idxb = sp->completed & 0X1UL;
+	int idxa = 1 - idxb;
+	unsigned long h  = head->chck_seq;
+	unsigned long za = sp->zero_seq[idxa];
+	unsigned long zb = sp->zero_seq[idxb];
+	unsigned long s  = sp->chck_seq;
+
+	if (!safe_less_than(h, za, s)) {
+		if (!check_zero_idx(sp, idxa, head, trycount))
+			return false;
+	}
+
+	if (!safe_less_than(h, zb, s)) {
+		srcu_flip(sp);
+		trycount = trycount < 2 ? 2 : trycount;
+		return check_zero_idx(sp, idxb, head, trycount);
+	}
+
+	return true;
+}
+
+/*
+ * Are the @head completed?
+ *
+ * Must called with sp->gp_lock held;
+ * srcu_queue_callback() and check_zero_idx() ensure (s - z0) and (s - z1)
+ * less than (ULONG_MAX / sizof(struct srcu_head)). There is at least one
+ * callback queued for each seq in (z0, s) and (z1, s). The same for
+ * za, zb, s in complete_head_flip_check().
+ */
+static bool complete_head(struct srcu_struct *sp, struct srcu_head *head)
+{
+	unsigned long h  = head->chck_seq;
+	unsigned long z0 = sp->zero_seq[0];
+	unsigned long z1 = sp->zero_seq[1];
+	unsigned long s  = sp->chck_seq;
+
+	return safe_less_than(h, z0, s) && safe_less_than(h, z1, s);
+}
+
+static void process_srcu_cpu_struct(struct work_struct *work)
+{
+	int i;
+	int can_flip_check;
+	struct srcu_head *head;
+	struct srcu_cpu_struct *scp;
+	struct srcu_struct *sp;
+	work_func_t wfunc;
+
+	scp = container_of(work, struct srcu_cpu_struct, work.work);
+	sp = scp->sp;
+
+	can_flip_check = mutex_trylock(&sp->flip_check_mutex);
+	spin_lock_irq(&sp->gp_lock);
+
+	for (i = 0; i < SRCU_CALLBACK_BATCH; i++) {
+		head = scp->head;
+		if (!head)
+			break;
+
+		/* Test whether the head is completed or not. */
+		if (can_flip_check) {
+			if (!complete_head_flip_check(sp, head, 1))
+				break;
+		} else {
+			if (!complete_head(sp, head))
+				break;
+		}
+
+		/* dequeue the completed callback */
+		scp->head = head->next;
+		if (!scp->head)
+			scp->tail = &scp->head;
+
+		/* deliver the callback, will be invoked in workqueue */
+		BUILD_BUG_ON(offsetof(struct srcu_head, work) != 0);
+		wfunc = (work_func_t)head->func;
+		INIT_WORK(&head->work, wfunc);
+		queue_work(srcu_callback_wq, &head->work);
+	}
+	if (scp->head)
+		schedule_delayed_work(&scp->work, SRCU_INTERVAL);
+	spin_unlock_irq(&sp->gp_lock);
+	if (can_flip_check)
+		mutex_unlock(&sp->flip_check_mutex);
+}
+
+static
+void srcu_queue_callback(struct srcu_struct *sp, struct srcu_cpu_struct *scp,
+                struct srcu_head *head, srcu_callback_func_t func)
+{
+	head->next = NULL;
+	head->func = func;
+	head->chck_seq = sp->chck_seq;
+	sp->callback_chck_seq = sp->chck_seq;
+	*scp->tail = head;
+	scp->tail = &head->next;
+}
+
+void call_srcu(struct srcu_struct *sp, struct srcu_head *head,
+		srcu_callback_func_t func)
+{
+	unsigned long flags;
+	int cpu = get_cpu();
+	struct srcu_cpu_struct *scp = per_cpu_ptr(sp->srcu_per_cpu, cpu);
+
+	spin_lock_irqsave(&sp->gp_lock, flags);
+	srcu_queue_callback(sp, scp, head, func);
+	/* start state machine when this is the head */
+	if (scp->head == head)
+		schedule_delayed_work(&scp->work, 0);
+	spin_unlock_irqrestore(&sp->gp_lock, flags);
+	put_cpu();
+}
+EXPORT_SYMBOL_GPL(call_srcu);
+
+struct srcu_sync {
+	struct srcu_head head;
+	struct completion completion;
+};
+
+static void __synchronize_srcu_callback(struct srcu_head *head)
+{
+	struct srcu_sync *sync = container_of(head, struct srcu_sync, head);
+
+	complete(&sync->completion);
+}
+
 /*
  * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
  */
-static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
+static void __synchronize_srcu(struct srcu_struct *sp, int try_count)
 {
+	struct srcu_sync sync;
+	struct srcu_head *head = &sync.head;
+	struct srcu_head **orig_tail;
+	int cpu = raw_smp_processor_id();
+	struct srcu_cpu_struct *scp = per_cpu_ptr(sp->srcu_per_cpu, cpu);
+	bool started;
+
 	rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
 			   !lock_is_held(&rcu_bh_lock_map) &&
 			   !lock_is_held(&rcu_lock_map) &&
 			   !lock_is_held(&rcu_sched_lock_map),
 			   "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
 
-	mutex_lock(&sp->mutex);
+	init_completion(&sync.completion);
+	if (!mutex_trylock(&sp->flip_check_mutex)) {
+		call_srcu(sp, &sync.head, __synchronize_srcu_callback);
+		goto wait;
+	}
+
+	spin_lock_irq(&sp->gp_lock);
+	started = !!scp->head;
+	orig_tail = scp->tail;
+	srcu_queue_callback(sp, scp, head, __synchronize_srcu_callback);
+
+	/* fast path */
+	if (complete_head_flip_check(sp, head, try_count)) {
+		/*
+		 * dequeue @head, we hold flip_check_mutex, the previous
+		 * node stays or all prevous node are all dequeued.
+		 */
+		if (scp->head == head)
+			orig_tail = &scp->head;
+		*orig_tail = head->next;
+		if (*orig_tail == NULL)
+			scp->tail = orig_tail;
+
+		/*
+		 * start state machine if this is the head and some callback(s)
+		 * are queued when we do check_zero(they have not started it).
+		 */
+		if (!started && scp->head)
+			schedule_delayed_work(&scp->work, 0);
+
+		/* we done! */
+		spin_unlock_irq(&sp->gp_lock);
+		mutex_unlock(&sp->flip_check_mutex);
+		return;
+	}
 
-	/*
-	 * Suppose that during the previous grace period, a reader
-	 * picked up the old value of the index, but did not increment
-	 * its counter until after the previous instance of
-	 * __synchronize_srcu() did the counter summation and recheck.
-	 * That previous grace period was OK because the reader did
-	 * not start until after the grace period started, so the grace
-	 * period was not obligated to wait for that reader.
-	 *
-	 * However, the current SRCU grace period does have to wait for
-	 * that reader.  This is handled by invoking wait_idx() on the
-	 * non-active set of counters (hence sp->completed - 1).  Once
-	 * wait_idx() returns, we know that all readers that picked up
-	 * the old value of ->completed and that already incremented their
-	 * counter will have completed.
-	 *
-	 * But what about readers that picked up the old value of
-	 * ->completed, but -still- have not managed to increment their
-	 * counter?  We do not need to wait for those readers, because
-	 * they will have started their SRCU read-side critical section
-	 * after the current grace period starts.
-	 *
-	 * Because it is unlikely that readers will be preempted between
-	 * fetching ->completed and incrementing their counter, wait_idx()
-	 * will normally not need to wait.
-	 */
-	wait_idx(sp, (sp->completed - 1) & 0x1, trycount);
+	/* slow path */
 
-	/*
-	 * Now that wait_idx() has waited for the really old readers,
-	 *
-	 * Flip the readers' index by incrementing ->completed, then wait
-	 * until there are no more readers using the counters referenced by
-	 * the old index value.  (Recall that the index is the bottom bit
-	 * of ->completed.)
-	 *
-	 * Of course, it is possible that a reader might be delayed for the
-	 * full duration of flip_idx_and_wait() between fetching the
-	 * index and incrementing its counter.  This possibility is handled
-	 * by the next __synchronize_srcu() invoking wait_idx() for such
-	 * readers before starting a new grace period.
-	 */
-	srcu_flip(sp);
-	wait_idx(sp, (sp->completed - 1) & 0x1, trycount);
+	/* start state machine when this is the head */
+	if (!started)
+		schedule_delayed_work(&scp->work, 0);
+	spin_unlock_irq(&sp->gp_lock);
+	mutex_unlock(&sp->flip_check_mutex);
 
-	mutex_unlock(&sp->mutex);
+wait:
+	wait_for_completion(&sync.completion);
 }
 
 /**
@@ -410,6 +639,44 @@ void synchronize_srcu_expedited(struct srcu_struct *sp)
 }
 EXPORT_SYMBOL_GPL(synchronize_srcu_expedited);
 
+void srcu_barrier(struct srcu_struct *sp)
+{
+	struct srcu_sync sync;
+	struct srcu_head *head = &sync.head;
+	unsigned long chck_seq; /* snap */
+
+	int idle_loop = 0;
+	int cpu;
+	struct srcu_cpu_struct *scp;
+
+	spin_lock_irq(&sp->gp_lock);
+	chck_seq = sp->chck_seq;
+	for_each_possible_cpu(cpu) {
+		scp = per_cpu_ptr(sp->srcu_per_cpu, cpu);
+		if (scp->head && !safe_less_than(chck_seq, scp->head->chck_seq,
+				sp->chck_seq)) {
+			/* this path is likely enterred only once */
+			init_completion(&sync.completion);
+			srcu_queue_callback(sp, scp, head,
+					__synchronize_srcu_callback);
+			/* don't need to wakeup the woken state machine */
+			spin_unlock_irq(&sp->gp_lock);
+			wait_for_completion(&sync.completion);
+			spin_lock_irq(&sp->gp_lock);
+		} else {
+			if ((++idle_loop & 0xF) == 0) {
+				spin_unlock_irq(&sp->gp_lock);
+				udelay(1);
+				spin_lock_irq(&sp->gp_lock);
+			}
+		}
+	}
+	spin_unlock_irq(&sp->gp_lock);
+
+	flush_workqueue(srcu_callback_wq);
+}
+EXPORT_SYMBOL_GPL(srcu_barrier);
+
 /**
  * srcu_batches_completed - return batches completed.
  * @sp: srcu_struct on which to report batch completion.
@@ -423,3 +690,10 @@ long srcu_batches_completed(struct srcu_struct *sp)
 	return sp->completed;
 }
 EXPORT_SYMBOL_GPL(srcu_batches_completed);
+
+__init int srcu_init(void)
+{
+	srcu_callback_wq = alloc_workqueue("srcu", 0, 0);
+	return srcu_callback_wq ? 0 : -1;
+}
+subsys_initcall(srcu_init);
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 6/6] add srcu torture test
  2012-03-06  9:57               ` [PATCH 1/6] remove unused srcu_barrier() Lai Jiangshan
                                   ` (3 preceding siblings ...)
  2012-03-06  9:57                 ` [RFC PATCH 5/6] implement per-cpu&per-domain " Lai Jiangshan
@ 2012-03-06  9:57                 ` Lai Jiangshan
  2012-03-08 19:03                 ` [PATCH 1/6] remove unused srcu_barrier() Paul E. McKenney
  5 siblings, 0 replies; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-06  9:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, peterz, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

Add srcu_torture_deferred_free() for srcu_ops, it uses new call_srcu().
Rename the original srcu_ops to srcu_sync_ops.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 kernel/rcutorture.c |   68 +++++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
index 54e5724..9197981 100644
--- a/kernel/rcutorture.c
+++ b/kernel/rcutorture.c
@@ -148,7 +148,10 @@ static struct task_struct *barrier_task;
 #define RCU_TORTURE_PIPE_LEN 10
 
 struct rcu_torture {
-	struct rcu_head rtort_rcu;
+	union {
+		struct rcu_head rtort_rcu;
+		struct srcu_head rtort_srcu;
+	};
 	int rtort_pipe_count;
 	struct list_head rtort_free;
 	int rtort_mbtest;
@@ -388,10 +391,9 @@ static int rcu_torture_completed(void)
 }
 
 static void
-rcu_torture_cb(struct rcu_head *p)
+__rcu_torture_cb(struct rcu_torture *rp)
 {
 	int i;
-	struct rcu_torture *rp = container_of(p, struct rcu_torture, rtort_rcu);
 
 	if (fullstop != FULLSTOP_DONTSTOP) {
 		/* Test is ending, just drop callbacks on the floor. */
@@ -409,6 +411,14 @@ rcu_torture_cb(struct rcu_head *p)
 		cur_ops->deferred_free(rp);
 }
 
+static void
+rcu_torture_cb(struct rcu_head *p)
+{
+	struct rcu_torture *rp = container_of(p, struct rcu_torture, rtort_rcu);
+
+	__rcu_torture_cb(rp);
+}
+
 static int rcu_no_completed(void)
 {
 	return 0;
@@ -623,6 +633,19 @@ static int srcu_torture_completed(void)
 	return srcu_batches_completed(&srcu_ctl);
 }
 
+static void
+srcu_torture_cb(struct srcu_head *p)
+{
+	struct rcu_torture *rp = container_of(p, struct rcu_torture, rtort_srcu);
+
+	__rcu_torture_cb(rp);
+}
+
+static void srcu_torture_deferred_free(struct rcu_torture *rp)
+{
+	call_srcu(&srcu_ctl, &rp->rtort_srcu, srcu_torture_cb);
+}
+
 static void srcu_torture_synchronize(void)
 {
 	synchronize_srcu(&srcu_ctl);
@@ -652,7 +675,7 @@ static struct rcu_torture_ops srcu_ops = {
 	.read_delay	= srcu_read_delay,
 	.readunlock	= srcu_torture_read_unlock,
 	.completed	= srcu_torture_completed,
-	.deferred_free	= rcu_sync_torture_deferred_free,
+	.deferred_free	= srcu_torture_deferred_free,
 	.sync		= srcu_torture_synchronize,
 	.call		= NULL,
 	.cb_barrier	= NULL,
@@ -660,6 +683,21 @@ static struct rcu_torture_ops srcu_ops = {
 	.name		= "srcu"
 };
 
+static struct rcu_torture_ops srcu_sync_ops = {
+	.init		= srcu_torture_init,
+	.cleanup	= srcu_torture_cleanup,
+	.readlock	= srcu_torture_read_lock,
+	.read_delay	= srcu_read_delay,
+	.readunlock	= srcu_torture_read_unlock,
+	.completed	= srcu_torture_completed,
+	.deferred_free	= rcu_sync_torture_deferred_free,
+	.sync		= srcu_torture_synchronize,
+	.call		= NULL,
+	.cb_barrier	= NULL,
+	.stats		= srcu_torture_stats,
+	.name		= "srcu_sync"
+};
+
 static int srcu_torture_read_lock_raw(void) __acquires(&srcu_ctl)
 {
 	return srcu_read_lock_raw(&srcu_ctl);
@@ -677,7 +715,7 @@ static struct rcu_torture_ops srcu_raw_ops = {
 	.read_delay	= srcu_read_delay,
 	.readunlock	= srcu_torture_read_unlock_raw,
 	.completed	= srcu_torture_completed,
-	.deferred_free	= rcu_sync_torture_deferred_free,
+	.deferred_free	= srcu_torture_deferred_free,
 	.sync		= srcu_torture_synchronize,
 	.call		= NULL,
 	.cb_barrier	= NULL,
@@ -685,6 +723,21 @@ static struct rcu_torture_ops srcu_raw_ops = {
 	.name		= "srcu_raw"
 };
 
+static struct rcu_torture_ops srcu_raw_sync_ops = {
+	.init		= srcu_torture_init,
+	.cleanup	= srcu_torture_cleanup,
+	.readlock	= srcu_torture_read_lock_raw,
+	.read_delay	= srcu_read_delay,
+	.readunlock	= srcu_torture_read_unlock_raw,
+	.completed	= srcu_torture_completed,
+	.deferred_free	= rcu_sync_torture_deferred_free,
+	.sync		= srcu_torture_synchronize,
+	.call		= NULL,
+	.cb_barrier	= NULL,
+	.stats		= srcu_torture_stats,
+	.name		= "srcu_raw_sync"
+};
+
 static void srcu_torture_synchronize_expedited(void)
 {
 	synchronize_srcu_expedited(&srcu_ctl);
@@ -1673,7 +1726,7 @@ static int rcu_torture_barrier_init(void)
 	for (i = 0; i < n_barrier_cbs; i++) {
 		init_waitqueue_head(&barrier_cbs_wq[i]);
 		barrier_cbs_tasks[i] = kthread_run(rcu_torture_barrier_cbs,
-						   (void *)i,
+						   (void *)(long)i,
 						   "rcu_torture_barrier_cbs");
 		if (IS_ERR(barrier_cbs_tasks[i])) {
 			ret = PTR_ERR(barrier_cbs_tasks[i]);
@@ -1857,7 +1910,8 @@ rcu_torture_init(void)
 	static struct rcu_torture_ops *torture_ops[] =
 		{ &rcu_ops, &rcu_sync_ops, &rcu_expedited_ops,
 		  &rcu_bh_ops, &rcu_bh_sync_ops, &rcu_bh_expedited_ops,
-		  &srcu_ops, &srcu_raw_ops, &srcu_expedited_ops,
+		  &srcu_ops, &srcu_sync_ops, &srcu_raw_ops,
+		  &srcu_raw_sync_ops, &srcu_expedited_ops,
 		  &sched_ops, &sched_sync_ops, &sched_expedited_ops, };
 
 	mutex_lock(&fullstop_mutex);
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH 4/6] remove flip_idx_and_wait()
  2012-03-06  9:57                 ` [PATCH 4/6] remove flip_idx_and_wait() Lai Jiangshan
@ 2012-03-06 10:41                   ` Peter Zijlstra
  2012-03-07  3:54                   ` [RFC PATCH 5/5 single-thread-version] implement per-domain single-thread state machine call_srcu() Lai Jiangshan
  1 sibling, 0 replies; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-06 10:41 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paul E. McKenney, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
> +        * full duration of flip_idx_and_wait() between fetching the

You just killed that function ;-)

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06  9:57                 ` [RFC PATCH 5/6] implement per-cpu&per-domain " Lai Jiangshan
@ 2012-03-06 10:47                   ` Peter Zijlstra
  2012-03-08 19:44                     ` Paul E. McKenney
  2012-03-06 10:58                   ` Peter Zijlstra
                                     ` (7 subsequent siblings)
  8 siblings, 1 reply; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-06 10:47 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paul E. McKenney, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
> o       The srcu callback is new thing, I hope it is completely preemptible,
>         even sleepable. It does in this implemetation, I use work_struct
>         to stand for every srcu callback. 

I didn't need the callbacks to sleep too, I just needed the read-side
srcu bit.

There's an argument against making the callbacks able to sleep like that
in that you typically want to minimize the amount of work done in the
callbacks, allowing them to sleep invites to callbacks that do _way_ too
much work.

I haven't made my mind up if I care yet.. :-)

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06  9:57                 ` [RFC PATCH 5/6] implement per-cpu&per-domain " Lai Jiangshan
  2012-03-06 10:47                   ` Peter Zijlstra
@ 2012-03-06 10:58                   ` Peter Zijlstra
  2012-03-06 15:17                     ` Lai Jiangshan
  2012-03-06 11:16                   ` Peter Zijlstra
                                     ` (6 subsequent siblings)
  8 siblings, 1 reply; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-06 10:58 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paul E. McKenney, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
>  /*
> + * 'return left < right;' but handle the overflow issues.
> + * The same as 'return (long)(right - left) > 0;' but it cares more.

About what? And why? We do the (long)(a - b) thing all over the kernel,
why would you care more?

> + */
> +static inline
> +bool safe_less_than(unsigned long left, unsigned long right, unsigned long max)
> +{
> +       unsigned long a = right - left;
> +       unsigned long b = max - left;
> +
> +       return !!a && (a <= b);
> +} 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06  9:57                 ` [RFC PATCH 5/6] implement per-cpu&per-domain " Lai Jiangshan
  2012-03-06 10:47                   ` Peter Zijlstra
  2012-03-06 10:58                   ` Peter Zijlstra
@ 2012-03-06 11:16                   ` Peter Zijlstra
  2012-03-06 15:12                     ` Lai Jiangshan
  2012-03-06 15:26                     ` Lai Jiangshan
  2012-03-06 11:17                   ` Peter Zijlstra
                                     ` (5 subsequent siblings)
  8 siblings, 2 replies; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-06 11:16 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paul E. McKenney, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
>         srcu_head is bigger, it is worth, it provides more ability and simplify
>         the srcu code. 

Dubious claim.. memory footprint of various data structures is deemed
important. rcu_head is 16 bytes, srcu_head is 32 bytes. I think it would
be real nice not to have two different callback structures and not grow
them as large.



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06  9:57                 ` [RFC PATCH 5/6] implement per-cpu&per-domain " Lai Jiangshan
                                     ` (2 preceding siblings ...)
  2012-03-06 11:16                   ` Peter Zijlstra
@ 2012-03-06 11:17                   ` Peter Zijlstra
  2012-03-06 11:22                   ` Peter Zijlstra
                                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-06 11:17 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paul E. McKenney, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
> +       /* the sequence of check zero */
> +       unsigned long chck_seq; 

So you're saving 1 character at the expense of readability, weird
trade-off.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06  9:57                 ` [RFC PATCH 5/6] implement per-cpu&per-domain " Lai Jiangshan
                                     ` (3 preceding siblings ...)
  2012-03-06 11:17                   ` Peter Zijlstra
@ 2012-03-06 11:22                   ` Peter Zijlstra
  2012-03-06 11:35                   ` Peter Zijlstra
                                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-06 11:22 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paul E. McKenney, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
> +       int idxb = sp->completed & 0X1UL;

I didn't even know you could write a hex literal with an upper-case X.

How weird :-)

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06  9:57                 ` [RFC PATCH 5/6] implement per-cpu&per-domain " Lai Jiangshan
                                     ` (4 preceding siblings ...)
  2012-03-06 11:22                   ` Peter Zijlstra
@ 2012-03-06 11:35                   ` Peter Zijlstra
  2012-03-06 11:36                   ` Peter Zijlstra
                                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-06 11:35 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paul E. McKenney, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
> +/* Must called with sp->flip_check_mutex and sp->gp_lock held; */
> +static bool check_zero_idx(struct srcu_struct *sp, int idx,
> +               struct srcu_head *head, int trycount)
> +{
> +       unsigned long chck_seq;
> +       bool checked_zero;
> + 

	lockdep_assert_held(&sp->flip_check_mutex);
	lockdep_assert_held(&sp->gp_lock);

and you can do away with your comment (also you forgot a verb).

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06  9:57                 ` [RFC PATCH 5/6] implement per-cpu&per-domain " Lai Jiangshan
                                     ` (5 preceding siblings ...)
  2012-03-06 11:35                   ` Peter Zijlstra
@ 2012-03-06 11:36                   ` Peter Zijlstra
  2012-03-06 11:39                   ` Peter Zijlstra
  2012-03-06 11:52                   ` Peter Zijlstra
  8 siblings, 0 replies; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-06 11:36 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paul E. McKenney, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
> + * Must be called with sp->flip_check_mutex and sp->gp_lock held;
> + * Must be called from process contex, 

Holding a mutex pretty much guarantees you're in process context,
no? :-)

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06  9:57                 ` [RFC PATCH 5/6] implement per-cpu&per-domain " Lai Jiangshan
                                     ` (6 preceding siblings ...)
  2012-03-06 11:36                   ` Peter Zijlstra
@ 2012-03-06 11:39                   ` Peter Zijlstra
  2012-03-06 14:50                     ` Lai Jiangshan
  2012-03-06 11:52                   ` Peter Zijlstra
  8 siblings, 1 reply; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-06 11:39 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paul E. McKenney, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
> +       int cpu = get_cpu();
> +       struct srcu_cpu_struct *scp = per_cpu_ptr(sp->srcu_per_cpu, cpu) 

They invented get_cpu_ptr() for that..

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06  9:57                 ` [RFC PATCH 5/6] implement per-cpu&per-domain " Lai Jiangshan
                                     ` (7 preceding siblings ...)
  2012-03-06 11:39                   ` Peter Zijlstra
@ 2012-03-06 11:52                   ` Peter Zijlstra
  2012-03-06 14:44                     ` Lai Jiangshan
  2012-03-06 14:47                     ` Lai Jiangshan
  8 siblings, 2 replies; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-06 11:52 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paul E. McKenney, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
> +void srcu_barrier(struct srcu_struct *sp)
> +{
> +       struct srcu_sync sync;
> +       struct srcu_head *head = &sync.head;
> +       unsigned long chck_seq; /* snap */
> +
> +       int idle_loop = 0;
> +       int cpu;
> +       struct srcu_cpu_struct *scp;
> +
> +       spin_lock_irq(&sp->gp_lock);
> +       chck_seq = sp->chck_seq;
> +       for_each_possible_cpu(cpu) {

ARGH!! this is really not ok.. so we spend all this time killing
srcu_sync_expidited and co because they prod at all cpus for no good
reason, and what do you do?

Also, what happens if your cpu isn't actually online?


> +               scp = per_cpu_ptr(sp->srcu_per_cpu, cpu);
> +               if (scp->head && !safe_less_than(chck_seq, scp->head->chck_seq,
> +                               sp->chck_seq)) {
> +                       /* this path is likely enterred only once */
> +                       init_completion(&sync.completion);
> +                       srcu_queue_callback(sp, scp, head,
> +                                       __synchronize_srcu_callback);
> +                       /* don't need to wakeup the woken state machine */
> +                       spin_unlock_irq(&sp->gp_lock);
> +                       wait_for_completion(&sync.completion);
> +                       spin_lock_irq(&sp->gp_lock);
> +               } else {
> +                       if ((++idle_loop & 0xF) == 0) {
> +                               spin_unlock_irq(&sp->gp_lock);
> +                               udelay(1);
> +                               spin_lock_irq(&sp->gp_lock);
> +                       }

The purpose of this bit isn't quite clear to me, is this simply a lock
break?

> +               }
> +       }
> +       spin_unlock_irq(&sp->gp_lock);
> +
> +       flush_workqueue(srcu_callback_wq);

Since you already waited for the completions one by one, what's the
purpose of this?

> +}
> +EXPORT_SYMBOL_GPL(srcu_barrier); 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06 11:52                   ` Peter Zijlstra
@ 2012-03-06 14:44                     ` Lai Jiangshan
  2012-03-06 15:31                       ` Peter Zijlstra
                                         ` (2 more replies)
  2012-03-06 14:47                     ` Lai Jiangshan
  1 sibling, 3 replies; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-06 14:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Paul E. McKenney, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On Tue, Mar 6, 2012 at 7:52 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
>> +void srcu_barrier(struct srcu_struct *sp)
>> +{
>> +       struct srcu_sync sync;
>> +       struct srcu_head *head = &sync.head;
>> +       unsigned long chck_seq; /* snap */
>> +
>> +       int idle_loop = 0;
>> +       int cpu;
>> +       struct srcu_cpu_struct *scp;
>> +
>> +       spin_lock_irq(&sp->gp_lock);
>> +       chck_seq = sp->chck_seq;
>> +       for_each_possible_cpu(cpu) {
>
> ARGH!! this is really not ok.. so we spend all this time killing
> srcu_sync_expidited and co because they prod at all cpus for no good
> reason, and what do you do?

it is srcu_barrier(), it have to wait all callbacks complete for all
cpus since it is per-cpu
implementation.

>
> Also, what happens if your cpu isn't actually online?

The workqueue handles it, not here, if a cpu state machine has callbacks, the
state machine is started, if it has no callback,  srcu_barrier() does
nothing for
this cpu

>
>
>> +               scp = per_cpu_ptr(sp->srcu_per_cpu, cpu);
>> +               if (scp->head && !safe_less_than(chck_seq, scp->head->chck_seq,
>> +                               sp->chck_seq)) {
>> +                       /* this path is likely enterred only once */
>> +                       init_completion(&sync.completion);
>> +                       srcu_queue_callback(sp, scp, head,
>> +                                       __synchronize_srcu_callback);
>> +                       /* don't need to wakeup the woken state machine */
>> +                       spin_unlock_irq(&sp->gp_lock);
>> +                       wait_for_completion(&sync.completion);
>> +                       spin_lock_irq(&sp->gp_lock);
>> +               } else {
>> +                       if ((++idle_loop & 0xF) == 0) {
>> +                               spin_unlock_irq(&sp->gp_lock);
>> +                               udelay(1);
>> +                               spin_lock_irq(&sp->gp_lock);
>> +                       }
>
> The purpose of this bit isn't quite clear to me, is this simply a lock
> break?

Yes, the main purpose is:
make the time of sp->gp_lock short, can be determined.

>
>> +               }
>> +       }
>> +       spin_unlock_irq(&sp->gp_lock);
>> +
>> +       flush_workqueue(srcu_callback_wq);
>
> Since you already waited for the completions one by one, what's the
> purpose of this?
>
>> +}
>> +EXPORT_SYMBOL_GPL(srcu_barrier);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06 11:52                   ` Peter Zijlstra
  2012-03-06 14:44                     ` Lai Jiangshan
@ 2012-03-06 14:47                     ` Lai Jiangshan
  1 sibling, 0 replies; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-06 14:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Paul E. McKenney, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

>
>> +               }
>> +       }
>> +       spin_unlock_irq(&sp->gp_lock);
>> +
>> +       flush_workqueue(srcu_callback_wq);
>
> Since you already waited for the completions one by one, what's the
> purpose of this?

srcu callbacks are preemptible/sleepable here, so they may not
complete in order,
some may not finish here. so srcu_barrier() needs to wait them.


>
>> +}
>> +EXPORT_SYMBOL_GPL(srcu_barrier);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06 11:39                   ` Peter Zijlstra
@ 2012-03-06 14:50                     ` Lai Jiangshan
  0 siblings, 0 replies; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-06 14:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Paul E. McKenney, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On Tue, Mar 6, 2012 at 7:39 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
>> +       int cpu = get_cpu();
>> +       struct srcu_cpu_struct *scp = per_cpu_ptr(sp->srcu_per_cpu, cpu)
>
> They invented get_cpu_ptr() for that..

Thanks.
And I'm sorry, these code is merged by two functions,
I forgot it when merging, I need to check it more.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06 11:16                   ` Peter Zijlstra
@ 2012-03-06 15:12                     ` Lai Jiangshan
  2012-03-06 15:34                       ` Peter Zijlstra
  2012-03-06 15:26                     ` Lai Jiangshan
  1 sibling, 1 reply; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-06 15:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Paul E. McKenney, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches, tj

On Tue, Mar 6, 2012 at 7:16 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
>>         srcu_head is bigger, it is worth, it provides more ability and simplify
>>         the srcu code.
>
> Dubious claim.. memory footprint of various data structures is deemed
> important. rcu_head is 16 bytes, srcu_head is 32 bytes. I think it would
> be real nice not to have two different callback structures and not grow
> them as large.

CC: tj@kernel.org
It could be better if workqueue also supports 2*sizeof(long) work callbacks.

I prefer ability/functionality a little more, it eases the caller's pain.
preemptible callbacks also eases the pressure of the whole system.
But I'm also ok if we limit the srcu-callbacks in softirq.

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06 10:58                   ` Peter Zijlstra
@ 2012-03-06 15:17                     ` Lai Jiangshan
  2012-03-06 15:38                       ` Peter Zijlstra
  0 siblings, 1 reply; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-06 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Paul E. McKenney, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On Tue, Mar 6, 2012 at 6:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
>>  /*
>> + * 'return left < right;' but handle the overflow issues.
>> + * The same as 'return (long)(right - left) > 0;' but it cares more.
>
> About what? And why? We do the (long)(a - b) thing all over the kernel,
> why would you care more?

@left is constants of  the callers(callbacks's snapshot), @right
increases very slow.
if (long)(right - left) is a big negative, we have to wait for a long
time in this kinds of overflow.
this kinds of overflow can not happen in this safe_less_than()

>
>> + */
>> +static inline
>> +bool safe_less_than(unsigned long left, unsigned long right, unsigned long max)
>> +{
>> +       unsigned long a = right - left;
>> +       unsigned long b = max - left;
>> +
>> +       return !!a && (a <= b);
>> +}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06 11:16                   ` Peter Zijlstra
  2012-03-06 15:12                     ` Lai Jiangshan
@ 2012-03-06 15:26                     ` Lai Jiangshan
  2012-03-06 15:37                       ` Peter Zijlstra
  1 sibling, 1 reply; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-06 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Paul E. McKenney, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On Tue, Mar 6, 2012 at 7:16 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
>>         srcu_head is bigger, it is worth, it provides more ability and simplify
>>         the srcu code.
>
> Dubious claim.. memory footprint of various data structures is deemed
> important. rcu_head is 16 bytes, srcu_head is 32 bytes. I think it would
> be real nice not to have two different callback structures and not grow
> them as large.

Also "Dubious" "simplify"?

bigger struct makes we can store ->check_seq in the callback, makes
the processing simply.

bigger struct is not required here, we can use a little complex way(batches)
to do it.

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06 14:44                     ` Lai Jiangshan
@ 2012-03-06 15:31                       ` Peter Zijlstra
  2012-03-06 15:32                       ` Peter Zijlstra
  2012-03-07  8:10                       ` Gilad Ben-Yossef
  2 siblings, 0 replies; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-06 15:31 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, Paul E. McKenney, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On Tue, 2012-03-06 at 22:44 +0800, Lai Jiangshan wrote:
> >> +                       if ((++idle_loop & 0xF) == 0) {
> >> +                               spin_unlock_irq(&sp->gp_lock);
> >> +                               udelay(1);
> >> +                               spin_lock_irq(&sp->gp_lock);
> >> +                       }
> >
> > The purpose of this bit isn't quite clear to me, is this simply a lock
> > break?
> 
> Yes, the main purpose is:
> make the time of sp->gp_lock short, can be determined. 

either introduce cond_resched_lock_irq() or write something like:

	if (need_resched() || spin_needbreak(&sp->gp_lock)) {
		spin_unlock_irq(&sp->gp_lock);
		cond_resched();
		spin_lock_irq(&sp->gp_lock);
	}

udelay(1) is complete nonsense.. 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06 14:44                     ` Lai Jiangshan
  2012-03-06 15:31                       ` Peter Zijlstra
@ 2012-03-06 15:32                       ` Peter Zijlstra
  2012-03-07  6:44                         ` Lai Jiangshan
  2012-03-07  8:10                       ` Gilad Ben-Yossef
  2 siblings, 1 reply; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-06 15:32 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, Paul E. McKenney, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On Tue, 2012-03-06 at 22:44 +0800, Lai Jiangshan wrote:
> it is srcu_barrier(), it have to wait all callbacks complete for all
> cpus since it is per-cpu
> implementation.
> 
If you enqueue each callback as a work, flush_workqueue() alone should
do that. It'll wait for completion of all currently enqueued works but
ignores new works.



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06 15:12                     ` Lai Jiangshan
@ 2012-03-06 15:34                       ` Peter Zijlstra
  2012-03-08 19:58                         ` Paul E. McKenney
  0 siblings, 1 reply; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-06 15:34 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, Paul E. McKenney, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches, tj

On Tue, 2012-03-06 at 23:12 +0800, Lai Jiangshan wrote:
> On Tue, Mar 6, 2012 at 7:16 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
> >>         srcu_head is bigger, it is worth, it provides more ability and simplify
> >>         the srcu code.
> >
> > Dubious claim.. memory footprint of various data structures is deemed
> > important. rcu_head is 16 bytes, srcu_head is 32 bytes. I think it would
> > be real nice not to have two different callback structures and not grow
> > them as large.
> 
> CC: tj@kernel.org
> It could be better if workqueue also supports 2*sizeof(long) work callbacks.

That's going to be very painful if at all possible.

> I prefer ability/functionality a little more, it eases the caller's pain.
> preemptible callbacks also eases the pressure of the whole system.
> But I'm also ok if we limit the srcu-callbacks in softirq.

You don't have to use softirq, you could run a complete list from a
single worklet. Just keep the single linked rcu_head list and enqueue a
static (per-cpu) worker to process the entire list.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06 15:26                     ` Lai Jiangshan
@ 2012-03-06 15:37                       ` Peter Zijlstra
  0 siblings, 0 replies; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-06 15:37 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, Paul E. McKenney, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On Tue, 2012-03-06 at 23:26 +0800, Lai Jiangshan wrote:
> On Tue, Mar 6, 2012 at 7:16 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
> >>         srcu_head is bigger, it is worth, it provides more ability and simplify
> >>         the srcu code.
> >
> > Dubious claim.. memory footprint of various data structures is deemed
> > important. rcu_head is 16 bytes, srcu_head is 32 bytes. I think it would
> > be real nice not to have two different callback structures and not grow
> > them as large.
> 
> Also "Dubious" "simplify"?

Nah, I meant dubious. 

> bigger struct makes we can store ->check_seq in the callback, makes
> the processing simply.
> 
> bigger struct is not required here, we can use a little complex way(batches)
> to do it.

Right, just keep a callback list for each state, and once you fall off
the end process it. No need to keep per callback state.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06 15:17                     ` Lai Jiangshan
@ 2012-03-06 15:38                       ` Peter Zijlstra
  2012-03-08 19:49                         ` Paul E. McKenney
  0 siblings, 1 reply; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-06 15:38 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, Paul E. McKenney, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On Tue, 2012-03-06 at 23:17 +0800, Lai Jiangshan wrote:
> On Tue, Mar 6, 2012 at 6:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
> >>  /*
> >> + * 'return left < right;' but handle the overflow issues.
> >> + * The same as 'return (long)(right - left) > 0;' but it cares more.
> >
> > About what? And why? We do the (long)(a - b) thing all over the kernel,
> > why would you care more?
> 
> @left is constants of  the callers(callbacks's snapshot), @right
> increases very slow.
> if (long)(right - left) is a big negative, we have to wait for a long
> time in this kinds of overflow.
> this kinds of overflow can not happen in this safe_less_than()

I'm afraid I'm being particularly dense, but what?!

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [RFC PATCH 5/5 single-thread-version] implement per-domain single-thread state machine call_srcu()
  2012-03-06  9:57                 ` [PATCH 4/6] remove flip_idx_and_wait() Lai Jiangshan
  2012-03-06 10:41                   ` Peter Zijlstra
@ 2012-03-07  3:54                   ` Lai Jiangshan
  2012-03-08 13:04                     ` Peter Zijlstra
                                       ` (2 more replies)
  1 sibling, 3 replies; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-07  3:54 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, peterz, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

This patch is on the top of the 4 previous patches(1/6, 2/6, 3/6, 4/6).

o	state machine is light way and single-threaded, it is preemptible when checking.

o	state machine is a work_struct. So, there is no thread occupied
	by SRCU when the srcu is not actived(no callback). And it does
	not sleep(avoid to occupy a thread when sleep).

o	state machine is the only thread can flip/check/write(*) the srcu_struct,
	so we don't need any mutex.
	(write(*): except ->per_cpu_ref, ->running, ->batch_queue)

o	synchronize_srcu() is always call call_srcu().
	synchronize_srcu_expedited() is also.
	It is OK for mb()-based srcu are extremely fast.

o	In current kernel, we can expect that there are only 1 callback per gp.
	so callback is probably called in the same CPU when it is queued.

The trip of a callback:
	1) ->batch_queue when call_srcu()

	2) ->batch_check0 when try to do check_zero

	3) ->batch_check1 after finish its first check_zero and the flip

	4) ->batch_done after finish its second check_zero

The current requirement of the callbacks:
	The callback will be called inside process context.
	The callback should be fast without any sleeping path.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 include/linux/rcupdate.h |    2 +-
 include/linux/srcu.h     |   28 +++++-
 kernel/rcupdate.c        |   24 ++++-
 kernel/rcutorture.c      |   44 ++++++++-
 kernel/srcu.c            |  238 ++++++++++++++++++++++++++++++++-------------
 5 files changed, 259 insertions(+), 77 deletions(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 9372174..d98eab2 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -222,7 +222,7 @@ extern void rcu_irq_exit(void);
  * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
  */
 
-typedef void call_rcu_func_t(struct rcu_head *head,
+typedef void (*call_rcu_func_t)(struct rcu_head *head,
 			     void (*func)(struct rcu_head *head));
 void wait_rcu_gp(call_rcu_func_t crf);
 
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index df8f5f7..56cb774 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -29,6 +29,7 @@
 
 #include <linux/mutex.h>
 #include <linux/rcupdate.h>
+#include <linux/workqueue.h>
 
 struct srcu_struct_array {
 	unsigned long c[2];
@@ -39,10 +40,23 @@ struct srcu_struct_array {
 #define SRCU_REF_MASK		(ULONG_MAX >> SRCU_USAGE_BITS)
 #define SRCU_USAGE_COUNT	(SRCU_REF_MASK + 1)
 
+struct rcu_batch {
+	struct rcu_head *head, **tail;
+};
+
 struct srcu_struct {
 	unsigned completed;
 	struct srcu_struct_array __percpu *per_cpu_ref;
-	struct mutex mutex;
+	spinlock_t queue_lock; /* protect ->batch_queue, ->running */
+	bool running;
+	/* callbacks just queued */
+	struct rcu_batch batch_queue;
+	/* callbacks try to do the first check_zero */
+	struct rcu_batch batch_check0;
+	/* callbacks done with the first check_zero and the flip */
+	struct rcu_batch batch_check1;
+	struct rcu_batch batch_done;
+	struct delayed_work work;
 	unsigned long snap[NR_CPUS];
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map dep_map;
@@ -67,12 +81,24 @@ int init_srcu_struct(struct srcu_struct *sp);
 
 #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
 
+/* draft
+ * queue callbacks which will be invoked after grace period.
+ * The callback will be called inside process context.
+ * The callback should be fast without any sleeping path.
+ */
+void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
+		void (*func)(struct rcu_head *head));
+
+typedef void (*call_srcu_func_t)(struct srcu_struct *sp, struct rcu_head *head,
+		void (*func)(struct rcu_head *head));
+void __wait_srcu_gp(struct srcu_struct *sp, call_srcu_func_t crf);
 void cleanup_srcu_struct(struct srcu_struct *sp);
 int __srcu_read_lock(struct srcu_struct *sp) __acquires(sp);
 void __srcu_read_unlock(struct srcu_struct *sp, int idx) __releases(sp);
 void synchronize_srcu(struct srcu_struct *sp);
 void synchronize_srcu_expedited(struct srcu_struct *sp);
 long srcu_batches_completed(struct srcu_struct *sp);
+void srcu_barrier(struct srcu_struct *sp);
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 
diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
index a86f174..f9b551f 100644
--- a/kernel/rcupdate.c
+++ b/kernel/rcupdate.c
@@ -45,6 +45,7 @@
 #include <linux/mutex.h>
 #include <linux/export.h>
 #include <linux/hardirq.h>
+#include <linux/srcu.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/rcu.h>
@@ -123,20 +124,39 @@ static void wakeme_after_rcu(struct rcu_head  *head)
 	complete(&rcu->completion);
 }
 
-void wait_rcu_gp(call_rcu_func_t crf)
+static void __wait_rcu_gp(void *domain, void *func)
 {
 	struct rcu_synchronize rcu;
 
 	init_rcu_head_on_stack(&rcu.head);
 	init_completion(&rcu.completion);
+
 	/* Will wake me after RCU finished. */
-	crf(&rcu.head, wakeme_after_rcu);
+	if (!domain) {
+		call_rcu_func_t crf = func;
+		crf(&rcu.head, wakeme_after_rcu);
+	} else {
+		call_srcu_func_t crf = func;
+		crf(domain, &rcu.head, wakeme_after_rcu);
+	}
+
 	/* Wait for it. */
 	wait_for_completion(&rcu.completion);
 	destroy_rcu_head_on_stack(&rcu.head);
 }
+
+void wait_rcu_gp(call_rcu_func_t crf)
+{
+	__wait_rcu_gp(NULL, crf);
+}
 EXPORT_SYMBOL_GPL(wait_rcu_gp);
 
+/* srcu.c internel */
+void __wait_srcu_gp(struct srcu_struct *sp, call_srcu_func_t crf)
+{
+	__wait_rcu_gp(sp, crf);
+}
+
 #ifdef CONFIG_PROVE_RCU
 /*
  * wrapper function to avoid #include problems.
diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
index 54e5724..40d24d0 100644
--- a/kernel/rcutorture.c
+++ b/kernel/rcutorture.c
@@ -623,6 +623,11 @@ static int srcu_torture_completed(void)
 	return srcu_batches_completed(&srcu_ctl);
 }
 
+static void srcu_torture_deferred_free(struct rcu_torture *rp)
+{
+	call_srcu(&srcu_ctl, &rp->rtort_rcu, rcu_torture_cb);
+}
+
 static void srcu_torture_synchronize(void)
 {
 	synchronize_srcu(&srcu_ctl);
@@ -652,7 +657,7 @@ static struct rcu_torture_ops srcu_ops = {
 	.read_delay	= srcu_read_delay,
 	.readunlock	= srcu_torture_read_unlock,
 	.completed	= srcu_torture_completed,
-	.deferred_free	= rcu_sync_torture_deferred_free,
+	.deferred_free	= srcu_torture_deferred_free,
 	.sync		= srcu_torture_synchronize,
 	.call		= NULL,
 	.cb_barrier	= NULL,
@@ -660,6 +665,21 @@ static struct rcu_torture_ops srcu_ops = {
 	.name		= "srcu"
 };
 
+static struct rcu_torture_ops srcu_sync_ops = {
+	.init		= srcu_torture_init,
+	.cleanup	= srcu_torture_cleanup,
+	.readlock	= srcu_torture_read_lock,
+	.read_delay	= srcu_read_delay,
+	.readunlock	= srcu_torture_read_unlock,
+	.completed	= srcu_torture_completed,
+	.deferred_free	= rcu_sync_torture_deferred_free,
+	.sync		= srcu_torture_synchronize,
+	.call		= NULL,
+	.cb_barrier	= NULL,
+	.stats		= srcu_torture_stats,
+	.name		= "srcu_sync"
+};
+
 static int srcu_torture_read_lock_raw(void) __acquires(&srcu_ctl)
 {
 	return srcu_read_lock_raw(&srcu_ctl);
@@ -677,7 +697,7 @@ static struct rcu_torture_ops srcu_raw_ops = {
 	.read_delay	= srcu_read_delay,
 	.readunlock	= srcu_torture_read_unlock_raw,
 	.completed	= srcu_torture_completed,
-	.deferred_free	= rcu_sync_torture_deferred_free,
+	.deferred_free	= srcu_torture_deferred_free,
 	.sync		= srcu_torture_synchronize,
 	.call		= NULL,
 	.cb_barrier	= NULL,
@@ -685,6 +705,21 @@ static struct rcu_torture_ops srcu_raw_ops = {
 	.name		= "srcu_raw"
 };
 
+static struct rcu_torture_ops srcu_raw_sync_ops = {
+	.init		= srcu_torture_init,
+	.cleanup	= srcu_torture_cleanup,
+	.readlock	= srcu_torture_read_lock_raw,
+	.read_delay	= srcu_read_delay,
+	.readunlock	= srcu_torture_read_unlock_raw,
+	.completed	= srcu_torture_completed,
+	.deferred_free	= rcu_sync_torture_deferred_free,
+	.sync		= srcu_torture_synchronize,
+	.call		= NULL,
+	.cb_barrier	= NULL,
+	.stats		= srcu_torture_stats,
+	.name		= "srcu_raw_sync"
+};
+
 static void srcu_torture_synchronize_expedited(void)
 {
 	synchronize_srcu_expedited(&srcu_ctl);
@@ -1673,7 +1708,7 @@ static int rcu_torture_barrier_init(void)
 	for (i = 0; i < n_barrier_cbs; i++) {
 		init_waitqueue_head(&barrier_cbs_wq[i]);
 		barrier_cbs_tasks[i] = kthread_run(rcu_torture_barrier_cbs,
-						   (void *)i,
+						   (void *)(long)i,
 						   "rcu_torture_barrier_cbs");
 		if (IS_ERR(barrier_cbs_tasks[i])) {
 			ret = PTR_ERR(barrier_cbs_tasks[i]);
@@ -1857,7 +1892,8 @@ rcu_torture_init(void)
 	static struct rcu_torture_ops *torture_ops[] =
 		{ &rcu_ops, &rcu_sync_ops, &rcu_expedited_ops,
 		  &rcu_bh_ops, &rcu_bh_sync_ops, &rcu_bh_expedited_ops,
-		  &srcu_ops, &srcu_raw_ops, &srcu_expedited_ops,
+		  &srcu_ops, &srcu_sync_ops, &srcu_raw_ops,
+		  &srcu_raw_sync_ops, &srcu_expedited_ops,
 		  &sched_ops, &sched_sync_ops, &sched_expedited_ops, };
 
 	mutex_lock(&fullstop_mutex);
diff --git a/kernel/srcu.c b/kernel/srcu.c
index d101ed5..532f890 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -34,10 +34,60 @@
 #include <linux/delay.h>
 #include <linux/srcu.h>
 
+static inline void rcu_batch_init(struct rcu_batch *b)
+{
+	b->head = NULL;
+	b->tail = &b->head;
+}
+
+static inline void rcu_batch_queue(struct rcu_batch *b, struct rcu_head *head)
+{
+	*b->tail = head;
+	b->tail = &head->next;
+}
+
+static inline bool rcu_batch_empty(struct rcu_batch *b)
+{
+	return b->tail == &b->head;
+}
+
+static inline struct rcu_head *rcu_batch_dequeue(struct rcu_batch *b)
+{
+	struct rcu_head *head;
+
+	if (rcu_batch_empty(b))
+		return NULL;
+
+	head = b->head;
+	b->head = head->next;
+	if (b->tail == &head->next)
+		rcu_batch_init(b);
+
+	return head;
+}
+
+static inline void rcu_batch_move(struct rcu_batch *to, struct rcu_batch *from)
+{
+	if (!rcu_batch_empty(from)) {
+		*to->tail = from->head;
+		to->tail = from->tail;
+		rcu_batch_init(from);
+	}
+}
+
+/* single-thread state-machine */
+static void process_srcu(struct work_struct *work);
+
 static int init_srcu_struct_fields(struct srcu_struct *sp)
 {
 	sp->completed = 0;
-	mutex_init(&sp->mutex);
+	spin_lock_init(&sp->queue_lock);
+	sp->running = false;
+	rcu_batch_init(&sp->batch_queue);
+	rcu_batch_init(&sp->batch_check0);
+	rcu_batch_init(&sp->batch_check1);
+	rcu_batch_init(&sp->batch_done);
+	INIT_DELAYED_WORK(&sp->work, process_srcu);
 	sp->per_cpu_ref = alloc_percpu(struct srcu_struct_array);
 	return sp->per_cpu_ref ? 0 : -ENOMEM;
 }
@@ -254,11 +304,9 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
  * we repeatedly block for 1-millisecond time periods.  This approach
  * has done well in testing, so there is no need for a config parameter.
  */
-#define SYNCHRONIZE_SRCU_READER_DELAY	5
-#define SYNCHRONIZE_SRCU_TRYCOUNT	2
-#define SYNCHRONIZE_SRCU_EXP_TRYCOUNT	12
+#define SRCU_RETRY_CHECK_DELAY	5
 
-static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
+static bool try_check_zero(struct srcu_struct *sp, int idx, int trycount)
 {
 	/*
 	 * If a reader fetches the index before the ->completed increment,
@@ -271,19 +319,12 @@ static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
 	 */
 	smp_mb(); /* D */
 
-	/*
-	 * SRCU read-side critical sections are normally short, so wait
-	 * a small amount of time before possibly blocking.
-	 */
-	if (!srcu_readers_active_idx_check(sp, idx)) {
-		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
-		while (!srcu_readers_active_idx_check(sp, idx)) {
-			if (trycount > 0) {
-				trycount--;
-				udelay(SYNCHRONIZE_SRCU_READER_DELAY);
-			} else
-				schedule_timeout_interruptible(1);
-		}
+	for (;;) {
+		if (srcu_readers_active_idx_check(sp, idx))
+			break;
+		if (--trycount <= 0)
+			return false;
+		udelay(SRCU_RETRY_CHECK_DELAY);
 	}
 
 	/*
@@ -297,6 +338,8 @@ static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
 	 * the next flipping.
 	 */
 	smp_mb(); /* E */
+
+	return true;
 }
 
 /*
@@ -308,10 +351,27 @@ static void srcu_flip(struct srcu_struct *sp)
 	ACCESS_ONCE(sp->completed)++;
 }
 
+void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
+		void (*func)(struct rcu_head *head))
+{
+	unsigned long flags;
+
+	head->next = NULL;
+	head->func = func;
+	spin_lock_irqsave(&sp->queue_lock, flags);
+	rcu_batch_queue(&sp->batch_queue, head);
+	if (!sp->running) {
+		sp->running = true;
+		queue_delayed_work(system_nrt_wq, &sp->work, 0);
+	}
+	spin_unlock_irqrestore(&sp->queue_lock, flags);
+}
+EXPORT_SYMBOL_GPL(call_srcu);
+
 /*
  * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
  */
-static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
+static void __synchronize_srcu(struct srcu_struct *sp)
 {
 	rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
 			   !lock_is_held(&rcu_bh_lock_map) &&
@@ -319,54 +379,7 @@ static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
 			   !lock_is_held(&rcu_sched_lock_map),
 			   "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
 
-	mutex_lock(&sp->mutex);
-
-	/*
-	 * Suppose that during the previous grace period, a reader
-	 * picked up the old value of the index, but did not increment
-	 * its counter until after the previous instance of
-	 * __synchronize_srcu() did the counter summation and recheck.
-	 * That previous grace period was OK because the reader did
-	 * not start until after the grace period started, so the grace
-	 * period was not obligated to wait for that reader.
-	 *
-	 * However, the current SRCU grace period does have to wait for
-	 * that reader.  This is handled by invoking wait_idx() on the
-	 * non-active set of counters (hence sp->completed - 1).  Once
-	 * wait_idx() returns, we know that all readers that picked up
-	 * the old value of ->completed and that already incremented their
-	 * counter will have completed.
-	 *
-	 * But what about readers that picked up the old value of
-	 * ->completed, but -still- have not managed to increment their
-	 * counter?  We do not need to wait for those readers, because
-	 * they will have started their SRCU read-side critical section
-	 * after the current grace period starts.
-	 *
-	 * Because it is unlikely that readers will be preempted between
-	 * fetching ->completed and incrementing their counter, wait_idx()
-	 * will normally not need to wait.
-	 */
-	wait_idx(sp, (sp->completed - 1) & 0x1, trycount);
-
-	/*
-	 * Now that wait_idx() has waited for the really old readers,
-	 *
-	 * Flip the readers' index by incrementing ->completed, then wait
-	 * until there are no more readers using the counters referenced by
-	 * the old index value.  (Recall that the index is the bottom bit
-	 * of ->completed.)
-	 *
-	 * Of course, it is possible that a reader might be delayed for the
-	 * full duration of flip_idx_and_wait() between fetching the
-	 * index and incrementing its counter.  This possibility is handled
-	 * by the next __synchronize_srcu() invoking wait_idx() for such
-	 * readers before starting a new grace period.
-	 */
-	srcu_flip(sp);
-	wait_idx(sp, (sp->completed - 1) & 0x1, trycount);
-
-	mutex_unlock(&sp->mutex);
+	__wait_srcu_gp(sp, call_srcu);
 }
 
 /**
@@ -385,7 +398,7 @@ static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
  */
 void synchronize_srcu(struct srcu_struct *sp)
 {
-	__synchronize_srcu(sp, SYNCHRONIZE_SRCU_TRYCOUNT);
+	__synchronize_srcu(sp);
 }
 EXPORT_SYMBOL_GPL(synchronize_srcu);
 
@@ -406,10 +419,16 @@ EXPORT_SYMBOL_GPL(synchronize_srcu);
  */
 void synchronize_srcu_expedited(struct srcu_struct *sp)
 {
-	__synchronize_srcu(sp, SYNCHRONIZE_SRCU_EXP_TRYCOUNT);
+	__synchronize_srcu(sp);
 }
 EXPORT_SYMBOL_GPL(synchronize_srcu_expedited);
 
+void srcu_barrier(struct srcu_struct *sp)
+{
+	__synchronize_srcu(sp);
+}
+EXPORT_SYMBOL_GPL(srcu_barrier);
+
 /**
  * srcu_batches_completed - return batches completed.
  * @sp: srcu_struct on which to report batch completion.
@@ -423,3 +442,84 @@ long srcu_batches_completed(struct srcu_struct *sp)
 	return sp->completed;
 }
 EXPORT_SYMBOL_GPL(srcu_batches_completed);
+
+#define SRCU_CALLBACK_BATCH	10
+#define SRCU_INTERVAL		1
+
+static void srcu_collect_new(struct srcu_struct *sp)
+{
+	if (!rcu_batch_empty(&sp->batch_queue)) {
+		spin_lock_irq(&sp->queue_lock);
+		rcu_batch_move(&sp->batch_check0, &sp->batch_queue);
+		spin_unlock_irq(&sp->queue_lock);
+	}
+}
+
+static void srcu_advance_batches(struct srcu_struct *sp)
+{
+	int idx = 1 - (sp->completed & 0x1UL);
+
+	/*
+	 * SRCU read-side critical sections are normally short, so check
+	 * twice after a flip.
+	 */
+	if (!rcu_batch_empty(&sp->batch_check1) ||
+	    !rcu_batch_empty(&sp->batch_check0)) {
+		if (try_check_zero(sp, idx, 1)) {
+			rcu_batch_move(&sp->batch_done, &sp->batch_check1);
+			rcu_batch_move(&sp->batch_check1, &sp->batch_check0);
+			if (!rcu_batch_empty(&sp->batch_check1)) {
+				srcu_flip(sp);
+				if (try_check_zero(sp, 1 - idx, 2)) {
+					rcu_batch_move(&sp->batch_done,
+						&sp->batch_check1);
+				}
+			}
+		}
+	}
+}
+
+static void srcu_invoke_callbacks(struct srcu_struct *sp)
+{
+	int i;
+	struct rcu_head *head;
+
+	for (i = 0; i < SRCU_CALLBACK_BATCH; i++) {
+		head = rcu_batch_dequeue(&sp->batch_done);
+		if (!head)
+			break;
+		head->func(head);
+	}
+}
+
+static void srcu_reschedule(struct srcu_struct *sp)
+{
+	bool running = true;
+
+	if (rcu_batch_empty(&sp->batch_done) &&
+	    rcu_batch_empty(&sp->batch_check1) &&
+	    rcu_batch_empty(&sp->batch_check0) &&
+	    rcu_batch_empty(&sp->batch_queue)) {
+		spin_lock_irq(&sp->queue_lock);
+		if (rcu_batch_empty(&sp->batch_queue)) {
+			sp->running = false;
+			running = false;
+		}
+		spin_unlock_irq(&sp->queue_lock);
+	}
+
+	if (running)
+		queue_delayed_work(system_nrt_wq, &sp->work, SRCU_INTERVAL);
+}
+
+static void process_srcu(struct work_struct *work)
+{
+	struct srcu_struct *sp;
+
+	sp = container_of(work, struct srcu_struct, work.work);
+
+	srcu_collect_new(sp);
+	srcu_advance_batches(sp);
+	srcu_invoke_callbacks(sp);
+	srcu_reschedule(sp);
+}




^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06 15:32                       ` Peter Zijlstra
@ 2012-03-07  6:44                         ` Lai Jiangshan
  0 siblings, 0 replies; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-07  6:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Paul E. McKenney, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On 03/06/2012 11:32 PM, Peter Zijlstra wrote:
> On Tue, 2012-03-06 at 22:44 +0800, Lai Jiangshan wrote:
>> it is srcu_barrier(), it have to wait all callbacks complete for all
>> cpus since it is per-cpu
>> implementation.
>>
> If you enqueue each callback as a work, flush_workqueue() alone should
> do that. It'll wait for completion of all currently enqueued works but
> ignores new works.

Callbacks are not queued in the workqueue when call_srcu().
They are delivered(queued in the workqueue) after a grace period,
so we need to wait for all are delivered before flush_workqueue().

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06 14:44                     ` Lai Jiangshan
  2012-03-06 15:31                       ` Peter Zijlstra
  2012-03-06 15:32                       ` Peter Zijlstra
@ 2012-03-07  8:10                       ` Gilad Ben-Yossef
  2012-03-07  9:21                         ` Lai Jiangshan
  2 siblings, 1 reply; 100+ messages in thread
From: Gilad Ben-Yossef @ 2012-03-07  8:10 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Peter Zijlstra, Lai Jiangshan, Paul E. McKenney, linux-kernel,
	mingo, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	rostedt, Valdis.Kletnieks, dhowells, eric.dumazet, darren,
	fweisbec, patches

On Tue, Mar 6, 2012 at 4:44 PM, Lai Jiangshan <eag0628@gmail.com> wrote:
> On Tue, Mar 6, 2012 at 7:52 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
>>> +void srcu_barrier(struct srcu_struct *sp)
>>> +{
>>> +       struct srcu_sync sync;
>>> +       struct srcu_head *head = &sync.head;
>>> +       unsigned long chck_seq; /* snap */
>>> +
>>> +       int idle_loop = 0;
>>> +       int cpu;
>>> +       struct srcu_cpu_struct *scp;
>>> +
>>> +       spin_lock_irq(&sp->gp_lock);
>>> +       chck_seq = sp->chck_seq;
>>> +       for_each_possible_cpu(cpu) {
>>
>> ARGH!! this is really not ok.. so we spend all this time killing
>> srcu_sync_expidited and co because they prod at all cpus for no good
>> reason, and what do you do?
>
> it is srcu_barrier(), it have to wait all callbacks complete for all
> cpus since it is per-cpu
> implementation.

I would say it only needs to wait for callbacks to complete for all
CPUs that has a callback pending.

Unless I misunderstood something, that is what your code does already
- it does not wait for completion,
or schedules a work on a CPU that does not has a callback pending, right?

>
>>
>> Also, what happens if your cpu isn't actually online?
>
> The workqueue handles it, not here, if a cpu state machine has callbacks, the
> state machine is started, if it has no callback,  srcu_barrier() does
> nothing for
> this cpu

I understand the point is that offline cpus wont have callbacks, so
nothing would be
done for them, but still, is that a reason to even check? why not use
for_each_online_cpu

I think that if a cpu that was offline went online after your check
and managed to get an
SRCU callback pending it is by definition not a callback srcu_barrier
needs to wait for
since it went pending at a later time then srcu_barrier was called. Or
have I missed something?

Thanks,
Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-07  8:10                       ` Gilad Ben-Yossef
@ 2012-03-07  9:21                         ` Lai Jiangshan
  0 siblings, 0 replies; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-07  9:21 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Lai Jiangshan, Peter Zijlstra, Paul E. McKenney, linux-kernel,
	mingo, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	rostedt, Valdis.Kletnieks, dhowells, eric.dumazet, darren,
	fweisbec, patches

On 03/07/2012 04:10 PM, Gilad Ben-Yossef wrote:
> On Tue, Mar 6, 2012 at 4:44 PM, Lai Jiangshan <eag0628@gmail.com> wrote:
>> On Tue, Mar 6, 2012 at 7:52 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
>>>> +void srcu_barrier(struct srcu_struct *sp)
>>>> +{
>>>> +       struct srcu_sync sync;
>>>> +       struct srcu_head *head = &sync.head;
>>>> +       unsigned long chck_seq; /* snap */
>>>> +
>>>> +       int idle_loop = 0;
>>>> +       int cpu;
>>>> +       struct srcu_cpu_struct *scp;
>>>> +
>>>> +       spin_lock_irq(&sp->gp_lock);
>>>> +       chck_seq = sp->chck_seq;
>>>> +       for_each_possible_cpu(cpu) {
>>>
>>> ARGH!! this is really not ok.. so we spend all this time killing
>>> srcu_sync_expidited and co because they prod at all cpus for no good
>>> reason, and what do you do?
>>
>> it is srcu_barrier(), it have to wait all callbacks complete for all
>> cpus since it is per-cpu
>> implementation.
> 
> I would say it only needs to wait for callbacks to complete for all
> CPUs that has a callback pending.

Right.
The code above flush_workqueue() wait until all of them are delivered.
flush_workqueue() wait until all of them are completely invoked.

> 
> Unless I misunderstood something, that is what your code does already
> - it does not wait for completion,
> or schedules a work on a CPU that does not has a callback pending, right?
> 
>>
>>>
>>> Also, what happens if your cpu isn't actually online?
>>
>> The workqueue handles it, not here, if a cpu state machine has callbacks, the
>> state machine is started, if it has no callback,  srcu_barrier() does
>> nothing for
>> this cpu
> 
> I understand the point is that offline cpus wont have callbacks, so
> nothing would be
> done for them, but still, is that a reason to even check? why not use
> for_each_online_cpu

It is possible that the offline cpus have callbacks during hot-plugging.

> 
> I think that if a cpu that was offline went online after your check
> and managed to get an
> SRCU callback pending it is by definition not a callback srcu_barrier
> needs to wait for
> since it went pending at a later time then srcu_barrier was called. Or
> have I missed something?
> 
> Thanks,
> Gilad
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/5 single-thread-version] implement per-domain single-thread state machine call_srcu()
  2012-03-07  3:54                   ` [RFC PATCH 5/5 single-thread-version] implement per-domain single-thread state machine call_srcu() Lai Jiangshan
@ 2012-03-08 13:04                     ` Peter Zijlstra
  2012-03-08 14:17                       ` Lai Jiangshan
  2012-03-08 13:08                     ` Peter Zijlstra
  2012-03-08 20:35                     ` Paul E. McKenney
  2 siblings, 1 reply; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-08 13:04 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paul E. McKenney, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Wed, 2012-03-07 at 11:54 +0800, Lai Jiangshan wrote:
> +static void srcu_advance_batches(struct srcu_struct *sp)
> +{
> +       int idx = 1 - (sp->completed & 0x1UL);
> +
> +       /*
> +        * SRCU read-side critical sections are normally short, so check
> +        * twice after a flip.
> +        */
> +       if (!rcu_batch_empty(&sp->batch_check1) ||
> +           !rcu_batch_empty(&sp->batch_check0)) {
> +               if (try_check_zero(sp, idx, 1)) {
> +                       rcu_batch_move(&sp->batch_done, &sp->batch_check1);
> +                       rcu_batch_move(&sp->batch_check1, &sp->batch_check0);
> +                       if (!rcu_batch_empty(&sp->batch_check1)) {
> +                               srcu_flip(sp);
> +                               if (try_check_zero(sp, 1 - idx, 2)) {
> +                                       rcu_batch_move(&sp->batch_done,
> +                                               &sp->batch_check1);
> +                               }
> +                       }
> +               }
> +       }
> +} 

static void srcu_advance_batches(struct srcu_struct *sp)
{
	int idx = 1 - (sp->completed & 1);

	if (rcu_batch_empty(&sp->batch_check0) && 
	    rcu_batch_empty(&sp->batch_check1))
		return;

	if (!try_check_zero(sp, idx, 1))
		return;

	rcu_batch_move(&sp->batch_done,   &sp->batch_check1);
	rcu_batch_move(&sp->batch_check1, &sp->batch_check0);

	if (rcu_batch_empty(&sp->batch_check1))
		return;

	srcu_flip(sp);

	if (!try_check_zero(sp, idx^1, 2))
		return;

	rcu_batch_move(&sp->batch_done, &sp->batch_check1);
}

Seems like a more readable version.. do check I didn't mess up the logic
though.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/5 single-thread-version] implement per-domain single-thread state machine call_srcu()
  2012-03-07  3:54                   ` [RFC PATCH 5/5 single-thread-version] implement per-domain single-thread state machine call_srcu() Lai Jiangshan
  2012-03-08 13:04                     ` Peter Zijlstra
@ 2012-03-08 13:08                     ` Peter Zijlstra
  2012-03-08 20:35                     ` Paul E. McKenney
  2 siblings, 0 replies; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-08 13:08 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paul E. McKenney, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Wed, 2012-03-07 at 11:54 +0800, Lai Jiangshan wrote:
> This patch is on the top of the 4 previous patches(1/6, 2/6, 3/6, 4/6).
> 
> o       state machine is light way and single-threaded, it is preemptible when checking.
> 
> o       state machine is a work_struct. So, there is no thread occupied
>         by SRCU when the srcu is not actived(no callback). And it does
>         not sleep(avoid to occupy a thread when sleep).
> 
> o       state machine is the only thread can flip/check/write(*) the srcu_struct,
>         so we don't need any mutex.
>         (write(*): except ->per_cpu_ref, ->running, ->batch_queue)
> 
> o       synchronize_srcu() is always call call_srcu().
>         synchronize_srcu_expedited() is also.
>         It is OK for mb()-based srcu are extremely fast.
> 
> o       In current kernel, we can expect that there are only 1 callback per gp.
>         so callback is probably called in the same CPU when it is queued.
> 
> The trip of a callback:
>         1) ->batch_queue when call_srcu()
> 
>         2) ->batch_check0 when try to do check_zero
> 
>         3) ->batch_check1 after finish its first check_zero and the flip
> 
>         4) ->batch_done after finish its second check_zero
> 
> The current requirement of the callbacks:
>         The callback will be called inside process context.
>         The callback should be fast without any sleeping path.
> 
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> 

Aside from the nit on srcu_advance_batches() this seems like a nice
implementation. Thanks!

I didn't fully verify the srcu state machine, but it looks about
right :-)

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/5 single-thread-version] implement per-domain single-thread state machine call_srcu()
  2012-03-08 13:04                     ` Peter Zijlstra
@ 2012-03-08 14:17                       ` Lai Jiangshan
  0 siblings, 0 replies; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-08 14:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Paul E. McKenney, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On Thu, Mar 8, 2012 at 9:04 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2012-03-07 at 11:54 +0800, Lai Jiangshan wrote:
>> +static void srcu_advance_batches(struct srcu_struct *sp)
>> +{
>> +       int idx = 1 - (sp->completed & 0x1UL);
>> +
>> +       /*
>> +        * SRCU read-side critical sections are normally short, so check
>> +        * twice after a flip.
>> +        */
>> +       if (!rcu_batch_empty(&sp->batch_check1) ||
>> +           !rcu_batch_empty(&sp->batch_check0)) {
>> +               if (try_check_zero(sp, idx, 1)) {
>> +                       rcu_batch_move(&sp->batch_done, &sp->batch_check1);
>> +                       rcu_batch_move(&sp->batch_check1, &sp->batch_check0);
>> +                       if (!rcu_batch_empty(&sp->batch_check1)) {
>> +                               srcu_flip(sp);
>> +                               if (try_check_zero(sp, 1 - idx, 2)) {
>> +                                       rcu_batch_move(&sp->batch_done,
>> +                                               &sp->batch_check1);
>> +                               }
>> +                       }
>> +               }
>> +       }
>> +}
>

Good, Thanks,

I'm thinking how to introduce the comments(original in __synchronize_srcu())
back. Your code will make it easier. (the comments are still needed to rewrite
before bring them back, need help!)

I will use your code with a little changed.

> static void srcu_advance_batches(struct srcu_struct *sp)
> {
>        int idx = 1 - (sp->completed & 1);

(sp->completed & 1) ^ 1;


>
>        if (rcu_batch_empty(&sp->batch_check0) &&
>            rcu_batch_empty(&sp->batch_check1))
>                return;
>
>        if (!try_check_zero(sp, idx, 1))
>                return;
>
>        rcu_batch_move(&sp->batch_done,   &sp->batch_check1);

....

>        rcu_batch_move(&sp->batch_check1, &sp->batch_check0);
>
>        if (rcu_batch_empty(&sp->batch_check1))
>                return;
>
>        srcu_flip(sp);

        if (rcu_batch_empty(&sp->batch_check0))
                return;

        srcu_flip(sp);
        rcu_batch_move(&sp->batch_check1, &sp->batch_check0);

make it match the changlog:
3) ->batch_check1 after finish its first check_zero and the flip

>
>        if (!try_check_zero(sp, idx^1, 2))
>                return;
>
>        rcu_batch_move(&sp->batch_done, &sp->batch_check1);
> }
>
> Seems like a more readable version.. do check I didn't mess up the logic
> though.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 1/6] remove unused srcu_barrier()
  2012-03-06  9:57               ` [PATCH 1/6] remove unused srcu_barrier() Lai Jiangshan
                                   ` (4 preceding siblings ...)
  2012-03-06  9:57                 ` [PATCH 6/6] add srcu torture test Lai Jiangshan
@ 2012-03-08 19:03                 ` Paul E. McKenney
  5 siblings, 0 replies; 100+ messages in thread
From: Paul E. McKenney @ 2012-03-08 19:03 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On Tue, Mar 06, 2012 at 05:57:33PM +0800, Lai Jiangshan wrote:
> srcu_barrier() is unused now.
> This identifier is need for later.
> 
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>

Queued, thank you!

							Thanx, Paul

> ---
>  include/linux/srcu.h |    6 ------
>  1 files changed, 0 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index 5b49d41..df8f5f7 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -49,12 +49,6 @@ struct srcu_struct {
>  #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>  };
> 
> -#ifndef CONFIG_PREEMPT
> -#define srcu_barrier() barrier()
> -#else /* #ifndef CONFIG_PREEMPT */
> -#define srcu_barrier()
> -#endif /* #else #ifndef CONFIG_PREEMPT */
> -
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> 
>  int __init_srcu_struct(struct srcu_struct *sp, const char *name,
> -- 
> 1.7.4.4
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 2/6] Don't touch the snap in srcu_readers_active()
  2012-03-06  9:57                 ` [PATCH 2/6] Don't touch the snap in srcu_readers_active() Lai Jiangshan
@ 2012-03-08 19:14                   ` Paul E. McKenney
  0 siblings, 0 replies; 100+ messages in thread
From: Paul E. McKenney @ 2012-03-08 19:14 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On Tue, Mar 06, 2012 at 05:57:34PM +0800, Lai Jiangshan wrote:
> srcu_readers_active() is called without the mutex, but it touch the snap.
> also achieve better cache locality a little
> 
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>

Queued with updated commit-log message, thank you!

							Thanx, Paul

	Expand the calls to srcu_readers_active_idx() from
	srcu_readers_active() inline.  This change improves cache locality
	by interating over the CPUs once rather than twice.

> ---
>  kernel/srcu.c |    9 ++++++++-
>  1 files changed, 8 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/srcu.c b/kernel/srcu.c
> index b6b9ea2..fbe2d5f 100644
> --- a/kernel/srcu.c
> +++ b/kernel/srcu.c
> @@ -181,7 +181,14 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
>   */
>  static int srcu_readers_active(struct srcu_struct *sp)
>  {
> -	return srcu_readers_active_idx(sp, 0) + srcu_readers_active_idx(sp, 1);
> +	int cpu;
> +	unsigned long sum = 0;
> +
> +	for_each_possible_cpu(cpu) {
> +		sum += ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[0]);
> +		sum += ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[1]);
> +	}
> +	return sum & SRCU_REF_MASK;
>  }
> 
>  /**
> -- 
> 1.7.4.4
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 3/6] use "int trycount" instead of "bool expedited"
  2012-03-06  9:57                 ` [PATCH 3/6] use "int trycount" instead of "bool expedited" Lai Jiangshan
@ 2012-03-08 19:25                   ` Paul E. McKenney
  0 siblings, 0 replies; 100+ messages in thread
From: Paul E. McKenney @ 2012-03-08 19:25 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On Tue, Mar 06, 2012 at 05:57:35PM +0800, Lai Jiangshan wrote:
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>

This looks good, but does not apply.  I will push to -rcu, and then you
can tell me what I am missing.

							Thanx, Paul

> ---
>  kernel/srcu.c |   27 ++++++++++++++-------------
>  1 files changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/srcu.c b/kernel/srcu.c
> index fbe2d5f..d5f3450 100644
> --- a/kernel/srcu.c
> +++ b/kernel/srcu.c
> @@ -254,12 +254,12 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>   * we repeatedly block for 1-millisecond time periods.  This approach
>   * has done well in testing, so there is no need for a config parameter.
>   */
> -#define SYNCHRONIZE_SRCU_READER_DELAY 5
> +#define SYNCHRONIZE_SRCU_READER_DELAY	5
> +#define SYNCHRONIZE_SRCU_TRYCOUNT	2
> +#define SYNCHRONIZE_SRCU_EXP_TRYCOUNT	12
> 
> -static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> +static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
>  {
> -	int trycount = 0;
> -
>  	/*
>  	 * If a reader fetches the index before the ->completed increment,
>  	 * but increments its counter after srcu_readers_active_idx_check()
> @@ -278,9 +278,10 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
>  	if (!srcu_readers_active_idx_check(sp, idx)) {
>  		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
>  		while (!srcu_readers_active_idx_check(sp, idx)) {
> -			if (expedited && ++ trycount < 10)
> +			if (trycount > 0) {
> +				trycount--;
>  				udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> -			else
> +			} else
>  				schedule_timeout_interruptible(1);
>  		}
>  	}
> @@ -310,18 +311,18 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
>   * by the next __synchronize_srcu() invoking wait_idx() for such readers
>   * before starting a new grace period.
>   */
> -static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited)
> +static void flip_idx_and_wait(struct srcu_struct *sp, int trycount)
>  {
>  	int idx;
> 
>  	idx = sp->completed++ & 0x1;
> -	wait_idx(sp, idx, expedited);
> +	wait_idx(sp, idx, trycount);
>  }
> 
>  /*
>   * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
>   */
> -static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
> +static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
>  {
>  	rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
>  			   !lock_is_held(&rcu_bh_lock_map) &&
> @@ -357,14 +358,14 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
>  	 * fetching ->completed and incrementing their counter, wait_idx()
>  	 * will normally not need to wait.
>  	 */
> -	wait_idx(sp, (sp->completed - 1) & 0x1, expedited);
> +	wait_idx(sp, (sp->completed - 1) & 0x1, trycount);
> 
>  	/*
>  	 * Now that wait_idx() has waited for the really old readers,
>  	 * invoke flip_idx_and_wait() to flip the counter and wait
>  	 * for current SRCU readers.
>  	 */
> -	flip_idx_and_wait(sp, expedited);
> +	flip_idx_and_wait(sp, trycount);
> 
>  	mutex_unlock(&sp->mutex);
>  }
> @@ -385,7 +386,7 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited)
>   */
>  void synchronize_srcu(struct srcu_struct *sp)
>  {
> -	__synchronize_srcu(sp, 0);
> +	__synchronize_srcu(sp, SYNCHRONIZE_SRCU_TRYCOUNT);
>  }
>  EXPORT_SYMBOL_GPL(synchronize_srcu);
> 
> @@ -406,7 +407,7 @@ EXPORT_SYMBOL_GPL(synchronize_srcu);
>   */
>  void synchronize_srcu_expedited(struct srcu_struct *sp)
>  {
> -	__synchronize_srcu(sp, 1);
> +	__synchronize_srcu(sp, SYNCHRONIZE_SRCU_EXP_TRYCOUNT);
>  }
>  EXPORT_SYMBOL_GPL(synchronize_srcu_expedited);
> 
> -- 
> 1.7.4.4
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06 10:47                   ` Peter Zijlstra
@ 2012-03-08 19:44                     ` Paul E. McKenney
  0 siblings, 0 replies; 100+ messages in thread
From: Paul E. McKenney @ 2012-03-08 19:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, patches

On Tue, Mar 06, 2012 at 11:47:25AM +0100, Peter Zijlstra wrote:
> On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
> > o       The srcu callback is new thing, I hope it is completely preemptible,
> >         even sleepable. It does in this implemetation, I use work_struct
> >         to stand for every srcu callback. 
> 
> I didn't need the callbacks to sleep too, I just needed the read-side
> srcu bit.
> 
> There's an argument against making the callbacks able to sleep like that
> in that you typically want to minimize the amount of work done in the
> callbacks, allowing them to sleep invites to callbacks that do _way_ too
> much work.
> 
> I haven't made my mind up if I care yet.. :-)

I prefer that they don't sleep.  Easy to push anything that needs to
sleep off to a work queue.  And allowing sleeping in an SRCU callback
function sounds like something that could cause serious problems down
the road.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06 15:38                       ` Peter Zijlstra
@ 2012-03-08 19:49                         ` Paul E. McKenney
  2012-03-10 10:12                           ` Peter Zijlstra
  0 siblings, 1 reply; 100+ messages in thread
From: Paul E. McKenney @ 2012-03-08 19:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Lai Jiangshan, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On Tue, Mar 06, 2012 at 04:38:18PM +0100, Peter Zijlstra wrote:
> On Tue, 2012-03-06 at 23:17 +0800, Lai Jiangshan wrote:
> > On Tue, Mar 6, 2012 at 6:58 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
> > >>  /*
> > >> + * 'return left < right;' but handle the overflow issues.
> > >> + * The same as 'return (long)(right - left) > 0;' but it cares more.
> > >
> > > About what? And why? We do the (long)(a - b) thing all over the kernel,
> > > why would you care more?
> > 
> > @left is constants of  the callers(callbacks's snapshot), @right
> > increases very slow.
> > if (long)(right - left) is a big negative, we have to wait for a long
> > time in this kinds of overflow.
> > this kinds of overflow can not happen in this safe_less_than()
> 
> I'm afraid I'm being particularly dense, but what?!

I have been converting the "(long)(a - b)" stuff in RCU to use unsigned
arithmetic.  The ULONG_CMP_GE() and friends in rcupdate.h are for this
purpose.

I too have used (long)(a - b) for a long time, but I saw with my own eyes
the glee in the compiler-writers' eyes when they discussed signed overflow
being undefined in the C standard.  I believe that the reasons for signed
overflow being undefined are long obsolete, but better safe than sorry.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-06 15:34                       ` Peter Zijlstra
@ 2012-03-08 19:58                         ` Paul E. McKenney
  2012-03-10  3:32                           ` Lai Jiangshan
  2012-03-10 10:09                           ` Peter Zijlstra
  0 siblings, 2 replies; 100+ messages in thread
From: Paul E. McKenney @ 2012-03-08 19:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Lai Jiangshan, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches, tj

On Tue, Mar 06, 2012 at 04:34:53PM +0100, Peter Zijlstra wrote:
> On Tue, 2012-03-06 at 23:12 +0800, Lai Jiangshan wrote:
> > On Tue, Mar 6, 2012 at 7:16 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
> > >>         srcu_head is bigger, it is worth, it provides more ability and simplify
> > >>         the srcu code.
> > >
> > > Dubious claim.. memory footprint of various data structures is deemed
> > > important. rcu_head is 16 bytes, srcu_head is 32 bytes. I think it would
> > > be real nice not to have two different callback structures and not grow
> > > them as large.
> > 
> > CC: tj@kernel.org
> > It could be better if workqueue also supports 2*sizeof(long) work callbacks.
> 
> That's going to be very painful if at all possible.
> 
> > I prefer ability/functionality a little more, it eases the caller's pain.
> > preemptible callbacks also eases the pressure of the whole system.
> > But I'm also ok if we limit the srcu-callbacks in softirq.
> 
> You don't have to use softirq, you could run a complete list from a
> single worklet. Just keep the single linked rcu_head list and enqueue a
> static (per-cpu) worker to process the entire list.

I like the idea of SRCU using rcu_head.  I am a little concerned about
what happens when there are lots of SRCU callbacks, but am willing to
wait to solve those problems until the situation arises.

But I guess I should ask...  Peter, what do you expect the maximum
call_srcu() rate to be in your use cases?  If tens of thousands are
possible, some adjustments will be needed.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/5 single-thread-version] implement per-domain single-thread state machine call_srcu()
  2012-03-07  3:54                   ` [RFC PATCH 5/5 single-thread-version] implement per-domain single-thread state machine call_srcu() Lai Jiangshan
  2012-03-08 13:04                     ` Peter Zijlstra
  2012-03-08 13:08                     ` Peter Zijlstra
@ 2012-03-08 20:35                     ` Paul E. McKenney
  2012-03-10  3:16                       ` Lai Jiangshan
  2 siblings, 1 reply; 100+ messages in thread
From: Paul E. McKenney @ 2012-03-08 20:35 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, patches

On Wed, Mar 07, 2012 at 11:54:02AM +0800, Lai Jiangshan wrote:
> This patch is on the top of the 4 previous patches(1/6, 2/6, 3/6, 4/6).
> 
> o	state machine is light way and single-threaded, it is preemptible when checking.
> 
> o	state machine is a work_struct. So, there is no thread occupied
> 	by SRCU when the srcu is not actived(no callback). And it does
> 	not sleep(avoid to occupy a thread when sleep).
> 
> o	state machine is the only thread can flip/check/write(*) the srcu_struct,
> 	so we don't need any mutex.
> 	(write(*): except ->per_cpu_ref, ->running, ->batch_queue)
> 
> o	synchronize_srcu() is always call call_srcu().
> 	synchronize_srcu_expedited() is also.
> 	It is OK for mb()-based srcu are extremely fast.
> 
> o	In current kernel, we can expect that there are only 1 callback per gp.
> 	so callback is probably called in the same CPU when it is queued.
> 
> The trip of a callback:
> 	1) ->batch_queue when call_srcu()
> 
> 	2) ->batch_check0 when try to do check_zero
> 
> 	3) ->batch_check1 after finish its first check_zero and the flip
> 
> 	4) ->batch_done after finish its second check_zero
> 
> The current requirement of the callbacks:
> 	The callback will be called inside process context.
> 	The callback should be fast without any sleeping path.
> 
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> ---
>  include/linux/rcupdate.h |    2 +-
>  include/linux/srcu.h     |   28 +++++-
>  kernel/rcupdate.c        |   24 ++++-
>  kernel/rcutorture.c      |   44 ++++++++-
>  kernel/srcu.c            |  238 ++++++++++++++++++++++++++++++++-------------
>  5 files changed, 259 insertions(+), 77 deletions(-)
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 9372174..d98eab2 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -222,7 +222,7 @@ extern void rcu_irq_exit(void);
>   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
>   */
> 
> -typedef void call_rcu_func_t(struct rcu_head *head,
> +typedef void (*call_rcu_func_t)(struct rcu_head *head,

I don't see what this applies against.  The old patch 5/6 created
a "(*call_rcu_func_t)(struct rcu_head *head," and I don't see what
created the "call_rcu_func_t(struct rcu_head *head,".

>  			     void (*func)(struct rcu_head *head));
>  void wait_rcu_gp(call_rcu_func_t crf);
> 
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index df8f5f7..56cb774 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -29,6 +29,7 @@
> 
>  #include <linux/mutex.h>
>  #include <linux/rcupdate.h>
> +#include <linux/workqueue.h>
> 
>  struct srcu_struct_array {
>  	unsigned long c[2];
> @@ -39,10 +40,23 @@ struct srcu_struct_array {
>  #define SRCU_REF_MASK		(ULONG_MAX >> SRCU_USAGE_BITS)
>  #define SRCU_USAGE_COUNT	(SRCU_REF_MASK + 1)
> 
> +struct rcu_batch {
> +	struct rcu_head *head, **tail;
> +};
> +
>  struct srcu_struct {
>  	unsigned completed;
>  	struct srcu_struct_array __percpu *per_cpu_ref;
> -	struct mutex mutex;
> +	spinlock_t queue_lock; /* protect ->batch_queue, ->running */
> +	bool running;
> +	/* callbacks just queued */
> +	struct rcu_batch batch_queue;
> +	/* callbacks try to do the first check_zero */
> +	struct rcu_batch batch_check0;
> +	/* callbacks done with the first check_zero and the flip */
> +	struct rcu_batch batch_check1;
> +	struct rcu_batch batch_done;
> +	struct delayed_work work;

Why not use your multiple-tail-pointer trick here?  (The one that is
used in treercu.)

>  	unsigned long snap[NR_CPUS];
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>  	struct lockdep_map dep_map;
> @@ -67,12 +81,24 @@ int init_srcu_struct(struct srcu_struct *sp);
> 
>  #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> 
> +/* draft
> + * queue callbacks which will be invoked after grace period.
> + * The callback will be called inside process context.
> + * The callback should be fast without any sleeping path.
> + */
> +void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
> +		void (*func)(struct rcu_head *head));
> +
> +typedef void (*call_srcu_func_t)(struct srcu_struct *sp, struct rcu_head *head,
> +		void (*func)(struct rcu_head *head));
> +void __wait_srcu_gp(struct srcu_struct *sp, call_srcu_func_t crf);
>  void cleanup_srcu_struct(struct srcu_struct *sp);
>  int __srcu_read_lock(struct srcu_struct *sp) __acquires(sp);
>  void __srcu_read_unlock(struct srcu_struct *sp, int idx) __releases(sp);
>  void synchronize_srcu(struct srcu_struct *sp);
>  void synchronize_srcu_expedited(struct srcu_struct *sp);
>  long srcu_batches_completed(struct srcu_struct *sp);
> +void srcu_barrier(struct srcu_struct *sp);
> 
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> 
> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> index a86f174..f9b551f 100644
> --- a/kernel/rcupdate.c
> +++ b/kernel/rcupdate.c
> @@ -45,6 +45,7 @@
>  #include <linux/mutex.h>
>  #include <linux/export.h>
>  #include <linux/hardirq.h>
> +#include <linux/srcu.h>
> 
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/rcu.h>
> @@ -123,20 +124,39 @@ static void wakeme_after_rcu(struct rcu_head  *head)
>  	complete(&rcu->completion);
>  }
> 
> -void wait_rcu_gp(call_rcu_func_t crf)
> +static void __wait_rcu_gp(void *domain, void *func)
>  {
>  	struct rcu_synchronize rcu;
> 
>  	init_rcu_head_on_stack(&rcu.head);
>  	init_completion(&rcu.completion);
> +
>  	/* Will wake me after RCU finished. */
> -	crf(&rcu.head, wakeme_after_rcu);
> +	if (!domain) {
> +		call_rcu_func_t crf = func;
> +		crf(&rcu.head, wakeme_after_rcu);
> +	} else {
> +		call_srcu_func_t crf = func;
> +		crf(domain, &rcu.head, wakeme_after_rcu);
> +	}
> +
>  	/* Wait for it. */
>  	wait_for_completion(&rcu.completion);
>  	destroy_rcu_head_on_stack(&rcu.head);
>  }

Mightn't it be simpler and faster to just have a separate wait_srcu_gp()
that doesn't share code with wait_rcu_gp()?  I am all for sharing code,
but this might be hrting more than helping.

> +
> +void wait_rcu_gp(call_rcu_func_t crf)
> +{
> +	__wait_rcu_gp(NULL, crf);
> +}
>  EXPORT_SYMBOL_GPL(wait_rcu_gp);
> 
> +/* srcu.c internel */
> +void __wait_srcu_gp(struct srcu_struct *sp, call_srcu_func_t crf)
> +{
> +	__wait_rcu_gp(sp, crf);
> +}
> +
>  #ifdef CONFIG_PROVE_RCU
>  /*
>   * wrapper function to avoid #include problems.
> diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
> index 54e5724..40d24d0 100644

OK, so your original patch #6 is folded into this?  I don't have a strong
view either way, just need to know.

> --- a/kernel/rcutorture.c
> +++ b/kernel/rcutorture.c
> @@ -623,6 +623,11 @@ static int srcu_torture_completed(void)
>  	return srcu_batches_completed(&srcu_ctl);
>  }
> 
> +static void srcu_torture_deferred_free(struct rcu_torture *rp)
> +{
> +	call_srcu(&srcu_ctl, &rp->rtort_rcu, rcu_torture_cb);
> +}
> +
>  static void srcu_torture_synchronize(void)
>  {
>  	synchronize_srcu(&srcu_ctl);
> @@ -652,7 +657,7 @@ static struct rcu_torture_ops srcu_ops = {
>  	.read_delay	= srcu_read_delay,
>  	.readunlock	= srcu_torture_read_unlock,
>  	.completed	= srcu_torture_completed,
> -	.deferred_free	= rcu_sync_torture_deferred_free,
> +	.deferred_free	= srcu_torture_deferred_free,
>  	.sync		= srcu_torture_synchronize,
>  	.call		= NULL,
>  	.cb_barrier	= NULL,
> @@ -660,6 +665,21 @@ static struct rcu_torture_ops srcu_ops = {
>  	.name		= "srcu"
>  };
> 
> +static struct rcu_torture_ops srcu_sync_ops = {
> +	.init		= srcu_torture_init,
> +	.cleanup	= srcu_torture_cleanup,
> +	.readlock	= srcu_torture_read_lock,
> +	.read_delay	= srcu_read_delay,
> +	.readunlock	= srcu_torture_read_unlock,
> +	.completed	= srcu_torture_completed,
> +	.deferred_free	= rcu_sync_torture_deferred_free,
> +	.sync		= srcu_torture_synchronize,
> +	.call		= NULL,
> +	.cb_barrier	= NULL,
> +	.stats		= srcu_torture_stats,
> +	.name		= "srcu_sync"
> +};
> +
>  static int srcu_torture_read_lock_raw(void) __acquires(&srcu_ctl)
>  {
>  	return srcu_read_lock_raw(&srcu_ctl);
> @@ -677,7 +697,7 @@ static struct rcu_torture_ops srcu_raw_ops = {
>  	.read_delay	= srcu_read_delay,
>  	.readunlock	= srcu_torture_read_unlock_raw,
>  	.completed	= srcu_torture_completed,
> -	.deferred_free	= rcu_sync_torture_deferred_free,
> +	.deferred_free	= srcu_torture_deferred_free,
>  	.sync		= srcu_torture_synchronize,
>  	.call		= NULL,
>  	.cb_barrier	= NULL,
> @@ -685,6 +705,21 @@ static struct rcu_torture_ops srcu_raw_ops = {
>  	.name		= "srcu_raw"
>  };
> 
> +static struct rcu_torture_ops srcu_raw_sync_ops = {
> +	.init		= srcu_torture_init,
> +	.cleanup	= srcu_torture_cleanup,
> +	.readlock	= srcu_torture_read_lock_raw,
> +	.read_delay	= srcu_read_delay,
> +	.readunlock	= srcu_torture_read_unlock_raw,
> +	.completed	= srcu_torture_completed,
> +	.deferred_free	= rcu_sync_torture_deferred_free,
> +	.sync		= srcu_torture_synchronize,
> +	.call		= NULL,
> +	.cb_barrier	= NULL,
> +	.stats		= srcu_torture_stats,
> +	.name		= "srcu_raw_sync"
> +};
> +
>  static void srcu_torture_synchronize_expedited(void)
>  {
>  	synchronize_srcu_expedited(&srcu_ctl);
> @@ -1673,7 +1708,7 @@ static int rcu_torture_barrier_init(void)
>  	for (i = 0; i < n_barrier_cbs; i++) {
>  		init_waitqueue_head(&barrier_cbs_wq[i]);
>  		barrier_cbs_tasks[i] = kthread_run(rcu_torture_barrier_cbs,
> -						   (void *)i,
> +						   (void *)(long)i,
>  						   "rcu_torture_barrier_cbs");
>  		if (IS_ERR(barrier_cbs_tasks[i])) {
>  			ret = PTR_ERR(barrier_cbs_tasks[i]);
> @@ -1857,7 +1892,8 @@ rcu_torture_init(void)
>  	static struct rcu_torture_ops *torture_ops[] =
>  		{ &rcu_ops, &rcu_sync_ops, &rcu_expedited_ops,
>  		  &rcu_bh_ops, &rcu_bh_sync_ops, &rcu_bh_expedited_ops,
> -		  &srcu_ops, &srcu_raw_ops, &srcu_expedited_ops,
> +		  &srcu_ops, &srcu_sync_ops, &srcu_raw_ops,
> +		  &srcu_raw_sync_ops, &srcu_expedited_ops,
>  		  &sched_ops, &sched_sync_ops, &sched_expedited_ops, };
> 
>  	mutex_lock(&fullstop_mutex);
> diff --git a/kernel/srcu.c b/kernel/srcu.c
> index d101ed5..532f890 100644
> --- a/kernel/srcu.c
> +++ b/kernel/srcu.c
> @@ -34,10 +34,60 @@
>  #include <linux/delay.h>
>  #include <linux/srcu.h>
> 
> +static inline void rcu_batch_init(struct rcu_batch *b)
> +{
> +	b->head = NULL;
> +	b->tail = &b->head;
> +}
> +
> +static inline void rcu_batch_queue(struct rcu_batch *b, struct rcu_head *head)
> +{
> +	*b->tail = head;
> +	b->tail = &head->next;
> +}
> +
> +static inline bool rcu_batch_empty(struct rcu_batch *b)
> +{
> +	return b->tail == &b->head;
> +}
> +
> +static inline struct rcu_head *rcu_batch_dequeue(struct rcu_batch *b)
> +{
> +	struct rcu_head *head;
> +
> +	if (rcu_batch_empty(b))
> +		return NULL;
> +
> +	head = b->head;
> +	b->head = head->next;
> +	if (b->tail == &head->next)
> +		rcu_batch_init(b);
> +
> +	return head;
> +}
> +
> +static inline void rcu_batch_move(struct rcu_batch *to, struct rcu_batch *from)
> +{
> +	if (!rcu_batch_empty(from)) {
> +		*to->tail = from->head;
> +		to->tail = from->tail;
> +		rcu_batch_init(from);
> +	}
> +}

And perhaps this is why you don't want the multi-tailed queue?

> +
> +/* single-thread state-machine */
> +static void process_srcu(struct work_struct *work);
> +
>  static int init_srcu_struct_fields(struct srcu_struct *sp)
>  {
>  	sp->completed = 0;
> -	mutex_init(&sp->mutex);
> +	spin_lock_init(&sp->queue_lock);
> +	sp->running = false;
> +	rcu_batch_init(&sp->batch_queue);
> +	rcu_batch_init(&sp->batch_check0);
> +	rcu_batch_init(&sp->batch_check1);
> +	rcu_batch_init(&sp->batch_done);
> +	INIT_DELAYED_WORK(&sp->work, process_srcu);
>  	sp->per_cpu_ref = alloc_percpu(struct srcu_struct_array);
>  	return sp->per_cpu_ref ? 0 : -ENOMEM;
>  }
> @@ -254,11 +304,9 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>   * we repeatedly block for 1-millisecond time periods.  This approach
>   * has done well in testing, so there is no need for a config parameter.
>   */
> -#define SYNCHRONIZE_SRCU_READER_DELAY	5
> -#define SYNCHRONIZE_SRCU_TRYCOUNT	2
> -#define SYNCHRONIZE_SRCU_EXP_TRYCOUNT	12
> +#define SRCU_RETRY_CHECK_DELAY	5
> 
> -static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
> +static bool try_check_zero(struct srcu_struct *sp, int idx, int trycount)
>  {
>  	/*
>  	 * If a reader fetches the index before the ->completed increment,
> @@ -271,19 +319,12 @@ static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
>  	 */
>  	smp_mb(); /* D */
> 
> -	/*
> -	 * SRCU read-side critical sections are normally short, so wait
> -	 * a small amount of time before possibly blocking.
> -	 */
> -	if (!srcu_readers_active_idx_check(sp, idx)) {
> -		udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> -		while (!srcu_readers_active_idx_check(sp, idx)) {
> -			if (trycount > 0) {
> -				trycount--;
> -				udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> -			} else
> -				schedule_timeout_interruptible(1);
> -		}
> +	for (;;) {
> +		if (srcu_readers_active_idx_check(sp, idx))
> +			break;
> +		if (--trycount <= 0)
> +			return false;
> +		udelay(SRCU_RETRY_CHECK_DELAY);
>  	}
> 
>  	/*
> @@ -297,6 +338,8 @@ static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
>  	 * the next flipping.
>  	 */
>  	smp_mb(); /* E */
> +
> +	return true;
>  }
> 
>  /*
> @@ -308,10 +351,27 @@ static void srcu_flip(struct srcu_struct *sp)
>  	ACCESS_ONCE(sp->completed)++;
>  }
> 
> +void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
> +		void (*func)(struct rcu_head *head))
> +{
> +	unsigned long flags;
> +
> +	head->next = NULL;
> +	head->func = func;
> +	spin_lock_irqsave(&sp->queue_lock, flags);
> +	rcu_batch_queue(&sp->batch_queue, head);
> +	if (!sp->running) {
> +		sp->running = true;
> +		queue_delayed_work(system_nrt_wq, &sp->work, 0);
> +	}
> +	spin_unlock_irqrestore(&sp->queue_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(call_srcu);
> +
>  /*
>   * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
>   */
> -static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
> +static void __synchronize_srcu(struct srcu_struct *sp)
>  {
>  	rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
>  			   !lock_is_held(&rcu_bh_lock_map) &&
> @@ -319,54 +379,7 @@ static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
>  			   !lock_is_held(&rcu_sched_lock_map),
>  			   "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
> 
> -	mutex_lock(&sp->mutex);
> -
> -	/*
> -	 * Suppose that during the previous grace period, a reader
> -	 * picked up the old value of the index, but did not increment
> -	 * its counter until after the previous instance of
> -	 * __synchronize_srcu() did the counter summation and recheck.
> -	 * That previous grace period was OK because the reader did
> -	 * not start until after the grace period started, so the grace
> -	 * period was not obligated to wait for that reader.
> -	 *
> -	 * However, the current SRCU grace period does have to wait for
> -	 * that reader.  This is handled by invoking wait_idx() on the
> -	 * non-active set of counters (hence sp->completed - 1).  Once
> -	 * wait_idx() returns, we know that all readers that picked up
> -	 * the old value of ->completed and that already incremented their
> -	 * counter will have completed.
> -	 *
> -	 * But what about readers that picked up the old value of
> -	 * ->completed, but -still- have not managed to increment their
> -	 * counter?  We do not need to wait for those readers, because
> -	 * they will have started their SRCU read-side critical section
> -	 * after the current grace period starts.
> -	 *
> -	 * Because it is unlikely that readers will be preempted between
> -	 * fetching ->completed and incrementing their counter, wait_idx()
> -	 * will normally not need to wait.
> -	 */
> -	wait_idx(sp, (sp->completed - 1) & 0x1, trycount);
> -
> -	/*
> -	 * Now that wait_idx() has waited for the really old readers,
> -	 *
> -	 * Flip the readers' index by incrementing ->completed, then wait
> -	 * until there are no more readers using the counters referenced by
> -	 * the old index value.  (Recall that the index is the bottom bit
> -	 * of ->completed.)
> -	 *
> -	 * Of course, it is possible that a reader might be delayed for the
> -	 * full duration of flip_idx_and_wait() between fetching the
> -	 * index and incrementing its counter.  This possibility is handled
> -	 * by the next __synchronize_srcu() invoking wait_idx() for such
> -	 * readers before starting a new grace period.
> -	 */
> -	srcu_flip(sp);
> -	wait_idx(sp, (sp->completed - 1) & 0x1, trycount);
> -
> -	mutex_unlock(&sp->mutex);
> +	__wait_srcu_gp(sp, call_srcu);
>  }
> 
>  /**
> @@ -385,7 +398,7 @@ static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
>   */
>  void synchronize_srcu(struct srcu_struct *sp)
>  {
> -	__synchronize_srcu(sp, SYNCHRONIZE_SRCU_TRYCOUNT);
> +	__synchronize_srcu(sp);
>  }
>  EXPORT_SYMBOL_GPL(synchronize_srcu);
> 
> @@ -406,10 +419,16 @@ EXPORT_SYMBOL_GPL(synchronize_srcu);
>   */
>  void synchronize_srcu_expedited(struct srcu_struct *sp)
>  {
> -	__synchronize_srcu(sp, SYNCHRONIZE_SRCU_EXP_TRYCOUNT);
> +	__synchronize_srcu(sp);
>  }

OK, I'll bite...  Why aren't synchronize_srcu_expedited() and
synchronize_srcu() different?

							Thanx, Paul

>  EXPORT_SYMBOL_GPL(synchronize_srcu_expedited);
> 
> +void srcu_barrier(struct srcu_struct *sp)
> +{
> +	__synchronize_srcu(sp);
> +}
> +EXPORT_SYMBOL_GPL(srcu_barrier);
> +
>  /**
>   * srcu_batches_completed - return batches completed.
>   * @sp: srcu_struct on which to report batch completion.
> @@ -423,3 +442,84 @@ long srcu_batches_completed(struct srcu_struct *sp)
>  	return sp->completed;
>  }
>  EXPORT_SYMBOL_GPL(srcu_batches_completed);
> +
> +#define SRCU_CALLBACK_BATCH	10
> +#define SRCU_INTERVAL		1
> +
> +static void srcu_collect_new(struct srcu_struct *sp)
> +{
> +	if (!rcu_batch_empty(&sp->batch_queue)) {
> +		spin_lock_irq(&sp->queue_lock);
> +		rcu_batch_move(&sp->batch_check0, &sp->batch_queue);
> +		spin_unlock_irq(&sp->queue_lock);
> +	}
> +}
> +
> +static void srcu_advance_batches(struct srcu_struct *sp)
> +{
> +	int idx = 1 - (sp->completed & 0x1UL);
> +
> +	/*
> +	 * SRCU read-side critical sections are normally short, so check
> +	 * twice after a flip.
> +	 */
> +	if (!rcu_batch_empty(&sp->batch_check1) ||
> +	    !rcu_batch_empty(&sp->batch_check0)) {
> +		if (try_check_zero(sp, idx, 1)) {
> +			rcu_batch_move(&sp->batch_done, &sp->batch_check1);
> +			rcu_batch_move(&sp->batch_check1, &sp->batch_check0);
> +			if (!rcu_batch_empty(&sp->batch_check1)) {
> +				srcu_flip(sp);
> +				if (try_check_zero(sp, 1 - idx, 2)) {
> +					rcu_batch_move(&sp->batch_done,
> +						&sp->batch_check1);
> +				}
> +			}
> +		}
> +	}
> +}
> +
> +static void srcu_invoke_callbacks(struct srcu_struct *sp)
> +{
> +	int i;
> +	struct rcu_head *head;
> +
> +	for (i = 0; i < SRCU_CALLBACK_BATCH; i++) {
> +		head = rcu_batch_dequeue(&sp->batch_done);
> +		if (!head)
> +			break;
> +		head->func(head);
> +	}
> +}
> +
> +static void srcu_reschedule(struct srcu_struct *sp)
> +{
> +	bool running = true;
> +
> +	if (rcu_batch_empty(&sp->batch_done) &&
> +	    rcu_batch_empty(&sp->batch_check1) &&
> +	    rcu_batch_empty(&sp->batch_check0) &&
> +	    rcu_batch_empty(&sp->batch_queue)) {
> +		spin_lock_irq(&sp->queue_lock);
> +		if (rcu_batch_empty(&sp->batch_queue)) {
> +			sp->running = false;
> +			running = false;
> +		}
> +		spin_unlock_irq(&sp->queue_lock);
> +	}
> +
> +	if (running)
> +		queue_delayed_work(system_nrt_wq, &sp->work, SRCU_INTERVAL);
> +}
> +
> +static void process_srcu(struct work_struct *work)
> +{
> +	struct srcu_struct *sp;
> +
> +	sp = container_of(work, struct srcu_struct, work.work);
> +
> +	srcu_collect_new(sp);
> +	srcu_advance_batches(sp);
> +	srcu_invoke_callbacks(sp);
> +	srcu_reschedule(sp);
> +}
> 
> 
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/5 single-thread-version] implement per-domain single-thread state machine call_srcu()
  2012-03-08 20:35                     ` Paul E. McKenney
@ 2012-03-10  3:16                       ` Lai Jiangshan
  2012-03-12 18:03                         ` Paul E. McKenney
  0 siblings, 1 reply; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-10  3:16 UTC (permalink / raw)
  To: paulmck
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, peterz, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On Fri, Mar 9, 2012 at 4:35 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Wed, Mar 07, 2012 at 11:54:02AM +0800, Lai Jiangshan wrote:
>> This patch is on the top of the 4 previous patches(1/6, 2/6, 3/6, 4/6).
>>
>> o     state machine is light way and single-threaded, it is preemptible when checking.
>>
>> o     state machine is a work_struct. So, there is no thread occupied
>>       by SRCU when the srcu is not actived(no callback). And it does
>>       not sleep(avoid to occupy a thread when sleep).
>>
>> o     state machine is the only thread can flip/check/write(*) the srcu_struct,
>>       so we don't need any mutex.
>>       (write(*): except ->per_cpu_ref, ->running, ->batch_queue)
>>
>> o     synchronize_srcu() is always call call_srcu().
>>       synchronize_srcu_expedited() is also.
>>       It is OK for mb()-based srcu are extremely fast.
>>
>> o     In current kernel, we can expect that there are only 1 callback per gp.
>>       so callback is probably called in the same CPU when it is queued.
>>
>> The trip of a callback:
>>       1) ->batch_queue when call_srcu()
>>
>>       2) ->batch_check0 when try to do check_zero
>>
>>       3) ->batch_check1 after finish its first check_zero and the flip
>>
>>       4) ->batch_done after finish its second check_zero
>>
>> The current requirement of the callbacks:
>>       The callback will be called inside process context.
>>       The callback should be fast without any sleeping path.
>>
>> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
>> ---
>>  include/linux/rcupdate.h |    2 +-
>>  include/linux/srcu.h     |   28 +++++-
>>  kernel/rcupdate.c        |   24 ++++-
>>  kernel/rcutorture.c      |   44 ++++++++-
>>  kernel/srcu.c            |  238 ++++++++++++++++++++++++++++++++-------------
>>  5 files changed, 259 insertions(+), 77 deletions(-)
>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>> index 9372174..d98eab2 100644
>> --- a/include/linux/rcupdate.h
>> +++ b/include/linux/rcupdate.h
>> @@ -222,7 +222,7 @@ extern void rcu_irq_exit(void);
>>   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
>>   */
>>
>> -typedef void call_rcu_func_t(struct rcu_head *head,
>> +typedef void (*call_rcu_func_t)(struct rcu_head *head,
>
> I don't see what this applies against.  The old patch 5/6 created
> a "(*call_rcu_func_t)(struct rcu_head *head," and I don't see what
> created the "call_rcu_func_t(struct rcu_head *head,".

typedef void call_rcu_func_t(...) declares a function type, not a
function pointer
type. I use a line of code as following:

call_rcu_func_t crf = func;

if call_rcu_func_t is a function type, the above code can't be complied,
I need to covert it to function pointer type.

>
>>                            void (*func)(struct rcu_head *head));
>>  void wait_rcu_gp(call_rcu_func_t crf);
>>
>> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
>> index df8f5f7..56cb774 100644
>> --- a/include/linux/srcu.h
>> +++ b/include/linux/srcu.h
>> @@ -29,6 +29,7 @@
>>
>>  #include <linux/mutex.h>
>>  #include <linux/rcupdate.h>
>> +#include <linux/workqueue.h>
>>
>>  struct srcu_struct_array {
>>       unsigned long c[2];
>> @@ -39,10 +40,23 @@ struct srcu_struct_array {
>>  #define SRCU_REF_MASK                (ULONG_MAX >> SRCU_USAGE_BITS)
>>  #define SRCU_USAGE_COUNT     (SRCU_REF_MASK + 1)
>>
>> +struct rcu_batch {
>> +     struct rcu_head *head, **tail;
>> +};
>> +
>>  struct srcu_struct {
>>       unsigned completed;
>>       struct srcu_struct_array __percpu *per_cpu_ref;
>> -     struct mutex mutex;
>> +     spinlock_t queue_lock; /* protect ->batch_queue, ->running */
>> +     bool running;
>> +     /* callbacks just queued */
>> +     struct rcu_batch batch_queue;
>> +     /* callbacks try to do the first check_zero */
>> +     struct rcu_batch batch_check0;
>> +     /* callbacks done with the first check_zero and the flip */
>> +     struct rcu_batch batch_check1;
>> +     struct rcu_batch batch_done;
>> +     struct delayed_work work;
>
> Why not use your multiple-tail-pointer trick here?  (The one that is
> used in treercu.)

1) Make the code of the advance of batches simpler.
2) batch_queue is protected by lock, so it will be hard to use
multiple-tail-pointer trick.
3) rcu_batch API do add a little more runtime overhead, but this
overhead is just
several cpu-instructions, I think it is OK. It is good tradeoff when
compare to the readability.
I think we can also use rcu_batch for rcutree/rcutiny.

>
>>       unsigned long snap[NR_CPUS];
>>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>>       struct lockdep_map dep_map;
>> @@ -67,12 +81,24 @@ int init_srcu_struct(struct srcu_struct *sp);
>>
>>  #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>>
>> +/* draft
>> + * queue callbacks which will be invoked after grace period.
>> + * The callback will be called inside process context.
>> + * The callback should be fast without any sleeping path.
>> + */
>> +void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
>> +             void (*func)(struct rcu_head *head));
>> +
>> +typedef void (*call_srcu_func_t)(struct srcu_struct *sp, struct rcu_head *head,
>> +             void (*func)(struct rcu_head *head));
>> +void __wait_srcu_gp(struct srcu_struct *sp, call_srcu_func_t crf);
>>  void cleanup_srcu_struct(struct srcu_struct *sp);
>>  int __srcu_read_lock(struct srcu_struct *sp) __acquires(sp);
>>  void __srcu_read_unlock(struct srcu_struct *sp, int idx) __releases(sp);
>>  void synchronize_srcu(struct srcu_struct *sp);
>>  void synchronize_srcu_expedited(struct srcu_struct *sp);
>>  long srcu_batches_completed(struct srcu_struct *sp);
>> +void srcu_barrier(struct srcu_struct *sp);
>>
>>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>>
>> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
>> index a86f174..f9b551f 100644
>> --- a/kernel/rcupdate.c
>> +++ b/kernel/rcupdate.c
>> @@ -45,6 +45,7 @@
>>  #include <linux/mutex.h>
>>  #include <linux/export.h>
>>  #include <linux/hardirq.h>
>> +#include <linux/srcu.h>
>>
>>  #define CREATE_TRACE_POINTS
>>  #include <trace/events/rcu.h>
>> @@ -123,20 +124,39 @@ static void wakeme_after_rcu(struct rcu_head  *head)
>>       complete(&rcu->completion);
>>  }
>>
>> -void wait_rcu_gp(call_rcu_func_t crf)
>> +static void __wait_rcu_gp(void *domain, void *func)
>>  {
>>       struct rcu_synchronize rcu;
>>
>>       init_rcu_head_on_stack(&rcu.head);
>>       init_completion(&rcu.completion);
>> +
>>       /* Will wake me after RCU finished. */
>> -     crf(&rcu.head, wakeme_after_rcu);
>> +     if (!domain) {
>> +             call_rcu_func_t crf = func;
>> +             crf(&rcu.head, wakeme_after_rcu);
>> +     } else {
>> +             call_srcu_func_t crf = func;
>> +             crf(domain, &rcu.head, wakeme_after_rcu);
>> +     }
>> +
>>       /* Wait for it. */
>>       wait_for_completion(&rcu.completion);
>>       destroy_rcu_head_on_stack(&rcu.head);
>>  }
>
> Mightn't it be simpler and faster to just have a separate wait_srcu_gp()
> that doesn't share code with wait_rcu_gp()?  I am all for sharing code,
> but this might be hrting more than helping.
>
>> +
>> +void wait_rcu_gp(call_rcu_func_t crf)
>> +{
>> +     __wait_rcu_gp(NULL, crf);
>> +}
>>  EXPORT_SYMBOL_GPL(wait_rcu_gp);
>>
>> +/* srcu.c internel */
>> +void __wait_srcu_gp(struct srcu_struct *sp, call_srcu_func_t crf)
>> +{
>> +     __wait_rcu_gp(sp, crf);
>> +}
>> +
>>  #ifdef CONFIG_PROVE_RCU
>>  /*
>>   * wrapper function to avoid #include problems.
>> diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
>> index 54e5724..40d24d0 100644
>
> OK, so your original patch #6 is folded into this?  I don't have a strong
> view either way, just need to know.
>
>> --- a/kernel/rcutorture.c
>> +++ b/kernel/rcutorture.c
>> @@ -623,6 +623,11 @@ static int srcu_torture_completed(void)
>>       return srcu_batches_completed(&srcu_ctl);
>>  }
>>
>> +static void srcu_torture_deferred_free(struct rcu_torture *rp)
>> +{
>> +     call_srcu(&srcu_ctl, &rp->rtort_rcu, rcu_torture_cb);
>> +}
>> +
>>  static void srcu_torture_synchronize(void)
>>  {
>>       synchronize_srcu(&srcu_ctl);
>> @@ -652,7 +657,7 @@ static struct rcu_torture_ops srcu_ops = {
>>       .read_delay     = srcu_read_delay,
>>       .readunlock     = srcu_torture_read_unlock,
>>       .completed      = srcu_torture_completed,
>> -     .deferred_free  = rcu_sync_torture_deferred_free,
>> +     .deferred_free  = srcu_torture_deferred_free,
>>       .sync           = srcu_torture_synchronize,
>>       .call           = NULL,
>>       .cb_barrier     = NULL,
>> @@ -660,6 +665,21 @@ static struct rcu_torture_ops srcu_ops = {
>>       .name           = "srcu"
>>  };
>>
>> +static struct rcu_torture_ops srcu_sync_ops = {
>> +     .init           = srcu_torture_init,
>> +     .cleanup        = srcu_torture_cleanup,
>> +     .readlock       = srcu_torture_read_lock,
>> +     .read_delay     = srcu_read_delay,
>> +     .readunlock     = srcu_torture_read_unlock,
>> +     .completed      = srcu_torture_completed,
>> +     .deferred_free  = rcu_sync_torture_deferred_free,
>> +     .sync           = srcu_torture_synchronize,
>> +     .call           = NULL,
>> +     .cb_barrier     = NULL,
>> +     .stats          = srcu_torture_stats,
>> +     .name           = "srcu_sync"
>> +};
>> +
>>  static int srcu_torture_read_lock_raw(void) __acquires(&srcu_ctl)
>>  {
>>       return srcu_read_lock_raw(&srcu_ctl);
>> @@ -677,7 +697,7 @@ static struct rcu_torture_ops srcu_raw_ops = {
>>       .read_delay     = srcu_read_delay,
>>       .readunlock     = srcu_torture_read_unlock_raw,
>>       .completed      = srcu_torture_completed,
>> -     .deferred_free  = rcu_sync_torture_deferred_free,
>> +     .deferred_free  = srcu_torture_deferred_free,
>>       .sync           = srcu_torture_synchronize,
>>       .call           = NULL,
>>       .cb_barrier     = NULL,
>> @@ -685,6 +705,21 @@ static struct rcu_torture_ops srcu_raw_ops = {
>>       .name           = "srcu_raw"
>>  };
>>
>> +static struct rcu_torture_ops srcu_raw_sync_ops = {
>> +     .init           = srcu_torture_init,
>> +     .cleanup        = srcu_torture_cleanup,
>> +     .readlock       = srcu_torture_read_lock_raw,
>> +     .read_delay     = srcu_read_delay,
>> +     .readunlock     = srcu_torture_read_unlock_raw,
>> +     .completed      = srcu_torture_completed,
>> +     .deferred_free  = rcu_sync_torture_deferred_free,
>> +     .sync           = srcu_torture_synchronize,
>> +     .call           = NULL,
>> +     .cb_barrier     = NULL,
>> +     .stats          = srcu_torture_stats,
>> +     .name           = "srcu_raw_sync"
>> +};
>> +
>>  static void srcu_torture_synchronize_expedited(void)
>>  {
>>       synchronize_srcu_expedited(&srcu_ctl);
>> @@ -1673,7 +1708,7 @@ static int rcu_torture_barrier_init(void)
>>       for (i = 0; i < n_barrier_cbs; i++) {
>>               init_waitqueue_head(&barrier_cbs_wq[i]);
>>               barrier_cbs_tasks[i] = kthread_run(rcu_torture_barrier_cbs,
>> -                                                (void *)i,
>> +                                                (void *)(long)i,
>>                                                  "rcu_torture_barrier_cbs");
>>               if (IS_ERR(barrier_cbs_tasks[i])) {
>>                       ret = PTR_ERR(barrier_cbs_tasks[i]);
>> @@ -1857,7 +1892,8 @@ rcu_torture_init(void)
>>       static struct rcu_torture_ops *torture_ops[] =
>>               { &rcu_ops, &rcu_sync_ops, &rcu_expedited_ops,
>>                 &rcu_bh_ops, &rcu_bh_sync_ops, &rcu_bh_expedited_ops,
>> -               &srcu_ops, &srcu_raw_ops, &srcu_expedited_ops,
>> +               &srcu_ops, &srcu_sync_ops, &srcu_raw_ops,
>> +               &srcu_raw_sync_ops, &srcu_expedited_ops,
>>                 &sched_ops, &sched_sync_ops, &sched_expedited_ops, };
>>
>>       mutex_lock(&fullstop_mutex);
>> diff --git a/kernel/srcu.c b/kernel/srcu.c
>> index d101ed5..532f890 100644
>> --- a/kernel/srcu.c
>> +++ b/kernel/srcu.c
>> @@ -34,10 +34,60 @@
>>  #include <linux/delay.h>
>>  #include <linux/srcu.h>
>>
>> +static inline void rcu_batch_init(struct rcu_batch *b)
>> +{
>> +     b->head = NULL;
>> +     b->tail = &b->head;
>> +}
>> +
>> +static inline void rcu_batch_queue(struct rcu_batch *b, struct rcu_head *head)
>> +{
>> +     *b->tail = head;
>> +     b->tail = &head->next;
>> +}
>> +
>> +static inline bool rcu_batch_empty(struct rcu_batch *b)
>> +{
>> +     return b->tail == &b->head;
>> +}
>> +
>> +static inline struct rcu_head *rcu_batch_dequeue(struct rcu_batch *b)
>> +{
>> +     struct rcu_head *head;
>> +
>> +     if (rcu_batch_empty(b))
>> +             return NULL;
>> +
>> +     head = b->head;
>> +     b->head = head->next;
>> +     if (b->tail == &head->next)
>> +             rcu_batch_init(b);
>> +
>> +     return head;
>> +}
>> +
>> +static inline void rcu_batch_move(struct rcu_batch *to, struct rcu_batch *from)
>> +{
>> +     if (!rcu_batch_empty(from)) {
>> +             *to->tail = from->head;
>> +             to->tail = from->tail;
>> +             rcu_batch_init(from);
>> +     }
>> +}
>
> And perhaps this is why you don't want the multi-tailed queue?
>
>> +
>> +/* single-thread state-machine */
>> +static void process_srcu(struct work_struct *work);
>> +
>>  static int init_srcu_struct_fields(struct srcu_struct *sp)
>>  {
>>       sp->completed = 0;
>> -     mutex_init(&sp->mutex);
>> +     spin_lock_init(&sp->queue_lock);
>> +     sp->running = false;
>> +     rcu_batch_init(&sp->batch_queue);
>> +     rcu_batch_init(&sp->batch_check0);
>> +     rcu_batch_init(&sp->batch_check1);
>> +     rcu_batch_init(&sp->batch_done);
>> +     INIT_DELAYED_WORK(&sp->work, process_srcu);
>>       sp->per_cpu_ref = alloc_percpu(struct srcu_struct_array);
>>       return sp->per_cpu_ref ? 0 : -ENOMEM;
>>  }
>> @@ -254,11 +304,9 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>>   * we repeatedly block for 1-millisecond time periods.  This approach
>>   * has done well in testing, so there is no need for a config parameter.
>>   */
>> -#define SYNCHRONIZE_SRCU_READER_DELAY        5
>> -#define SYNCHRONIZE_SRCU_TRYCOUNT    2
>> -#define SYNCHRONIZE_SRCU_EXP_TRYCOUNT        12
>> +#define SRCU_RETRY_CHECK_DELAY       5
>>
>> -static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
>> +static bool try_check_zero(struct srcu_struct *sp, int idx, int trycount)
>>  {
>>       /*
>>        * If a reader fetches the index before the ->completed increment,
>> @@ -271,19 +319,12 @@ static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
>>        */
>>       smp_mb(); /* D */
>>
>> -     /*
>> -      * SRCU read-side critical sections are normally short, so wait
>> -      * a small amount of time before possibly blocking.
>> -      */
>> -     if (!srcu_readers_active_idx_check(sp, idx)) {
>> -             udelay(SYNCHRONIZE_SRCU_READER_DELAY);
>> -             while (!srcu_readers_active_idx_check(sp, idx)) {
>> -                     if (trycount > 0) {
>> -                             trycount--;
>> -                             udelay(SYNCHRONIZE_SRCU_READER_DELAY);
>> -                     } else
>> -                             schedule_timeout_interruptible(1);
>> -             }
>> +     for (;;) {
>> +             if (srcu_readers_active_idx_check(sp, idx))
>> +                     break;
>> +             if (--trycount <= 0)
>> +                     return false;
>> +             udelay(SRCU_RETRY_CHECK_DELAY);
>>       }
>>
>>       /*
>> @@ -297,6 +338,8 @@ static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
>>        * the next flipping.
>>        */
>>       smp_mb(); /* E */
>> +
>> +     return true;
>>  }
>>
>>  /*
>> @@ -308,10 +351,27 @@ static void srcu_flip(struct srcu_struct *sp)
>>       ACCESS_ONCE(sp->completed)++;
>>  }
>>
>> +void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
>> +             void (*func)(struct rcu_head *head))
>> +{
>> +     unsigned long flags;
>> +
>> +     head->next = NULL;
>> +     head->func = func;
>> +     spin_lock_irqsave(&sp->queue_lock, flags);
>> +     rcu_batch_queue(&sp->batch_queue, head);
>> +     if (!sp->running) {
>> +             sp->running = true;
>> +             queue_delayed_work(system_nrt_wq, &sp->work, 0);
>> +     }
>> +     spin_unlock_irqrestore(&sp->queue_lock, flags);
>> +}
>> +EXPORT_SYMBOL_GPL(call_srcu);
>> +
>>  /*
>>   * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
>>   */
>> -static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
>> +static void __synchronize_srcu(struct srcu_struct *sp)
>>  {
>>       rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
>>                          !lock_is_held(&rcu_bh_lock_map) &&
>> @@ -319,54 +379,7 @@ static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
>>                          !lock_is_held(&rcu_sched_lock_map),
>>                          "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
>>
>> -     mutex_lock(&sp->mutex);
>> -
>> -     /*
>> -      * Suppose that during the previous grace period, a reader
>> -      * picked up the old value of the index, but did not increment
>> -      * its counter until after the previous instance of
>> -      * __synchronize_srcu() did the counter summation and recheck.
>> -      * That previous grace period was OK because the reader did
>> -      * not start until after the grace period started, so the grace
>> -      * period was not obligated to wait for that reader.
>> -      *
>> -      * However, the current SRCU grace period does have to wait for
>> -      * that reader.  This is handled by invoking wait_idx() on the
>> -      * non-active set of counters (hence sp->completed - 1).  Once
>> -      * wait_idx() returns, we know that all readers that picked up
>> -      * the old value of ->completed and that already incremented their
>> -      * counter will have completed.
>> -      *
>> -      * But what about readers that picked up the old value of
>> -      * ->completed, but -still- have not managed to increment their
>> -      * counter?  We do not need to wait for those readers, because
>> -      * they will have started their SRCU read-side critical section
>> -      * after the current grace period starts.
>> -      *
>> -      * Because it is unlikely that readers will be preempted between
>> -      * fetching ->completed and incrementing their counter, wait_idx()
>> -      * will normally not need to wait.
>> -      */
>> -     wait_idx(sp, (sp->completed - 1) & 0x1, trycount);
>> -
>> -     /*
>> -      * Now that wait_idx() has waited for the really old readers,
>> -      *
>> -      * Flip the readers' index by incrementing ->completed, then wait
>> -      * until there are no more readers using the counters referenced by
>> -      * the old index value.  (Recall that the index is the bottom bit
>> -      * of ->completed.)
>> -      *
>> -      * Of course, it is possible that a reader might be delayed for the
>> -      * full duration of flip_idx_and_wait() between fetching the
>> -      * index and incrementing its counter.  This possibility is handled
>> -      * by the next __synchronize_srcu() invoking wait_idx() for such
>> -      * readers before starting a new grace period.
>> -      */
>> -     srcu_flip(sp);
>> -     wait_idx(sp, (sp->completed - 1) & 0x1, trycount);
>> -
>> -     mutex_unlock(&sp->mutex);
>> +     __wait_srcu_gp(sp, call_srcu);
>>  }
>>
>>  /**
>> @@ -385,7 +398,7 @@ static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
>>   */
>>  void synchronize_srcu(struct srcu_struct *sp)
>>  {
>> -     __synchronize_srcu(sp, SYNCHRONIZE_SRCU_TRYCOUNT);
>> +     __synchronize_srcu(sp);
>>  }
>>  EXPORT_SYMBOL_GPL(synchronize_srcu);
>>
>> @@ -406,10 +419,16 @@ EXPORT_SYMBOL_GPL(synchronize_srcu);
>>   */
>>  void synchronize_srcu_expedited(struct srcu_struct *sp)
>>  {
>> -     __synchronize_srcu(sp, SYNCHRONIZE_SRCU_EXP_TRYCOUNT);
>> +     __synchronize_srcu(sp);
>>  }
>
> OK, I'll bite...  Why aren't synchronize_srcu_expedited() and
> synchronize_srcu() different?

In mb()-based srcu, synchronize_srcu() is very fast,
synchronize_srcu_expedited() makes less sense than before.

But when wait_srcu_gp() is move back here, I will use
a bigger "trycount" for synchronize_srcu_expedited().

And any problem for srcu_advance_batches()?

Thanks.
Lai

>
>                                                        Thanx, Paul
>
>>  EXPORT_SYMBOL_GPL(synchronize_srcu_expedited);
>>
>> +void srcu_barrier(struct srcu_struct *sp)
>> +{
>> +     __synchronize_srcu(sp);
>> +}
>> +EXPORT_SYMBOL_GPL(srcu_barrier);
>> +
>>  /**
>>   * srcu_batches_completed - return batches completed.
>>   * @sp: srcu_struct on which to report batch completion.
>> @@ -423,3 +442,84 @@ long srcu_batches_completed(struct srcu_struct *sp)
>>       return sp->completed;
>>  }
>>  EXPORT_SYMBOL_GPL(srcu_batches_completed);
>> +
>> +#define SRCU_CALLBACK_BATCH  10
>> +#define SRCU_INTERVAL                1
>> +
>> +static void srcu_collect_new(struct srcu_struct *sp)
>> +{
>> +     if (!rcu_batch_empty(&sp->batch_queue)) {
>> +             spin_lock_irq(&sp->queue_lock);
>> +             rcu_batch_move(&sp->batch_check0, &sp->batch_queue);
>> +             spin_unlock_irq(&sp->queue_lock);
>> +     }
>> +}
>> +
>> +static void srcu_advance_batches(struct srcu_struct *sp)
>> +{
>> +     int idx = 1 - (sp->completed & 0x1UL);
>> +
>> +     /*
>> +      * SRCU read-side critical sections are normally short, so check
>> +      * twice after a flip.
>> +      */
>> +     if (!rcu_batch_empty(&sp->batch_check1) ||
>> +         !rcu_batch_empty(&sp->batch_check0)) {
>> +             if (try_check_zero(sp, idx, 1)) {
>> +                     rcu_batch_move(&sp->batch_done, &sp->batch_check1);
>> +                     rcu_batch_move(&sp->batch_check1, &sp->batch_check0);
>> +                     if (!rcu_batch_empty(&sp->batch_check1)) {
>> +                             srcu_flip(sp);
>> +                             if (try_check_zero(sp, 1 - idx, 2)) {
>> +                                     rcu_batch_move(&sp->batch_done,
>> +                                             &sp->batch_check1);
>> +                             }
>> +                     }
>> +             }
>> +     }
>> +}
>> +
>> +static void srcu_invoke_callbacks(struct srcu_struct *sp)
>> +{
>> +     int i;
>> +     struct rcu_head *head;
>> +
>> +     for (i = 0; i < SRCU_CALLBACK_BATCH; i++) {
>> +             head = rcu_batch_dequeue(&sp->batch_done);
>> +             if (!head)
>> +                     break;
>> +             head->func(head);
>> +     }
>> +}
>> +
>> +static void srcu_reschedule(struct srcu_struct *sp)
>> +{
>> +     bool running = true;
>> +
>> +     if (rcu_batch_empty(&sp->batch_done) &&
>> +         rcu_batch_empty(&sp->batch_check1) &&
>> +         rcu_batch_empty(&sp->batch_check0) &&
>> +         rcu_batch_empty(&sp->batch_queue)) {
>> +             spin_lock_irq(&sp->queue_lock);
>> +             if (rcu_batch_empty(&sp->batch_queue)) {
>> +                     sp->running = false;
>> +                     running = false;
>> +             }
>> +             spin_unlock_irq(&sp->queue_lock);
>> +     }
>> +
>> +     if (running)
>> +             queue_delayed_work(system_nrt_wq, &sp->work, SRCU_INTERVAL);
>> +}
>> +
>> +static void process_srcu(struct work_struct *work)
>> +{
>> +     struct srcu_struct *sp;
>> +
>> +     sp = container_of(work, struct srcu_struct, work.work);
>> +
>> +     srcu_collect_new(sp);
>> +     srcu_advance_batches(sp);
>> +     srcu_invoke_callbacks(sp);
>> +     srcu_reschedule(sp);
>> +}
>>
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-08 19:58                         ` Paul E. McKenney
@ 2012-03-10  3:32                           ` Lai Jiangshan
  2012-03-10 10:09                           ` Peter Zijlstra
  1 sibling, 0 replies; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-10  3:32 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Lai Jiangshan, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches, tj

On Fri, Mar 9, 2012 at 3:58 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Tue, Mar 06, 2012 at 04:34:53PM +0100, Peter Zijlstra wrote:
>> On Tue, 2012-03-06 at 23:12 +0800, Lai Jiangshan wrote:
>> > On Tue, Mar 6, 2012 at 7:16 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > > On Tue, 2012-03-06 at 17:57 +0800, Lai Jiangshan wrote:
>> > >>         srcu_head is bigger, it is worth, it provides more ability and simplify
>> > >>         the srcu code.
>> > >
>> > > Dubious claim.. memory footprint of various data structures is deemed
>> > > important. rcu_head is 16 bytes, srcu_head is 32 bytes. I think it would
>> > > be real nice not to have two different callback structures and not grow
>> > > them as large.
>> >
>> > CC: tj@kernel.org
>> > It could be better if workqueue also supports 2*sizeof(long) work callbacks.
>>
>> That's going to be very painful if at all possible.
>>
>> > I prefer ability/functionality a little more, it eases the caller's pain.
>> > preemptible callbacks also eases the pressure of the whole system.
>> > But I'm also ok if we limit the srcu-callbacks in softirq.
>>
>> You don't have to use softirq, you could run a complete list from a
>> single worklet. Just keep the single linked rcu_head list and enqueue a
>> static (per-cpu) worker to process the entire list.
>
> I like the idea of SRCU using rcu_head.  I am a little concerned about
> what happens when there are lots of SRCU callbacks, but am willing to
> wait to solve those problems until the situation arises.
>
> But I guess I should ask...  Peter, what do you expect the maximum
> call_srcu() rate to be in your use cases?  If tens of thousands are
> possible, some adjustments will be needed.

Different from RCU(global domain), SRCU is separated
by domains, I think the rate is not high for a given domain,
but the real answer depends on the practice.

>
>                                                        Thanx, Paul
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm
  2012-03-01 13:20                                 ` Paul E. McKenney
@ 2012-03-10  3:41                                   ` Lai Jiangshan
  0 siblings, 0 replies; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-10  3:41 UTC (permalink / raw)
  To: paulmck
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, peterz, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On Thu, Mar 1, 2012 at 9:20 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Thu, Mar 01, 2012 at 10:31:22AM +0800, Lai Jiangshan wrote:
>> On 02/29/2012 09:55 PM, Paul E. McKenney wrote:
>> > On Wed, Feb 29, 2012 at 06:07:32PM +0800, Lai Jiangshan wrote:
>> >> On 02/28/2012 09:47 PM, Paul E. McKenney wrote:
>> >>> On Tue, Feb 28, 2012 at 09:51:22AM +0800, Lai Jiangshan wrote:
>> >>>> On 02/28/2012 02:30 AM, Paul E. McKenney wrote:
>> >>>>> On Mon, Feb 27, 2012 at 04:01:04PM +0800, Lai Jiangshan wrote:
>> >>>>>> >From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
>> >>>>>> From: Lai Jiangshan <laijs@cn.fujitsu.com>
>> >>>>>> Date: Mon, 27 Feb 2012 14:22:47 +0800
>> >>>>>> Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm
>> >>>>>>
>> >>>>>> This patch implement the algorithm as Peter's:
>> >>>>>> https://lkml.org/lkml/2012/2/1/119
>> >>>>>>
>> >>>>>> o      Make the checking lock-free and we can perform parallel checking,
>> >>>>>>        Although almost parallel checking makes no sense, but we need it
>> >>>>>>        when 1) the original checking task is preempted for long, 2)
>> >>>>>>        sychronize_srcu_expedited(), 3) avoid lock(see next)
>> >>>>>>
>> >>>>>> o      Since it is lock-free, we save a mutex in state machine for
>> >>>>>>        call_srcu().
>> >>>>>>
>> >>>>>> o      Remove the SRCU_REF_MASK and remove the coupling with the flipping.
>> >>>>>>        (so we can remove the preempt_disable() in future, but use
>> >>>>>>         __this_cpu_inc() instead.)
>> >>>>>>
>> >>>>>> o      reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
>> >>>>>>        more intuitive.
>> >>>>>
>> >>>>> Hello, Lai,
>> >>>>>
>> >>>>> Interesting approach!
>> >>>>>
>> >>>>> What happens given the following sequence of events?
>> >>>>>
>> >>>>> o       CPU 0 in srcu_readers_active_idx_check() invokes
>> >>>>>         srcu_readers_seq_idx(), getting some number back.
>> >>>>>
>> >>>>> o       CPU 0 invokes srcu_readers_active_idx(), summing the
>> >>>>>         ->c[] array up through CPU 3.
>> >>>>>
>> >>>>> o       CPU 1 invokes __srcu_read_lock(), and increments its counter
>> >>>>>         but not yet its ->seq[] element.
>> >>>>
>> >>>>
>> >>>> Any __srcu_read_lock() whose increment of active counter is not seen
>> >>>> by srcu_readers_active_idx() is considerred as
>> >>>> "reader-started-after-this-srcu_readers_active_idx_check()",
>> >>>> We don't need to wait.
>> >>>>
>> >>>> As you said, this srcu C.S 's increment seq is not seen by above
>> >>>> srcu_readers_seq_idx().
>> >>>>
>> >>>>>
>> >>>>> o       CPU 0 completes its summing of the ->c[] array, incorrectly
>> >>>>>         obtaining zero.
>> >>>>>
>> >>>>> o       CPU 0 invokes srcu_readers_seq_idx(), getting the same
>> >>>>>         number back that it got last time.
>> >>>>
>> >>>> If it incorrectly get zero, it means __srcu_read_unlock() is seen
>> >>>> in srcu_readers_active_idx(), and it means the increment of
>> >>>> seq is seen in this srcu_readers_seq_idx(), it is different
>> >>>> from the above seq that it got last time.
>> >>>>
>> >>>> increment of seq is not seen by above srcu_readers_seq_idx(),
>> >>>> but is seen by later one, so the two returned seq is different,
>> >>>> this is the core of Peter's algorithm, and this was written
>> >>>> in the comments(Sorry for my bad English). Or maybe I miss
>> >>>> your means in this mail.
>> >>>
>> >>> OK, good, this analysis agrees with what I was thinking.
>> >>>
>> >>> So my next question is about the lock freedom.  This lock freedom has to
>> >>> be limited in nature and carefully implemented.  The reasons for this are:
>> >>>
>> >>> 1.        Readers can block in any case, which can of course block both
>> >>>   synchronize_srcu_expedited() and synchronize_srcu().
>> >>>
>> >>> 2.        Because only one CPU at a time can be incrementing ->completed,
>> >>>   some sort of lock with preemption disabling will of course be
>> >>>   needed.  Alternatively, an rt_mutex could be used for its
>> >>>   priority-inheritance properties.
>> >>>
>> >>> 3.        Once some CPU has incremented ->completed, all CPUs that might
>> >>>   still be summing up the old indexes must stop.  If they don't,
>> >>>   they might incorrectly call a too-short grace period in case of
>> >>>   ->seq[]-sum overflow on 32-bit systems.
>> >>>
>> >>> Or did you have something else in mind?
>> >>
>> >> When flip happens when check_zero, this check_zero will no be
>> >> committed even it is success.
>> >
>> > But if the CPU in check_zero isn't blocking the grace period, then
>> > ->completed could overflow while that CPU was preempted.  Then how
>> > would this CPU know that the flip had happened?
>>
>> as you said, check the ->completed.
>> but disable the overflow for ->completed.
>>
>> there is a spinlock for srcu_struct(including locking for flipping)
>>
>> 1) assume we need to wait on widx
>> 2) use srcu_read_lock() to hold a reference of the 1-widx active counter
>> 3) release the spinlock
>> 4) do_check_zero
>> 5) gain the spinlock
>> 6) srcu_read_unlock()
>> 7) if ->completed is not changed, and there is no other later check_zero which
>>    is committed earlier than us, we will commit our check_zero if we success.
>>
>> too complicated.
>
> Plus I don't see how it disables overflow for ->completed.

 srcu_read_lock() gets a reader reference, then the other state machine
can do flip at most once, and then the ->completed can't wrap.

Thanks,
Lai

>
> As you said earlier, abandoning the goal of lock freedom sounds like the
> best approach.  Then you can indeed just hold the srcu_struct's mutex
> across the whole thing.
>
>                                                        Thanx, Paul
>
>> Thanks,
>> Lai
>>
>> >
>> >> I play too much with lock-free for call_srcu(), the code becomes complicated,
>> >> I just give up lock-free for call_srcu(), the main aim of call_srcu() is simple.
>> >
>> > Makes sense to me!
>> >
>> >> (But I still like Peter's approach, it has some other good thing
>> >> besides lock-free-checking, if you don't like it, I will send
>> >> another patch to fix srcu_readers_active())
>> >
>> > Try them both and check their performance &c.  If within espilon of
>> > each other, pick whichever one you prefer.
>> >
>> >                                                     Thanx, Paul
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-08 19:58                         ` Paul E. McKenney
  2012-03-10  3:32                           ` Lai Jiangshan
@ 2012-03-10 10:09                           ` Peter Zijlstra
  2012-03-12 17:54                             ` Paul E. McKenney
  1 sibling, 1 reply; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-10 10:09 UTC (permalink / raw)
  To: paulmck
  Cc: Lai Jiangshan, Lai Jiangshan, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches, tj

On Thu, 2012-03-08 at 11:58 -0800, Paul E. McKenney wrote:
> 
> But I guess I should ask...  Peter, what do you expect the maximum
> call_srcu() rate to be in your use cases?  If tens of thousands are
> possible, some adjustments will be needed. 

The one call-site I currently have is linked to vma lifetimes, so yeah,
I guess that that can be lots.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-08 19:49                         ` Paul E. McKenney
@ 2012-03-10 10:12                           ` Peter Zijlstra
  2012-03-12 17:52                             ` Paul E. McKenney
  0 siblings, 1 reply; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-10 10:12 UTC (permalink / raw)
  To: paulmck
  Cc: Lai Jiangshan, Lai Jiangshan, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On Thu, 2012-03-08 at 11:49 -0800, Paul E. McKenney wrote:
> 
> I too have used (long)(a - b) for a long time, but I saw with my own eyes
> the glee in the compiler-writers' eyes when they discussed signed overflow
> being undefined in the C standard.  I believe that the reasons for signed
> overflow being undefined are long obsolete, but better safe than sorry. 

Thing is, if they break that the whole kernel comes falling down, I
really wouldn't worry about RCU at that point. But to each their
pet-paranoia I guess ;-)

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-10 10:12                           ` Peter Zijlstra
@ 2012-03-12 17:52                             ` Paul E. McKenney
  0 siblings, 0 replies; 100+ messages in thread
From: Paul E. McKenney @ 2012-03-12 17:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Lai Jiangshan, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On Sat, Mar 10, 2012 at 11:12:29AM +0100, Peter Zijlstra wrote:
> On Thu, 2012-03-08 at 11:49 -0800, Paul E. McKenney wrote:
> > 
> > I too have used (long)(a - b) for a long time, but I saw with my own eyes
> > the glee in the compiler-writers' eyes when they discussed signed overflow
> > being undefined in the C standard.  I believe that the reasons for signed
> > overflow being undefined are long obsolete, but better safe than sorry. 
> 
> Thing is, if they break that the whole kernel comes falling down, I
> really wouldn't worry about RCU at that point. But to each their
> pet-paranoia I guess ;-)

But just because I am paranoid doesn't mean that no one is after me!  ;-)

I agree that the compiler guys would need to provide a chicken switch
due to the huge amount of code that relies on (long)(a - b) handling
overflow reasonably.  But avoiding signed integer overflow is pretty
straightforward.  For example, I use the following in RCU:

	#define UINT_CMP_GE(a, b)	(UINT_MAX / 2 >= (a) - (b))
	#define UINT_CMP_LT(a, b)	(UINT_MAX / 2 < (a) - (b))
	#define ULONG_CMP_GE(a, b)	(ULONG_MAX / 2 >= (a) - (b))
	#define ULONG_CMP_LT(a, b)	(ULONG_MAX / 2 < (a) - (b))

But yes, part of the reason for my doing this was to make conversations
with the usual standards-committee suspects go more smoothly.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-10 10:09                           ` Peter Zijlstra
@ 2012-03-12 17:54                             ` Paul E. McKenney
  2012-03-12 17:58                               ` Peter Zijlstra
  0 siblings, 1 reply; 100+ messages in thread
From: Paul E. McKenney @ 2012-03-12 17:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Lai Jiangshan, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches, tj

On Sat, Mar 10, 2012 at 11:09:53AM +0100, Peter Zijlstra wrote:
> On Thu, 2012-03-08 at 11:58 -0800, Paul E. McKenney wrote:
> > 
> > But I guess I should ask...  Peter, what do you expect the maximum
> > call_srcu() rate to be in your use cases?  If tens of thousands are
> > possible, some adjustments will be needed. 
> 
> The one call-site I currently have is linked to vma lifetimes, so yeah,
> I guess that that can be lots.

So the worst case would be if several processes with lots of VMAs were
to exit at about the same time?  If so, my guess is that call_srcu()
needs to handle several thousand callbacks showing up within a few
tens of microseconds.  Is that a reasonable assumption, or am I missing
an order of magnitude or two in either direction?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-12 17:54                             ` Paul E. McKenney
@ 2012-03-12 17:58                               ` Peter Zijlstra
  2012-03-12 18:32                                 ` Paul E. McKenney
  0 siblings, 1 reply; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-12 17:58 UTC (permalink / raw)
  To: paulmck
  Cc: Lai Jiangshan, Lai Jiangshan, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches, tj

On Mon, 2012-03-12 at 10:54 -0700, Paul E. McKenney wrote:
> On Sat, Mar 10, 2012 at 11:09:53AM +0100, Peter Zijlstra wrote:
> > On Thu, 2012-03-08 at 11:58 -0800, Paul E. McKenney wrote:
> > > 
> > > But I guess I should ask...  Peter, what do you expect the maximum
> > > call_srcu() rate to be in your use cases?  If tens of thousands are
> > > possible, some adjustments will be needed. 
> > 
> > The one call-site I currently have is linked to vma lifetimes, so yeah,
> > I guess that that can be lots.
> 
> So the worst case would be if several processes with lots of VMAs were
> to exit at about the same time?  If so, my guess is that call_srcu()
> needs to handle several thousand callbacks showing up within a few
> tens of microseconds.  Is that a reasonable assumption, or am I missing
> an order of magnitude or two in either direction?

That or a process is doing mmap/munmap loops (some file scanners are
known to do this). But yeah, that can be lots.

My current use case doesn't quite trigger since it needs another syscall
to attach something to a vma before vma tear-down actually ends up
calling call_srcu() so in practice it won't be much at all (for now).

Still I think its prudent to make it (srcu callbacks) deal with plenty
callbacks even if initially there won't be many -- who knows what other
people will do while you're not paying attention... :-)

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/5 single-thread-version] implement per-domain single-thread state machine call_srcu()
  2012-03-10  3:16                       ` Lai Jiangshan
@ 2012-03-12 18:03                         ` Paul E. McKenney
  2012-03-14  7:47                           ` Lai Jiangshan
  0 siblings, 1 reply; 100+ messages in thread
From: Paul E. McKenney @ 2012-03-12 18:03 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, peterz, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On Sat, Mar 10, 2012 at 11:16:48AM +0800, Lai Jiangshan wrote:
> On Fri, Mar 9, 2012 at 4:35 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Wed, Mar 07, 2012 at 11:54:02AM +0800, Lai Jiangshan wrote:
> >> This patch is on the top of the 4 previous patches(1/6, 2/6, 3/6, 4/6).
> >>
> >> o     state machine is light way and single-threaded, it is preemptible when checking.
> >>
> >> o     state machine is a work_struct. So, there is no thread occupied
> >>       by SRCU when the srcu is not actived(no callback). And it does
> >>       not sleep(avoid to occupy a thread when sleep).
> >>
> >> o     state machine is the only thread can flip/check/write(*) the srcu_struct,
> >>       so we don't need any mutex.
> >>       (write(*): except ->per_cpu_ref, ->running, ->batch_queue)
> >>
> >> o     synchronize_srcu() is always call call_srcu().
> >>       synchronize_srcu_expedited() is also.
> >>       It is OK for mb()-based srcu are extremely fast.
> >>
> >> o     In current kernel, we can expect that there are only 1 callback per gp.
> >>       so callback is probably called in the same CPU when it is queued.
> >>
> >> The trip of a callback:
> >>       1) ->batch_queue when call_srcu()
> >>
> >>       2) ->batch_check0 when try to do check_zero
> >>
> >>       3) ->batch_check1 after finish its first check_zero and the flip
> >>
> >>       4) ->batch_done after finish its second check_zero
> >>
> >> The current requirement of the callbacks:
> >>       The callback will be called inside process context.
> >>       The callback should be fast without any sleeping path.
> >>
> >> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> >> ---
> >>  include/linux/rcupdate.h |    2 +-
> >>  include/linux/srcu.h     |   28 +++++-
> >>  kernel/rcupdate.c        |   24 ++++-
> >>  kernel/rcutorture.c      |   44 ++++++++-
> >>  kernel/srcu.c            |  238 ++++++++++++++++++++++++++++++++-------------
> >>  5 files changed, 259 insertions(+), 77 deletions(-)
> >> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> >> index 9372174..d98eab2 100644
> >> --- a/include/linux/rcupdate.h
> >> +++ b/include/linux/rcupdate.h
> >> @@ -222,7 +222,7 @@ extern void rcu_irq_exit(void);
> >>   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
> >>   */
> >>
> >> -typedef void call_rcu_func_t(struct rcu_head *head,
> >> +typedef void (*call_rcu_func_t)(struct rcu_head *head,
> >
> > I don't see what this applies against.  The old patch 5/6 created
> > a "(*call_rcu_func_t)(struct rcu_head *head," and I don't see what
> > created the "call_rcu_func_t(struct rcu_head *head,".
> 
> typedef void call_rcu_func_t(...) declares a function type, not a
> function pointer
> type. I use a line of code as following:
> 
> call_rcu_func_t crf = func;
> 
> if call_rcu_func_t is a function type, the above code can't be complied,
> I need to covert it to function pointer type.

Got it, thank you!

> >>                            void (*func)(struct rcu_head *head));
> >>  void wait_rcu_gp(call_rcu_func_t crf);
> >>
> >> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> >> index df8f5f7..56cb774 100644
> >> --- a/include/linux/srcu.h
> >> +++ b/include/linux/srcu.h
> >> @@ -29,6 +29,7 @@
> >>
> >>  #include <linux/mutex.h>
> >>  #include <linux/rcupdate.h>
> >> +#include <linux/workqueue.h>
> >>
> >>  struct srcu_struct_array {
> >>       unsigned long c[2];
> >> @@ -39,10 +40,23 @@ struct srcu_struct_array {
> >>  #define SRCU_REF_MASK                (ULONG_MAX >> SRCU_USAGE_BITS)
> >>  #define SRCU_USAGE_COUNT     (SRCU_REF_MASK + 1)
> >>
> >> +struct rcu_batch {
> >> +     struct rcu_head *head, **tail;
> >> +};
> >> +
> >>  struct srcu_struct {
> >>       unsigned completed;
> >>       struct srcu_struct_array __percpu *per_cpu_ref;
> >> -     struct mutex mutex;
> >> +     spinlock_t queue_lock; /* protect ->batch_queue, ->running */
> >> +     bool running;
> >> +     /* callbacks just queued */
> >> +     struct rcu_batch batch_queue;
> >> +     /* callbacks try to do the first check_zero */
> >> +     struct rcu_batch batch_check0;
> >> +     /* callbacks done with the first check_zero and the flip */
> >> +     struct rcu_batch batch_check1;
> >> +     struct rcu_batch batch_done;
> >> +     struct delayed_work work;
> >
> > Why not use your multiple-tail-pointer trick here?  (The one that is
> > used in treercu.)
> 
> 1) Make the code of the advance of batches simpler.
> 2) batch_queue is protected by lock, so it will be hard to use
> multiple-tail-pointer trick.
> 3) rcu_batch API do add a little more runtime overhead, but this
> overhead is just
> several cpu-instructions, I think it is OK. It is good tradeoff when
> compare to the readability.

OK, let's see how it goes.

> I think we can also use rcu_batch for rcutree/rcutiny.

Hmmm...  Readability and speed both improved when moving from something
resembing rcu_batch to the current multi-tailed lists.  ;-)

> >>       unsigned long snap[NR_CPUS];
> >>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >>       struct lockdep_map dep_map;
> >> @@ -67,12 +81,24 @@ int init_srcu_struct(struct srcu_struct *sp);
> >>
> >>  #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> >>
> >> +/* draft
> >> + * queue callbacks which will be invoked after grace period.
> >> + * The callback will be called inside process context.
> >> + * The callback should be fast without any sleeping path.
> >> + */
> >> +void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
> >> +             void (*func)(struct rcu_head *head));
> >> +
> >> +typedef void (*call_srcu_func_t)(struct srcu_struct *sp, struct rcu_head *head,
> >> +             void (*func)(struct rcu_head *head));
> >> +void __wait_srcu_gp(struct srcu_struct *sp, call_srcu_func_t crf);
> >>  void cleanup_srcu_struct(struct srcu_struct *sp);
> >>  int __srcu_read_lock(struct srcu_struct *sp) __acquires(sp);
> >>  void __srcu_read_unlock(struct srcu_struct *sp, int idx) __releases(sp);
> >>  void synchronize_srcu(struct srcu_struct *sp);
> >>  void synchronize_srcu_expedited(struct srcu_struct *sp);
> >>  long srcu_batches_completed(struct srcu_struct *sp);
> >> +void srcu_barrier(struct srcu_struct *sp);
> >>
> >>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >>
> >> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> >> index a86f174..f9b551f 100644
> >> --- a/kernel/rcupdate.c
> >> +++ b/kernel/rcupdate.c
> >> @@ -45,6 +45,7 @@
> >>  #include <linux/mutex.h>
> >>  #include <linux/export.h>
> >>  #include <linux/hardirq.h>
> >> +#include <linux/srcu.h>
> >>
> >>  #define CREATE_TRACE_POINTS
> >>  #include <trace/events/rcu.h>
> >> @@ -123,20 +124,39 @@ static void wakeme_after_rcu(struct rcu_head  *head)
> >>       complete(&rcu->completion);
> >>  }
> >>
> >> -void wait_rcu_gp(call_rcu_func_t crf)
> >> +static void __wait_rcu_gp(void *domain, void *func)
> >>  {
> >>       struct rcu_synchronize rcu;
> >>
> >>       init_rcu_head_on_stack(&rcu.head);
> >>       init_completion(&rcu.completion);
> >> +
> >>       /* Will wake me after RCU finished. */
> >> -     crf(&rcu.head, wakeme_after_rcu);
> >> +     if (!domain) {
> >> +             call_rcu_func_t crf = func;
> >> +             crf(&rcu.head, wakeme_after_rcu);
> >> +     } else {
> >> +             call_srcu_func_t crf = func;
> >> +             crf(domain, &rcu.head, wakeme_after_rcu);
> >> +     }
> >> +
> >>       /* Wait for it. */
> >>       wait_for_completion(&rcu.completion);
> >>       destroy_rcu_head_on_stack(&rcu.head);
> >>  }
> >
> > Mightn't it be simpler and faster to just have a separate wait_srcu_gp()
> > that doesn't share code with wait_rcu_gp()?  I am all for sharing code,
> > but this might be hrting more than helping.
> >
> >> +
> >> +void wait_rcu_gp(call_rcu_func_t crf)
> >> +{
> >> +     __wait_rcu_gp(NULL, crf);
> >> +}
> >>  EXPORT_SYMBOL_GPL(wait_rcu_gp);
> >>
> >> +/* srcu.c internel */
> >> +void __wait_srcu_gp(struct srcu_struct *sp, call_srcu_func_t crf)
> >> +{
> >> +     __wait_rcu_gp(sp, crf);
> >> +}
> >> +
> >>  #ifdef CONFIG_PROVE_RCU
> >>  /*
> >>   * wrapper function to avoid #include problems.
> >> diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
> >> index 54e5724..40d24d0 100644
> >
> > OK, so your original patch #6 is folded into this?  I don't have a strong
> > view either way, just need to know.
> >
> >> --- a/kernel/rcutorture.c
> >> +++ b/kernel/rcutorture.c
> >> @@ -623,6 +623,11 @@ static int srcu_torture_completed(void)
> >>       return srcu_batches_completed(&srcu_ctl);
> >>  }
> >>
> >> +static void srcu_torture_deferred_free(struct rcu_torture *rp)
> >> +{
> >> +     call_srcu(&srcu_ctl, &rp->rtort_rcu, rcu_torture_cb);
> >> +}
> >> +
> >>  static void srcu_torture_synchronize(void)
> >>  {
> >>       synchronize_srcu(&srcu_ctl);
> >> @@ -652,7 +657,7 @@ static struct rcu_torture_ops srcu_ops = {
> >>       .read_delay     = srcu_read_delay,
> >>       .readunlock     = srcu_torture_read_unlock,
> >>       .completed      = srcu_torture_completed,
> >> -     .deferred_free  = rcu_sync_torture_deferred_free,
> >> +     .deferred_free  = srcu_torture_deferred_free,
> >>       .sync           = srcu_torture_synchronize,
> >>       .call           = NULL,
> >>       .cb_barrier     = NULL,
> >> @@ -660,6 +665,21 @@ static struct rcu_torture_ops srcu_ops = {
> >>       .name           = "srcu"
> >>  };
> >>
> >> +static struct rcu_torture_ops srcu_sync_ops = {
> >> +     .init           = srcu_torture_init,
> >> +     .cleanup        = srcu_torture_cleanup,
> >> +     .readlock       = srcu_torture_read_lock,
> >> +     .read_delay     = srcu_read_delay,
> >> +     .readunlock     = srcu_torture_read_unlock,
> >> +     .completed      = srcu_torture_completed,
> >> +     .deferred_free  = rcu_sync_torture_deferred_free,
> >> +     .sync           = srcu_torture_synchronize,
> >> +     .call           = NULL,
> >> +     .cb_barrier     = NULL,
> >> +     .stats          = srcu_torture_stats,
> >> +     .name           = "srcu_sync"
> >> +};
> >> +
> >>  static int srcu_torture_read_lock_raw(void) __acquires(&srcu_ctl)
> >>  {
> >>       return srcu_read_lock_raw(&srcu_ctl);
> >> @@ -677,7 +697,7 @@ static struct rcu_torture_ops srcu_raw_ops = {
> >>       .read_delay     = srcu_read_delay,
> >>       .readunlock     = srcu_torture_read_unlock_raw,
> >>       .completed      = srcu_torture_completed,
> >> -     .deferred_free  = rcu_sync_torture_deferred_free,
> >> +     .deferred_free  = srcu_torture_deferred_free,
> >>       .sync           = srcu_torture_synchronize,
> >>       .call           = NULL,
> >>       .cb_barrier     = NULL,
> >> @@ -685,6 +705,21 @@ static struct rcu_torture_ops srcu_raw_ops = {
> >>       .name           = "srcu_raw"
> >>  };
> >>
> >> +static struct rcu_torture_ops srcu_raw_sync_ops = {
> >> +     .init           = srcu_torture_init,
> >> +     .cleanup        = srcu_torture_cleanup,
> >> +     .readlock       = srcu_torture_read_lock_raw,
> >> +     .read_delay     = srcu_read_delay,
> >> +     .readunlock     = srcu_torture_read_unlock_raw,
> >> +     .completed      = srcu_torture_completed,
> >> +     .deferred_free  = rcu_sync_torture_deferred_free,
> >> +     .sync           = srcu_torture_synchronize,
> >> +     .call           = NULL,
> >> +     .cb_barrier     = NULL,
> >> +     .stats          = srcu_torture_stats,
> >> +     .name           = "srcu_raw_sync"
> >> +};
> >> +
> >>  static void srcu_torture_synchronize_expedited(void)
> >>  {
> >>       synchronize_srcu_expedited(&srcu_ctl);
> >> @@ -1673,7 +1708,7 @@ static int rcu_torture_barrier_init(void)
> >>       for (i = 0; i < n_barrier_cbs; i++) {
> >>               init_waitqueue_head(&barrier_cbs_wq[i]);
> >>               barrier_cbs_tasks[i] = kthread_run(rcu_torture_barrier_cbs,
> >> -                                                (void *)i,
> >> +                                                (void *)(long)i,
> >>                                                  "rcu_torture_barrier_cbs");
> >>               if (IS_ERR(barrier_cbs_tasks[i])) {
> >>                       ret = PTR_ERR(barrier_cbs_tasks[i]);
> >> @@ -1857,7 +1892,8 @@ rcu_torture_init(void)
> >>       static struct rcu_torture_ops *torture_ops[] =
> >>               { &rcu_ops, &rcu_sync_ops, &rcu_expedited_ops,
> >>                 &rcu_bh_ops, &rcu_bh_sync_ops, &rcu_bh_expedited_ops,
> >> -               &srcu_ops, &srcu_raw_ops, &srcu_expedited_ops,
> >> +               &srcu_ops, &srcu_sync_ops, &srcu_raw_ops,
> >> +               &srcu_raw_sync_ops, &srcu_expedited_ops,
> >>                 &sched_ops, &sched_sync_ops, &sched_expedited_ops, };
> >>
> >>       mutex_lock(&fullstop_mutex);
> >> diff --git a/kernel/srcu.c b/kernel/srcu.c
> >> index d101ed5..532f890 100644
> >> --- a/kernel/srcu.c
> >> +++ b/kernel/srcu.c
> >> @@ -34,10 +34,60 @@
> >>  #include <linux/delay.h>
> >>  #include <linux/srcu.h>
> >>
> >> +static inline void rcu_batch_init(struct rcu_batch *b)
> >> +{
> >> +     b->head = NULL;
> >> +     b->tail = &b->head;
> >> +}
> >> +
> >> +static inline void rcu_batch_queue(struct rcu_batch *b, struct rcu_head *head)
> >> +{
> >> +     *b->tail = head;
> >> +     b->tail = &head->next;
> >> +}
> >> +
> >> +static inline bool rcu_batch_empty(struct rcu_batch *b)
> >> +{
> >> +     return b->tail == &b->head;
> >> +}
> >> +
> >> +static inline struct rcu_head *rcu_batch_dequeue(struct rcu_batch *b)
> >> +{
> >> +     struct rcu_head *head;
> >> +
> >> +     if (rcu_batch_empty(b))
> >> +             return NULL;
> >> +
> >> +     head = b->head;
> >> +     b->head = head->next;
> >> +     if (b->tail == &head->next)
> >> +             rcu_batch_init(b);
> >> +
> >> +     return head;
> >> +}
> >> +
> >> +static inline void rcu_batch_move(struct rcu_batch *to, struct rcu_batch *from)
> >> +{
> >> +     if (!rcu_batch_empty(from)) {
> >> +             *to->tail = from->head;
> >> +             to->tail = from->tail;
> >> +             rcu_batch_init(from);
> >> +     }
> >> +}
> >
> > And perhaps this is why you don't want the multi-tailed queue?
> >
> >> +
> >> +/* single-thread state-machine */
> >> +static void process_srcu(struct work_struct *work);
> >> +
> >>  static int init_srcu_struct_fields(struct srcu_struct *sp)
> >>  {
> >>       sp->completed = 0;
> >> -     mutex_init(&sp->mutex);
> >> +     spin_lock_init(&sp->queue_lock);
> >> +     sp->running = false;
> >> +     rcu_batch_init(&sp->batch_queue);
> >> +     rcu_batch_init(&sp->batch_check0);
> >> +     rcu_batch_init(&sp->batch_check1);
> >> +     rcu_batch_init(&sp->batch_done);
> >> +     INIT_DELAYED_WORK(&sp->work, process_srcu);
> >>       sp->per_cpu_ref = alloc_percpu(struct srcu_struct_array);
> >>       return sp->per_cpu_ref ? 0 : -ENOMEM;
> >>  }
> >> @@ -254,11 +304,9 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> >>   * we repeatedly block for 1-millisecond time periods.  This approach
> >>   * has done well in testing, so there is no need for a config parameter.
> >>   */
> >> -#define SYNCHRONIZE_SRCU_READER_DELAY        5
> >> -#define SYNCHRONIZE_SRCU_TRYCOUNT    2
> >> -#define SYNCHRONIZE_SRCU_EXP_TRYCOUNT        12
> >> +#define SRCU_RETRY_CHECK_DELAY       5
> >>
> >> -static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
> >> +static bool try_check_zero(struct srcu_struct *sp, int idx, int trycount)
> >>  {
> >>       /*
> >>        * If a reader fetches the index before the ->completed increment,
> >> @@ -271,19 +319,12 @@ static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
> >>        */
> >>       smp_mb(); /* D */
> >>
> >> -     /*
> >> -      * SRCU read-side critical sections are normally short, so wait
> >> -      * a small amount of time before possibly blocking.
> >> -      */
> >> -     if (!srcu_readers_active_idx_check(sp, idx)) {
> >> -             udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> >> -             while (!srcu_readers_active_idx_check(sp, idx)) {
> >> -                     if (trycount > 0) {
> >> -                             trycount--;
> >> -                             udelay(SYNCHRONIZE_SRCU_READER_DELAY);
> >> -                     } else
> >> -                             schedule_timeout_interruptible(1);
> >> -             }
> >> +     for (;;) {
> >> +             if (srcu_readers_active_idx_check(sp, idx))
> >> +                     break;
> >> +             if (--trycount <= 0)
> >> +                     return false;
> >> +             udelay(SRCU_RETRY_CHECK_DELAY);
> >>       }
> >>
> >>       /*
> >> @@ -297,6 +338,8 @@ static void wait_idx(struct srcu_struct *sp, int idx, int trycount)
> >>        * the next flipping.
> >>        */
> >>       smp_mb(); /* E */
> >> +
> >> +     return true;
> >>  }
> >>
> >>  /*
> >> @@ -308,10 +351,27 @@ static void srcu_flip(struct srcu_struct *sp)
> >>       ACCESS_ONCE(sp->completed)++;
> >>  }
> >>
> >> +void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
> >> +             void (*func)(struct rcu_head *head))
> >> +{
> >> +     unsigned long flags;
> >> +
> >> +     head->next = NULL;
> >> +     head->func = func;
> >> +     spin_lock_irqsave(&sp->queue_lock, flags);
> >> +     rcu_batch_queue(&sp->batch_queue, head);
> >> +     if (!sp->running) {
> >> +             sp->running = true;
> >> +             queue_delayed_work(system_nrt_wq, &sp->work, 0);
> >> +     }
> >> +     spin_unlock_irqrestore(&sp->queue_lock, flags);
> >> +}
> >> +EXPORT_SYMBOL_GPL(call_srcu);
> >> +
> >>  /*
> >>   * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
> >>   */
> >> -static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
> >> +static void __synchronize_srcu(struct srcu_struct *sp)
> >>  {
> >>       rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
> >>                          !lock_is_held(&rcu_bh_lock_map) &&
> >> @@ -319,54 +379,7 @@ static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
> >>                          !lock_is_held(&rcu_sched_lock_map),
> >>                          "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
> >>
> >> -     mutex_lock(&sp->mutex);
> >> -
> >> -     /*
> >> -      * Suppose that during the previous grace period, a reader
> >> -      * picked up the old value of the index, but did not increment
> >> -      * its counter until after the previous instance of
> >> -      * __synchronize_srcu() did the counter summation and recheck.
> >> -      * That previous grace period was OK because the reader did
> >> -      * not start until after the grace period started, so the grace
> >> -      * period was not obligated to wait for that reader.
> >> -      *
> >> -      * However, the current SRCU grace period does have to wait for
> >> -      * that reader.  This is handled by invoking wait_idx() on the
> >> -      * non-active set of counters (hence sp->completed - 1).  Once
> >> -      * wait_idx() returns, we know that all readers that picked up
> >> -      * the old value of ->completed and that already incremented their
> >> -      * counter will have completed.
> >> -      *
> >> -      * But what about readers that picked up the old value of
> >> -      * ->completed, but -still- have not managed to increment their
> >> -      * counter?  We do not need to wait for those readers, because
> >> -      * they will have started their SRCU read-side critical section
> >> -      * after the current grace period starts.
> >> -      *
> >> -      * Because it is unlikely that readers will be preempted between
> >> -      * fetching ->completed and incrementing their counter, wait_idx()
> >> -      * will normally not need to wait.
> >> -      */
> >> -     wait_idx(sp, (sp->completed - 1) & 0x1, trycount);
> >> -
> >> -     /*
> >> -      * Now that wait_idx() has waited for the really old readers,
> >> -      *
> >> -      * Flip the readers' index by incrementing ->completed, then wait
> >> -      * until there are no more readers using the counters referenced by
> >> -      * the old index value.  (Recall that the index is the bottom bit
> >> -      * of ->completed.)
> >> -      *
> >> -      * Of course, it is possible that a reader might be delayed for the
> >> -      * full duration of flip_idx_and_wait() between fetching the
> >> -      * index and incrementing its counter.  This possibility is handled
> >> -      * by the next __synchronize_srcu() invoking wait_idx() for such
> >> -      * readers before starting a new grace period.
> >> -      */
> >> -     srcu_flip(sp);
> >> -     wait_idx(sp, (sp->completed - 1) & 0x1, trycount);
> >> -
> >> -     mutex_unlock(&sp->mutex);
> >> +     __wait_srcu_gp(sp, call_srcu);
> >>  }
> >>
> >>  /**
> >> @@ -385,7 +398,7 @@ static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
> >>   */
> >>  void synchronize_srcu(struct srcu_struct *sp)
> >>  {
> >> -     __synchronize_srcu(sp, SYNCHRONIZE_SRCU_TRYCOUNT);
> >> +     __synchronize_srcu(sp);
> >>  }
> >>  EXPORT_SYMBOL_GPL(synchronize_srcu);
> >>
> >> @@ -406,10 +419,16 @@ EXPORT_SYMBOL_GPL(synchronize_srcu);
> >>   */
> >>  void synchronize_srcu_expedited(struct srcu_struct *sp)
> >>  {
> >> -     __synchronize_srcu(sp, SYNCHRONIZE_SRCU_EXP_TRYCOUNT);
> >> +     __synchronize_srcu(sp);
> >>  }
> >
> > OK, I'll bite...  Why aren't synchronize_srcu_expedited() and
> > synchronize_srcu() different?
> 
> In mb()-based srcu, synchronize_srcu() is very fast,
> synchronize_srcu_expedited() makes less sense than before.

I am worried about expedited callbacks getting backed up behind
non-expedited callbacks (especially given Peter's point about per-VMA
SRCU callbacks) and behind other workqueue uses.

> But when wait_srcu_gp() is move back here, I will use
> a bigger "trycount" for synchronize_srcu_expedited().
> 
> And any problem for srcu_advance_batches()?

I prefer the use of "return" that you and Peter discussed later.

What sort of testing are you doing?

							Thanx, Paul

> Thanks.
> Lai
> 
> >
> >                                                        Thanx, Paul
> >
> >>  EXPORT_SYMBOL_GPL(synchronize_srcu_expedited);
> >>
> >> +void srcu_barrier(struct srcu_struct *sp)
> >> +{
> >> +     __synchronize_srcu(sp);
> >> +}
> >> +EXPORT_SYMBOL_GPL(srcu_barrier);
> >> +
> >>  /**
> >>   * srcu_batches_completed - return batches completed.
> >>   * @sp: srcu_struct on which to report batch completion.
> >> @@ -423,3 +442,84 @@ long srcu_batches_completed(struct srcu_struct *sp)
> >>       return sp->completed;
> >>  }
> >>  EXPORT_SYMBOL_GPL(srcu_batches_completed);
> >> +
> >> +#define SRCU_CALLBACK_BATCH  10
> >> +#define SRCU_INTERVAL                1
> >> +
> >> +static void srcu_collect_new(struct srcu_struct *sp)
> >> +{
> >> +     if (!rcu_batch_empty(&sp->batch_queue)) {
> >> +             spin_lock_irq(&sp->queue_lock);
> >> +             rcu_batch_move(&sp->batch_check0, &sp->batch_queue);
> >> +             spin_unlock_irq(&sp->queue_lock);
> >> +     }
> >> +}
> >> +
> >> +static void srcu_advance_batches(struct srcu_struct *sp)
> >> +{
> >> +     int idx = 1 - (sp->completed & 0x1UL);
> >> +
> >> +     /*
> >> +      * SRCU read-side critical sections are normally short, so check
> >> +      * twice after a flip.
> >> +      */
> >> +     if (!rcu_batch_empty(&sp->batch_check1) ||
> >> +         !rcu_batch_empty(&sp->batch_check0)) {
> >> +             if (try_check_zero(sp, idx, 1)) {
> >> +                     rcu_batch_move(&sp->batch_done, &sp->batch_check1);
> >> +                     rcu_batch_move(&sp->batch_check1, &sp->batch_check0);
> >> +                     if (!rcu_batch_empty(&sp->batch_check1)) {
> >> +                             srcu_flip(sp);
> >> +                             if (try_check_zero(sp, 1 - idx, 2)) {
> >> +                                     rcu_batch_move(&sp->batch_done,
> >> +                                             &sp->batch_check1);
> >> +                             }
> >> +                     }
> >> +             }
> >> +     }
> >> +}
> >> +
> >> +static void srcu_invoke_callbacks(struct srcu_struct *sp)
> >> +{
> >> +     int i;
> >> +     struct rcu_head *head;
> >> +
> >> +     for (i = 0; i < SRCU_CALLBACK_BATCH; i++) {
> >> +             head = rcu_batch_dequeue(&sp->batch_done);
> >> +             if (!head)
> >> +                     break;
> >> +             head->func(head);
> >> +     }
> >> +}
> >> +
> >> +static void srcu_reschedule(struct srcu_struct *sp)
> >> +{
> >> +     bool running = true;
> >> +
> >> +     if (rcu_batch_empty(&sp->batch_done) &&
> >> +         rcu_batch_empty(&sp->batch_check1) &&
> >> +         rcu_batch_empty(&sp->batch_check0) &&
> >> +         rcu_batch_empty(&sp->batch_queue)) {
> >> +             spin_lock_irq(&sp->queue_lock);
> >> +             if (rcu_batch_empty(&sp->batch_queue)) {
> >> +                     sp->running = false;
> >> +                     running = false;
> >> +             }
> >> +             spin_unlock_irq(&sp->queue_lock);
> >> +     }
> >> +
> >> +     if (running)
> >> +             queue_delayed_work(system_nrt_wq, &sp->work, SRCU_INTERVAL);
> >> +}
> >> +
> >> +static void process_srcu(struct work_struct *work)
> >> +{
> >> +     struct srcu_struct *sp;
> >> +
> >> +     sp = container_of(work, struct srcu_struct, work.work);
> >> +
> >> +     srcu_collect_new(sp);
> >> +     srcu_advance_batches(sp);
> >> +     srcu_invoke_callbacks(sp);
> >> +     srcu_reschedule(sp);
> >> +}
> >>
> >>
> >>
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-12 17:58                               ` Peter Zijlstra
@ 2012-03-12 18:32                                 ` Paul E. McKenney
  2012-03-12 20:25                                   ` Peter Zijlstra
  0 siblings, 1 reply; 100+ messages in thread
From: Paul E. McKenney @ 2012-03-12 18:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Lai Jiangshan, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches, tj

On Mon, Mar 12, 2012 at 06:58:17PM +0100, Peter Zijlstra wrote:
> On Mon, 2012-03-12 at 10:54 -0700, Paul E. McKenney wrote:
> > On Sat, Mar 10, 2012 at 11:09:53AM +0100, Peter Zijlstra wrote:
> > > On Thu, 2012-03-08 at 11:58 -0800, Paul E. McKenney wrote:
> > > > 
> > > > But I guess I should ask...  Peter, what do you expect the maximum
> > > > call_srcu() rate to be in your use cases?  If tens of thousands are
> > > > possible, some adjustments will be needed. 
> > > 
> > > The one call-site I currently have is linked to vma lifetimes, so yeah,
> > > I guess that that can be lots.
> > 
> > So the worst case would be if several processes with lots of VMAs were
> > to exit at about the same time?  If so, my guess is that call_srcu()
> > needs to handle several thousand callbacks showing up within a few
> > tens of microseconds.  Is that a reasonable assumption, or am I missing
> > an order of magnitude or two in either direction?
> 
> That or a process is doing mmap/munmap loops (some file scanners are
> known to do this). But yeah, that can be lots.
> 
> My current use case doesn't quite trigger since it needs another syscall
> to attach something to a vma before vma tear-down actually ends up
> calling call_srcu() so in practice it won't be much at all (for now).
> 
> Still I think its prudent to make it (srcu callbacks) deal with plenty
> callbacks even if initially there won't be many -- who knows what other
> people will do while you're not paying attention... :-)

And another question I should have asked to begin with...  Would each
VMA have its own SRCU domain, or are you thinking in terms of one
SRCU domain for all VMAs globally?

If the latter, that pushes pretty strongly for per-CPU SRCU callback
lists.  Which brings up srcu_barrier() scalability (and yes, I am working
on rcu_barrier() scalability).  One way to handle this at least initially
is to have srcu_barrier() avoid enqueueing callbacks on CPUs whose
callback lists are empty.  In addition, if the loop over all CPUs is
preemptible, then there should not be much in the way of realtime issues.

The memory-ordering issues should be OK -- if a given CPU's list was
empty at any time during srcu_barrier()'s execution, then there is no
need for a callback on that CPU.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-12 18:32                                 ` Paul E. McKenney
@ 2012-03-12 20:25                                   ` Peter Zijlstra
  2012-03-12 23:15                                     ` Paul E. McKenney
  0 siblings, 1 reply; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-12 20:25 UTC (permalink / raw)
  To: paulmck
  Cc: Lai Jiangshan, Lai Jiangshan, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches, tj

On Mon, 2012-03-12 at 11:32 -0700, Paul E. McKenney wrote:
> And another question I should have asked to begin with...  Would each
> VMA have its own SRCU domain, or are you thinking in terms of one
> SRCU domain for all VMAs globally?

The latter, single domain for all objects.

> If the latter, that pushes pretty strongly for per-CPU SRCU callback
> lists. 

Agreed. I was under the impression the proposed thing had this, but on
looking at it again it does not. Shouldn't be hard to add though.

>  Which brings up srcu_barrier() scalability (and yes, I am working
> on rcu_barrier() scalability).  One way to handle this at least initially
> is to have srcu_barrier() avoid enqueueing callbacks on CPUs whose
> callback lists are empty.  In addition, if the loop over all CPUs is
> preemptible, then there should not be much in the way of realtime issues.

Why do we have rcu_barrier() and how is it different from
synchronize_rcu()?


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-12 20:25                                   ` Peter Zijlstra
@ 2012-03-12 23:15                                     ` Paul E. McKenney
  2012-03-12 23:18                                       ` Peter Zijlstra
  0 siblings, 1 reply; 100+ messages in thread
From: Paul E. McKenney @ 2012-03-12 23:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Lai Jiangshan, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches, tj

On Mon, Mar 12, 2012 at 09:25:16PM +0100, Peter Zijlstra wrote:
> On Mon, 2012-03-12 at 11:32 -0700, Paul E. McKenney wrote:
> > And another question I should have asked to begin with...  Would each
> > VMA have its own SRCU domain, or are you thinking in terms of one
> > SRCU domain for all VMAs globally?
> 
> The latter, single domain for all objects.

OK.

> > If the latter, that pushes pretty strongly for per-CPU SRCU callback
> > lists. 
> 
> Agreed. I was under the impression the proposed thing had this, but on
> looking at it again it does not. Shouldn't be hard to add though.

Agreed, but see srcu_barrier()...

> >  Which brings up srcu_barrier() scalability (and yes, I am working
> > on rcu_barrier() scalability).  One way to handle this at least initially
> > is to have srcu_barrier() avoid enqueueing callbacks on CPUs whose
> > callback lists are empty.  In addition, if the loop over all CPUs is
> > preemptible, then there should not be much in the way of realtime issues.
> 
> Why do we have rcu_barrier() and how is it different from
> synchronize_rcu()?

We need rcu_barrier() in order to be able to safely unload modules that
use call_rcu().  If a module fails to invoke rcu_barrier() between its
last call_rcu() and its unloading, then its RCU callbacks can be fatally
disappointed to learn that their callback functions are no longer in
memory.  See http://lwn.net/Articles/202847/ for more info.

While synchronize_rcu() waits only for a grace period, rcu_barrier()
waits for all pre-existing RCU callbacks to be invoked.  There is also
an rcu_barrier_bh() and rcu_barrier_sched().

Of course, if all uses of call_srcu() were to be from the main kernel
(as opposed to from a module), then there would be no need for a
srcu_barrier().  But it seems quite likely that if a call_srcu() is
available, its use from a module won't be far behind -- especially
given that rcutorture is normally built as a kernel module.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-12 23:15                                     ` Paul E. McKenney
@ 2012-03-12 23:18                                       ` Peter Zijlstra
  2012-03-12 23:38                                         ` Paul E. McKenney
  0 siblings, 1 reply; 100+ messages in thread
From: Peter Zijlstra @ 2012-03-12 23:18 UTC (permalink / raw)
  To: paulmck
  Cc: Lai Jiangshan, Lai Jiangshan, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches, tj

On Mon, 2012-03-12 at 16:15 -0700, Paul E. McKenney wrote:
> We need rcu_barrier() in order to be able to safely unload modules
> that
> use call_rcu(). 

Oh, right.. I always forget about modules. 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/6] implement per-cpu&per-domain state machine call_srcu()
  2012-03-12 23:18                                       ` Peter Zijlstra
@ 2012-03-12 23:38                                         ` Paul E. McKenney
  0 siblings, 0 replies; 100+ messages in thread
From: Paul E. McKenney @ 2012-03-12 23:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Lai Jiangshan, linux-kernel, mingo, dipankar,
	akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches, tj

On Tue, Mar 13, 2012 at 12:18:39AM +0100, Peter Zijlstra wrote:
> On Mon, 2012-03-12 at 16:15 -0700, Paul E. McKenney wrote:
> > We need rcu_barrier() in order to be able to safely unload modules
> > that
> > use call_rcu(). 
> 
> Oh, right.. I always forget about modules. 

Lucky you!  ;-)

(Sorry, couldn't resist...)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/5 single-thread-version] implement per-domain single-thread state machine call_srcu()
  2012-03-12 18:03                         ` Paul E. McKenney
@ 2012-03-14  7:47                           ` Lai Jiangshan
  2012-04-10 20:15                             ` Paul E. McKenney
  0 siblings, 1 reply; 100+ messages in thread
From: Lai Jiangshan @ 2012-03-14  7:47 UTC (permalink / raw)
  To: paulmck
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, peterz, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On 03/13/2012 02:03 AM, Paul E. McKenney wrote:

>>
>> In mb()-based srcu, synchronize_srcu() is very fast,
>> synchronize_srcu_expedited() makes less sense than before.
> 
> I am worried about expedited callbacks getting backed up behind
> non-expedited callbacks (especially given Peter's point about per-VMA
> SRCU callbacks) and behind other workqueue uses.
> 
>> But when wait_srcu_gp() is move back here, I will use
>> a bigger "trycount" for synchronize_srcu_expedited().
>>
>> And any problem for srcu_advance_batches()?
> 
> I prefer the use of "return" that you and Peter discussed later.
> 
> What sort of testing are you doing?
>

rcutorture in my box for several days on my daily used machine.

What would you prefer for next round of patches, single-thread or per-cpu?
I will send them soon.
(per-cpu approach will be also "batches, in-sleepable, reuse rcu_head"....)

I prefer the single-thread approach until high-callback-rate-per-domain-era
comes, but I don't know how long when it comes. Peter?

Thanks,
Lai

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH 5/5 single-thread-version] implement per-domain single-thread state machine call_srcu()
  2012-03-14  7:47                           ` Lai Jiangshan
@ 2012-04-10 20:15                             ` Paul E. McKenney
  0 siblings, 0 replies; 100+ messages in thread
From: Paul E. McKenney @ 2012-04-10 20:15 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, linux-kernel, mingo, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, peterz, rostedt,
	Valdis.Kletnieks, dhowells, eric.dumazet, darren, fweisbec,
	patches

On Wed, Mar 14, 2012 at 03:47:06PM +0800, Lai Jiangshan wrote:
> On 03/13/2012 02:03 AM, Paul E. McKenney wrote:
> 
> >>
> >> In mb()-based srcu, synchronize_srcu() is very fast,
> >> synchronize_srcu_expedited() makes less sense than before.
> > 
> > I am worried about expedited callbacks getting backed up behind
> > non-expedited callbacks (especially given Peter's point about per-VMA
> > SRCU callbacks) and behind other workqueue uses.
> > 
> >> But when wait_srcu_gp() is move back here, I will use
> >> a bigger "trycount" for synchronize_srcu_expedited().
> >>
> >> And any problem for srcu_advance_batches()?
> > 
> > I prefer the use of "return" that you and Peter discussed later.
> > 
> > What sort of testing are you doing?
> 
> rcutorture in my box for several days on my daily used machine.

OK, good!

> What would you prefer for next round of patches, single-thread or per-cpu?
> I will send them soon.
> (per-cpu approach will be also "batches, in-sleepable, reuse rcu_head"....)
> 
> I prefer the single-thread approach until high-callback-rate-per-domain-era
> comes, but I don't know how long when it comes. Peter?

Well, the price for sticking with the single-thread approach is
a commitment on your part to create a high-callback-rate-per-domain
version at a moment's notice, should it be needed.

Can you commit to that?  If not, then the initial version needs to be
able to handle a high callback rate.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 100+ messages in thread

end of thread, other threads:[~2012-04-10 20:16 UTC | newest]

Thread overview: 100+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-13  2:09 [PATCH RFC tip/core/rcu] rcu: direct algorithmic SRCU implementation Paul E. McKenney
2012-02-15 12:59 ` Peter Zijlstra
2012-02-16  6:35   ` Paul E. McKenney
2012-02-16 10:50     ` Mathieu Desnoyers
2012-02-16 10:52       ` Peter Zijlstra
2012-02-16 11:14         ` Mathieu Desnoyers
2012-02-15 14:31 ` Mathieu Desnoyers
2012-02-15 14:51   ` Mathieu Desnoyers
2012-02-16  6:38     ` Paul E. McKenney
2012-02-16 11:00       ` Mathieu Desnoyers
2012-02-16 11:51         ` Peter Zijlstra
2012-02-16 12:18           ` Mathieu Desnoyers
2012-02-16 12:44             ` Peter Zijlstra
2012-02-16 14:52               ` Mathieu Desnoyers
2012-02-16 14:58                 ` Peter Zijlstra
2012-02-16 15:13               ` Paul E. McKenney
2012-02-20  7:15 ` Lai Jiangshan
2012-02-20 17:44   ` Paul E. McKenney
2012-02-21  1:11     ` Lai Jiangshan
2012-02-21  1:50       ` Paul E. McKenney
2012-02-21  8:44         ` Lai Jiangshan
2012-02-21 17:24           ` Paul E. McKenney
2012-02-22  9:29             ` [PATCH 1/3 RFC paul/rcu/srcu] srcu: Remove fast check path Lai Jiangshan
2012-02-22  9:29             ` [PATCH 2/3 RFC paul/rcu/srcu] srcu: only increase the upper bit for srcu_read_lock() Lai Jiangshan
2012-02-22  9:50               ` Peter Zijlstra
2012-02-22 21:20               ` Paul E. McKenney
2012-02-22 21:26                 ` Paul E. McKenney
2012-02-22 21:39                   ` Steven Rostedt
2012-02-23  1:01                     ` Paul E. McKenney
2012-02-22  9:29             ` [PATCH 3/3 RFC paul/rcu/srcu] srcu: flip only once for every grace period Lai Jiangshan
2012-02-23  1:01               ` Paul E. McKenney
2012-02-24  8:06               ` Lai Jiangshan
2012-02-24 20:01                 ` Paul E. McKenney
2012-02-27  8:01                   ` [PATCH 1/2 RFC] srcu: change the comments of the wait algorithm Lai Jiangshan
2012-02-27  8:01                   ` [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm Lai Jiangshan
2012-02-27 18:30                     ` Paul E. McKenney
2012-02-28  1:51                       ` Lai Jiangshan
2012-02-28 13:47                         ` Paul E. McKenney
2012-02-29 10:07                           ` Lai Jiangshan
2012-02-29 13:55                             ` Paul E. McKenney
2012-03-01  2:31                               ` Lai Jiangshan
2012-03-01 13:20                                 ` Paul E. McKenney
2012-03-10  3:41                                   ` Lai Jiangshan
2012-03-06  8:42             ` [RFC PATCH 0/6 paul/rcu/srcu] srcu: implement call_srcu() Lai Jiangshan
2012-03-06  9:57               ` [PATCH 1/6] remove unused srcu_barrier() Lai Jiangshan
2012-03-06  9:57                 ` [PATCH 2/6] Don't touch the snap in srcu_readers_active() Lai Jiangshan
2012-03-08 19:14                   ` Paul E. McKenney
2012-03-06  9:57                 ` [PATCH 3/6] use "int trycount" instead of "bool expedited" Lai Jiangshan
2012-03-08 19:25                   ` Paul E. McKenney
2012-03-06  9:57                 ` [PATCH 4/6] remove flip_idx_and_wait() Lai Jiangshan
2012-03-06 10:41                   ` Peter Zijlstra
2012-03-07  3:54                   ` [RFC PATCH 5/5 single-thread-version] implement per-domain single-thread state machine call_srcu() Lai Jiangshan
2012-03-08 13:04                     ` Peter Zijlstra
2012-03-08 14:17                       ` Lai Jiangshan
2012-03-08 13:08                     ` Peter Zijlstra
2012-03-08 20:35                     ` Paul E. McKenney
2012-03-10  3:16                       ` Lai Jiangshan
2012-03-12 18:03                         ` Paul E. McKenney
2012-03-14  7:47                           ` Lai Jiangshan
2012-04-10 20:15                             ` Paul E. McKenney
2012-03-06  9:57                 ` [RFC PATCH 5/6] implement per-cpu&per-domain " Lai Jiangshan
2012-03-06 10:47                   ` Peter Zijlstra
2012-03-08 19:44                     ` Paul E. McKenney
2012-03-06 10:58                   ` Peter Zijlstra
2012-03-06 15:17                     ` Lai Jiangshan
2012-03-06 15:38                       ` Peter Zijlstra
2012-03-08 19:49                         ` Paul E. McKenney
2012-03-10 10:12                           ` Peter Zijlstra
2012-03-12 17:52                             ` Paul E. McKenney
2012-03-06 11:16                   ` Peter Zijlstra
2012-03-06 15:12                     ` Lai Jiangshan
2012-03-06 15:34                       ` Peter Zijlstra
2012-03-08 19:58                         ` Paul E. McKenney
2012-03-10  3:32                           ` Lai Jiangshan
2012-03-10 10:09                           ` Peter Zijlstra
2012-03-12 17:54                             ` Paul E. McKenney
2012-03-12 17:58                               ` Peter Zijlstra
2012-03-12 18:32                                 ` Paul E. McKenney
2012-03-12 20:25                                   ` Peter Zijlstra
2012-03-12 23:15                                     ` Paul E. McKenney
2012-03-12 23:18                                       ` Peter Zijlstra
2012-03-12 23:38                                         ` Paul E. McKenney
2012-03-06 15:26                     ` Lai Jiangshan
2012-03-06 15:37                       ` Peter Zijlstra
2012-03-06 11:17                   ` Peter Zijlstra
2012-03-06 11:22                   ` Peter Zijlstra
2012-03-06 11:35                   ` Peter Zijlstra
2012-03-06 11:36                   ` Peter Zijlstra
2012-03-06 11:39                   ` Peter Zijlstra
2012-03-06 14:50                     ` Lai Jiangshan
2012-03-06 11:52                   ` Peter Zijlstra
2012-03-06 14:44                     ` Lai Jiangshan
2012-03-06 15:31                       ` Peter Zijlstra
2012-03-06 15:32                       ` Peter Zijlstra
2012-03-07  6:44                         ` Lai Jiangshan
2012-03-07  8:10                       ` Gilad Ben-Yossef
2012-03-07  9:21                         ` Lai Jiangshan
2012-03-06 14:47                     ` Lai Jiangshan
2012-03-06  9:57                 ` [PATCH 6/6] add srcu torture test Lai Jiangshan
2012-03-08 19:03                 ` [PATCH 1/6] remove unused srcu_barrier() Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).