From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1753882AbcKRO0N (ORCPT <rfc822;w@1wt.eu>);
        Fri, 18 Nov 2016 09:26:13 -0500
Received: from mail.efficios.com ([167.114.142.141]:38068 "EHLO
        mail.efficios.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1752197AbcKRO0J (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 18 Nov 2016 09:26:09 -0500
Date: Fri, 18 Nov 2016 14:27:08 +0000 (UTC)
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: ldr709 <ldr709@gmail.com>, linux-kernel <linux-kernel@vger.kernel.org>,
        Ingo Molnar <mingo@kernel.org>, Lai Jiangshan <jiangshanlai@gmail.com>,
        dipankar <dipankar@in.ibm.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Josh Triplett <josh@joshtriplett.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <peterz@infradead.org>, rostedt <rostedt@goodmis.org>,
        David Howells <dhowells@redhat.com>,
        Eric Dumazet <edumazet@google.com>, dvhart <dvhart@linux.intel.com>,
        fweisbec <fweisbec@gmail.com>, Oleg Nesterov <oleg@redhat.com>,
        bobby prani <bobby.prani@gmail.com>
Message-ID: <1917047180.7534.1479479228544.JavaMail.zimbra@efficios.com>
In-Reply-To: <20161118140845.GQ3612@linux.vnet.ibm.com>
References: <20161117220335.20177-1-ldr709@gmail.com> <20161118140845.GQ3612@linux.vnet.ibm.com>
Subject: Re: [RFC PATCH] SRCU: More efficient reader counts.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [167.114.142.141]
X-Mailer: Zimbra 8.7.1_GA_1670 (ZimbraWebClient - FF45 (Linux)/8.7.1_GA_1670)
Thread-Topic: SRCU: More efficient reader counts.
Thread-Index: i4Y1fJGHzZKDb6b2+V4lYrKPthLbZQ==
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

----- On Nov 18, 2016, at 9:08 AM, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote:

> On Thu, Nov 17, 2016 at 02:03:35PM -0800, Lance Roy wrote:
>> SRCU uses two per-cpu counters: a nesting counter to count the number of
>> active critical sections, and a sequence counter to ensure that the nesting
>> counters don't change while they are being added together in
>> srcu_readers_active_idx_check().
>> 
>> This patch instead uses per-cpu lock and unlock counters. Because the both
>> counters only increase and srcu_readers_active_idx_check() reads the unlock
>> counter before the lock counter, this achieves the same end without having
>> to increment two different counters in srcu_read_lock(). This also saves a
>> smp_mb() in srcu_readers_active_idx_check().
>> 
>> Possible bug: There is no guarantee that the lock counter won't overflow
>> during srcu_readers_active_idx_check(), as there are no memory barriers
>> around srcu_flip() (see comment in srcu_readers_active_idx_check() for
>> details). However, this problem was already present before this patch.
> 
> This patch differs from the previous one in a few (good) code-style
> changes, comment changes, and of course the commit log.  Good.  Once
> discussion converges, I will apply the agreed-upon commit, which might
> well be this one.
> 
> However, let's first take a look at the overflow issue.
> 
> If a given program could have ULONG_MAX or more readers at any given
> time, there would of course be overflow.  However, each read must have
> an srcu_read_lock() outstanding, and the resulting four-byte return
> value must be stored somewhere.  Because the full address space is at
> most ULONG_MAX, the maximum number of outstanding readers is at most
> ULONG_MAX/4, even in the degenerate case where a single CPU/task invokes
> srcu_read_lock() in a tight loop.  And even this assumes that the entire
> address space can somehow be devoted to srcu_read_lock() return values.
> ULONG_MAX/4 is therefore a hard upper bound on the number of outstanding
> SRCU readers.

The loop taking srcu_read_lock() seems a bit far fetched.

More realistically, we could move this bound even lower
if we can expect each thread to only nest srcu up to a limited
amount (e.g. 128). The realistic limit of number of SRCU readers
then becomes bounded by the maximum number of threads multiplied
by the srcu max nesting count.

> 
> Now srcu_readers_active_idx_check() checks for strict equality between
> the number of locks and unlocks, so we can in theory tolerate ULONG_MAX-1
> readers.  So, the question is whether ULONG_MAX/4 readers can result
> in the updater seeing ULONG_MAX reads, due to memory misordering and
> other issues.
> 
> Because there are no memory barriers surrounding srcu_flip(), the updater
> could miss an extremely large number of srcu_read_unlock()s.  However,
> each missed srcu_read_unlock() must have a corresponding srcu_read_lock(),
> and there is a full memory barrier between between the srcu_flip() and
> the read of the lock count.  There is also a full barrier between any
> srcu_read_lock()'s increment of the lock count and that CPU's/task's next
> srcu_read_lock()'s fetch of the index.  Therefore, if the updater misses
> counting a given srcu_read_lock(), that CPU's/task's next srcu_read_lock()
> must see the new value of the index.  Because srcu_read_lock() disables
> preemption across the index fetch and the lock increment, there can be at
> most NR_CPUS-1 srcu_read_lock() calls that missed the recent srcu_flip()'s
> update to the index.  (I said NR_CPUS earlier, but Mathieu is correct
> in pointing out that srcu_flip() has to run somewhere.)
> 
> The maximum number of locks that the updater can see is therefore:
> 
> o	ULONG_MAX/4 for a full set of missed srcu_read_unlock()s.
> 
> o	ULONG_MAX/4 for a full set of srcu_read_lock()s.
> 
> o	NR_CPUS-1 for a full set of subsequent srcu_read_lock()s that
>	missed the flip.
> 
> This totals to ULONG_MAX/2+NR_CPUS-1.  So as long as there are no more
> than ULONG_MAX/2 CPUs, we should be good.  And given that the biggest
> system I have hard evidence of is 4K CPUs, we have ample headrooom
> compared to the ~2G value of ULONG_MAX/2, even on 32-bit systems.
> 
> So, what am I missing here?

Your analysis makes sense.

An alternative approach would be to document some limits on the allowed
nesting count for a given SRCU domain that should be expected from a thread,
and use that as an even lower upper bound on the number of concurrent
SRCU readers, which would allow us to increase the number of supported
CPUs accordingly on 32-bit systems.

Thanks,

Mathieu

> 
>							Thanx, Paul
> 
>> Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Signed-off-by: Lance Roy <ldr709@gmail.com>
>> ---
>>  include/linux/srcu.h    |   4 +-
>>  kernel/rcu/rcutorture.c |  20 ++++++++-
>>  kernel/rcu/srcu.c       | 116 ++++++++++++++++++------------------------------
>>  3 files changed, 63 insertions(+), 77 deletions(-)
>> 
>> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
>> index dc8eb63..0caea34 100644
>> --- a/include/linux/srcu.h
>> +++ b/include/linux/srcu.h
>> @@ -34,8 +34,8 @@
>>  #include <linux/workqueue.h>
>> 
>>  struct srcu_struct_array {
>> -	unsigned long c[2];
>> -	unsigned long seq[2];
>> +	unsigned long lock_count[2];
>> +	unsigned long unlock_count[2];
>>  };
>> 
>>  struct rcu_batch {
>> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
>> index bf08fee..2450c61 100644
>> --- a/kernel/rcu/rcutorture.c
>> +++ b/kernel/rcu/rcutorture.c
>> @@ -555,10 +555,26 @@ static void srcu_torture_stats(void)
>>  	pr_alert("%s%s per-CPU(idx=%d):",
>>  		 torture_type, TORTURE_FLAG, idx);
>>  	for_each_possible_cpu(cpu) {
>> +		unsigned long l0, l1;
>> +		unsigned long u0, u1;
>>  		long c0, c1;
>> +		struct srcu_struct_array *counts =
>> +			per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu);
>> 
>> -		c0 = (long)per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu)->c[!idx];
>> -		c1 = (long)per_cpu_ptr(srcu_ctlp->per_cpu_ref, cpu)->c[idx];
>> +		u0 = counts->unlock_count[!idx];
>> +		u1 = counts->unlock_count[idx];
>> +
>> +		/*
>> +		 * Make sure that a lock is always counted if the corresponding
>> +		 * unlock is counted.
>> +		 */
>> +		smp_rmb();
>> +
>> +		l0 = counts->lock_count[!idx];
>> +		l1 = counts->lock_count[idx];
>> +
>> +		c0 = (long)(l0 - u0);
>> +		c1 = (long)(l1 - u1);
>>  		pr_cont(" %d(%ld,%ld)", cpu, c0, c1);
>>  	}
>>  	pr_cont("\n");
>> diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
>> index 9b9cdd5..38e9aae 100644
>> --- a/kernel/rcu/srcu.c
>> +++ b/kernel/rcu/srcu.c
>> @@ -141,34 +141,38 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
>>  #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>> 
>>  /*
>> - * Returns approximate total of the readers' ->seq[] values for the
>> + * Returns approximate total of the readers' ->lock_count[] values for the
>>   * rank of per-CPU counters specified by idx.
>>   */
>> -static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
>> +static unsigned long srcu_readers_lock_idx(struct srcu_struct *sp, int idx)
>>  {
>>  	int cpu;
>>  	unsigned long sum = 0;
>>  	unsigned long t;
>> 
>>  	for_each_possible_cpu(cpu) {
>> -		t = READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
>> +		struct srcu_struct_array *cpu_counts =
>> +			per_cpu_ptr(sp->per_cpu_ref, cpu);
>> +		t = READ_ONCE(cpu_counts->lock_count[idx]);
>>  		sum += t;
>>  	}
>>  	return sum;
>>  }
>> 
>>  /*
>> - * Returns approximate number of readers active on the specified rank
>> - * of the per-CPU ->c[] counters.
>> + * Returns approximate total of the readers' ->unlock_count[] values for the
>> + * rank of per-CPU counters specified by idx.
>>   */
>> -static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
>> +static unsigned long srcu_readers_unlock_idx(struct srcu_struct *sp, int idx)
>>  {
>>  	int cpu;
>>  	unsigned long sum = 0;
>>  	unsigned long t;
>> 
>>  	for_each_possible_cpu(cpu) {
>> -		t = READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
>> +		struct srcu_struct_array *cpu_counts =
>> +			per_cpu_ptr(sp->per_cpu_ref, cpu);
>> +		t = READ_ONCE(cpu_counts->unlock_count[idx]);
>>  		sum += t;
>>  	}
>>  	return sum;
>> @@ -176,79 +180,42 @@ static unsigned long srcu_readers_active_idx(struct
>> srcu_struct *sp, int idx)
>> 
>>  /*
>>   * Return true if the number of pre-existing readers is determined to
>> - * be stably zero.  An example unstable zero can occur if the call
>> - * to srcu_readers_active_idx() misses an __srcu_read_lock() increment,
>> - * but due to task migration, sees the corresponding __srcu_read_unlock()
>> - * decrement.  This can happen because srcu_readers_active_idx() takes
>> - * time to sum the array, and might in fact be interrupted or preempted
>> - * partway through the summation.
>> + * be zero.
>>   */
>>  static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
>>  {
>> -	unsigned long seq;
>> +	unsigned long unlocks;
>> 
>> -	seq = srcu_readers_seq_idx(sp, idx);
>> +	unlocks = srcu_readers_unlock_idx(sp, idx);
>> 
>>  	/*
>> -	 * The following smp_mb() A pairs with the smp_mb() B located in
>> -	 * __srcu_read_lock().  This pairing ensures that if an
>> -	 * __srcu_read_lock() increments its counter after the summation
>> -	 * in srcu_readers_active_idx(), then the corresponding SRCU read-side
>> -	 * critical section will see any changes made prior to the start
>> -	 * of the current SRCU grace period.
>> +	 * Make sure that a lock is always counted if the corresponding unlock
>> +	 * is counted. Needs to be a smp_mb() as the read side may contain a
>> +	 * read from a variable that is written to before the synchronize_srcu()
>> +	 * in the write side. In this case smp_mb()s A and B act like the store
>> +	 * buffering pattern.
>>  	 *
>> -	 * Also, if the above call to srcu_readers_seq_idx() saw the
>> -	 * increment of ->seq[], then the call to srcu_readers_active_idx()
>> -	 * must see the increment of ->c[].
>> +	 * This smp_mb() also pairs with smp_mb() C to prevent writes after the
>> +	 * synchronize_srcu() from being executed before the grace period ends.
>>  	 */
>>  	smp_mb(); /* A */
>> 
>>  	/*
>> -	 * Note that srcu_readers_active_idx() can incorrectly return
>> -	 * zero even though there is a pre-existing reader throughout.
>> -	 * To see this, suppose that task A is in a very long SRCU
>> -	 * read-side critical section that started on CPU 0, and that
>> -	 * no other reader exists, so that the sum of the counters
>> -	 * is equal to one.  Then suppose that task B starts executing
>> -	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
>> -	 * task C starts reading on CPU 0, so that its increment is not
>> -	 * summed, but finishes reading on CPU 2, so that its decrement
>> -	 * -is- summed.  Then when task B completes its sum, it will
>> -	 * incorrectly get zero, despite the fact that task A has been
>> -	 * in its SRCU read-side critical section the whole time.
>> -	 *
>> -	 * We therefore do a validation step should srcu_readers_active_idx()
>> -	 * return zero.
>> -	 */
>> -	if (srcu_readers_active_idx(sp, idx) != 0)
>> -		return false;
>> -
>> -	/*
>> -	 * The remainder of this function is the validation step.
>> -	 * The following smp_mb() D pairs with the smp_mb() C in
>> -	 * __srcu_read_unlock().  If the __srcu_read_unlock() was seen
>> -	 * by srcu_readers_active_idx() above, then any destructive
>> -	 * operation performed after the grace period will happen after
>> -	 * the corresponding SRCU read-side critical section.
>> +	 * If the locks are the same as the unlocks, then there must of have
>> +	 * been no readers on this index at some time in between. This does not
>> +	 * mean that there are no more readers, as one could have read the
>> +	 * current index but have incremented the lock counter yet.
>>  	 *
>> -	 * Note that there can be at most NR_CPUS worth of readers using
>> -	 * the old index, which is not enough to overflow even a 32-bit
>> -	 * integer.  (Yes, this does mean that systems having more than
>> -	 * a billion or so CPUs need to be 64-bit systems.)  Therefore,
>> -	 * the sum of the ->seq[] counters cannot possibly overflow.
>> -	 * Therefore, the only way that the return values of the two
>> -	 * calls to srcu_readers_seq_idx() can be equal is if there were
>> -	 * no increments of the corresponding rank of ->seq[] counts
>> -	 * in the interim.  But the missed-increment scenario laid out
>> -	 * above includes an increment of the ->seq[] counter by
>> -	 * the corresponding __srcu_read_lock().  Therefore, if this
>> -	 * scenario occurs, the return values from the two calls to
>> -	 * srcu_readers_seq_idx() will differ, and thus the validation
>> -	 * step below suffices.
>> +	 * Possible bug: There is no guarantee that there haven't been ULONG_MAX
>> +	 * increments of ->lock_count[] since the unlocks were counted, meaning
>> +	 * that this could return true even if there are still active readers.
>> +	 * Since there are no memory barriers around srcu_flip(), the CPU is not
>> +	 * required to increment ->completed before running
>> +	 * srcu_readers_unlock_idx(), which means that there could be an
>> +	 * arbitrarily large number of critical sections that execute after
>> +	 * srcu_readers_unlock_idx() but use the old value of ->completed.
>>  	 */
>> -	smp_mb(); /* D */
>> -
>> -	return srcu_readers_seq_idx(sp, idx) == seq;
>> +	return srcu_readers_lock_idx(sp, idx) == unlocks;
>>  }
>> 
>>  /**
>> @@ -266,8 +233,12 @@ static bool srcu_readers_active(struct srcu_struct *sp)
>>  	unsigned long sum = 0;
>> 
>>  	for_each_possible_cpu(cpu) {
>> -		sum += READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[0]);
>> -		sum += READ_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[1]);
>> +		struct srcu_struct_array *cpu_counts =
>> +			per_cpu_ptr(sp->per_cpu_ref, cpu);
>> +		sum += READ_ONCE(cpu_counts->lock_count[0]);
>> +		sum += READ_ONCE(cpu_counts->lock_count[1]);
>> +		sum -= READ_ONCE(cpu_counts->unlock_count[0]);
>> +		sum -= READ_ONCE(cpu_counts->unlock_count[1]);
>>  	}
>>  	return sum;
>>  }
>> @@ -298,9 +269,8 @@ int __srcu_read_lock(struct srcu_struct *sp)
>>  	int idx;
>> 
>>  	idx = READ_ONCE(sp->completed) & 0x1;
>> -	__this_cpu_inc(sp->per_cpu_ref->c[idx]);
>> +	__this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
>>  	smp_mb(); /* B */  /* Avoid leaking the critical section. */
>> -	__this_cpu_inc(sp->per_cpu_ref->seq[idx]);
>>  	return idx;
>>  }
>>  EXPORT_SYMBOL_GPL(__srcu_read_lock);
>> @@ -314,7 +284,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
>>  void __srcu_read_unlock(struct srcu_struct *sp, int idx)
>>  {
>>  	smp_mb(); /* C */  /* Avoid leaking the critical section. */
>> -	this_cpu_dec(sp->per_cpu_ref->c[idx]);
>> +	this_cpu_inc(sp->per_cpu_ref->unlock_count[idx]);
>>  }
>>  EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>> 
>> @@ -349,7 +319,7 @@ static bool try_check_zero(struct srcu_struct *sp, int idx,
>> int trycount)
>> 
>>  /*
>>   * Increment the ->completed counter so that future SRCU readers will
>> - * use the other rank of the ->c[] and ->seq[] arrays.  This allows
>> + * use the other rank of the ->(un)lock_count[] arrays.  This allows
>>   * us to wait for pre-existing readers in a starvation-free manner.
>>   */
>>  static void srcu_flip(struct srcu_struct *sp)
>> --
>> 2.9.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com