Re: [PATCH RFC tip/core/rcu] SRCU rewrite

From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@kernel.org>, dipankar <dipankar@in.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Josh Triplett <josh@joshtriplett.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>,
	rostedt <rostedt@goodmis.org>,
	David Howells <dhowells@redhat.com>,
	Eric Dumazet <edumazet@google.com>,
	dvhart <dvhart@linux.intel.com>, fweisbec <fweisbec@gmail.com>,
	Oleg Nesterov <oleg@redhat.com>,
	bobby prani <bobby.prani@gmail.com>, ldr709 <ldr709@gmail.com>
Subject: Re: [PATCH RFC tip/core/rcu] SRCU rewrite
Date: Thu, 17 Nov 2016 16:33:14 +0000 (UTC)	[thread overview]
Message-ID: <1091181941.5761.1479400394025.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <20161117155326.GZ3612@linux.vnet.ibm.com>

----- On Nov 17, 2016, at 10:53 AM, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote:

> On Thu, Nov 17, 2016 at 03:38:08PM +0000, Mathieu Desnoyers wrote:
>> ----- On Nov 17, 2016, at 10:31 AM, Mathieu Desnoyers
>> mathieu.desnoyers@efficios.com wrote:
>> 
>> > ----- On Nov 17, 2016, at 10:07 AM, Lai Jiangshan jiangshanlai@gmail.com wrote:
>> > 
>> >> On Thu, Nov 17, 2016 at 10:31 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
>> >>> On Thu, Nov 17, 2016 at 08:18:51PM +0800, Lai Jiangshan wrote:
>> >>>> On Tue, Nov 15, 2016 at 10:37 PM, Paul E. McKenney
>> >>>> <paulmck@linux.vnet.ibm.com> wrote:
>> >>>> > On Tue, Nov 15, 2016 at 09:44:45AM +0800, Boqun Feng wrote:
>> >>>>
>> >>>> >>
>> >>>> >> __srcu_read_lock() used to be called with preemption disabled. I guess
>> >>>> >> the reason was because we have two percpu variables to increase. So with
>> >>>> >> only one percpu right, could we remove the preempt_{dis,en}able() in
>> >>>> >> srcu_read_lock() and use this_cpu_inc() here?
>> >>>> >
>> >>>> > Quite possibly...
>> >>>> >
>> >>>>
>> >>>
>> >>> Hello, Lai ;-)
>> >>>
>> >>>> it will be nicer if it is removed.
>> >>>>
>> >>>> The reason for the preemption-disabled was also because we
>> >>>> have to disallow any preemption between the fetching of the idx
>> >>>> and the increasement. so that we have at most NR_CPUS worth
>> >>>> of readers using the old index that haven't incremented the counters.
>> >>>>
>> >>>
>> >>> After reading the comment for a while, I actually got a question, maybe
>> >>> I miss something ;-)
>> >>>
>> >>> Why "at most NR_CPUS worth of readers using the old index haven't
>> >>> incremented the counters" could save us from overflow the counter?
>> >>>
>> >>> Please consider the following case in current implementation:
>> >>>
>> >>>
>> >>> {sp->completed = 0} so idx = 1 in srcu_advance_batches(...)
>> >>>
>> >>> one thread A is currently in __srcu_read_lock() and using idx = 1 and
>> >>> about to increase the percpu c[idx], and ULONG_MAX __srcu_read_lock()s
>> >>> have been called and returned with idx = 1, please note I think this is
>> >>> possible because I assume we may have some code like this:
>> >>>
>> >>>         unsigned long i = 0;
>> >>>         for (; i < ULONG_MAX; i++)
>> >>>                 srcu_read_lock(); // return the same idx 1;
>> >> 
>> >> this is the wrong usage of the api.
>> >> 
>> >> 
>> >> you might rewrite it as:
>> >> 
>> >>        unsigned long index[2] = {0, 0};
>> >>        unsigned long i = 0;
>> >>        for (;  index[1] < ULONG_MAX; i++)
>> >>                 index[srcu_read_lock()]++;
>> >> 
>> >> 
>> >> I think we should add document to disallow this kind of usage.
>> >> a reader should eat 4bytes on the memory at least.
>> >> 
>> > 
>> > (the analysis below refers to the rewritten SRCU scheme)
>> > 
>> > Let's presume we use the API correctly, as you describe (saving
>> > the returned index of srcu_read_lock() somewhere).
>> > 
>> > So for the sake of argument, we can either call srcu_read_lock
>> > in a loop (during which we can be migrated), or call it
>> > concurrently from various threads. The effect in terms of
>> > overflow is pretty much the same.
>> > 
>> > What is done here is incrementing per-cpu split-counters. In
>> > the worse-case scenario, let's assume we're incrementing those
>> > counters for a single index (0 or 1).
>> > 
>> > If we think about this system-wide, we don't really care about
>> > the overflow of a single CPU counter, because what matters is
>> > the difference between the overall nr_lock - nr_unlock counts
>> > for a given index, once summed up by synchronize_srcu().
>> > 
>> > So the only situation that could lead to an overflow that matters
>> > is if synchronize_srcu() see ULONG_MAX more increments of nr_lock
>> > than the observed number of nr_unlock increments.
>> > 
>> > So the bound is not only about the number of concurrent SRCU
>> > readers, but also about the number of SRCU readers that may
>> > appear between the moment synchronize_srcu() reads the nr_unlock
>> > per-cpu counters and the moment it reads the nr_lock counters.
>> > 
>> > This maximum bound of ULONG_MAX - 1 therefore applies to the
>> > sum of:
>> > - numner of concurrent SRCU read-side critical sections active
>> >  at the same time,
>> > - number of SRCU read-side critical sections beginning after
>> >  synchronize_srcu() has read the nr_unlock counters, before
>> >  it reads the nr_lock counters.
>> 
>> Now that I think of it, since we flip the period before summing
>> the nr_unlock counter, we cannot have any newcoming readers appearing
>> within the target period while we execute synchronize_srcu().
>> So it ends up being a limit on the number of concurrent SRCU
>> read-side c.s. active at the same time. (you can scratch the
>> second bullet above).
> 
> We can have NR_CPUS worth of them -- those that have fetched the
> index, but not yet incremented their counter.

Ah, yes, due to preemption being disabled across those operations.
More precisely, it would be NR_CPUS - 1, because synchronize_srcu()
needs to run somewhere ;-)

> 
> But if the updater fails to see their counter increment, then
> their next srcu_read_lock() is guaranteed to see the new index.

Got it, thanks!

Mathieu

> 
>							Thanx, Paul
> 
>> Thanks,
>> 
>> Mathieu
>> 
>> 
>> 
>> > You guys seem to see cases that would require a lower max nr
>> > reader bound, but I'm afraid I don't quite understand them.
>> > 
>> > Thanks,
>> > 
>> > Mathieu
>> > 
>> > 
>> >>>
>> >>> And none of the corresponding srcu_read_unlock() has been called;
>> >>>
>> >>> In this case, at the time thread A increases the percpu c[idx], that
>> >>> will result in an overflow, right? So even one reader using old idx will
>> >>> result in overflow.
>> >>>
>> >>>
>> >>> I think we won't be hit by overflow is not because we have few readers
>> >>> using old idx, it's because there are unlikely ULONG_MAX + 1
>> >>> __srcu_read_lock() called for the same idx, right? And the reason of
>> >>> this is much complex: because we won't have a fair mount of threads in
>> >>> the system, because no thread will nest srcu many levels, because there
>> >>> won't be a lot readers using old idx.
>> >>>
>> >>> And this will still be true if we use new mechanism and shrink the
>> >>> preemption disabled section, right?
>> >>>
>> >>> Regards,
>> >>> Boqun
>> >>>
>> >>>> if we remove the preempt_{dis,en}able(). we must change the
>> >>>> "NR_CPUS" in the comment into ULONG_MAX/4. (I assume
>> >>>> one on-going reader needs at least need 4bytes at the stack). it is still safe.
>> >>>>
>> >>>> but we still need to think more if we want to remove the preempt_{dis,en}able().
>> >>>>
>> >>>> Thanks
>> >> >> Lai
>> > 
>> > --
>> > Mathieu Desnoyers
>> > EfficiOS Inc.
>> > http://www.efficios.com
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com