Re: [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()

From: "Paul E. McKenney" <paulmck@linux.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: rcu@vger.kernel.org, linux-kernel@vger.kernel.org,
	mingo@kernel.org, jiangshanlai@gmail.com, dipankar@in.ibm.com,
	akpm@linux-foundation.org, mathieu.desnoyers@efficios.com,
	josh@joshtriplett.org, tglx@linutronix.de, rostedt@goodmis.org,
	dhowells@redhat.com, edumazet@google.com, fweisbec@gmail.com,
	oleg@redhat.com, joel@joelfernandes.org,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: Re: [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
Date: Mon, 1 Apr 2019 12:03:04 -0700	[thread overview]
Message-ID: <20190401190304.GA20804@linux.ibm.com> (raw)
In-Reply-To: <20190401172257.GN4102@linux.ibm.com>

On Mon, Apr 01, 2019 at 10:22:57AM -0700, Paul E. McKenney wrote:
> On Mon, Apr 01, 2019 at 10:32:11AM +0200, Peter Zijlstra wrote:
> > On Fri, Mar 29, 2019 at 11:26:34AM -0700, Paul E. McKenney wrote:
> > > When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc
> > > kthreads, a full and unconditional wakeup is required to initiate RCU
> > > core processing.  In contrast, when RCU core processing is carried
> > > out by RCU_SOFTIRQ, a raise_softirq() suffices.  Of course, there are
> > > situations where raise_softirq() does a full wakeup, but these do not
> > > occur with normal usage of rcu_read_unlock().
> > 
> > Do we have a comment somewhere explaining why?
> 
> First, thank you for reviewing this!
> 
> The "why" is because people normally don't do things like the code
> sequence shown below, but where the scheduler holds locks across the
> second RCU read-side critical section.  (If they did, lockdep would
> complain.  Nevertheless, it is good to avoid this potential problem.)
> 
> > > The initial solution to this problem was to use set_tsk_need_resched() and
> > > set_preempt_need_resched() to force a future context switch, which allows
> > > rcu_preempt_note_context_switch() to report the deferred quiescent state
> > > to RCU's core processing.  Unfortunately for expedited grace periods,
> > > there can be a significant delay between the call for a context switch
> > > and the actual context switch.
> > 
> > This is all PREEMPT=y kernels, right? Where is the latency coming from?
> > Because PREEMPT=y _should_ react quite quickly.
> 
> Yes, PREEMPT=y.  It happens like this:
> 
> 1.	rcu_read_lock() with everything enabled.
> 
> 2.	Preemption then resumption.
> 
> 3.	local_irq_disable().
> 
> 4.	rcu_read_unlock().
> 
> 5.	local_irq_enable().
> 
> From what I know, the scheduler doesn't see anything until the next
> interrupt, local_bh_enable(), return to userspace, etc.  Because this
> is PREEMPT=y, preempt_enable() and cond_resched() do nothing.  So
> it could be some time (milliseconds, depending on HZ, NO_HZ_FULL, and
> so on) until the scheduler responds.  With NO_HZ_FULL, last I knew,
> the delay can be extremely long.
> 
> Or am I missing something that gets the scheduler on the job faster?
> 
> Hmmm...  If your point is that this amount of delay matters only for
> expedited grace periods, you are quite right.  So perhaps I shouldn't be
> doing any of the expensive stuff unless there is an expedited grace period
> in flight.  Or if NO_HZ_FULL.  See below for an updated (and untested)
> patch to this effect.
> 
> > > This commit therefore introduces a ->deferred_qs flag to the task_struct
> > > structure's rcu_special structure.  This flag is initially false, and
> > > is set to true by the first call to rcu_read_unlock() requiring special
> > > attention, then finally reset back to false when the quiescent state is
> > > finally reported.  Then rcu_read_unlock() attempts full wakeups only when
> > > ->deferred_qs is false, that is, on the first rcu_read_unlock() requiring
> > > special attention.  Note that a chain of RCU readers linked by some other
> > > sort of reader may find that a later rcu_read_unlock() is once again able
> > > to do a full wakeup, courtesy of an intervening preemption:
> > > 
> > > 	rcu_read_lock();
> > > 	/* preempted */
> > > 	local_irq_disable();
> > > 	rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */
> > > 	rcu_read_lock();
> > > 	local_irq_enable();
> > > 	preempt_disable()
> > > 	rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */
> > > 	rcu_read_lock();
> > > 	preempt_enable();
> > > 	/* preempted, >deferred_qs reset. */
> > 
> > As it would have without ->deferred_sq and just having done the above
> > which was deemed insufficient.
> > 
> > So I'm really puzzled by the need for all this.
> 
> On the first round, without the ->deferred_qs, we know the scheduler
> cannot be holding any of its pi or rq locks because if it did, it would
> have disabled interrupts across the entire RCU read-side critical section.
> Wakeups are therefore safe in this case, whether via softirq or wakeup.
> Afterwards, we don't have that guarantee because an earlier critical
> section might have been preempted and the scheduler might have held one
> of its locks across the entire just-ended critical section.
> 
> And I believe you are right that we should avoid the wakeups unless
> there is an expedited grace period in flight or in a NO_HZ_FULL kernel.
> Hence the patch shown below.

Which is a stupid patch.  It assumes ancient times when expedited grace
periods were guaranteed to preempted pre-existing RCU read-side critical
sections.  Back to the drawing board...

							Thanx, Paul

> -----------------------------------------------------------------------
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 2e52a77af6be..582c6d88aaa0 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -606,20 +606,26 @@ static void rcu_read_unlock_special(struct task_struct *t)
>  	local_irq_save(flags);
>  	irqs_were_disabled = irqs_disabled_flags(flags);
>  	if (preempt_bh_were_disabled || irqs_were_disabled) {
> +		bool exp;
> +
>  		t->rcu_read_unlock_special.b.exp_hint = false;
> +		exp = !!READ_ONCE(this_cpu_ptr(&rcu_data)->mynode->exp_tasks);
>  		// Need to defer quiescent state until everything is enabled.
> -		if (irqs_were_disabled && use_softirq &&
> +		if ((exp || IS_ENABLED(CONFIG_NO_HZ_FULL)) &&
> +		    irqs_were_disabled && use_softirq &&
>  		    (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
>  			// Using softirq, safe to awaken, and we get
>  			// no help from enabling irqs, unlike bh/preempt.
>  			raise_softirq_irqoff(RCU_SOFTIRQ);
> -		} else if (irqs_were_disabled && !use_softirq &&
> +		} else if ((exp || IS_ENABLED(CONFIG_NO_HZ_FULL)) &&
> +			   irqs_were_disabled && !use_softirq &&
>  			   !t->rcu_read_unlock_special.b.deferred_qs) {
>  			// Safe to awaken and we get no help from enabling
>  			// irqs, unlike bh/preempt.
>  			invoke_rcu_core();
>  		} else {
>  			// Enabling BH or preempt does reschedule, so...
> +			// Also if no expediting or NO_HZ_FULL, slow is OK.
>  			set_tsk_need_resched(current);
>  			set_preempt_need_resched();
>  		}