LKML Archive on lore.kernel.org
 help / color / Atom feed
From: "Paul E. McKenney" <paulmck@linux.ibm.com>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: Ran Rozenstein <ranro@mellanox.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"mingo@kernel.org" <mingo@kernel.org>,
	"jiangshanlai@gmail.com" <jiangshanlai@gmail.com>,
	"dipankar@in.ibm.com" <dipankar@in.ibm.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"mathieu.desnoyers@efficios.com" <mathieu.desnoyers@efficios.com>,
	"josh@joshtriplett.org" <josh@joshtriplett.org>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"rostedt@goodmis.org" <rostedt@goodmis.org>,
	"dhowells@redhat.com" <dhowells@redhat.com>,
	"edumazet@google.com" <edumazet@google.com>,
	"fweisbec@gmail.com" <fweisbec@gmail.com>,
	"oleg@redhat.com" <oleg@redhat.com>,
	Maor Gottlieb <maorg@mellanox.com>,
	Tariq Toukan <tariqt@mellanox.com>,
	Eran Ben Elisha <eranbe@mellanox.com>,
	Leon Romanovsky <leonro@mellanox.com>
Subject: Re: [PATCH tip/core/rcu 02/19] rcu: Defer reporting RCU-preempt quiescent states when disabled
Date: Tue, 30 Oct 2018 05:58:00 -0700
Message-ID: <20181030125800.GE4170@linux.ibm.com> (raw)
In-Reply-To: <20181030034452.GA224709@google.com>

On Mon, Oct 29, 2018 at 08:44:52PM -0700, Joel Fernandes wrote:
> On Mon, Oct 29, 2018 at 07:27:35AM -0700, Paul E. McKenney wrote:
> > On Mon, Oct 29, 2018 at 11:24:42AM +0000, Ran Rozenstein wrote:
> > > Hi Paul and all,
> > > 
> > > > -----Original Message-----
> > > > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> > > > owner@vger.kernel.org] On Behalf Of Paul E. McKenney
> > > > Sent: Thursday, August 30, 2018 01:21
> > > > To: linux-kernel@vger.kernel.org
> > > > Cc: mingo@kernel.org; jiangshanlai@gmail.com; dipankar@in.ibm.com;
> > > > akpm@linux-foundation.org; mathieu.desnoyers@efficios.com;
> > > > josh@joshtriplett.org; tglx@linutronix.de; peterz@infradead.org;
> > > > rostedt@goodmis.org; dhowells@redhat.com; edumazet@google.com;
> > > > fweisbec@gmail.com; oleg@redhat.com; joel@joelfernandes.org; Paul E.
> > > > McKenney <paulmck@linux.vnet.ibm.com>
> > > > Subject: [PATCH tip/core/rcu 02/19] rcu: Defer reporting RCU-preempt
> > > > quiescent states when disabled
> > > > 
> > > > This commit defers reporting of RCU-preempt quiescent states at
> > > > rcu_read_unlock_special() time when any of interrupts, softirq, or
> > > > preemption are disabled.  These deferred quiescent states are reported at a
> > > > later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug offline
> > > > operation.  Of course, if another RCU read-side critical section has started in
> > > > the meantime, the reporting of the quiescent state will be further deferred.
> > > > 
> > > > This also means that disabling preemption, interrupts, and/or softirqs will act
> > > > as an RCU-preempt read-side critical section.
> > > > This is enforced by checking preempt_count() as needed.
> > > > 
> > > > Some special cases must be handled on an ad-hoc basis, for example,
> > > > context switch is a quiescent state even though both the scheduler and
> > > > do_exit() disable preemption.  In these cases, additional calls to
> > > > rcu_preempt_deferred_qs() override the preemption disabling.  Similar logic
> > > > overrides disabled interrupts in rcu_preempt_check_callbacks() because in
> > > > this case the quiescent state happened just before the corresponding
> > > > scheduling-clock interrupt.
> > > > 
> > > > In theory, this change lifts a long-standing restriction that required that if
> > > > interrupts were disabled across a call to rcu_read_unlock() that the matching
> > > > rcu_read_lock() also be contained within that interrupts-disabled region of
> > > > code.  Because the reporting of the corresponding RCU-preempt quiescent
> > > > state is now deferred until after interrupts have been enabled, it is no longer
> > > > possible for this situation to result in deadlocks involving the scheduler's
> > > > runqueue and priority-inheritance locks.  This may allow some code
> > > > simplification that might reduce interrupt latency a bit.  Unfortunately, in
> > > > practice this would also defer deboosting a low-priority task that had been
> > > > subjected to RCU priority boosting, so real-time-response considerations
> > > > might well force this restriction to remain in place.
> > > > 
> > > > Because RCU-preempt grace periods are now blocked not only by RCU read-
> > > > side critical sections, but also by disabling of interrupts, preemption, and
> > > > softirqs, it will be possible to eliminate RCU-bh and RCU-sched in favor of
> > > > RCU-preempt in CONFIG_PREEMPT=y kernels.  This may require some
> > > > additional plumbing to provide the network denial-of-service guarantees
> > > > that have been traditionally provided by RCU-bh.  Once these are in place,
> > > > CONFIG_PREEMPT=n kernels will be able to fold RCU-bh into RCU-sched.
> > > > This would mean that all kernels would have but one flavor of RCU, which
> > > > would open the door to significant code cleanup.
> > > > 
> > > > Moving to a single flavor of RCU would also have the beneficial effect of
> > > > reducing the NOCB kthreads by at least a factor of two.
> > > > 
> > > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck:
> > > > Apply rcu_read_unlock_special() preempt_count() feedback
> > > >   from Joel Fernandes. ]
> > > > [ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
> > > >   response to bug reports from kbuild test robot. ] [ paulmck: Fix bug located
> > > > by kbuild test robot involving recursion
> > > >   via rcu_preempt_deferred_qs(). ]
> > > > ---
> > > >  .../RCU/Design/Requirements/Requirements.html |  50 +++---
> > > >  include/linux/rcutiny.h                       |   5 +
> > > >  kernel/rcu/tree.c                             |   9 ++
> > > >  kernel/rcu/tree.h                             |   3 +
> > > >  kernel/rcu/tree_exp.h                         |  71 +++++++--
> > > >  kernel/rcu/tree_plugin.h                      | 144 +++++++++++++-----
> > > >  6 files changed, 205 insertions(+), 77 deletions(-)
> > > > 
> > > 
> > > We started seeing the trace below in our regression system, after I bisected I found this is the offending commit.
> > > This appears immediately on boot. 
> > > Please let me know if you need any additional details.
> > 
> > Interesting.  Here is the offending function:
> > 
> > 	static void rcu_preempt_deferred_qs(struct task_struct *t)
> > 	{
> > 		unsigned long flags;
> > 		bool couldrecurse = t->rcu_read_lock_nesting >= 0;
> > 
> > 		if (!rcu_preempt_need_deferred_qs(t))
> > 			return;
> > 		if (couldrecurse)
> > 			t->rcu_read_lock_nesting -= INT_MIN;
> > 		local_irq_save(flags);
> > 		rcu_preempt_deferred_qs_irqrestore(t, flags);
> > 		if (couldrecurse)
> > 			t->rcu_read_lock_nesting += INT_MIN;
> > 	}
> > 
> > Using twos-complement arithmetic (which the kernel build gcc arguments
> > enforce, last I checked) this does work.  But as UBSAN says, subtracting
> > INT_MIN is unconditionally undefined behavior according to the C standard.
> > 
> > Good catch!!!
> > 
> > So how do I make the above code not simply function, but rather meet
> > the C standard?
> > 
> > One approach to add INT_MIN going in, then add INT_MAX and then add 1
> > coming out.
> > 
> > Another approach is to sacrifice the INT_MAX value (should be plenty
> > safe), thus subtract INT_MAX going in and add INT_MAX coming out.
> > For consistency, I suppose that I should change the INT_MIN in
> > __rcu_read_unlock() to -INT_MAX.
> > 
> > I could also leave __rcu_read_unlock() alone and XOR the top
> > bit of t->rcu_read_lock_nesting on entry and exit to/from
> > rcu_preempt_deferred_qs().
> > 
> > Sacrificing the INT_MIN value seems most maintainable, as in the following
> > patch.  Thoughts?
> 
> The INT_MAX naming could be very confusing for nesting levels, could we not
> instead just define something like:
> #define RCU_NESTING_MIN (INT_MIN - 1)
> #define RCU_NESTING_MAX (INT_MAX)
> 
> and just use that? also one more comment below:

Hmmm...  There is currently no use for RCU_NESTING_MAX, but if the check
at the end of __rcu_read_unlock() were to be extended to check for
too-deep positive nesting, it would need to check for something like
INT_MAX/2.  You could of course argue that the current check against
INT_MIN/2 should instead be against -INT_MAX/2, but there really isn't
much difference between the two.

Another approach would be to convert to unsigned in order to avoid the
overflow problem completely.

For the moment, anyway, I am inclined to leave it as is.

> > ------------------------------------------------------------------------
> > 
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index bd8186d0f4a7..f1b40c6d36e4 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -424,7 +424,7 @@ void __rcu_read_unlock(void)
> >  		--t->rcu_read_lock_nesting;
> >  	} else {
> >  		barrier();  /* critical section before exit code. */
> > -		t->rcu_read_lock_nesting = INT_MIN;
> > +		t->rcu_read_lock_nesting = -INT_MAX;
> >  		barrier();  /* assign before ->rcu_read_unlock_special load */
> >  		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> >  			rcu_read_unlock_special(t);
> > @@ -617,11 +617,11 @@ static void rcu_preempt_deferred_qs(struct task_struct *t)
> >  	if (!rcu_preempt_need_deferred_qs(t))
> >  		return;
> >  	if (couldrecurse)
> > -		t->rcu_read_lock_nesting -= INT_MIN;
> > +		t->rcu_read_lock_nesting -= INT_MAX;
> 
> Shouldn't this be:  t->rcu_read_lock_nesting -= -INT_MAX; ?

Suppose t->rcu_read_lock_nesting is zero.  Then you change would leave the
value a large positive number (INT_MAX, to be precise), which would then
result in signed integer overflow if there was a nested rcu_read_lock().
Worse yet, if t->rcu_read_lock_nesting is any positive number, subtracting
-INT_MAX would immediately result in signed integer overflow.  Please
note that the whole point of this patch in the first place is to avoid
signed integer overflow.

Note that the check at the beginnning of this function never sets
couldrecurse if ->rcu_read_lock_nesting is negative, which avoids the
possibility of overflow given the current code.  Unless you have two
billion nested rcu_read_lock() calls, in which case you broke it, so
you get to buy it.  ;-)

> >  	local_irq_save(flags);
> >  	rcu_preempt_deferred_qs_irqrestore(t, flags);
> >  	if (couldrecurse)
> > -		t->rcu_read_lock_nesting += INT_MIN;
> > +		t->rcu_read_lock_nesting += INT_MAX;
> 
> And t->rcu_read_lock_nesting += -INT_MAX; ?

And this would have similar problems for ->rcu_read_lock_nesting
not equal to zero on entry to this function.

> But apologies if I missed something, thanks,

No need to apologies, especially if it turns out that I am the one
who is confused.  Either way, thank you for looking this over!

							Thanx, Paul


  reply index

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-29 22:20 [PATCH tip/core/rcu 0/19] RCU flavor-consolidation changes for v4.20/v5.0 Paul E. McKenney
2018-08-29 22:20 ` [PATCH tip/core/rcu 01/19] rcu: Refactor rcu_{nmi,irq}_{enter,exit}() Paul E. McKenney
2018-08-30 18:10   ` Steven Rostedt
2018-08-30 23:02     ` Paul E. McKenney
2018-08-31  2:25     ` Byungchul Park
2018-08-29 22:20 ` [PATCH tip/core/rcu 02/19] rcu: Defer reporting RCU-preempt quiescent states when disabled Paul E. McKenney
2018-10-29 11:24   ` Ran Rozenstein
2018-10-29 14:27     ` Paul E. McKenney
2018-10-30  3:44       ` Joel Fernandes
2018-10-30 12:58         ` Paul E. McKenney [this message]
2018-10-30 22:21           ` Joel Fernandes
2018-10-31 18:22             ` Paul E. McKenney
2018-11-02 19:43               ` Paul E. McKenney
2018-11-26 13:55                 ` Ran Rozenstein
2018-11-26 19:00                   ` Paul E. McKenney
2018-08-29 22:20 ` [PATCH tip/core/rcu 03/19] rcutorture: Test extended "rcu" read-side critical sections Paul E. McKenney
2018-08-29 22:20 ` [PATCH tip/core/rcu 04/19] rcu: Allow processing deferred QSes for exiting RCU-preempt readers Paul E. McKenney
2018-08-29 22:20 ` [PATCH tip/core/rcu 05/19] rcu: Remove now-unused ->b.exp_need_qs field from the rcu_special union Paul E. McKenney
2018-08-29 22:20 ` [PATCH tip/core/rcu 06/19] rcu: Add warning to detect half-interrupts Paul E. McKenney
2019-03-11 13:39   ` Joel Fernandes
2019-03-11 22:29     ` Paul E. McKenney
2019-03-12 15:05       ` Joel Fernandes
2019-03-12 15:20         ` Paul E. McKenney
2019-03-13 15:09           ` Joel Fernandes
2019-03-13 15:27             ` Steven Rostedt
2019-03-13 15:51               ` Paul E. McKenney
2019-03-13 16:51                 ` Steven Rostedt
2019-03-13 18:07                   ` Paul E. McKenney
2019-03-14 12:31                     ` Joel Fernandes
2019-03-14 13:36                       ` Steven Rostedt
2019-03-14 13:37                         ` Steven Rostedt
2019-03-14 21:27                           ` Joel Fernandes
2019-03-15  7:31     ` Byungchul Park
2019-03-15  7:44       ` Byungchul Park
2019-03-15 13:46         ` Joel Fernandes
2018-08-29 22:20 ` [PATCH tip/core/rcu 07/19] rcu: Apply RCU-bh QSes to RCU-sched and RCU-preempt when safe Paul E. McKenney
2018-08-29 22:20 ` [PATCH tip/core/rcu 08/19] rcu: Report expedited grace periods at context-switch time Paul E. McKenney
2018-08-29 22:20 ` [PATCH tip/core/rcu 09/19] rcu: Define RCU-bh update API in terms of RCU Paul E. McKenney
2018-08-29 22:20 ` [PATCH tip/core/rcu 10/19] rcu: Update comments and help text for no more RCU-bh updaters Paul E. McKenney
2018-08-29 22:20 ` [PATCH tip/core/rcu 11/19] rcu: Drop "wake" parameter from rcu_report_exp_rdp() Paul E. McKenney
2018-08-29 22:20 ` [PATCH tip/core/rcu 12/19] rcu: Fix typo in rcu_get_gp_kthreads_prio() header comment Paul E. McKenney
2018-08-29 22:20 ` [PATCH tip/core/rcu 13/19] rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds Paul E. McKenney
2018-08-29 22:20 ` [PATCH tip/core/rcu 14/19] rcu: Express Tiny RCU updates in terms of RCU rather than RCU-sched Paul E. McKenney
2018-08-29 22:20 ` [PATCH tip/core/rcu 15/19] rcu: Remove RCU_STATE_INITIALIZER() Paul E. McKenney
2018-08-29 22:20 ` [PATCH tip/core/rcu 16/19] rcu: Eliminate rcu_state structure's ->call field Paul E. McKenney
2018-08-29 22:20 ` [PATCH tip/core/rcu 17/19] rcu: Remove rcu_state structure's ->rda field Paul E. McKenney
2018-08-29 22:20 ` [PATCH tip/core/rcu 18/19] rcu: Remove rcu_state_p pointer to default rcu_state structure Paul E. McKenney
2018-08-29 22:20 ` [PATCH tip/core/rcu 19/19] rcu: Remove rcu_data_p pointer to default rcu_data structure Paul E. McKenney
2018-08-29 22:22 ` [PATCH tip/core/rcu 0/19] RCU flavor-consolidation changes for v4.20/v5.0 Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181030125800.GE4170@linux.ibm.com \
    --to=paulmck@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=dipankar@in.ibm.com \
    --cc=edumazet@google.com \
    --cc=eranbe@mellanox.com \
    --cc=fweisbec@gmail.com \
    --cc=jiangshanlai@gmail.com \
    --cc=joel@joelfernandes.org \
    --cc=josh@joshtriplett.org \
    --cc=leonro@mellanox.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maorg@mellanox.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=ranro@mellanox.com \
    --cc=rostedt@goodmis.org \
    --cc=tariqt@mellanox.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git