All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org,
	Dipankar Sarma <dipankar@in.ibm.com>
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site
Date: Sat, 10 Aug 2013 08:07:15 -0700	[thread overview]
Message-ID: <20130810150715.GF29406@linux.vnet.ibm.com> (raw)
In-Reply-To: <5205B6FF.7060502@cn.fujitsu.com>

On Sat, Aug 10, 2013 at 11:43:59AM +0800, Lai Jiangshan wrote:
> Hi, Steven
> 
> I was considering rtmutex's lock->wait_lock is a scheduler lock,
> But it is not, and it is just a spinlock of process context.
> I hope you change it to a spinlock of irq context.
> 
> 1) it causes rcu read site more deadlockable, example:
> x is a spinlock of softirq context.
> 
> CPU1					cpu2(rcu boost)
> rcu_read_lock()				rt_mutext_lock()
> <preemption and reschedule back>	raw_spin_lock(lock->wait_lock)
> spin_lock_bh(x)				<interrupt and doing softirq after irq>
> rcu_read_unlock()                         do_softirq()
>   rcu_read_unlock_special()
>     rt_mutext_unlock()
>       raw_spin_lock(lock->wait_lock)	    spin_lock_bh(x)  DEADLOCK
> 
> This example can happen on any one of these code:
> 	without my patchset
> 	with my patchset
> 	with my patchset but reverted patch2
> 
> 2) why it causes more deadlockable: it extends the range of suspect locks.
> #DEFINE suspect locks: any lock can be (chained) nested in rcu_read_unlock_special().
> 
> So the suspect locks are: rnp->lock, scheduler locks, rtmutex's lock->wait_lock,
> locks in prink()/WARN_ON() and the locks which can be chained/indirectly nested
> in the above locks.
> 
> If the rtmutex's lock->wait_lock is a spinlock of irq context, all suspect locks are
> some spinlocks of irq context.
> 
> If the rtmutex's lock->wait_lock is a spinlock of process context, suspect locks
> will be extended to, all spinlocks of irq context, all spinlocks of softirq context,
> and (all spinlocks of process context but nested in rtmutex's lock->wait_lock).
> 
> We can see from the definition, if rcu_read_unlock_special() is called from
> any suspect lock, it may be deadlock like the example. the rtmutex's lock->wait_lock
> extends the range of suspect locks, it causes more deadlockable.
> 
> 3) How my algorithm works, why smaller range of suspect locks help us.
> Since rcu_read_unlock_special() can't be called from suspect locks context,
> we should defer rcu_read_unlock_special() when in these contexts.
> It is hard to find out current context is suspect locks context or not,
> but we can determine it based on irq/softirq/process context.
> 
> if all suspect locks are some spinlocks of irq context:
> 	if (irqs_disabled) /* we may be in suspect locks context */
> 		defer rcu_read_unlock_special().
> 
> if all suspect locks are some spinlocks of irq/softirq/process context:
> 	if (irqs_disabled || in_atomic()) /* we may be in suspect locks context */
> 		defer rcu_read_unlock_special().
> In this case, the deferring becomes large more, I can't accept it.
> So I have to narrow the range of suspect locks. Two choices:
> A) don't call rt_mutex_unlock() from rcu_read_unlock(), only call it
>    from rcu_preempt_not_context_switch(). we need to rework these
>    two functions and it will add complexity to RCU, and it also still
>    adds some probability of deferring.

One advantage of bh-disable locks is that enabling bh checks
TIF_NEED_RESCHED, so that there is no deferring beyond that
needed by bh disable.  The same of course applies to preempt_disable().

So one approach is to defer when rcu_read_unlock_special() is entered
with either preemption or bh disabled.  Your current set_need_resched()
trick would work fine in this case.  Unfortunately, re-enabling interrupts
does -not- check TIF_NEED_RESCHED, which is why we have latency problems
in that case.  (Hence my earlier question about making self-IPI safe
on all arches, which would result in an interrupt as soon as interrupts
were re-enabled.)

Another possibility is to defer only when preemption or bh are disabled
on entry ro rcu_read_unlock_special(), but to retain the current
(admittedly ugly) nesting rules for the scheduler locks.

> B) change rtmutex's lock->wait_lock to irqs-disabled.

I have to defer to Steven on this one.

							Thanx, Paul

> 4) In the view of rtmutex, I think it will be better if ->wait_lock is irqs-disabled.
>    A) like trylock of mutex/rw_sem, we may call rt_mutex_trylock() in irq in future.
>    B) the critical section of ->wait_lock is short,
>       making it irqs-disabled don't hurts responsibility/latency.
>    C) almost all time of the critical section of ->wait_lock is irqs-disabled
>       (due to task->pi_lock), I think converting whole time of the critical section
>       of ->wait_lock to irqs-disabled is OK.
> 
> So I hope you change rtmutex's lock->wait_lock.
> 
> Any feedback from anyone is welcome.
> 
> Thanks,
> Lai
> 
> On 08/09/2013 04:40 AM, Paul E. McKenney wrote:
> > On Wed, Aug 07, 2013 at 06:25:01PM +0800, Lai Jiangshan wrote:
> >> Background)
> >>
> >> Although all articles declare that rcu read site is deadlock-immunity.
> >> It is not true for rcu-preempt, it will be deadlock if rcu read site
> >> overlaps with scheduler lock.
> >>
> >> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
> >> is still not deadlock-immunity. And the problem described in 016a8d5b
> >> is still existed(rcu_read_unlock_special() calls wake_up).
> >>
> >> Aim)
> >>
> >> We want to fix the problem forever, we want to keep rcu read site
> >> is deadlock-immunity as books say.
> >>
> >> How)
> >>
> >> The problem is solved by "if rcu_read_unlock_special() is called inside
> >> any lock which can be (chained) nested in rcu_read_unlock_special(),
> >> we defer rcu_read_unlock_special()".
> >> This kind locks include rnp->lock, scheduler locks, perf ctx->lock, locks
> >> in printk()/WARN_ON() and all locks nested in these locks or chained nested
> >> in these locks.
> >>
> >> The problem is reduced to "how to distinguish all these locks(context)",
> >> We don't distinguish all these locks, we know that all these locks
> >> should be nested in local_irqs_disable().
> >>
> >> we just consider if rcu_read_unlock_special() is called in irqs-disabled
> >> context, it may be called in these suspect locks, we should defer
> >> rcu_read_unlock_special().
> >>
> >> The algorithm enlarges the probability of deferring, but the probability
> >> is still very very low.
> >>
> >> Deferring does add a small overhead, but it offers us:
> >> 	1) really deadlock-immunity for rcu read site
> >> 	2) remove the overhead of the irq-work(250 times per second in avg.)
> > 
> > One problem here -- it may take quite some time for a set_need_resched()
> > to take effect.  This is especially a problem for RCU priority boosting,
> > but can also needlessly delay preemptible-RCU grace periods because
> > local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit.
> > 
> > OK, alternatives...
> > 
> > o	Keep the current rule saying that if the scheduler is going
> > 	to exit an RCU read-side critical section while holding
> > 	one of its spinlocks, preemption has to have been disabled
> > 	throughout the full duration of that critical section.
> > 	Well, we can certainly do this, but it would be nice to get
> > 	rid of this rule.
> > 
> > o	Use per-CPU variables, possibly injecting delay.  This has ugly
> > 	disadvantages as noted above.
> > 
> > o	irq_work_queue() can wait a jiffy (or on some architectures,
> > 	quite a bit longer) before actually doing anything.
> > 
> > o	raise_softirq() is more immediate and is an easy change, but
> > 	adds a softirq vector -- which people are really trying to
> > 	get rid of.  Also, wakeup_softirqd() calls things that acquire
> > 	the scheduler locks, which is exactly what we were trying to
> > 	avoid doing.
> > 
> > o	invoke_rcu_core() can invoke raise_softirq() as above.
> > 
> > o	IPI to self.  From what I can see, not all architectures
> > 	support this.  Easy to fake if you have at least two CPUs,
> > 	but not so good from an OS jitter viewpoint...
> > 
> > o	Add a check to local_irq_disable() and friends.  I would guess
> > 	that this suggestion would not make architecture maintainers
> > 	happy.
> > 
> > Other thoughts?
> > 
> > 							Thanx, Paul
> > 
> >> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> >> ---
> >>  include/linux/rcupdate.h |    2 +-
> >>  kernel/rcupdate.c        |    2 +-
> >>  kernel/rcutree_plugin.h  |   47 +++++++++++++++++++++++++++++++++++++++++----
> >>  3 files changed, 44 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> >> index 4b14bdc..00b4220 100644
> >> --- a/include/linux/rcupdate.h
> >> +++ b/include/linux/rcupdate.h
> >> @@ -180,7 +180,7 @@ extern void synchronize_sched(void);
> >>
> >>  extern void __rcu_read_lock(void);
> >>  extern void __rcu_read_unlock(void);
> >> -extern void rcu_read_unlock_special(struct task_struct *t);
> >> +extern void rcu_read_unlock_special(struct task_struct *t, bool unlock);
> >>  void synchronize_rcu(void);
> >>
> >>  /*
> >> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> >> index cce6ba8..33b89a3 100644
> >> --- a/kernel/rcupdate.c
> >> +++ b/kernel/rcupdate.c
> >> @@ -90,7 +90,7 @@ void __rcu_read_unlock(void)
> >>  #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
> >>  		barrier();  /* assign before ->rcu_read_unlock_special load */
> >>  		if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
> >> -			rcu_read_unlock_special(t);
> >> +			rcu_read_unlock_special(t, true);
> >>  		barrier();  /* ->rcu_read_unlock_special load before assign */
> >>  		t->rcu_read_lock_nesting = 0;
> >>  	}
> >> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> >> index fc8b36f..997b424 100644
> >> --- a/kernel/rcutree_plugin.h
> >> +++ b/kernel/rcutree_plugin.h
> >> @@ -242,15 +242,16 @@ static void rcu_preempt_note_context_switch(int cpu)
> >>  				       ? rnp->gpnum
> >>  				       : rnp->gpnum + 1);
> >>  		raw_spin_unlock_irqrestore(&rnp->lock, flags);
> >> -	} else if (t->rcu_read_lock_nesting < 0 &&
> >> -		   !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) &&
> >> -		   t->rcu_read_unlock_special) {
> >> +	} else if (t->rcu_read_lock_nesting == 0 ||
> >> +		   (t->rcu_read_lock_nesting < 0 &&
> >> +		   !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN))) {
> >>
> >>  		/*
> >>  		 * Complete exit from RCU read-side critical section on
> >>  		 * behalf of preempted instance of __rcu_read_unlock().
> >>  		 */
> >> -		rcu_read_unlock_special(t);
> >> +		if (t->rcu_read_unlock_special)
> >> +			rcu_read_unlock_special(t, false);
> >>  	}
> >>
> >>  	/*
> >> @@ -333,7 +334,7 @@ static struct list_head *rcu_next_node_entry(struct task_struct *t,
> >>   * notify RCU core processing or task having blocked during the RCU
> >>   * read-side critical section.
> >>   */
> >> -void rcu_read_unlock_special(struct task_struct *t)
> >> +void rcu_read_unlock_special(struct task_struct *t, bool unlock)
> >>  {
> >>  	int empty;
> >>  	int empty_exp;
> >> @@ -364,6 +365,42 @@ void rcu_read_unlock_special(struct task_struct *t)
> >>
> >>  	/* Clean up if blocked during RCU read-side critical section. */
> >>  	if (special & RCU_READ_UNLOCK_BLOCKED) {
> >> +		/*
> >> +		 * If rcu read lock overlaps with scheduler lock,
> >> +		 * rcu_read_unlock_special() may lead to deadlock:
> >> +		 *
> >> +		 * rcu_read_lock();
> >> +		 * preempt_schedule[_irq]() (when preemption)
> >> +		 * scheduler lock; (or some other locks can be (chained) nested
> >> +		 *                  in rcu_read_unlock_special()/rnp->lock)
> >> +		 * access and check rcu data
> >> +		 * rcu_read_unlock();
> >> +		 *   rcu_read_unlock_special();
> >> +		 *     wake_up();                 DEAD LOCK
> >> +		 *
> >> +		 * To avoid all these kinds of deadlock, we should quit
> >> +		 * rcu_read_unlock_special() here and defer it to
> >> +		 * rcu_preempt_note_context_switch() or next outmost
> >> +		 * rcu_read_unlock() if we consider this case may happen.
> >> +		 *
> >> +		 * Although we can't know whether current _special()
> >> +		 * is nested in scheduler lock or not. But we know that
> >> +		 * irqs are always disabled in this case. so we just quit
> >> +		 * and defer it to rcu_preempt_note_context_switch()
> >> +		 * when irqs are disabled.
> >> +		 *
> >> +		 * It means we always defer _special() when it is
> >> +		 * nested in irqs disabled context, but
> >> +		 *	(special & RCU_READ_UNLOCK_BLOCKED) &&
> >> +		 *	irqs_disabled_flags(flags)
> >> +		 * is still unlikely to be true.
> >> +		 */
> >> +		if (unlikely(unlock && irqs_disabled_flags(flags))) {
> >> +			set_need_resched();
> >> +			local_irq_restore(flags);
> >> +			return;
> >> +		}
> >> +
> >>  		t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED;
> >>
> >>  		/*
> >> -- 
> >> 1.7.4.4
> >>
> > 
> > 
> 


  reply	other threads:[~2013-08-10 15:07 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-07 10:24 [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity Lai Jiangshan
2013-08-07 10:24 ` [PATCH 1/8] rcu: add a warn to rcu_preempt_note_context_switch() Lai Jiangshan
2013-10-30 11:02   ` Paul E. McKenney
2013-08-07 10:24 ` [PATCH 2/8] rcu: remove irq/softirq context check in rcu_read_unlock_special() Lai Jiangshan
2013-10-30 11:18   ` Paul E. McKenney
2013-08-07 10:24 ` [PATCH 3/8] rcu: keep irqs disabled " Lai Jiangshan
2013-08-07 10:25 ` [PATCH 4/8] rcu: delay task rcu state cleanup in exit_rcu() Lai Jiangshan
2013-08-07 10:25 ` [PATCH 5/8] rcu: eliminate deadlock for rcu read site Lai Jiangshan
2013-08-08 20:40   ` Paul E. McKenney
2013-08-09  9:31     ` Lai Jiangshan
2013-08-09 17:58       ` Paul E. McKenney
2013-08-12 13:55       ` Peter Zijlstra
2013-08-12 15:16         ` Paul E. McKenney
2013-08-12 16:21           ` Peter Zijlstra
2013-08-12 16:44             ` Paul E. McKenney
2013-08-10  3:43     ` Lai Jiangshan
2013-08-10 15:07       ` Paul E. McKenney [this message]
2013-08-10 15:08         ` Paul E. McKenney
2013-08-21  3:17         ` Paul E. McKenney
2013-08-21  3:25           ` Lai Jiangshan
2013-08-21 13:42             ` Paul E. McKenney
2013-08-12 13:53       ` Peter Zijlstra
2013-08-12 14:10         ` Steven Rostedt
2013-08-12 14:14           ` Peter Zijlstra
     [not found]           ` <CACvQF51-oAGkdxwku+orKSQ=SZd1A4saXzkrgcRGi+KnZUZYxQ@mail.gmail.com>
2013-08-22 14:34             ` Steven Rostedt
2013-08-22 14:41               ` Steven Rostedt
2013-08-23  6:26               ` Lai Jiangshan
     [not found]                 ` <CACvQF51kcLrJsa=zBKhLkJfFBh109TW+Zrepm+tRxmEp0gALbQ@mail.gmail.com>
2013-08-25 17:43                   ` Paul E. McKenney
2013-08-26  2:39                     ` Lai Jiangshan
2013-08-30  2:05                       ` Paul E. McKenney
2013-09-05 15:22                 ` Steven Rostedt
2013-08-07 10:25 ` [PATCH 6/8] rcu: call rcu_read_unlock_special() in rcu_preempt_check_callbacks() Lai Jiangshan
2013-08-07 10:25 ` [PATCH 7/8] rcu: add # of deferred _special() statistics Lai Jiangshan
2013-08-07 16:42   ` Paul E. McKenney
2013-08-07 10:25 ` [PATCH 8/8] rcu: remove irq work for rsp_wakeup() Lai Jiangshan
2013-08-07 12:38 ` [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity Paul E. McKenney
2013-08-07 19:29   ` Carsten Emde
2013-08-07 19:52     ` Paul E. McKenney
2013-08-08  1:17     ` Lai Jiangshan
2013-08-08  0:36   ` Paul E. McKenney
2013-08-08  1:47     ` Lai Jiangshan
2013-08-08  2:12       ` Steven Rostedt
2013-08-08  2:33         ` Lai Jiangshan
2013-08-08  2:33           ` Paul E. McKenney
2013-08-08  3:10             ` Lai Jiangshan
2013-08-08  4:18               ` Paul E. McKenney
2013-08-08  5:27                 ` Lai Jiangshan
2013-08-08  7:05                   ` Paul E. McKenney
2013-08-08  2:45         ` Lai Jiangshan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130810150715.GF29406@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=dipankar@in.ibm.com \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.