All of lore.kernel.org
 help / color / mirror / Atom feed
From: Juri Lelli <juri.lelli@arm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: mingo@kernel.org, tglx@linutronix.de, rostedt@goodmis.org,
	xlpang@redhat.com, linux-kernel@vger.kernel.org,
	mathieu.desnoyers@efficios.com, jdesfossez@efficios.com,
	bristot@redhat.com, Ingo Molnar <mingo@redhat.com>
Subject: Re: [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
Date: Tue, 14 Jun 2016 11:21:09 +0100	[thread overview]
Message-ID: <20160614102109.GF5981@e106622-lin> (raw)
In-Reply-To: <20160607200215.719626477@infradead.org>

Hi,

On 07/06/16 21:56, Peter Zijlstra wrote:
> From: Xunlei Pang <xlpang@redhat.com>
> 
> A crash happened while I was playing with deadline PI rtmutex.
> 
>     BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>     IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
>     PGD 232a75067 PUD 230947067 PMD 0
>     Oops: 0000 [#1] SMP
>     CPU: 1 PID: 10994 Comm: a.out Not tainted
> 
>     Call Trace:
>     [<ffffffff810b658c>] enqueue_task+0x2c/0x80
>     [<ffffffff810ba763>] activate_task+0x23/0x30
>     [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
>     [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
>     [<ffffffff8164e783>] __schedule+0xd3/0x900
>     [<ffffffff8164efd9>] schedule+0x29/0x70
>     [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
>     [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
>     [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
>     [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
>     [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
>     [<ffffffff810edd50>] SyS_futex+0x80/0x180
> 

This seems to be caused by the race condition you detail below between
load balancing and PI code. I tried to reproduce the BUG on my box, but
it looks hard to get. Do you have a reproducer I can give a try?

> This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
> are only protected by pi_lock when operating pi waiters, while
> rt_mutex_get_top_task(), will access them with rq lock held but
> not holding pi_lock.
> 
> In order to tackle it, we introduce new "pi_top_task" pointer
> cached in task_struct, and add new rt_mutex_update_top_task()
> to update its value, it can be called by rt_mutex_setprio()
> which held both owner's pi_lock and rq lock. Thus "pi_top_task"
> can be safely accessed by enqueue_task_dl() under rq lock.
> 
> [XXX this next section is unparsable]

Yes, a bit hard to understand. However, am I correct in assuming this
patch and the previous one should fix this problem? Or are there still
other races causing issues?

> One problem is when rt_mutex_adjust_prio()->...->rt_mutex_setprio(),
> at that time rtmutex lock was released and owner was marked off,
> this can cause "pi_top_task" dereferenced to be a running one(as it
> can be falsely woken up by others before rt_mutex_setprio() is
> made to update "pi_top_task"). We solve this by directly calling
> __rt_mutex_adjust_prio() in mark_wakeup_next_waiter() which held
> pi_lock and rtmutex lock, and remove rt_mutex_adjust_prio(). Since
> now we moved the deboost point, in order to avoid current to be
> preempted due to deboost earlier before wake_up_q(), we also moved
> preempt_disable() before unlocking rtmutex.
> 
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Juri Lelli <juri.lelli@arm.com>
> Originally-From: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Xunlei Pang <xlpang@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1461659449-19497-2-git-send-email-xlpang@redhat.com

The idea of this fix makes sense to me. But, I would like to be able to
see the BUG and test the fix. What I have is a test in which I create N
DEADLINE workers that share a PI mutex. They get migrated around and
seem to stress PI code. But I couldn't hit the BUG yet. Maybe I let it
run for some more time.

Best,

- Juri

> ---
> 
>  include/linux/init_task.h |    1 
>  include/linux/sched.h     |    2 +
>  include/linux/sched/rt.h  |    1 
>  kernel/fork.c             |    1 
>  kernel/locking/rtmutex.c  |   65 +++++++++++++++++++---------------------------
>  kernel/sched/core.c       |    2 +
>  6 files changed, 34 insertions(+), 38 deletions(-)
> 
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -162,6 +162,7 @@ extern struct task_group root_task_group
>  #ifdef CONFIG_RT_MUTEXES
>  # define INIT_RT_MUTEXES(tsk)						\
>  	.pi_waiters = RB_ROOT,						\
> +	.pi_top_task = NULL,						\
>  	.pi_waiters_leftmost = NULL,
>  #else
>  # define INIT_RT_MUTEXES(tsk)
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1681,6 +1681,8 @@ struct task_struct {
>  	/* PI waiters blocked on a rt_mutex held by this task */
>  	struct rb_root pi_waiters;
>  	struct rb_node *pi_waiters_leftmost;
> +	/* Updated under owner's pi_lock and rq lock */
> +	struct task_struct *pi_top_task;
>  	/* Deadlock detection and priority inheritance handling */
>  	struct rt_mutex_waiter *pi_blocked_on;
>  #endif
> --- a/include/linux/sched/rt.h
> +++ b/include/linux/sched/rt.h
> @@ -19,6 +19,7 @@ static inline int rt_task(struct task_st
>  extern int rt_mutex_getprio(struct task_struct *p);
>  extern void rt_mutex_setprio(struct task_struct *p, int prio);
>  extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio);
> +extern void rt_mutex_update_top_task(struct task_struct *p);
>  extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
>  extern void rt_mutex_adjust_pi(struct task_struct *p);
>  static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1219,6 +1219,7 @@ static void rt_mutex_init_task(struct ta
>  #ifdef CONFIG_RT_MUTEXES
>  	p->pi_waiters = RB_ROOT;
>  	p->pi_waiters_leftmost = NULL;
> +	p->pi_top_task = NULL;
>  	p->pi_blocked_on = NULL;
>  #endif
>  }
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -256,6 +256,16 @@ rt_mutex_dequeue_pi(struct task_struct *
>  	RB_CLEAR_NODE(&waiter->pi_tree_entry);
>  }
>  
> +void rt_mutex_update_top_task(struct task_struct *p)
> +{
> +	if (!task_has_pi_waiters(p)) {
> +		p->pi_top_task = NULL;
> +		return;
> +	}
> +
> +	p->pi_top_task = task_top_pi_waiter(p)->task;
> +}
> +
>  /*
>   * Calculate task priority from the waiter tree priority
>   *
> @@ -273,10 +283,7 @@ int rt_mutex_getprio(struct task_struct
>  
>  struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
>  {
> -	if (likely(!task_has_pi_waiters(task)))
> -		return NULL;
> -
> -	return task_top_pi_waiter(task)->task;
> +	return task->pi_top_task;
>  }
>  
>  /*
> @@ -285,12 +292,12 @@ struct task_struct *rt_mutex_get_top_tas
>   */
>  int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
>  {
> -	if (!task_has_pi_waiters(task))
> +	struct task_struct *top_task = rt_mutex_get_top_task(task);
> +
> +	if (!top_task)
>  		return newprio;
>  
> -	if (task_top_pi_waiter(task)->task->prio <= newprio)
> -		return task_top_pi_waiter(task)->task->prio;
> -	return newprio;
> +	return min(top_task->prio, newprio);
>  }
>  
>  /*
> @@ -307,24 +314,6 @@ static void __rt_mutex_adjust_prio(struc
>  }
>  
>  /*
> - * Adjust task priority (undo boosting). Called from the exit path of
> - * rt_mutex_slowunlock() and rt_mutex_slowlock().
> - *
> - * (Note: We do this outside of the protection of lock->wait_lock to
> - * allow the lock to be taken while or before we readjust the priority
> - * of task. We do not use the spin_xx_mutex() variants here as we are
> - * outside of the debug path.)
> - */
> -void rt_mutex_adjust_prio(struct task_struct *task)
> -{
> -	unsigned long flags;
> -
> -	raw_spin_lock_irqsave(&task->pi_lock, flags);
> -	__rt_mutex_adjust_prio(task);
> -	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
> -}
> -
> -/*
>   * Deadlock detection is conditional:
>   *
>   * If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted
> @@ -987,6 +976,7 @@ static void mark_wakeup_next_waiter(stru
>  	 * lock->wait_lock.
>  	 */
>  	rt_mutex_dequeue_pi(current, waiter);
> +	__rt_mutex_adjust_prio(current);
>  
>  	/*
>  	 * As we are waking up the top waiter, and the waiter stays
> @@ -1325,6 +1315,16 @@ static bool __sched rt_mutex_slowunlock(
>  	 */
>  	mark_wakeup_next_waiter(wake_q, lock);
>  
> +	/*
> +	 * We should deboost before waking the top waiter task such that
> +	 * we don't run two tasks with the 'same' priority. This however
> +	 * can lead to prio-inversion if we would get preempted after
> +	 * the deboost but before waking our high-prio task, hence the
> +	 * preempt_disable before unlock. Pairs with preempt_enable() in
> +	 * rt_mutex_postunlock();
> +	 */
> +	preempt_disable();
> +
>  	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>  
>  	/* check PI boosting */
> @@ -1400,20 +1400,9 @@ rt_mutex_fastunlock(struct rt_mutex *loc
>   */
>  void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
>  {
> -	/*
> -	 * We should deboost before waking the top waiter task such that
> -	 * we don't run two tasks with the 'same' priority. This however
> -	 * can lead to prio-inversion if we would get preempted after
> -	 * the deboost but before waking our high-prio task, hence the
> -	 * preempt_disable.
> -	 */
> -	if (deboost) {
> -		preempt_disable();
> -		rt_mutex_adjust_prio(current);
> -	}
> -
>  	wake_up_q(wake_q);
>  
> +	/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
>  	if (deboost)
>  		preempt_enable();
>  }
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3568,6 +3568,8 @@ void rt_mutex_setprio(struct task_struct
>  		goto out_unlock;
>  	}
>  
> +	rt_mutex_update_top_task(p);
> +
>  	trace_sched_pi_setprio(p, prio);
>  	oldprio = p->prio;
>  
> 
> 

  reply	other threads:[~2016-06-14 10:21 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-07 19:56 [RFC][PATCH 0/8] PI and assorted failings Peter Zijlstra
2016-06-07 19:56 ` [RFC][PATCH 1/8] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
2016-06-14  9:09   ` Juri Lelli
2016-06-14 12:54     ` Peter Zijlstra
2016-06-14 13:20       ` Juri Lelli
2016-06-14 13:59         ` Peter Zijlstra
2016-06-14 16:36     ` Davidlohr Bueso
2016-06-14 17:01       ` Juri Lelli
2016-06-14 18:22   ` Steven Rostedt
2016-06-07 19:56 ` [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Peter Zijlstra
2016-06-14 10:21   ` Juri Lelli [this message]
2016-06-14 12:30     ` Peter Zijlstra
2016-06-14 12:53     ` Xunlei Pang
2016-06-14 13:07       ` Juri Lelli
2016-06-14 16:39         ` Juri Lelli
2016-06-14 18:42   ` Steven Rostedt
2016-06-14 20:28     ` Peter Zijlstra
2016-06-15 16:14       ` Steven Rostedt
2016-06-07 19:56 ` [RFC][PATCH 3/8] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update Peter Zijlstra
2016-06-14 10:43   ` Juri Lelli
2016-06-14 12:09     ` Peter Zijlstra
2016-06-15 16:30   ` Steven Rostedt
2016-06-15 17:55     ` Peter Zijlstra
2016-06-07 19:56 ` [RFC][PATCH 4/8] rtmutex: Remove rt_mutex_fastunlock() Peter Zijlstra
2016-06-15 16:43   ` Steven Rostedt
2016-06-07 19:56 ` [RFC][PATCH 5/8] rtmutex: Clean up Peter Zijlstra
2016-06-14 12:08   ` Juri Lelli
2016-06-14 12:32     ` Peter Zijlstra
2016-06-14 12:41       ` Juri Lelli
2016-06-07 19:56 ` [RFC][PATCH 6/8] sched/rtmutex: Refactor rt_mutex_setprio() Peter Zijlstra
2016-06-14 13:14   ` Juri Lelli
2016-06-14 14:08     ` Peter Zijlstra
2016-06-07 19:56 ` [RFC][PATCH 7/8] sched,tracing: Update trace_sched_pi_setprio() Peter Zijlstra
2016-06-07 19:56 ` [RFC][PATCH 8/8] rtmutex: Fix PI chain order integrity Peter Zijlstra
2016-06-14 17:39   ` Juri Lelli
2016-06-14 19:44     ` Peter Zijlstra
2016-06-15  7:25       ` Juri Lelli
2016-06-27 12:23         ` Peter Zijlstra
2016-06-27 12:40           ` Thomas Gleixner
2016-06-28  9:05           ` Juri Lelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160614102109.GF5981@e106622-lin \
    --to=juri.lelli@arm.com \
    --cc=bristot@redhat.com \
    --cc=jdesfossez@efficios.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=xlpang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.