Re: [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks

From: Xunlei Pang <xpang@redhat.com>
To: Juri Lelli <juri.lelli@arm.com>, Peter Zijlstra <peterz@infradead.org>
Cc: mingo@kernel.org, tglx@linutronix.de, rostedt@goodmis.org,
	xlpang@redhat.com, linux-kernel@vger.kernel.org,
	mathieu.desnoyers@efficios.com, jdesfossez@efficios.com,
	bristot@redhat.com, Ingo Molnar <mingo@redhat.com>
Subject: Re: [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
Date: Tue, 14 Jun 2016 20:53:44 +0800	[thread overview]
Message-ID: <575FFE58.1090102@redhat.com> (raw)
In-Reply-To: <20160614102109.GF5981@e106622-lin>

[-- Attachment #1: Type: text/plain, Size: 9770 bytes --]

On 2016/06/14 at 18:21, Juri Lelli wrote:
> Hi,
>
> On 07/06/16 21:56, Peter Zijlstra wrote:
>> From: Xunlei Pang <xlpang@redhat.com>
>>
>> A crash happened while I was playing with deadline PI rtmutex.
>>
>>     BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>>     IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
>>     PGD 232a75067 PUD 230947067 PMD 0
>>     Oops: 0000 [#1] SMP
>>     CPU: 1 PID: 10994 Comm: a.out Not tainted
>>
>>     Call Trace:
>>     [<ffffffff810b658c>] enqueue_task+0x2c/0x80
>>     [<ffffffff810ba763>] activate_task+0x23/0x30
>>     [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
>>     [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
>>     [<ffffffff8164e783>] __schedule+0xd3/0x900
>>     [<ffffffff8164efd9>] schedule+0x29/0x70
>>     [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
>>     [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
>>     [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
>>     [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
>>     [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
>>     [<ffffffff810edd50>] SyS_futex+0x80/0x180
>>
> This seems to be caused by the race condition you detail below between
> load balancing and PI code. I tried to reproduce the BUG on my box, but
> it looks hard to get. Do you have a reproducer I can give a try?
>
>> This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
>> are only protected by pi_lock when operating pi waiters, while
>> rt_mutex_get_top_task(), will access them with rq lock held but
>> not holding pi_lock.
>>
>> In order to tackle it, we introduce new "pi_top_task" pointer
>> cached in task_struct, and add new rt_mutex_update_top_task()
>> to update its value, it can be called by rt_mutex_setprio()
>> which held both owner's pi_lock and rq lock. Thus "pi_top_task"
>> can be safely accessed by enqueue_task_dl() under rq lock.
>>
>> [XXX this next section is unparsable]
> Yes, a bit hard to understand. However, am I correct in assuming this
> patch and the previous one should fix this problem? Or are there still
> other races causing issues?

Yes, these two patches can fix the problem.

>
>> One problem is when rt_mutex_adjust_prio()->...->rt_mutex_setprio(),
>> at that time rtmutex lock was released and owner was marked off,
>> this can cause "pi_top_task" dereferenced to be a running one(as it
>> can be falsely woken up by others before rt_mutex_setprio() is
>> made to update "pi_top_task"). We solve this by directly calling
>> __rt_mutex_adjust_prio() in mark_wakeup_next_waiter() which held
>> pi_lock and rtmutex lock, and remove rt_mutex_adjust_prio(). Since
>> now we moved the deboost point, in order to avoid current to be
>> preempted due to deboost earlier before wake_up_q(), we also moved
>> preempt_disable() before unlocking rtmutex.
>>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Juri Lelli <juri.lelli@arm.com>
>> Originally-From: Peter Zijlstra <peterz@infradead.org>
>> Signed-off-by: Xunlei Pang <xlpang@redhat.com>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Link: http://lkml.kernel.org/r/1461659449-19497-2-git-send-email-xlpang@redhat.com
> The idea of this fix makes sense to me. But, I would like to be able to
> see the BUG and test the fix. What I have is a test in which I create N
> DEADLINE workers that share a PI mutex. They get migrated around and
> seem to stress PI code. But I couldn't hit the BUG yet. Maybe I let it
> run for some more time.

You can use this reproducer attached(gcc crash_deadline_pi.c -lpthread -lrt ).
Start multiple instances, then it will hit the bug very soon.

Regards,
Xunlei

>
> Best,
>
> - Juri
>
>> ---
>>
>>  include/linux/init_task.h |    1 
>>  include/linux/sched.h     |    2 +
>>  include/linux/sched/rt.h  |    1 
>>  kernel/fork.c             |    1 
>>  kernel/locking/rtmutex.c  |   65 +++++++++++++++++++---------------------------
>>  kernel/sched/core.c       |    2 +
>>  6 files changed, 34 insertions(+), 38 deletions(-)
>>
>> --- a/include/linux/init_task.h
>> +++ b/include/linux/init_task.h
>> @@ -162,6 +162,7 @@ extern struct task_group root_task_group
>>  #ifdef CONFIG_RT_MUTEXES
>>  # define INIT_RT_MUTEXES(tsk)						\
>>  	.pi_waiters = RB_ROOT,						\
>> +	.pi_top_task = NULL,						\
>>  	.pi_waiters_leftmost = NULL,
>>  #else
>>  # define INIT_RT_MUTEXES(tsk)
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1681,6 +1681,8 @@ struct task_struct {
>>  	/* PI waiters blocked on a rt_mutex held by this task */
>>  	struct rb_root pi_waiters;
>>  	struct rb_node *pi_waiters_leftmost;
>> +	/* Updated under owner's pi_lock and rq lock */
>> +	struct task_struct *pi_top_task;
>>  	/* Deadlock detection and priority inheritance handling */
>>  	struct rt_mutex_waiter *pi_blocked_on;
>>  #endif
>> --- a/include/linux/sched/rt.h
>> +++ b/include/linux/sched/rt.h
>> @@ -19,6 +19,7 @@ static inline int rt_task(struct task_st
>>  extern int rt_mutex_getprio(struct task_struct *p);
>>  extern void rt_mutex_setprio(struct task_struct *p, int prio);
>>  extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio);
>> +extern void rt_mutex_update_top_task(struct task_struct *p);
>>  extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
>>  extern void rt_mutex_adjust_pi(struct task_struct *p);
>>  static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1219,6 +1219,7 @@ static void rt_mutex_init_task(struct ta
>>  #ifdef CONFIG_RT_MUTEXES
>>  	p->pi_waiters = RB_ROOT;
>>  	p->pi_waiters_leftmost = NULL;
>> +	p->pi_top_task = NULL;
>>  	p->pi_blocked_on = NULL;
>>  #endif
>>  }
>> --- a/kernel/locking/rtmutex.c
>> +++ b/kernel/locking/rtmutex.c
>> @@ -256,6 +256,16 @@ rt_mutex_dequeue_pi(struct task_struct *
>>  	RB_CLEAR_NODE(&waiter->pi_tree_entry);
>>  }
>>  
>> +void rt_mutex_update_top_task(struct task_struct *p)
>> +{
>> +	if (!task_has_pi_waiters(p)) {
>> +		p->pi_top_task = NULL;
>> +		return;
>> +	}
>> +
>> +	p->pi_top_task = task_top_pi_waiter(p)->task;
>> +}
>> +
>>  /*
>>   * Calculate task priority from the waiter tree priority
>>   *
>> @@ -273,10 +283,7 @@ int rt_mutex_getprio(struct task_struct
>>  
>>  struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
>>  {
>> -	if (likely(!task_has_pi_waiters(task)))
>> -		return NULL;
>> -
>> -	return task_top_pi_waiter(task)->task;
>> +	return task->pi_top_task;
>>  }
>>  
>>  /*
>> @@ -285,12 +292,12 @@ struct task_struct *rt_mutex_get_top_tas
>>   */
>>  int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
>>  {
>> -	if (!task_has_pi_waiters(task))
>> +	struct task_struct *top_task = rt_mutex_get_top_task(task);
>> +
>> +	if (!top_task)
>>  		return newprio;
>>  
>> -	if (task_top_pi_waiter(task)->task->prio <= newprio)
>> -		return task_top_pi_waiter(task)->task->prio;
>> -	return newprio;
>> +	return min(top_task->prio, newprio);
>>  }
>>  
>>  /*
>> @@ -307,24 +314,6 @@ static void __rt_mutex_adjust_prio(struc
>>  }
>>  
>>  /*
>> - * Adjust task priority (undo boosting). Called from the exit path of
>> - * rt_mutex_slowunlock() and rt_mutex_slowlock().
>> - *
>> - * (Note: We do this outside of the protection of lock->wait_lock to
>> - * allow the lock to be taken while or before we readjust the priority
>> - * of task. We do not use the spin_xx_mutex() variants here as we are
>> - * outside of the debug path.)
>> - */
>> -void rt_mutex_adjust_prio(struct task_struct *task)
>> -{
>> -	unsigned long flags;
>> -
>> -	raw_spin_lock_irqsave(&task->pi_lock, flags);
>> -	__rt_mutex_adjust_prio(task);
>> -	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
>> -}
>> -
>> -/*
>>   * Deadlock detection is conditional:
>>   *
>>   * If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted
>> @@ -987,6 +976,7 @@ static void mark_wakeup_next_waiter(stru
>>  	 * lock->wait_lock.
>>  	 */
>>  	rt_mutex_dequeue_pi(current, waiter);
>> +	__rt_mutex_adjust_prio(current);
>>  
>>  	/*
>>  	 * As we are waking up the top waiter, and the waiter stays
>> @@ -1325,6 +1315,16 @@ static bool __sched rt_mutex_slowunlock(
>>  	 */
>>  	mark_wakeup_next_waiter(wake_q, lock);
>>  
>> +	/*
>> +	 * We should deboost before waking the top waiter task such that
>> +	 * we don't run two tasks with the 'same' priority. This however
>> +	 * can lead to prio-inversion if we would get preempted after
>> +	 * the deboost but before waking our high-prio task, hence the
>> +	 * preempt_disable before unlock. Pairs with preempt_enable() in
>> +	 * rt_mutex_postunlock();
>> +	 */
>> +	preempt_disable();
>> +
>>  	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>>  
>>  	/* check PI boosting */
>> @@ -1400,20 +1400,9 @@ rt_mutex_fastunlock(struct rt_mutex *loc
>>   */
>>  void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
>>  {
>> -	/*
>> -	 * We should deboost before waking the top waiter task such that
>> -	 * we don't run two tasks with the 'same' priority. This however
>> -	 * can lead to prio-inversion if we would get preempted after
>> -	 * the deboost but before waking our high-prio task, hence the
>> -	 * preempt_disable.
>> -	 */
>> -	if (deboost) {
>> -		preempt_disable();
>> -		rt_mutex_adjust_prio(current);
>> -	}
>> -
>>  	wake_up_q(wake_q);
>>  
>> +	/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
>>  	if (deboost)
>>  		preempt_enable();
>>  }
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3568,6 +3568,8 @@ void rt_mutex_setprio(struct task_struct
>>  		goto out_unlock;
>>  	}
>>  
>> +	rt_mutex_update_top_task(p);
>> +
>>  	trace_sched_pi_setprio(p, prio);
>>  	oldprio = p->prio;
>>  
>>
>>

[-- Attachment #2: crash_deadline_pi.c --]
[-- Type: text/x-csrc, Size: 4149 bytes --]

#define _GNU_SOURCE
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <linux/unistd.h>
#include <linux/kernel.h>
#include <linux/types.h>
#include <sys/syscall.h>
#include <pthread.h>
#include <sched.h>
#include <time.h>

 #define gettid() syscall(__NR_gettid)

 #define SCHED_DEADLINE	6

 /* XXX use the proper syscall numbers */
 #ifdef __x86_64__
 #define __NR_sched_setattr		314
 #define __NR_sched_getattr		315
 #endif

 #ifdef __i386__
 #define __NR_sched_setattr		351
 #define __NR_sched_getattr		352
 #endif

 #ifdef __ppc64__
 #define __NR_sched_setattr		355
 #define __NR_sched_getattr		356
 #endif

#ifdef __s390x__
#define __NR_sched_setattr     345
#define __NR_sched_getattr     346
#endif

static volatile int done;

 struct sched_attr {
	__u32 size;

	__u32 sched_policy;
	__u64 sched_flags;

	/* SCHED_NORMAL, SCHED_BATCH */
	__s32 sched_nice;

	/* SCHED_FIFO, SCHED_RR */
	__u32 sched_priority;

	/* SCHED_DEADLINE (nsec) */
	__u64 sched_runtime;
	__u64 sched_deadline;
	__u64 sched_period;
 };

 int sched_setattr(pid_t pid,
		  const struct sched_attr *attr,
		  unsigned int flags)
 {
	return syscall(__NR_sched_setattr, pid, attr, flags);
 }

 int sched_getattr(pid_t pid,
		  struct sched_attr *attr,
		  unsigned int size,
		  unsigned int flags)
 {
	return syscall(__NR_sched_getattr, pid, attr, size, flags);
 }

pthread_mutex_t mutex_obj;
pthread_mutexattr_t mutex_attr;
int x = 0;

static int decide(void)
{
	struct timespec ts;

	clock_gettime(CLOCK_MONOTONIC, &ts);

	if (ts.tv_nsec & 1)
		return 1;
	else
		return 0;
}

static void mutex_init(void)
{
	pthread_mutexattr_init(&mutex_attr);
	pthread_mutexattr_setprotocol(&mutex_attr, PTHREAD_PRIO_INHERIT);
	pthread_mutex_init(&mutex_obj, &mutex_attr);
}

static deadline_ndelay(unsigned int cnt)
{
	unsigned int i;

	for (i = 0; i < 10000 * cnt; i++);
}

 void *run_deadline_special(void *data)
 {
	struct sched_attr attr;
	int ret, take;
	unsigned int flags = 0;

	attr.size = sizeof(attr);
	attr.sched_flags = 0;
	attr.sched_nice = 0;
	attr.sched_priority = 0;

	/* This creates a 10ms/30ms reservation */
	attr.sched_policy = SCHED_DEADLINE;
	attr.sched_runtime = 100 * 1000 * 1000;
	attr.sched_deadline = 200 * 1000 * 1000;
	attr.sched_period = 300 * 1000 * 1000;

	ret = sched_setattr(0, &attr, flags);
	if (ret < 0) {
		done = 0;
		perror("sched_setattr");
		exit(-1);
	}

	printf("special deadline thread started [%ld]\n", gettid());
	while (!done) {
		take = decide();
		if (take)
			pthread_mutex_lock(&mutex_obj);

		x++;
		deadline_ndelay((unsigned long) attr.sched_runtime % 7);

		if (take)
			pthread_mutex_unlock(&mutex_obj);
	}

	printf("special deadline thread dies [%ld]\n", gettid());
	return NULL;
 }

 void *run_deadline(void *data)
 {
	struct sched_attr attr;
	int ret, take;
	unsigned int flags = 0;
	static unsigned int delta = 0;

	attr.size = sizeof(attr);
	attr.sched_flags = 0;
	attr.sched_nice = 0;
	attr.sched_priority = 0;

	/* This creates a 10ms/30ms reservation */
	delta += 1000 * 1000 * 2;
	attr.sched_policy = SCHED_DEADLINE;
	attr.sched_runtime = 20 * 1000 * 1000 + delta;
	attr.sched_deadline = 400 * 1000 * 1000;
	attr.sched_period = 400 * 1000 * 1000;

	ret = sched_setattr(0, &attr, flags);
	if (ret < 0) {
		done = 0;
		perror("sched_setattr");
		exit(-1);
	}

	printf("deadline thread started [%ld]\n", gettid());
	while (!done) {
		take = decide();
		if (take)
			pthread_mutex_lock(&mutex_obj);

		x++;
		deadline_ndelay((unsigned long) attr.sched_runtime % 7);

		if (take)
			pthread_mutex_unlock(&mutex_obj);
	}

	printf("deadline thread dies [%ld]\n", gettid());
	return NULL;
 }

#define THREAD_NUM 10

 int main (int argc, char **argv)
 {
	pthread_t thread[THREAD_NUM];
	int i;

	mutex_init();

	printf("main thread [%ld]\n", gettid());

	for (i = 0; i < THREAD_NUM-1; i++)
		pthread_create(&thread[i], NULL, run_deadline, NULL);

	pthread_create(&thread[THREAD_NUM-1], NULL, run_deadline_special, NULL);

	sleep(3600*300);

	done = 1;
	for (i = 0; i < THREAD_NUM; i++)
		pthread_join(thread[i], NULL);

	printf("main dies [%ld]\n", gettid());
	return 0;
 }