All of lore.kernel.org
 help / color / mirror / Atom feed
* Possible race in rt_mutex_adjust_prio_chain
@ 2023-07-06  6:08 Henry Wu
  2023-07-06 12:01 ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Henry Wu @ 2023-07-06  6:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz

[-- Attachment #1: Type: text/plain, Size: 5634 bytes --]

Hi,

I found that it's not safe to change waiter->prio after waiter
dequeued from mutex's waiter rbtree because it's still on owner's
pi_waiters rbtree. From my analysis, waiters on pi_waiters rbtree
should be protected by pi_lock of task which have pi_waiters and
corresponding rt_mutex's wait_lock, but pi_waiters is shared by many
locks owned by this task, so actually we serialize changes on
pi_waiters only by pi_lock.

`rt_mutex_adjust_prio_chain' changes key of nodes of pi_waiters rbtree
without pi_lock and pi_waiters rbtree's invariant is violated. Maybe
we are enqueuing waiter on other cpu and pi_waiters rbtree will be
corrupted.

I attach a source file which can trigger this violation. I tested it
on Ubuntu 20.04 LTS with 5.4 kernel.

$ sudo ./a.out
pid: 39949
prio: 39
    PID     LWP TTY          TIME CMD
  39949   39949 pts/1    00:00:00 a.out
  39949   39950 pts/1    00:00:00 waiter-0
  39949   39951 pts/1    00:00:00 waiter-1
.............
prio: 20
prio: 20
prio: 20
prio: 20
prio: 20
prio: 20
prio: 20
.............
prio: 21
prio: 21
prio: 21
prio: 21
prio: 21
prio: 21
prio: 21
prio: 21
prio: 21
prio: 21
prio: 21
    PID     LWP TTY          TIME CMD
  39949   39949 pts/1    00:00:00 a.out
  39949   39950 pts/1    00:00:00 waiter-0
  39949   39951 pts/1    00:00:00 waiter-1
  39949   39952 pts/1    00:00:00 waiter-2
  39949   39953 pts/1    00:00:00 waiter-3
  39949   39954 pts/1    00:00:00 waiter-4
  39949   39955 pts/1    00:00:00 waiter-5
  39949   39956 pts/1    00:00:00 waiter-6
  39949   39957 pts/1    00:00:00 waiter-7
  39949   39958 pts/1    00:00:00 waiter-8
  39949   39959 pts/1    00:00:00 waiter-9
  39949   39960 pts/1    00:00:00 waiter-10
  39949   39961 pts/1    00:00:00 waiter-11
  39949   39962 pts/1    00:00:00 waiter-12
  39949   39963 pts/1    00:00:00 waiter-13
  39949   39964 pts/1    00:00:00 waiter-14
  39949   39965 pts/1    00:00:00 waiter-15
  39949   39966 pts/1    00:00:00 waiter-16
  39949   39967 pts/1    00:00:00 waiter-17
  39949   39968 pts/1    00:00:00 waiter-18
  39949   39969 pts/1    00:00:00 changer-0
  39949   39970 pts/1    00:00:00 changer-1
  39949   39971 pts/1    00:00:00 changer-2
  39949   39972 pts/1    00:00:00 changer-3
  39949   39973 pts/1    00:00:00 changer-4
  39949   39974 pts/1    00:00:00 changer-5
  39949   39975 pts/1    00:00:00 changer-6
  39949   39976 pts/1    00:00:00 changer-7
  39949   39977 pts/1    00:00:00 changer-8
  39949   39978 pts/1    00:00:00 changer-9
  39949   39979 pts/1    00:00:00 changer-10
  39949   39980 pts/1    00:00:00 changer-11
  39949   39981 pts/1    00:00:00 changer-12
  39949   39982 pts/1    00:00:00 changer-13
  39949   39983 pts/1    00:00:00 changer-14
  39949   39984 pts/1    00:00:00 changer-15
  39949   39985 pts/1    00:00:00 changer-16
  39949   39986 pts/1    00:00:00 changer-17
  39949   39987 pts/1    00:00:00 changer-18
found race, hang...

$ crash
crash> task -R prio,normal_prio,pi_waiters 39949
PID: 39949  TASK: ffff956a5c8b2f00  CPU: 2   COMMAND: "a.out"
  prio = 121,
  normal_prio = 139,
  pi_waiters = {
    rb_root = {
      rb_node = 0xffffb24941f43d40
    },
    rb_leftmost = 0xffffb24941f8bd40
  },

crash> print (struct rb_node *)0xffffb24941f43d40
$62 = (struct rb_node *) 0xffffb24941f43d40
crash> print *(struct rt_mutex_waiter *)((void *)$ - 24)
$63 = {
  tree_entry = {
    __rb_parent_color = 1,
    rb_right = 0x0,
    rb_left = 0x0
  },
  pi_tree_entry = {
    __rb_parent_color = 1,
    rb_right = 0xffffb24941f13d40,
    rb_left = 0xffffb24941f4bd40
  },
  task = 0xffff956a6c008000,
  lock = 0xffff956920b80970,
  prio = 130,
  deadline = 0
}
crash> print $62->rb_left
$64 = (struct rb_node *) 0xffffb24941f4bd40
crash> print *(struct rt_mutex_waiter *)((void *)$ - 24)
$65 = {
  tree_entry = {
    __rb_parent_color = 1,
    rb_right = 0x0,
    rb_left = 0x0
  },
  pi_tree_entry = {
    __rb_parent_color = 18446658626441723201,
    rb_right = 0xffffb24941f63d40,
    rb_left = 0xffffb24941f5bd40
  },
  task = 0xffff956a594baf00,
  lock = 0xffff956920b80190,
  prio = 126,
  deadline = 0
}
crash> print $62->rb_left->rb_left
$66 = (struct rb_node *) 0xffffb24941f5bd40
crash> print *(struct rt_mutex_waiter *)((void *)$ - 24)
$67 = {
  tree_entry = {
    __rb_parent_color = 1,
    rb_right = 0x0,
    rb_left = 0x0
  },
  pi_tree_entry = {
    __rb_parent_color = 18446658626441755968,
    rb_right = 0xffffb24941f73d40,
    rb_left = 0xffffb24941f2bd40
  },
  task = 0xffff956a64edaf00,
  lock = 0xffff956a8d481670,
  prio = 123,
  deadline = 0
}
crash> print $62->rb_left->rb_left->rb_left
$68 = (struct rb_node *) 0xffffb24941f2bd40
crash> print *(struct rt_mutex_waiter *)((void *)$ - 24)
$69 = {
  tree_entry = {
    __rb_parent_color = 1,
    rb_right = 0x0,
    rb_left = 0x0
  },
  pi_tree_entry = {
    __rb_parent_color = 18446658626441821505,
    rb_right = 0xffffb24940e47d40,
    rb_left = 0xffffb24941f8bd40
  },
  task = 0xffff956a62ae0000,
  lock = 0xffff956a76c5c250,
  prio = 120,
  deadline = 0
}
crash> print $62->rb_left->rb_left->rb_left->rb_left
$70 = (struct rb_node *) 0xffffb24941f8bd40
crash> print *(struct rt_mutex_waiter *)((void *)$ - 24)
$71 = {
  tree_entry = {
    __rb_parent_color = 1,
    rb_right = 0x0,
    rb_left = 0x0
  },
  pi_tree_entry = {
    __rb_parent_color = 18446658626441624896,
    rb_right = 0x0,
    rb_left = 0x0
  },
  task = 0xffff956a597dc680,
  lock = 0xffff956920b802b0,
  prio = 121,
  deadline = 0
}
crash>

From the last two rt_mutex_waiter we see that leftmost waiter's prio
is 121 but it's parent is 120.

Sorry for the inconvenience if I made a mistake. Thanks!

Henry

[-- Attachment #2: pi.c --]
[-- Type: text/x-csrc, Size: 5424 bytes --]

#define _GNU_SOURCE

#include <assert.h>
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <sched.h>
#include <errno.h>
#include <sys/syscall.h>

#define PCOUNT 19

// Copied from related header.
struct sched_attr {
	uint32_t size;

	uint32_t sched_policy;
	uint64_t sched_flags;

	/* SCHED_NORMAL, SCHED_BATCH */
	int32_t sched_nice;

	/* SCHED_FIFO, SCHED_RR */
	uint32_t sched_priority;

	/* SCHED_DEADLINE */
	uint64_t sched_runtime;
	uint64_t sched_deadline;
	uint64_t sched_period;

	/* Utilization hints */
	uint32_t sched_util_min;
	uint32_t sched_util_max;
};

static pthread_mutex_t mutexes[PCOUNT];

static void init_mutex(pthread_mutex_t *mutex) {
	pthread_mutexattr_t attr;
	int ret;

	ret = pthread_mutexattr_init(&attr);
	assert(!ret);
	ret = pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);
	assert(!ret);

	ret = pthread_mutex_init(mutex, &attr);
	assert(!ret);

	pthread_mutexattr_destroy(&attr);
}

enum thread_type {
	WAITER,
	PRIO_CHANGER,
};

static struct thread_spec {
	enum thread_type type;
	int name_index;
	int nice;
	pthread_barrier_t *barrier;

	// Used by waiter.
	pthread_mutex_t *wait_mutex;
	pid_t pid_save;
	
	// Used by prio changer.
	pid_t *waiter_pid;
} thread_specs[PCOUNT * 2];

static pthread_t threads[PCOUNT * 2];
static int thread_count = 0;

static int barrier_wait(pthread_barrier_t *barrier) {
	if (!barrier) {
		return 0;
	}

	int ret = pthread_barrier_wait(barrier);
	assert(!ret || ret == PTHREAD_BARRIER_SERIAL_THREAD);

	return ret;
}

static int sched_getattr(pid_t pid, struct sched_attr *attr, int flags) {
	return syscall(SYS_sched_getattr, pid, attr, sizeof(*attr), flags);
}

static int sched_setattr(pid_t pid, struct sched_attr *attr, int flags) {
	return syscall(SYS_sched_setattr, pid, attr, flags);
}

static void *prio_change_loop(struct thread_spec *spec) {
	struct sched_attr attr;
	int ret;

	for (;;) {
		barrier_wait(spec->barrier);

		ret = sched_getattr(*spec->waiter_pid, &attr, 0);
		if (ret < 0) {
			perror("sched_getattr");
			assert(0);
		}
		
		attr.sched_nice = spec->nice;
	
		ret = sched_setattr(*spec->waiter_pid, &attr, 0);
		if (ret < 0) {
			perror("sched_setattr");
			exit(1);
		}

		barrier_wait(spec->barrier);
	}
}

static void *thread(void *arg) {
	struct thread_spec *spec = arg;
	char name[64];
	int ret;
	
	snprintf(name, sizeof(name), "%s-%d", 
			spec->type == WAITER ? "waiter" : "changer", 
			spec->name_index);
	ret = pthread_setname_np(pthread_self(), name);
	assert(!ret);

	switch (spec->type) {
	case WAITER:
		spec->pid_save = gettid();

		barrier_wait(spec->barrier);

		ret = nice(spec->nice);
		assert(ret >= 0);

		ret = pthread_mutex_lock(spec->wait_mutex);
		assert(!ret);
		break;
	case PRIO_CHANGER:
		prio_change_loop(spec);
		break;
	default:
		assert(0);
		break;
	}

	return NULL;
}

static void create_thread(struct thread_spec *spec) {
	int ret;

	ret = pthread_create(&threads[thread_count++], NULL, thread, spec);
	assert(!ret);
}

static int print_prio() {
	FILE *file = fopen("/proc/self/stat", "r");
	assert(file);

	char comm[64];
	int tmp, prio, ret;
	ret = fscanf(file, "%d %s %c %d %d %d %d %d %d %d %d %d %d %d %d %d %d %d",
			&tmp, comm, comm, &tmp, &tmp, &tmp, &tmp, &tmp, &tmp,
			&tmp, &tmp, &tmp, &tmp, &tmp, &tmp, &tmp, &tmp, &prio);
	assert(ret == 18);

	printf("prio: %d\n", prio);

	fclose(file);

	return prio;
}

static void print_threads(pid_t process_pid) {
	char cmd[128];
	snprintf(cmd, sizeof(cmd), "ps -L -p %d", process_pid);
	system(cmd);
}

int main(void) {
	pid_t pid;
	int ret;

	pid = getpid();
	printf("pid: %d\n", pid);

	for (int i = 0; i < sizeof(mutexes) / sizeof(mutexes[0]); ++i) {
		init_mutex(&mutexes[i]);
		ret = pthread_mutex_lock(&mutexes[i]);
		assert(!ret);
	}

	pthread_barrier_t barrier;
	ret = pthread_barrier_init(&barrier, NULL, PCOUNT + 1);
	assert(!ret);

	for (int i = 0; i < PCOUNT; ++i) {
		struct thread_spec *spec = &thread_specs[i];
		
		spec->type = WAITER;
		spec->name_index = i;
		spec->nice = 18;
		spec->wait_mutex = &mutexes[i];
		spec->barrier = &barrier;

		create_thread(spec);
	}

	// Wait for filling of pid_save.
	barrier_wait(&barrier);

	for (int i = 0; i < PCOUNT; ++i) {
		struct thread_spec *spec = &thread_specs[i + 19];

		spec->type = PRIO_CHANGER; 
		spec->name_index = i;
		spec->nice = 0;
		spec->barrier = &barrier;
		spec->waiter_pid = &thread_specs[i].pid_save;

		create_thread(spec);
	}

	nice(19);

	print_prio();
	print_threads(pid);

	//sleep(60);
	//printf("launch!\n");

	srandom(time(NULL));
	int iter = 0;
	for (;;) {
		++iter;
		for (int i = 0; i < PCOUNT; ++i) {
			thread_specs[i + 19].nice = (iter % 2) ? i : 18 - i;
		}

		for (int i = 0; i < PCOUNT; ++i) {
			int pos = random() % PCOUNT;
			int tmp = thread_specs[pos + 19].nice;
			thread_specs[pos + 19].nice = thread_specs[19].nice;
			thread_specs[19].nice = tmp;
		}

		barrier_wait(&barrier);
		barrier_wait(&barrier);

		//system("echo iteration > /sys/kernel/debug/tracing/trace_marker");

		int prio, unexpected_times = 0;
		do {
			prio = print_prio();
			if (prio != 20) {
				++unexpected_times;
				if (unexpected_times > 10) {
					print_threads(pid);
					printf("found race, hang...\n");
					while (1) {
						sleep(3600);
					}
				}
			}
		} while (prio != 20);
	}

	for (int i = 0; i < thread_count; ++i) {
		pthread_join(threads[i], NULL);
	}

	return 0;
}


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Possible race in rt_mutex_adjust_prio_chain
  2023-07-06  6:08 Possible race in rt_mutex_adjust_prio_chain Henry Wu
@ 2023-07-06 12:01 ` Peter Zijlstra
  2023-07-06 19:58   ` Peter Zijlstra
       [not found]   ` <CAG-UpRTQ-ovVO02HRd5z-9oZGCPvZ0vODJse8oyuA9L-3NV_pQ@mail.gmail.com>
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Zijlstra @ 2023-07-06 12:01 UTC (permalink / raw)
  To: Henry Wu; +Cc: linux-kernel, Thomas Gleixner

On Thu, Jul 06, 2023 at 02:08:20PM +0800, Henry Wu wrote:
> Hi,
> 
> I found that it's not safe to change waiter->prio after waiter
> dequeued from mutex's waiter rbtree because it's still on owner's
> pi_waiters rbtree. From my analysis, waiters on pi_waiters rbtree
> should be protected by pi_lock of task which have pi_waiters and
> corresponding rt_mutex's wait_lock, but pi_waiters is shared by many
> locks owned by this task, so actually we serialize changes on
> pi_waiters only by pi_lock.
> 
> `rt_mutex_adjust_prio_chain' changes key of nodes of pi_waiters rbtree
> without pi_lock and pi_waiters rbtree's invariant is violated. Maybe
> we are enqueuing waiter on other cpu and pi_waiters rbtree will be
> corrupted.

Are you talking about [7];

Where we do waiter_update_prio() while only
holding [L] rtmutex->wait_lock.

VS

rt_mutex_adjust_prio() / task_top_pi_waiter() that accesses ->pi_waiters
while holding [P] task->pi_lock.

?

I'll go stare at that in more detail -- but I wanted to verify that's
what you're talking about.

> I attach a source file which can trigger this violation. I tested it
> on Ubuntu 20.04 LTS with 5.4 kernel.

Well, that's a horribly old kernel :-( Please double check on v6.4 and
consult that code for the discussion above -- I'm really not too
interested in debugging something ancient.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Possible race in rt_mutex_adjust_prio_chain
  2023-07-06 12:01 ` Peter Zijlstra
@ 2023-07-06 19:58   ` Peter Zijlstra
  2023-07-07  2:27     ` Henry Wu
       [not found]   ` <CAG-UpRTQ-ovVO02HRd5z-9oZGCPvZ0vODJse8oyuA9L-3NV_pQ@mail.gmail.com>
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2023-07-06 19:58 UTC (permalink / raw)
  To: Henry Wu; +Cc: linux-kernel, Thomas Gleixner

On Thu, Jul 06, 2023 at 02:01:03PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 06, 2023 at 02:08:20PM +0800, Henry Wu wrote:
> > Hi,
> > 
> > I found that it's not safe to change waiter->prio after waiter
> > dequeued from mutex's waiter rbtree because it's still on owner's
> > pi_waiters rbtree. From my analysis, waiters on pi_waiters rbtree
> > should be protected by pi_lock of task which have pi_waiters and
> > corresponding rt_mutex's wait_lock, but pi_waiters is shared by many
> > locks owned by this task, so actually we serialize changes on
> > pi_waiters only by pi_lock.
> > 
> > `rt_mutex_adjust_prio_chain' changes key of nodes of pi_waiters rbtree
> > without pi_lock and pi_waiters rbtree's invariant is violated. Maybe
> > we are enqueuing waiter on other cpu and pi_waiters rbtree will be
> > corrupted.
> 
> Are you talking about [7];
> 
> Where we do waiter_update_prio() while only
> holding [L] rtmutex->wait_lock.
> 
> VS
> 
> rt_mutex_adjust_prio() / task_top_pi_waiter() that accesses ->pi_waiters
> while holding [P] task->pi_lock.
> 
> ?
> 
> I'll go stare at that in more detail -- but I wanted to verify that's
> what you're talking about.

Current notes...

We hold [L] from 5-13; we hold [P1] 1-8 and [P2] 10-13

  P1 - blocked task
  P2 - lock owner

  7  holds [L]+[P1]
     modifies the values, which, temporarily, messes up the pi_waiters tree

  11 holds [L]+[P2]
     requeues the waiter on pi_waiter, restoring pi_waiters tree

pi_waiters is modified by:

  - rt_mutex_{en,de}queue_pi(); which are used:

   - [11] (holds [L]+[P2])
   - try_to_wake_rt_mutex() [L]+[P3] ?!? (P3 will be owner,
					  but is not yet)
   - task_blocks_on_rt_mutex() [L]+[P2]
   - mark_wakeup_next_waiter() [L]+[P2] (current is owner,
					 gives up lock)
   - remove_waiter() [L]+[P2]

pi_waiters is used by:

  - task_top_pi_waiter(), asserts [P], this is used:

    - rt_mutex_adjust_prio(), which asserts [P2] (*), is used:

      - [11] (holds [L])
      - task_blocks_on_rt_mutex() (holds [L])
      - mark_wakeup_next_waiter() (holds [L])
      - remove_waiter() (holds [L])

      (*)(see patch below -- adding more assertions)

    - [3] -- does *NOT* hold [L], does hold [P]


Now, [3] doesn't hold [L], but since 'all' modifications are done under
[L]+[P], just holding [P] should provide a stable read. Except [7], that
messes up the tree while not holding (the right!) [P].

It's late, I'll have to continue staring at this tomorrow!


---
 kernel/locking/rtmutex.c        | 17 +++++++++++------
 kernel/locking/rtmutex_common.h |  1 +
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 728f434de2bb..ff64054fc24c 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -472,10 +472,13 @@ rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
 	RB_CLEAR_NODE(&waiter->pi_tree_entry);
 }
 
-static __always_inline void rt_mutex_adjust_prio(struct task_struct *p)
+static __always_inline void rt_mutex_adjust_prio(struct rt_mutex_base *lock,
+						 struct task_struct *p)
 {
 	struct task_struct *pi_task = NULL;
 
+	lockdep_assert_held(&lock->wait_lock);
+	lockdep_assert(rt_mutex_owner(lock) == p);
 	lockdep_assert_held(&p->pi_lock);
 
 	if (task_has_pi_waiters(p))
@@ -922,7 +925,7 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
 		 */
 		rt_mutex_dequeue_pi(task, prerequeue_top_waiter);
 		rt_mutex_enqueue_pi(task, waiter);
-		rt_mutex_adjust_prio(task);
+		rt_mutex_adjust_prio(lock, task);
 
 	} else if (prerequeue_top_waiter == waiter) {
 		/*
@@ -938,7 +941,7 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
 		rt_mutex_dequeue_pi(task, waiter);
 		waiter = rt_mutex_top_waiter(lock);
 		rt_mutex_enqueue_pi(task, waiter);
-		rt_mutex_adjust_prio(task);
+		rt_mutex_adjust_prio(lock, task);
 	} else {
 		/*
 		 * Nothing changed. No need to do any priority
@@ -1187,7 +1190,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
 		rt_mutex_dequeue_pi(owner, top_waiter);
 		rt_mutex_enqueue_pi(owner, waiter);
 
-		rt_mutex_adjust_prio(owner);
+		rt_mutex_adjust_prio(lock, owner);
 		if (owner->pi_blocked_on)
 			chain_walk = 1;
 	} else if (rt_mutex_cond_detect_deadlock(waiter, chwalk)) {
@@ -1234,6 +1237,8 @@ static void __sched mark_wakeup_next_waiter(struct rt_wake_q_head *wqh,
 {
 	struct rt_mutex_waiter *waiter;
 
+	lockdep_assert_held(&lock->wait_lock);
+
 	raw_spin_lock(&current->pi_lock);
 
 	waiter = rt_mutex_top_waiter(lock);
@@ -1246,7 +1251,7 @@ static void __sched mark_wakeup_next_waiter(struct rt_wake_q_head *wqh,
 	 * task unblocks.
 	 */
 	rt_mutex_dequeue_pi(current, waiter);
-	rt_mutex_adjust_prio(current);
+	rt_mutex_adjust_prio(lock, current);
 
 	/*
 	 * As we are waking up the top waiter, and the waiter stays
@@ -1482,7 +1487,7 @@ static void __sched remove_waiter(struct rt_mutex_base *lock,
 	if (rt_mutex_has_waiters(lock))
 		rt_mutex_enqueue_pi(owner, rt_mutex_top_waiter(lock));
 
-	rt_mutex_adjust_prio(owner);
+	rt_mutex_adjust_prio(lock, owner);
 
 	/* Store the lock on which owner is blocked or NULL */
 	next_lock = task_blocked_on_lock(owner);
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index c47e8361bfb5..a71cd0f2eea9 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -127,6 +127,7 @@ static inline int task_has_pi_waiters(struct task_struct *p)
 
 static inline struct rt_mutex_waiter *task_top_pi_waiter(struct task_struct *p)
 {
+	lockdep_assert_held(&p->pi_lock);
 	return rb_entry(p->pi_waiters.rb_leftmost, struct rt_mutex_waiter,
 			pi_tree_entry);
 }

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Fwd: Possible race in rt_mutex_adjust_prio_chain
       [not found]   ` <CAG-UpRTQ-ovVO02HRd5z-9oZGCPvZ0vODJse8oyuA9L-3NV_pQ@mail.gmail.com>
@ 2023-07-07  2:09     ` Henry Wu
  2023-07-07 12:59       ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Henry Wu @ 2023-07-07  2:09 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7482 bytes --]

I forget to CC linux-kernel, sorry for duplicate mail.

Hi, Peter.

Peter Zijlstra <peterz@infradead.org> 于2023年7月6日周四 20:01写道:
>
> On Thu, Jul 06, 2023 at 02:08:20PM +0800, Henry Wu wrote:
> > Hi,
> >
> > I found that it's not safe to change waiter->prio after waiter
> > dequeued from mutex's waiter rbtree because it's still on owner's
> > pi_waiters rbtree. From my analysis, waiters on pi_waiters rbtree
> > should be protected by pi_lock of task which have pi_waiters and
> > corresponding rt_mutex's wait_lock, but pi_waiters is shared by many
> > locks owned by this task, so actually we serialize changes on
> > pi_waiters only by pi_lock.
> >
> > `rt_mutex_adjust_prio_chain' changes key of nodes of pi_waiters rbtree
> > without pi_lock and pi_waiters rbtree's invariant is violated. Maybe
> > we are enqueuing waiter on other cpu and pi_waiters rbtree will be
> > corrupted.
>
> Are you talking about [7];
>
> Where we do waiter_update_prio() while only
> holding [L] rtmutex->wait_lock.
>
> VS
>
> rt_mutex_adjust_prio() / task_top_pi_waiter() that accesses ->pi_waiters
> while holding [P] task->pi_lock.
>
> ?
>
> I'll go stare at that in more detail -- but I wanted to verify that's
> what you're talking about.
>

I refered step 7. I checked every call site of rt_mutex_adjust_prio()
and all of them are protected by the right pi_lock.

Imagine a scenario that we have two rt_mutex (M1, M2) and three
threads (A, B, C). Both rt_mutex are owned by thread A. B is blocked
when acquiring M1 and C is blocked when acquiring M2. We use
sched_setattr to change priority of B and C.

CPU0                                             CPU1
.........                                              ..........

rt_mutex_adjust_prio_chain(C)
rt_mutex_adjust_prio_chain(B)
......
[7] update waiter->prio
(Now A's pi_waiters rbtree is
 corrupted temporarily)
......                                                 [11] enqueue
operation may select
                                                       insert position
according to corrupted
                                                       waiter node CPU0 created.
[11] Even though we fixed
corrupted waiter now, we
are not sure about pi_waiters's
sanity because other cpu may
create new invariant violation
based on our violation.

> > I attach a source file which can trigger this violation. I tested it
> > on Ubuntu 20.04 LTS with 5.4 kernel.
>
> Well, that's a horribly old kernel :-( Please double check on v6.4 and
> consult that code for the discussion above -- I'm really not too
> interested in debugging something ancient.

I revised my code to make it work on 6.4.2 and fixed one logic error.
I tested it on Fedora 38 with kernel 6.4.2-858.vanilla.fc38.x86_64.
You can find the new code in attachment.

$ sudo ./a.out
.........................
prio: -21
prio: -21
prio: -21
prio: -21
prio: -21
prio: -21
prio: -20
prio: -20
prio: -20
prio: -20
prio: -20
prio: -20
prio: -20
prio: -20
prio: -20
prio: -20
prio: -20
    PID     LWP TTY          TIME CMD
   4564    4564 pts/0    00:00:01 a.out
   4564    4565 pts/0    00:00:00 waiter-0
   4564    4566 pts/0    00:00:00 waiter-1
   4564    4567 pts/0    00:00:00 waiter-2
   4564    4568 pts/0    00:00:00 waiter-3
   4564    4569 pts/0    00:00:00 waiter-4
   4564    4570 pts/0    00:00:00 waiter-5
   4564    4571 pts/0    00:00:00 waiter-6
   4564    4572 pts/0    00:00:00 waiter-7
   4564    4573 pts/0    00:00:00 waiter-8
   4564    4574 pts/0    00:00:00 waiter-9
   4564    4575 pts/0    00:00:00 waiter-10
   4564    4576 pts/0    00:00:00 waiter-11
   4564    4577 pts/0    00:00:00 waiter-12
   4564    4578 pts/0    00:00:00 waiter-13
   4564    4579 pts/0    00:00:00 waiter-14
   4564    4580 pts/0    00:00:00 waiter-15
   4564    4581 pts/0    00:00:00 waiter-16
   4564    4582 pts/0    00:00:00 waiter-17
   4564    4583 pts/0    00:00:00 waiter-18
   4564    4584 pts/0    00:00:00 waiter-19
   4564    4585 pts/0    00:00:00 changer-0
   4564    4586 pts/0    00:00:00 changer-1
   4564    4587 pts/0    00:00:00 changer-2
   4564    4588 pts/0    00:00:00 changer-3
   4564    4589 pts/0    00:00:00 changer-4
   4564    4590 pts/0    00:00:00 changer-5
   4564    4591 pts/0    00:00:00 changer-6
   4564    4592 pts/0    00:00:00 changer-7
   4564    4593 pts/0    00:00:00 changer-8
   4564    4594 pts/0    00:00:00 changer-9
   4564    4595 pts/0    00:00:00 changer-10
   4564    4596 pts/0    00:00:00 changer-11
   4564    4597 pts/0    00:00:00 changer-12
   4564    4598 pts/0    00:00:00 changer-13
   4564    4599 pts/0    00:00:00 changer-14
   4564    4600 pts/0    00:00:00 changer-15
   4564    4601 pts/0    00:00:00 changer-16
   4564    4602 pts/0    00:00:00 changer-17
   4564    4603 pts/0    00:00:00 changer-18
   4564    4604 pts/0    00:00:00 changer-19
found race, hang...

$ sudo crash --no_module
.....................
crash> task -R prio,normal_prio,rt_priority,pi_waiters 4564
PID: 4564     TASK: ffff9b7c8480a8c0  CPU: 3    COMMAND: "a.out"
  prio = 80,
  normal_prio = 120,
  rt_priority = 0,
  pi_waiters = {
    rb_root = {
      rb_node = 0xffffb5bad2ddfcf8
    },
    rb_leftmost = 0xffffb5bad2da7d98
  },

crash> print (struct rb_node *)0xffffb5bad2ddfcf8
$1 = (struct rb_node *) 0xffffb5bad2ddfcf8
crash> print *(struct rt_mutex_waiter *)((void *)$ - 24)
$2 = {
  tree_entry = {
    __rb_parent_color = 1,
    rb_right = 0x0,
    rb_left = 0x0
  },
  pi_tree_entry = {
    __rb_parent_color = 1,
    rb_right = 0xffffb5bad2df7d28,
    rb_left = 0xffffb5bad2dafd68
  },
  task = 0xffff9b7c80388000,
  lock = 0xffff9b7caa2cceb0,
  wake_state = 3,
  prio = 89,
  deadline = 0,
  ww_ctx = 0x0
}
crash> print $1->rb_left
$3 = (struct rb_node *) 0xffffb5bad2dafd68
crash> print *(struct rt_mutex_waiter *)((void *)$ - 24)
$4 = {
  tree_entry = {
    __rb_parent_color = 1,
    rb_right = 0x0,
    rb_left = 0x0
  },
  pi_tree_entry = {
    __rb_parent_color = 18446662412739149049,
    rb_right = 0xffffb5bad2defdb8,
    rb_left = 0xffffb5bad2dbfd30
  },
  task = 0xffff9b7d2bca28c0,
  lock = 0xffff9b7d004a2970,
  wake_state = 3,
  prio = 83,
  deadline = 0,
  ww_ctx = 0x0
}
crash> print $1->rb_left->rb_left
$5 = (struct rb_node *) 0xffffb5bad2dbfd30
crash> print *(struct rt_mutex_waiter *)((void *)$ - 24)
$6 = {
  tree_entry = {
    __rb_parent_color = 1,
    rb_right = 0x0,
    rb_left = 0x0
  },
  pi_tree_entry = {
    __rb_parent_color = 18446662412738952552,
    rb_right = 0xffffb5bad2d87d18,
    rb_left = 0xffffb5bad2da7d98
  },
  task = 0xffff9b7cfd55a8c0,
  lock = 0xffff9b7caa2cc6d0,
  wake_state = 3,
  prio = 79,
  deadline = 0,
  ww_ctx = 0x0
}
crash> print $1->rb_left->rb_left->rb_left
$7 = (struct rb_node *) 0xffffb5bad2da7d98
crash> print *(struct rt_mutex_waiter *)((void *)$ - 24)
$8 = {
  tree_entry = {
    __rb_parent_color = 1,
    rb_right = 0x0,
    rb_left = 0x0
  },
  pi_tree_entry = {
    __rb_parent_color = 18446662412739018033,
    rb_right = 0x0,
    rb_left = 0x0
  },
  task = 0xffff9b7c80bd0000,
  lock = 0xffff9b7caa2ccb50,
  wake_state = 3,
  prio = 80,
  deadline = 0,
  ww_ctx = 0x0
}
crash>

Key order invariant of pi_waiters had been violated by the last two
waiters above.

Thanks.

Henry

[-- Attachment #2: pi_642.c --]
[-- Type: text/x-c-code, Size: 5389 bytes --]

#define _GNU_SOURCE

#include <assert.h>
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <sched.h>
#include <errno.h>
#include <sys/syscall.h>

#define PCOUNT 20
#define EXPECTED_PRIO -21

// Copied from related header.
struct sched_attr {
	uint32_t size;

	uint32_t sched_policy;
	uint64_t sched_flags;

	/* SCHED_NORMAL, SCHED_BATCH */
	int32_t sched_nice;

	/* SCHED_FIFO, SCHED_RR */
	uint32_t sched_priority;

	/* SCHED_DEADLINE */
	uint64_t sched_runtime;
	uint64_t sched_deadline;
	uint64_t sched_period;

	/* Utilization hints */
	uint32_t sched_util_min;
	uint32_t sched_util_max;
};

static pthread_mutex_t mutexes[PCOUNT];

static void init_mutex(pthread_mutex_t *mutex) {
	pthread_mutexattr_t attr;
	int ret;

	ret = pthread_mutexattr_init(&attr);
	assert(!ret);
	ret = pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);
	assert(!ret);

	ret = pthread_mutex_init(mutex, &attr);
	assert(!ret);

	pthread_mutexattr_destroy(&attr);
}

enum thread_type {
	WAITER,
	PRIO_CHANGER,
};

static struct thread_spec {
	enum thread_type type;
	int name_index;
	pthread_barrier_t *barrier;

	// Used by waiter.
	pthread_mutex_t *wait_mutex;
	pid_t pid_save;
	
	// Used by prio changer.
	pid_t *waiter_pid;
	uint32_t prio;
} thread_specs[PCOUNT * 2];

static pthread_t threads[PCOUNT * 2];
static int thread_count = 0;

static int barrier_wait(pthread_barrier_t *barrier) {
	if (!barrier) {
		return 0;
	}

	int ret = pthread_barrier_wait(barrier);
	assert(!ret || ret == PTHREAD_BARRIER_SERIAL_THREAD);

	return ret;
}

static int sched_getattr(pid_t pid, struct sched_attr *attr, int flags) {
	return syscall(SYS_sched_getattr, pid, attr, sizeof(*attr), flags);
}

static int sched_setattr(pid_t pid, struct sched_attr *attr, int flags) {
	return syscall(SYS_sched_setattr, pid, attr, flags);
}

static void *prio_change_loop(struct thread_spec *spec) {
	struct sched_attr attr;
	int ret;

	for (;;) {
		barrier_wait(spec->barrier);

		ret = sched_getattr(*spec->waiter_pid, &attr, 0);
		if (ret < 0) {
			perror("sched_getattr");
			assert(0);
		}
		
		attr.sched_policy = SCHED_RR;
		attr.sched_priority = spec->prio;
	
		ret = sched_setattr(*spec->waiter_pid, &attr, 0);
		if (ret < 0) {
			perror("sched_setattr");
			exit(1);
		}

		barrier_wait(spec->barrier);
	}
}

static void *thread(void *arg) {
	struct thread_spec *spec = arg;
	char name[64];
	int ret;
	
	snprintf(name, sizeof(name), "%s-%d", 
			spec->type == WAITER ? "waiter" : "changer", 
			spec->name_index);
	ret = pthread_setname_np(pthread_self(), name);
	assert(!ret);

	switch (spec->type) {
	case WAITER:
		spec->pid_save = gettid();

		barrier_wait(spec->barrier);

		ret = pthread_mutex_lock(spec->wait_mutex);
		assert(!ret);
		break;
	case PRIO_CHANGER:
		prio_change_loop(spec);
		break;
	default:
		assert(0);
		break;
	}

	return NULL;
}

static void create_thread(struct thread_spec *spec) {
	int ret;

	ret = pthread_create(&threads[thread_count++], NULL, thread, spec);
	assert(!ret);
}

static int print_prio() {
	FILE *file = fopen("/proc/self/stat", "r");
	assert(file);

	char comm[64];
	int tmp, prio, ret;
	ret = fscanf(file, "%d %s %c %d %d %d %d %d %d %d %d %d %d %d %d %d %d %d",
			&tmp, comm, comm, &tmp, &tmp, &tmp, &tmp, &tmp, &tmp,
			&tmp, &tmp, &tmp, &tmp, &tmp, &tmp, &tmp, &tmp, &prio);
	assert(ret == 18);

	printf("prio: %d\n", prio);

	fclose(file);

	return prio;
}

static void print_threads(pid_t process_pid) {
	char cmd[128];
	snprintf(cmd, sizeof(cmd), "ps -L -p %d", process_pid);
	system(cmd);
}

int main(void) {
	pid_t pid;
	int ret;

	pid = getpid();
	printf("pid: %d\n", pid);

	for (int i = 0; i < sizeof(mutexes) / sizeof(mutexes[0]); ++i) {
		init_mutex(&mutexes[i]);
		ret = pthread_mutex_lock(&mutexes[i]);
		assert(!ret);
	}

	pthread_barrier_t barrier;
	ret = pthread_barrier_init(&barrier, NULL, PCOUNT + 1);
	assert(!ret);

	for (int i = 0; i < PCOUNT; ++i) {
		struct thread_spec *spec = &thread_specs[i];
		
		spec->type = WAITER;
		spec->name_index = i;
		spec->wait_mutex = &mutexes[i];
		spec->barrier = &barrier;

		create_thread(spec);
	}

	// Wait for filling of pid_save.
	barrier_wait(&barrier);

	for (int i = 0; i < PCOUNT; ++i) {
		struct thread_spec *spec = &thread_specs[i + PCOUNT];

		spec->type = PRIO_CHANGER; 
		spec->name_index = i;
		spec->prio = 99;
		spec->barrier = &barrier;
		spec->waiter_pid = &thread_specs[i].pid_save;

		create_thread(spec);
	}

	print_prio();
	print_threads(pid);

	srandom(time(NULL));
	int iter = 0;
	for (;;) {
		++iter;
		for (int i = 0; i < PCOUNT; ++i) {
			thread_specs[i + PCOUNT].prio = (iter % 2) ? i + 1 : PCOUNT - i;
		}

		for (int i = 0; i < PCOUNT; ++i) {
			int pos = random() % PCOUNT;
			int tmp = thread_specs[pos + PCOUNT].prio;
			thread_specs[pos + PCOUNT].prio = thread_specs[PCOUNT].prio;
			thread_specs[PCOUNT].prio = tmp;
		}

		barrier_wait(&barrier);
		barrier_wait(&barrier);

		//system("echo iteration > /sys/kernel/debug/tracing/trace_marker");

		for (int try = 0; ; ++try) {
			int prio = print_prio();
			if (prio == EXPECTED_PRIO) {
				// Try a new iteration.
				break;
			}

			if (try >= 10) {
				print_threads(pid);
				printf("found race, hang...\n");
				while (1) {
					sleep(3600);
				}
			}
		}
	}

	for (int i = 0; i < thread_count; ++i) {
		pthread_join(threads[i], NULL);
	}

	return 0;
}


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Possible race in rt_mutex_adjust_prio_chain
  2023-07-06 19:58   ` Peter Zijlstra
@ 2023-07-07  2:27     ` Henry Wu
  0 siblings, 0 replies; 11+ messages in thread
From: Henry Wu @ 2023-07-07  2:27 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel

Hi, Peter.

Peter Zijlstra <peterz@infradead.org> 于2023年7月7日周五 03:58写道:
>
> Current notes...
>
> We hold [L] from 5-13; we hold [P1] 1-8 and [P2] 10-13
>
>   P1 - blocked task
>   P2 - lock owner
>
>   7  holds [L]+[P1]
>      modifies the values, which, temporarily, messes up the pi_waiters tree
>
>   11 holds [L]+[P2]
>      requeues the waiter on pi_waiter, restoring pi_waiters tree
>
> pi_waiters is modified by:
>
>   - rt_mutex_{en,de}queue_pi(); which are used:
>
>    - [11] (holds [L]+[P2])
>    - try_to_wake_rt_mutex() [L]+[P3] ?!? (P3 will be owner,
>                                           but is not yet)
>    - task_blocks_on_rt_mutex() [L]+[P2]
>    - mark_wakeup_next_waiter() [L]+[P2] (current is owner,
>                                          gives up lock)
>    - remove_waiter() [L]+[P2]
>
> pi_waiters is used by:
>
>   - task_top_pi_waiter(), asserts [P], this is used:
>
>     - rt_mutex_adjust_prio(), which asserts [P2] (*), is used:
>
>       - [11] (holds [L])
>       - task_blocks_on_rt_mutex() (holds [L])
>       - mark_wakeup_next_waiter() (holds [L])
>       - remove_waiter() (holds [L])
>
>       (*)(see patch below -- adding more assertions)
>
>     - [3] -- does *NOT* hold [L], does hold [P]
>
>
> Now, [3] doesn't hold [L], but since 'all' modifications are done under
> [L]+[P], just holding [P] should provide a stable read. Except [7], that
> messes up the tree while not holding (the right!) [P].
>
> It's late, I'll have to continue staring at this tomorrow!
>

Your analysis is correct! I try to quote part of my previous reply
with correct format:

CPU0                              CPU1
.........                         ..........
                                  rt_mutex_adjust_prio_chain(C)
rt_mutex_adjust_prio_chain(B)
......
[7] update waiter->prio
(Now A's pi_waiters rbtree is
 corrupted temporarily)
......                            [11] enqueue operation may select
                                  insert position according to corrupted
                                  waiter node CPU0 created.
[11] Even though we fixed
corrupted waiter now, we
are not sure about pi_waiters's
sanity because other cpu may
create new invariant violation
based on our violation.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Fwd: Possible race in rt_mutex_adjust_prio_chain
  2023-07-07  2:09     ` Fwd: " Henry Wu
@ 2023-07-07 12:59       ` Peter Zijlstra
  2023-07-07 15:39         ` Mike Galbraith
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2023-07-07 12:59 UTC (permalink / raw)
  To: Henry Wu; +Cc: linux-kernel, Thomas Gleixner

On Fri, Jul 07, 2023 at 10:09:04AM +0800, Henry Wu wrote:

> Imagine a scenario that we have two rt_mutex (M1, M2) and three
> threads (A, B, C). Both rt_mutex are owned by thread A. B is blocked
> when acquiring M1 and C is blocked when acquiring M2. We use
> sched_setattr to change priority of B and C.

      A
     / \
    M1  M2
    |   |
    B   C

And in that case L will be two different locks, which I overlooked
yesterday. So L can't save the day.

This means I either need to take P2 earlier in overlap with P1 -- which
is tricky at best -- or duplicate the data to retain consistency.

Duplicating the data would still leave the logical race condition in
that a concurrent observer will not observe the waiter in the 'right'
location, however it should be harmless since both will continue the
propagate their resp state.

The below implements this duplication and seems to not insta-crash.

I'll endeavour to write a few comments and a Changelog to go with it.

---
 kernel/locking/rtmutex.c        | 93 ++++++++++++++++++++++++-----------------
 kernel/locking/rtmutex_api.c    |  2 +-
 kernel/locking/rtmutex_common.h | 34 +++++++++------
 3 files changed, 76 insertions(+), 53 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 728f434de2bb..3af6772cebb6 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -336,18 +336,27 @@ static __always_inline int __waiter_prio(struct task_struct *task)
 static __always_inline void
 waiter_update_prio(struct rt_mutex_waiter *waiter, struct task_struct *task)
 {
-	waiter->prio = __waiter_prio(task);
-	waiter->deadline = task->dl.deadline;
+	waiter->tree.prio = __waiter_prio(task);
+	waiter->tree.deadline = task->dl.deadline;
+}
+
+static __always_inline void
+waiter_clone_prio(struct rt_mutex_waiter *waiter)
+{
+	waiter->pi_tree.prio = waiter->tree.prio;
+	waiter->pi_tree.deadline = waiter->tree.deadline;
 }
 
 /*
- * Only use with rt_mutex_waiter_{less,equal}()
+ * Only use with rt_waiter_node_{less,equal}()
  */
+#define task_to_waiter_node(p) \
+	&(struct rt_waiter_node){ .prio = __waiter_prio(p), .deadline = (p)->dl.deadline }
 #define task_to_waiter(p)	\
-	&(struct rt_mutex_waiter){ .prio = __waiter_prio(p), .deadline = (p)->dl.deadline }
+	&(struct rt_mutex_waiter){ .tree = *task_to_waiter_node(p) }
 
-static __always_inline int rt_mutex_waiter_less(struct rt_mutex_waiter *left,
-						struct rt_mutex_waiter *right)
+static __always_inline int rt_waiter_node_less(struct rt_waiter_node *left,
+					       struct rt_waiter_node *right)
 {
 	if (left->prio < right->prio)
 		return 1;
@@ -364,8 +373,8 @@ static __always_inline int rt_mutex_waiter_less(struct rt_mutex_waiter *left,
 	return 0;
 }
 
-static __always_inline int rt_mutex_waiter_equal(struct rt_mutex_waiter *left,
-						 struct rt_mutex_waiter *right)
+static __always_inline int rt_waiter_node_equal(struct rt_waiter_node *left,
+						 struct rt_waiter_node *right)
 {
 	if (left->prio != right->prio)
 		return 0;
@@ -385,7 +394,7 @@ static __always_inline int rt_mutex_waiter_equal(struct rt_mutex_waiter *left,
 static inline bool rt_mutex_steal(struct rt_mutex_waiter *waiter,
 				  struct rt_mutex_waiter *top_waiter)
 {
-	if (rt_mutex_waiter_less(waiter, top_waiter))
+	if (rt_waiter_node_less(&waiter->tree, &top_waiter->tree))
 		return true;
 
 #ifdef RT_MUTEX_BUILD_SPINLOCKS
@@ -393,30 +402,30 @@ static inline bool rt_mutex_steal(struct rt_mutex_waiter *waiter,
 	 * Note that RT tasks are excluded from same priority (lateral)
 	 * steals to prevent the introduction of an unbounded latency.
 	 */
-	if (rt_prio(waiter->prio) || dl_prio(waiter->prio))
+	if (rt_prio(waiter->tree.prio) || dl_prio(waiter->tree.prio))
 		return false;
 
-	return rt_mutex_waiter_equal(waiter, top_waiter);
+	return rt_waiter_node_equal(&waiter->tree, &top_waiter->tree);
 #else
 	return false;
 #endif
 }
 
 #define __node_2_waiter(node) \
-	rb_entry((node), struct rt_mutex_waiter, tree_entry)
+	rb_entry((node), struct rt_mutex_waiter, tree.entry)
 
 static __always_inline bool __waiter_less(struct rb_node *a, const struct rb_node *b)
 {
 	struct rt_mutex_waiter *aw = __node_2_waiter(a);
 	struct rt_mutex_waiter *bw = __node_2_waiter(b);
 
-	if (rt_mutex_waiter_less(aw, bw))
+	if (rt_waiter_node_less(&aw->tree, &bw->tree))
 		return 1;
 
 	if (!build_ww_mutex())
 		return 0;
 
-	if (rt_mutex_waiter_less(bw, aw))
+	if (rt_waiter_node_less(&bw->tree, &aw->tree))
 		return 0;
 
 	/* NOTE: relies on waiter->ww_ctx being set before insertion */
@@ -434,48 +443,54 @@ static __always_inline bool __waiter_less(struct rb_node *a, const struct rb_nod
 static __always_inline void
 rt_mutex_enqueue(struct rt_mutex_base *lock, struct rt_mutex_waiter *waiter)
 {
-	rb_add_cached(&waiter->tree_entry, &lock->waiters, __waiter_less);
+	rb_add_cached(&waiter->tree.entry, &lock->waiters, __waiter_less);
 }
 
 static __always_inline void
 rt_mutex_dequeue(struct rt_mutex_base *lock, struct rt_mutex_waiter *waiter)
 {
-	if (RB_EMPTY_NODE(&waiter->tree_entry))
+	if (RB_EMPTY_NODE(&waiter->tree.entry))
 		return;
 
-	rb_erase_cached(&waiter->tree_entry, &lock->waiters);
-	RB_CLEAR_NODE(&waiter->tree_entry);
+	rb_erase_cached(&waiter->tree.entry, &lock->waiters);
+	RB_CLEAR_NODE(&waiter->tree.entry);
 }
 
-#define __node_2_pi_waiter(node) \
-	rb_entry((node), struct rt_mutex_waiter, pi_tree_entry)
+#define __node_2_rt_node(node) \
+	rb_entry((node), struct rt_waiter_node, entry)
 
-static __always_inline bool
-__pi_waiter_less(struct rb_node *a, const struct rb_node *b)
+static __always_inline bool __pi_waiter_less(struct rb_node *a, const struct rb_node *b)
 {
-	return rt_mutex_waiter_less(__node_2_pi_waiter(a), __node_2_pi_waiter(b));
+	return rt_waiter_node_less(__node_2_rt_node(a), __node_2_rt_node(b));
 }
 
 static __always_inline void
 rt_mutex_enqueue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
 {
-	rb_add_cached(&waiter->pi_tree_entry, &task->pi_waiters, __pi_waiter_less);
+	lockdep_assert_held(&task->pi_lock);
+
+	rb_add_cached(&waiter->pi_tree.entry, &task->pi_waiters, __pi_waiter_less);
 }
 
 static __always_inline void
 rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
 {
-	if (RB_EMPTY_NODE(&waiter->pi_tree_entry))
+	lockdep_assert_held(&task->pi_lock);
+
+	if (RB_EMPTY_NODE(&waiter->pi_tree.entry))
 		return;
 
-	rb_erase_cached(&waiter->pi_tree_entry, &task->pi_waiters);
-	RB_CLEAR_NODE(&waiter->pi_tree_entry);
+	rb_erase_cached(&waiter->pi_tree.entry, &task->pi_waiters);
+	RB_CLEAR_NODE(&waiter->pi_tree.entry);
 }
 
-static __always_inline void rt_mutex_adjust_prio(struct task_struct *p)
+static __always_inline void rt_mutex_adjust_prio(struct rt_mutex_base *lock,
+						 struct task_struct *p)
 {
 	struct task_struct *pi_task = NULL;
 
+	lockdep_assert_held(&lock->wait_lock);
+	lockdep_assert(rt_mutex_owner(lock) == p);
 	lockdep_assert_held(&p->pi_lock);
 
 	if (task_has_pi_waiters(p))
@@ -756,7 +771,7 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 * enabled we continue, but stop the requeueing in the chain
 	 * walk.
 	 */
-	if (rt_mutex_waiter_equal(waiter, task_to_waiter(task))) {
+	if (rt_waiter_node_equal(&waiter->tree, task_to_waiter_node(task))) {
 		if (!detect_deadlock)
 			goto out_unlock_pi;
 		else
@@ -874,11 +889,6 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 * or
 	 *
 	 *   DL CBS enforcement advancing the effective deadline.
-	 *
-	 * Even though pi_waiters also uses these fields, and that tree is only
-	 * updated in [11], we can do this here, since we hold [L], which
-	 * serializes all pi_waiters access and rb_erase() does not care about
-	 * the values of the node being removed.
 	 */
 	waiter_update_prio(waiter, task);
 
@@ -921,8 +931,9 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
 		 * and adjust the priority of the owner.
 		 */
 		rt_mutex_dequeue_pi(task, prerequeue_top_waiter);
+		waiter_clone_prio(waiter);
 		rt_mutex_enqueue_pi(task, waiter);
-		rt_mutex_adjust_prio(task);
+		rt_mutex_adjust_prio(lock, task);
 
 	} else if (prerequeue_top_waiter == waiter) {
 		/*
@@ -937,8 +948,9 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
 		 */
 		rt_mutex_dequeue_pi(task, waiter);
 		waiter = rt_mutex_top_waiter(lock);
+		waiter_clone_prio(waiter);
 		rt_mutex_enqueue_pi(task, waiter);
-		rt_mutex_adjust_prio(task);
+		rt_mutex_adjust_prio(lock, task);
 	} else {
 		/*
 		 * Nothing changed. No need to do any priority
@@ -1154,6 +1166,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
 	waiter->task = task;
 	waiter->lock = lock;
 	waiter_update_prio(waiter, task);
+	waiter_clone_prio(waiter);
 
 	/* Get the top priority waiter on the lock */
 	if (rt_mutex_has_waiters(lock))
@@ -1187,7 +1200,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
 		rt_mutex_dequeue_pi(owner, top_waiter);
 		rt_mutex_enqueue_pi(owner, waiter);
 
-		rt_mutex_adjust_prio(owner);
+		rt_mutex_adjust_prio(lock, owner);
 		if (owner->pi_blocked_on)
 			chain_walk = 1;
 	} else if (rt_mutex_cond_detect_deadlock(waiter, chwalk)) {
@@ -1234,6 +1247,8 @@ static void __sched mark_wakeup_next_waiter(struct rt_wake_q_head *wqh,
 {
 	struct rt_mutex_waiter *waiter;
 
+	lockdep_assert_held(&lock->wait_lock);
+
 	raw_spin_lock(&current->pi_lock);
 
 	waiter = rt_mutex_top_waiter(lock);
@@ -1246,7 +1261,7 @@ static void __sched mark_wakeup_next_waiter(struct rt_wake_q_head *wqh,
 	 * task unblocks.
 	 */
 	rt_mutex_dequeue_pi(current, waiter);
-	rt_mutex_adjust_prio(current);
+	rt_mutex_adjust_prio(lock, current);
 
 	/*
 	 * As we are waking up the top waiter, and the waiter stays
@@ -1482,7 +1497,7 @@ static void __sched remove_waiter(struct rt_mutex_base *lock,
 	if (rt_mutex_has_waiters(lock))
 		rt_mutex_enqueue_pi(owner, rt_mutex_top_waiter(lock));
 
-	rt_mutex_adjust_prio(owner);
+	rt_mutex_adjust_prio(lock, owner);
 
 	/* Store the lock on which owner is blocked or NULL */
 	next_lock = task_blocked_on_lock(owner);
diff --git a/kernel/locking/rtmutex_api.c b/kernel/locking/rtmutex_api.c
index cb9fdff76a8a..a6974d044593 100644
--- a/kernel/locking/rtmutex_api.c
+++ b/kernel/locking/rtmutex_api.c
@@ -459,7 +459,7 @@ void __sched rt_mutex_adjust_pi(struct task_struct *task)
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || rt_mutex_waiter_equal(waiter, task_to_waiter(task))) {
+	if (!waiter || rt_waiter_node_equal(&waiter->tree, task_to_waiter_node(task))) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index c47e8361bfb5..3d7457bffd43 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -17,27 +17,34 @@
 #include <linux/rtmutex.h>
 #include <linux/sched/wake_q.h>
 
+
+/*
+ * @prio:		Priority of the waiter
+ * @deadline:		Deadline of the waiter if applicable
+ */
+struct rt_waiter_node {
+	struct rb_node	entry;
+	int		prio;
+	u64		deadline;
+};
+
 /*
  * This is the control structure for tasks blocked on a rt_mutex,
  * which is allocated on the kernel stack on of the blocked task.
  *
- * @tree_entry:		pi node to enqueue into the mutex waiters tree
- * @pi_tree_entry:	pi node to enqueue into the mutex owner waiters tree
+ * @tree:		pi node to enqueue into the mutex waiters tree
+ * @pi_tree:		pi node to enqueue into the mutex owner waiters tree
  * @task:		task reference to the blocked task
  * @lock:		Pointer to the rt_mutex on which the waiter blocks
  * @wake_state:		Wakeup state to use (TASK_NORMAL or TASK_RTLOCK_WAIT)
- * @prio:		Priority of the waiter
- * @deadline:		Deadline of the waiter if applicable
  * @ww_ctx:		WW context pointer
  */
 struct rt_mutex_waiter {
-	struct rb_node		tree_entry;
-	struct rb_node		pi_tree_entry;
+	struct rt_waiter_node	tree;
+	struct rt_waiter_node	pi_tree;
 	struct task_struct	*task;
 	struct rt_mutex_base	*lock;
 	unsigned int		wake_state;
-	int			prio;
-	u64			deadline;
 	struct ww_acquire_ctx	*ww_ctx;
 };
 
@@ -105,7 +112,7 @@ static inline bool rt_mutex_waiter_is_top_waiter(struct rt_mutex_base *lock,
 {
 	struct rb_node *leftmost = rb_first_cached(&lock->waiters);
 
-	return rb_entry(leftmost, struct rt_mutex_waiter, tree_entry) == waiter;
+	return rb_entry(leftmost, struct rt_mutex_waiter, tree.entry) == waiter;
 }
 
 static inline struct rt_mutex_waiter *rt_mutex_top_waiter(struct rt_mutex_base *lock)
@@ -114,7 +121,7 @@ static inline struct rt_mutex_waiter *rt_mutex_top_waiter(struct rt_mutex_base *
 	struct rt_mutex_waiter *w = NULL;
 
 	if (leftmost) {
-		w = rb_entry(leftmost, struct rt_mutex_waiter, tree_entry);
+		w = rb_entry(leftmost, struct rt_mutex_waiter, tree.entry);
 		BUG_ON(w->lock != lock);
 	}
 	return w;
@@ -127,8 +134,9 @@ static inline int task_has_pi_waiters(struct task_struct *p)
 
 static inline struct rt_mutex_waiter *task_top_pi_waiter(struct task_struct *p)
 {
+	lockdep_assert_held(&p->pi_lock);
 	return rb_entry(p->pi_waiters.rb_leftmost, struct rt_mutex_waiter,
-			pi_tree_entry);
+			pi_tree.entry);
 }
 
 #define RT_MUTEX_HAS_WAITERS	1UL
@@ -190,8 +198,8 @@ static inline void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter)
 static inline void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
 {
 	debug_rt_mutex_init_waiter(waiter);
-	RB_CLEAR_NODE(&waiter->pi_tree_entry);
-	RB_CLEAR_NODE(&waiter->tree_entry);
+	RB_CLEAR_NODE(&waiter->pi_tree.entry);
+	RB_CLEAR_NODE(&waiter->tree.entry);
 	waiter->wake_state = TASK_NORMAL;
 	waiter->task = NULL;
 }

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Fwd: Possible race in rt_mutex_adjust_prio_chain
  2023-07-07 12:59       ` Peter Zijlstra
@ 2023-07-07 15:39         ` Mike Galbraith
  2023-07-07 16:05           ` Peter Zijlstra
  2023-07-07 17:28           ` Henry Wu
  0 siblings, 2 replies; 11+ messages in thread
From: Mike Galbraith @ 2023-07-07 15:39 UTC (permalink / raw)
  To: Peter Zijlstra, Henry Wu; +Cc: linux-kernel, Thomas Gleixner

On Fri, 2023-07-07 at 14:59 +0200, Peter Zijlstra wrote:
>
> The below implements this duplication and seems to not insta-crash.

RT bits of ww_mutex.h needed tree_entry -> tree.entry.  Modulo that, RT
seems content.

	-Mike

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Fwd: Possible race in rt_mutex_adjust_prio_chain
  2023-07-07 15:39         ` Mike Galbraith
@ 2023-07-07 16:05           ` Peter Zijlstra
  2023-07-07 17:28           ` Henry Wu
  1 sibling, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2023-07-07 16:05 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Henry Wu, linux-kernel, Thomas Gleixner

On Fri, Jul 07, 2023 at 05:39:38PM +0200, Mike Galbraith wrote:
> On Fri, 2023-07-07 at 14:59 +0200, Peter Zijlstra wrote:
> >
> > The below implements this duplication and seems to not insta-crash.
> 
> RT bits of ww_mutex.h needed tree_entry -> tree.entry.  Modulo that, RT
> seems content.

Ah indeed, just in time. Added, I'll post a real patch soon.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Fwd: Possible race in rt_mutex_adjust_prio_chain
  2023-07-07 15:39         ` Mike Galbraith
  2023-07-07 16:05           ` Peter Zijlstra
@ 2023-07-07 17:28           ` Henry Wu
  2023-07-13 12:55             ` Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: Henry Wu @ 2023-07-07 17:28 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra; +Cc: linux-kernel

Hi, Peter and Mike.

Mike Galbraith <efault@gmx.de> 于2023年7月7日周五 23:39写道:
>
> On Fri, 2023-07-07 at 14:59 +0200, Peter Zijlstra wrote:
> >
> > The below implements this duplication and seems to not insta-crash.
>
> RT bits of ww_mutex.h needed tree_entry -> tree.entry.  Modulo that, RT
> seems content.
>
>         -Mike

I patched my kernel with Peter's patch and tested it with my test
program. I haven't seen any race so far and I will test more tomorrow.
Fedora's default kernel config doesn't enable CONFIG_PREEMPT_RT so I
didn't come across with compile error. I will test final patch if
available.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Fwd: Possible race in rt_mutex_adjust_prio_chain
  2023-07-07 17:28           ` Henry Wu
@ 2023-07-13 12:55             ` Peter Zijlstra
  2023-07-14  3:45               ` Henry Wu
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2023-07-13 12:55 UTC (permalink / raw)
  To: Henry Wu; +Cc: Mike Galbraith, linux-kernel

On Sat, Jul 08, 2023 at 01:28:25AM +0800, Henry Wu wrote:

> I will test final patch if available.

https://lkml.kernel.org/r/20230707161052.GF2883469%40hirez.programming.kicks-ass.net

Does that work for you; could you reply with a tested-by if possible?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Fwd: Possible race in rt_mutex_adjust_prio_chain
  2023-07-13 12:55             ` Peter Zijlstra
@ 2023-07-14  3:45               ` Henry Wu
  0 siblings, 0 replies; 11+ messages in thread
From: Henry Wu @ 2023-07-14  3:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel

Peter:

Peter Zijlstra <peterz@infradead.org> 于2023年7月13日周四 20:55写道:
>
> On Sat, Jul 08, 2023 at 01:28:25AM +0800, Henry Wu wrote:
>
> > I will test final patch if available.
>
> https://lkml.kernel.org/r/20230707161052.GF2883469%40hirez.programming.kicks-ass.net
>
> Does that work for you; could you reply with a tested-by if possible?

Sorry, I missed your patch and I will test it as soon as possible. I
will reply in two days.

Sincerely,

Henry

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-07-14  3:45 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-06  6:08 Possible race in rt_mutex_adjust_prio_chain Henry Wu
2023-07-06 12:01 ` Peter Zijlstra
2023-07-06 19:58   ` Peter Zijlstra
2023-07-07  2:27     ` Henry Wu
     [not found]   ` <CAG-UpRTQ-ovVO02HRd5z-9oZGCPvZ0vODJse8oyuA9L-3NV_pQ@mail.gmail.com>
2023-07-07  2:09     ` Fwd: " Henry Wu
2023-07-07 12:59       ` Peter Zijlstra
2023-07-07 15:39         ` Mike Galbraith
2023-07-07 16:05           ` Peter Zijlstra
2023-07-07 17:28           ` Henry Wu
2023-07-13 12:55             ` Peter Zijlstra
2023-07-14  3:45               ` Henry Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.