linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/1] Add FUTEX_SPIN operation
@ 2024-04-25 20:43 André Almeida
  2024-04-25 20:43 ` [RFC PATCH 1/1] futex: " André Almeida
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: André Almeida @ 2024-04-25 20:43 UTC (permalink / raw)
  To: Mathieu Desnoyers, Peter Zijlstra, Thomas Gleixner
  Cc: linux-kernel, Paul E . McKenney, Boqun Feng, H . Peter Anvin,
	Paul Turner, linux-api, Christian Brauner, Florian Weimer,
	David.Laight, carlos, Peter Oskolkov, Alexander Mikhalitsyn,
	Chris Kennelly, Ingo Molnar, Darren Hart, Davidlohr Bueso,
	André Almeida, libc-alpha, Steven Rostedt, Jonathan Corbet,
	Noah Goldstein, Daniel Colascione, longman, kernel-dev

Hi,

In the last LPC, Mathieu Desnoyers and I presented[0] a proposal to extend the
rseq interface to be able to implement spin locks in userspace correctly. Thomas
Gleixner agreed that this is something that Linux could improve, but asked for
an alternative proposal first: a futex operation that allows to spin a user
lock inside the kernel. This patchset implements a prototype of this idea for
further discussion.

With FUTEX2_SPIN flag set during a futex_wait(), the futex value is expected to
be the PID of the lock owner. Then, the kernel gets the task_struct of the
corresponding PID, and checks if it's running. It spins until the futex
is awaken, the task is scheduled out or if a timeout happens.  If the lock owner
is scheduled out at any time, then the syscall follows the normal path of
sleeping as usual.

If the futex is awaken and we are spinning, we can return to userspace quickly,
avoid the scheduling out and in again to wake from a futex_wait(), thus
speeding up the wait operation.

I didn't manage to find a good mechanism to prevent race conditions between
setting *futex = PID in userspace and doing find_get_task_by_vpid(PID) in kernel
space, giving that there's enough room for the original PID owner exit and such
PID to be relocated to another unrelated task in the system. I didn't performed
benchmarks so far, as I hope to clarify if this interface makes sense prior to
doing measurements on it.

This implementation has some debug prints to make it easy to inspect what the
kernel is doing, so you can check if the futex woke during spinning or if
just slept as the normal path:

[ 6331] futex_spin: spinned 64738 times, sleeping
[ 6331] futex_spin: woke after 1864606 spins
[ 6332] futex_spin: woke after 1820906 spins
[ 6351] futex_spin: spinned 1603293 times, sleeping
[ 6352] futex_spin: woke after 1848199 spins

[0] https://lpc.events/event/17/contributions/1481/

You can find a small snippet to play with this interface here:

---

/*
 * futex2_spin example, by André Almeida <andrealmeid@igalia.com>
 *
 * gcc spin.c -o spin
 */

#define _GNU_SOURCE
#include <err.h>
#include <errno.h>
#include <linux/futex.h>
#include <linux/sched.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>

#define __NR_futex_wake 454
#define __NR_futex_wait 455

#define WAKE_WAIT_US	10000
#define FUTEX2_SPIN	0x08
#define STACK_SIZE	(1024 * 1024)

#define FUTEX2_SIZE_U32	0x02
#define FUTEX2_PRIVATE	FUTEX_PRIVATE_FLAG

#define timeout_ns  30000000

void *futex;

static inline int futex2_wake(volatile void *uaddr, unsigned long mask, int nr, unsigned int flags)
{
	return syscall(__NR_futex_wake, uaddr, mask, nr, flags);
}

static inline int futex2_wait(volatile void *uaddr, unsigned long val, unsigned long mask,
			      unsigned int flags, struct timespec *timo, clockid_t clockid)
{
	return syscall(__NR_futex_wait, uaddr, val, mask, flags, timo, clockid);
}

void waiter_fn()
{
	struct timespec to;
	unsigned int flags = FUTEX2_PRIVATE | FUTEX2_SIZE_U32 | FUTEX2_SPIN;

	uint32_t child_pid = *(uint32_t *) futex;

	clock_gettime(CLOCK_MONOTONIC, &to);
	to.tv_nsec += timeout_ns;
	if (to.tv_nsec >= 1000000000) {
		to.tv_sec++;
		to.tv_nsec -= 1000000000;
	}

	printf("waiting on PID %d...\n", child_pid);
	if (futex2_wait(futex, child_pid, ~0U, flags, &to, CLOCK_MONOTONIC))
		printf("waiter failed errno %d\n", errno);

	puts("waiting done");
}

int function(int n)
{
	return n + n;
}

#define CHILD_LOOPS 500000

static int child_fn(void *arg)
{
	int i, n = 2;

	for (i = 0; i < CHILD_LOOPS; i++)
		n = function(n);

	futex2_wake(futex, ~0U, 1, FUTEX2_SIZE_U32 | FUTEX_PRIVATE_FLAG);

	puts("child thread is done");

	return 0;
}

int main() {
	uint32_t child_pid = 0;
	char *stack;

	futex = &child_pid;

	stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE,
			MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);

	if (stack == MAP_FAILED)
		err(EXIT_FAILURE, "mmap");

	child_pid = clone(child_fn, stack + STACK_SIZE, CLONE_VM, NULL);

	waiter_fn();

	usleep(WAKE_WAIT_US * 10);

	return 0;
}

---

André Almeida (1):
  futex: Add FUTEX_SPIN operation

 include/uapi/linux/futex.h |  2 +-
 kernel/futex/futex.h       |  6 ++-
 kernel/futex/waitwake.c    | 79 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 83 insertions(+), 4 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/1] futex: Add FUTEX_SPIN operation
  2024-04-25 20:43 [RFC PATCH 0/1] Add FUTEX_SPIN operation André Almeida
@ 2024-04-25 20:43 ` André Almeida
  2024-04-26  9:43 ` [RFC PATCH 0/1] " Florian Weimer
  2024-04-26 10:26 ` Christian Brauner
  2 siblings, 0 replies; 11+ messages in thread
From: André Almeida @ 2024-04-25 20:43 UTC (permalink / raw)
  To: Mathieu Desnoyers, Peter Zijlstra, Thomas Gleixner
  Cc: linux-kernel, Paul E . McKenney, Boqun Feng, H . Peter Anvin,
	Paul Turner, linux-api, Christian Brauner, Florian Weimer,
	David.Laight, carlos, Peter Oskolkov, Alexander Mikhalitsyn,
	Chris Kennelly, Ingo Molnar, Darren Hart, Davidlohr Bueso,
	André Almeida, libc-alpha, Steven Rostedt, Jonathan Corbet,
	Noah Goldstein, Daniel Colascione, longman, kernel-dev

Add a new futex mode for futex wait, the futex spin.

Given the FUTEX2_SPIN flag, parse the futex value as the PID of the lock
owner. Then, before going to the normal wait path, spins while the lock
owner is running in a different CPU, to avoid the whole context switch
operation and to quickly return to userspace. If the lock owner is not
running, just sleep as the normal futex wait path.

The check for the owner to be running or not is important to avoid
spinning for something that won't be released quickly. Userspace is
responsible on providing the proper PID, the kernel does a basic check.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 include/uapi/linux/futex.h |  2 +-
 kernel/futex/futex.h       |  6 ++-
 kernel/futex/waitwake.c    | 79 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 83 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index d2ee625ea189..d77d692ffac2 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -63,7 +63,7 @@
 #define FUTEX2_SIZE_U32		0x02
 #define FUTEX2_SIZE_U64		0x03
 #define FUTEX2_NUMA		0x04
-			/*	0x08 */
+#define FUTEX2_SPIN		0x08
 			/*	0x10 */
 			/*	0x20 */
 			/*	0x40 */
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 8b195d06f4e8..180c1c10dc81 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -37,6 +37,7 @@
 #define FLAGS_HAS_TIMEOUT	0x0040
 #define FLAGS_NUMA		0x0080
 #define FLAGS_STRICT		0x0100
+#define FLAGS_SPIN		0x0200
 
 /* FUTEX_ to FLAGS_ */
 static inline unsigned int futex_to_flags(unsigned int op)
@@ -52,7 +53,7 @@ static inline unsigned int futex_to_flags(unsigned int op)
 	return flags;
 }
 
-#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_PRIVATE)
+#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_PRIVATE | FUTEX2_SPIN)
 
 /* FUTEX2_ to FLAGS_ */
 static inline unsigned int futex2_to_flags(unsigned int flags2)
@@ -65,6 +66,9 @@ static inline unsigned int futex2_to_flags(unsigned int flags2)
 	if (flags2 & FUTEX2_NUMA)
 		flags |= FLAGS_NUMA;
 
+	if (flags2 & FUTEX2_SPIN)
+		flags |= FLAGS_SPIN;
+
 	return flags;
 }
 
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 3a10375d9521..94feac92cf4f 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -372,6 +372,78 @@ void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
 	__set_current_state(TASK_RUNNING);
 }
 
+static inline bool task_on_cpu(struct task_struct *p)
+{
+#ifdef CONFIG_SMP
+	return !!(p->on_cpu);
+#else
+	return false;
+#endif
+}
+
+static int futex_spin(struct futex_hash_bucket *hb, struct futex_q *q,
+		       struct hrtimer_sleeper *timeout, void __user *uaddr, u32 val)
+{
+	struct task_struct *p;
+	u32 pid, uval;
+	unsigned int i = 0;
+
+	if (futex_get_value_locked(&uval, uaddr))
+		return -EFAULT;
+
+	pid = uval;
+
+	p = find_get_task_by_vpid(pid);
+	if (!p) {
+		printk("%s: no task found with PID %d\n", __func__, pid);
+		return -EAGAIN;
+	}
+
+	if (unlikely(p->flags & PF_KTHREAD)) {
+		put_task_struct(p);
+		printk("%s: can't spin in a kernel task\n", __func__);
+		return -EPERM;
+	}
+
+	futex_queue(q, hb);
+
+	if (timeout)
+		hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
+
+	while (1) {
+		if (likely(!plist_node_empty(&q->list))) {
+			if (timeout && !timeout->task)
+				return 0;
+
+			/* spin */
+			if (task_on_cpu(p)) {
+				i++;
+				continue;
+			/* task is not running, sleep */
+			} else {
+				break;
+			}
+		} else {
+			printk("%s: woke after %d spins\n", __func__, i);
+			return 0;
+		}
+	}
+
+	printk("%s: spinned %d times, sleeping\n", __func__, i);
+
+	/* spinning didn't work, go to the normal path */
+	set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+
+	if (likely(!plist_node_empty(&q->list))) {
+		if (!timeout || timeout->task)
+			schedule();
+	}
+
+	__set_current_state(TASK_RUNNING);
+
+	return 0;
+}
+
 /**
  * futex_unqueue_multiple - Remove various futexes from their hash bucket
  * @v:	   The list of futexes to unqueue
@@ -665,8 +737,11 @@ int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 	if (ret)
 		return ret;
 
-	/* futex_queue and wait for wakeup, timeout, or a signal. */
-	futex_wait_queue(hb, &q, to);
+	if (flags & FLAGS_SPIN)
+		futex_spin(hb, &q, to, uaddr, val);
+	else
+		/* futex_queue and wait for wakeup, timeout, or a signal. */
+		futex_wait_queue(hb, &q, to);
 
 	/* If we were woken (and unqueued), we succeeded, whatever. */
 	if (!futex_unqueue(&q))
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/1] Add FUTEX_SPIN operation
  2024-04-25 20:43 [RFC PATCH 0/1] Add FUTEX_SPIN operation André Almeida
  2024-04-25 20:43 ` [RFC PATCH 1/1] futex: " André Almeida
@ 2024-04-26  9:43 ` Florian Weimer
  2024-04-26 10:14   ` Peter Zijlstra
  2024-04-26 10:26 ` Christian Brauner
  2 siblings, 1 reply; 11+ messages in thread
From: Florian Weimer @ 2024-04-26  9:43 UTC (permalink / raw)
  To: André Almeida
  Cc: Mathieu Desnoyers, Peter Zijlstra, Thomas Gleixner, linux-kernel,
	Paul E . McKenney, Boqun Feng, H . Peter Anvin, Paul Turner,
	linux-api, Christian Brauner, David.Laight, carlos,
	Peter Oskolkov, Alexander Mikhalitsyn, Chris Kennelly,
	Ingo Molnar, Darren Hart, Davidlohr Bueso, libc-alpha,
	Steven Rostedt, Jonathan Corbet, Noah Goldstein,
	Daniel Colascione, longman, kernel-dev

* André Almeida:

> With FUTEX2_SPIN flag set during a futex_wait(), the futex value is
> expected to be the PID of the lock owner. Then, the kernel gets the
> task_struct of the corresponding PID, and checks if it's running. It
> spins until the futex is awaken, the task is scheduled out or if a
> timeout happens.  If the lock owner is scheduled out at any time, then
> the syscall follows the normal path of sleeping as usual.

PID or TID?

I think we'd like to have at least one, possibly more, bits for free
use, so the kernel ID comparison should at least mask off the MSB,
possibly more.

I haven't really thought about the proposed locking protocol, sorry.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/1] Add FUTEX_SPIN operation
  2024-04-26  9:43 ` [RFC PATCH 0/1] " Florian Weimer
@ 2024-04-26 10:14   ` Peter Zijlstra
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2024-04-26 10:14 UTC (permalink / raw)
  To: Florian Weimer
  Cc: André Almeida, Mathieu Desnoyers, Thomas Gleixner,
	linux-kernel, Paul E . McKenney, Boqun Feng, H . Peter Anvin,
	Paul Turner, linux-api, Christian Brauner, David.Laight, carlos,
	Peter Oskolkov, Alexander Mikhalitsyn, Chris Kennelly,
	Ingo Molnar, Darren Hart, Davidlohr Bueso, libc-alpha,
	Steven Rostedt, Jonathan Corbet, Noah Goldstein,
	Daniel Colascione, longman, kernel-dev

On Fri, Apr 26, 2024 at 11:43:51AM +0200, Florian Weimer wrote:
> * André Almeida:
> 
> > With FUTEX2_SPIN flag set during a futex_wait(), the futex value is
> > expected to be the PID of the lock owner. Then, the kernel gets the
> > task_struct of the corresponding PID, and checks if it's running. It
> > spins until the futex is awaken, the task is scheduled out or if a
> > timeout happens.  If the lock owner is scheduled out at any time, then
> > the syscall follows the normal path of sleeping as usual.
> 
> PID or TID?

TID, just like PI_LOCK I would presume.

> I think we'd like to have at least one, possibly more, bits for free
> use, so the kernel ID comparison should at least mask off the MSB,
> possibly more.

Yeah, it should be using FUTEX_TID_MASK -- just like PI_LOCK :-)

I suppose the question is if this thing should then also imply
FUTEX_WAITERS or not.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/1] Add FUTEX_SPIN operation
  2024-04-25 20:43 [RFC PATCH 0/1] Add FUTEX_SPIN operation André Almeida
  2024-04-25 20:43 ` [RFC PATCH 1/1] futex: " André Almeida
  2024-04-26  9:43 ` [RFC PATCH 0/1] " Florian Weimer
@ 2024-04-26 10:26 ` Christian Brauner
  2024-05-01 23:44   ` André Almeida
  2 siblings, 1 reply; 11+ messages in thread
From: Christian Brauner @ 2024-04-26 10:26 UTC (permalink / raw)
  To: André Almeida
  Cc: Mathieu Desnoyers, Peter Zijlstra, Thomas Gleixner, linux-kernel,
	Paul E . McKenney, Boqun Feng, H . Peter Anvin, Paul Turner,
	linux-api, Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Alexander Mikhalitsyn, Chris Kennelly, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, libc-alpha, Steven Rostedt, Jonathan Corbet,
	Noah Goldstein, Daniel Colascione, longman, kernel-dev

On Thu, Apr 25, 2024 at 05:43:31PM -0300, André Almeida wrote:
> Hi,
> 
> In the last LPC, Mathieu Desnoyers and I presented[0] a proposal to extend the
> rseq interface to be able to implement spin locks in userspace correctly. Thomas
> Gleixner agreed that this is something that Linux could improve, but asked for
> an alternative proposal first: a futex operation that allows to spin a user
> lock inside the kernel. This patchset implements a prototype of this idea for
> further discussion.
> 
> With FUTEX2_SPIN flag set during a futex_wait(), the futex value is expected to
> be the PID of the lock owner. Then, the kernel gets the task_struct of the
> corresponding PID, and checks if it's running. It spins until the futex
> is awaken, the task is scheduled out or if a timeout happens.  If the lock owner
> is scheduled out at any time, then the syscall follows the normal path of
> sleeping as usual.
> 
> If the futex is awaken and we are spinning, we can return to userspace quickly,
> avoid the scheduling out and in again to wake from a futex_wait(), thus
> speeding up the wait operation.
> 
> I didn't manage to find a good mechanism to prevent race conditions between
> setting *futex = PID in userspace and doing find_get_task_by_vpid(PID) in kernel
> space, giving that there's enough room for the original PID owner exit and such
> PID to be relocated to another unrelated task in the system. I didn't performed

One option would be to also allow pidfds. Starting with v6.9 they can be
used to reference individual threads.

So for the really fast case where you have multiple threads and you
somehow may really do care about the impact of the atomic_long_inc() on
pidfd_file->f_count during fdget() (for the single-threaded case the
increment is elided), callers can pass the TID. But in cases where the
inc and put aren't a performance sensitive, you can use pidfds.

So something like the _completely untested_ below:

diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 94feac92cf4f..b842680aa7e0 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -4,6 +4,9 @@
 #include <linux/sched/task.h>
 #include <linux/sched/signal.h>
 #include <linux/freezer.h>
+#include <linux/cleanup.h>
+#include <linux/file.h>
+#include <uapi/linux/pidfd.h>
 
 #include "futex.h"
 
@@ -385,19 +388,29 @@ static int futex_spin(struct futex_hash_bucket *hb, struct futex_q *q,
 		       struct hrtimer_sleeper *timeout, void __user *uaddr, u32 val)
 {
 	struct task_struct *p;
-	u32 pid, uval;
+	struct pid *pid;
+	u32 pidfd, uval;
 	unsigned int i = 0;
 
 	if (futex_get_value_locked(&uval, uaddr))
 		return -EFAULT;
 
-	pid = uval;
+	pidfd = uval;
+	CLASS(fd, f)(pidfd);
 
-	p = find_get_task_by_vpid(pid);
-	if (!p) {
-		printk("%s: no task found with PID %d\n", __func__, pid);
-		return -EAGAIN;
-	}
+	if (!f.file)
+		return -EBADF;
+
+	pid = pidfd_pid(f.file);
+	if (IS_ERR(pid))
+		return PTR_ERR(pid);
+
+	if (f.file->f_flags & PIDFD_THREAD)
+		p = get_pid_task(pid, PIDTYPE_PID); /* individual thread */
+	else
+		p = get_pid_task(pid, PIDTYPE_TGID); /* thread-group leader */
+	if (!p)
+		return -ESRCH;
 
 	if (unlikely(p->flags & PF_KTHREAD)) {
 		put_task_struct(p);


> benchmarks so far, as I hope to clarify if this interface makes sense prior to
> doing measurements on it.
> 
> This implementation has some debug prints to make it easy to inspect what the
> kernel is doing, so you can check if the futex woke during spinning or if
> just slept as the normal path:
> 
> [ 6331] futex_spin: spinned 64738 times, sleeping
> [ 6331] futex_spin: woke after 1864606 spins
> [ 6332] futex_spin: woke after 1820906 spins
> [ 6351] futex_spin: spinned 1603293 times, sleeping
> [ 6352] futex_spin: woke after 1848199 spins
> 
> [0] https://lpc.events/event/17/contributions/1481/
> 
> You can find a small snippet to play with this interface here:
> 
> ---
> 
> /*
>  * futex2_spin example, by André Almeida <andrealmeid@igalia.com>
>  *
>  * gcc spin.c -o spin
>  */
> 
> #define _GNU_SOURCE
> #include <err.h>
> #include <errno.h>
> #include <linux/futex.h>
> #include <linux/sched.h>
> #include <pthread.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <unistd.h>
> 
> #define __NR_futex_wake 454
> #define __NR_futex_wait 455
> 
> #define WAKE_WAIT_US	10000
> #define FUTEX2_SPIN	0x08
> #define STACK_SIZE	(1024 * 1024)
> 
> #define FUTEX2_SIZE_U32	0x02
> #define FUTEX2_PRIVATE	FUTEX_PRIVATE_FLAG
> 
> #define timeout_ns  30000000
> 
> void *futex;
> 
> static inline int futex2_wake(volatile void *uaddr, unsigned long mask, int nr, unsigned int flags)
> {
> 	return syscall(__NR_futex_wake, uaddr, mask, nr, flags);
> }
> 
> static inline int futex2_wait(volatile void *uaddr, unsigned long val, unsigned long mask,
> 			      unsigned int flags, struct timespec *timo, clockid_t clockid)
> {
> 	return syscall(__NR_futex_wait, uaddr, val, mask, flags, timo, clockid);
> }
> 
> void waiter_fn()
> {
> 	struct timespec to;
> 	unsigned int flags = FUTEX2_PRIVATE | FUTEX2_SIZE_U32 | FUTEX2_SPIN;
> 
> 	uint32_t child_pid = *(uint32_t *) futex;
> 
> 	clock_gettime(CLOCK_MONOTONIC, &to);
> 	to.tv_nsec += timeout_ns;
> 	if (to.tv_nsec >= 1000000000) {
> 		to.tv_sec++;
> 		to.tv_nsec -= 1000000000;
> 	}
> 
> 	printf("waiting on PID %d...\n", child_pid);
> 	if (futex2_wait(futex, child_pid, ~0U, flags, &to, CLOCK_MONOTONIC))
> 		printf("waiter failed errno %d\n", errno);
> 
> 	puts("waiting done");
> }
> 
> int function(int n)
> {
> 	return n + n;
> }
> 
> #define CHILD_LOOPS 500000
> 
> static int child_fn(void *arg)
> {
> 	int i, n = 2;
> 
> 	for (i = 0; i < CHILD_LOOPS; i++)
> 		n = function(n);
> 
> 	futex2_wake(futex, ~0U, 1, FUTEX2_SIZE_U32 | FUTEX_PRIVATE_FLAG);
> 
> 	puts("child thread is done");
> 
> 	return 0;
> }
> 
> int main() {
> 	uint32_t child_pid = 0;
> 	char *stack;
> 
> 	futex = &child_pid;
> 
> 	stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE,
> 			MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
> 
> 	if (stack == MAP_FAILED)
> 		err(EXIT_FAILURE, "mmap");
> 
> 	child_pid = clone(child_fn, stack + STACK_SIZE, CLONE_VM, NULL);
> 
> 	waiter_fn();
> 
> 	usleep(WAKE_WAIT_US * 10);
> 
> 	return 0;
> }
> 
> ---
> 
> André Almeida (1):
>   futex: Add FUTEX_SPIN operation
> 
>  include/uapi/linux/futex.h |  2 +-
>  kernel/futex/futex.h       |  6 ++-
>  kernel/futex/waitwake.c    | 79 +++++++++++++++++++++++++++++++++++++-
>  3 files changed, 83 insertions(+), 4 deletions(-)
> 
> -- 
> 2.44.0
> 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/1] Add FUTEX_SPIN operation
  2024-04-26 10:26 ` Christian Brauner
@ 2024-05-01 23:44   ` André Almeida
  2024-05-02  8:45     ` Christian Brauner
  0 siblings, 1 reply; 11+ messages in thread
From: André Almeida @ 2024-05-01 23:44 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Mathieu Desnoyers, Peter Zijlstra, Thomas Gleixner, linux-kernel,
	Paul E . McKenney, Boqun Feng, H . Peter Anvin, Paul Turner,
	linux-api, Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Alexander Mikhalitsyn, Chris Kennelly, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, libc-alpha, Steven Rostedt, Jonathan Corbet,
	Noah Goldstein, Daniel Colascione, longman, kernel-dev

Hi Christian,

Em 26/04/2024 07:26, Christian Brauner escreveu:
> On Thu, Apr 25, 2024 at 05:43:31PM -0300, André Almeida wrote:
>> Hi,
>>
>> In the last LPC, Mathieu Desnoyers and I presented[0] a proposal to extend the
>> rseq interface to be able to implement spin locks in userspace correctly. Thomas
>> Gleixner agreed that this is something that Linux could improve, but asked for
>> an alternative proposal first: a futex operation that allows to spin a user
>> lock inside the kernel. This patchset implements a prototype of this idea for
>> further discussion.
>>
>> With FUTEX2_SPIN flag set during a futex_wait(), the futex value is expected to
>> be the PID of the lock owner. Then, the kernel gets the task_struct of the
>> corresponding PID, and checks if it's running. It spins until the futex
>> is awaken, the task is scheduled out or if a timeout happens.  If the lock owner
>> is scheduled out at any time, then the syscall follows the normal path of
>> sleeping as usual.
>>
>> If the futex is awaken and we are spinning, we can return to userspace quickly,
>> avoid the scheduling out and in again to wake from a futex_wait(), thus
>> speeding up the wait operation.
>>
>> I didn't manage to find a good mechanism to prevent race conditions between
>> setting *futex = PID in userspace and doing find_get_task_by_vpid(PID) in kernel
>> space, giving that there's enough room for the original PID owner exit and such
>> PID to be relocated to another unrelated task in the system. I didn't performed
> 
> One option would be to also allow pidfds. Starting with v6.9 they can be
> used to reference individual threads.
> 
> So for the really fast case where you have multiple threads and you
> somehow may really do care about the impact of the atomic_long_inc() on
> pidfd_file->f_count during fdget() (for the single-threaded case the
> increment is elided), callers can pass the TID. But in cases where the
> inc and put aren't a performance sensitive, you can use pidfds.
> 

Thank you very much for making the effort here, much appreciated :)

While I agree that pidfds would fix the PID race conditions, I will move 
this interface to support TIDs instead, as noted by Florian and Peter. 
With TID the race conditions are diminished I reckon?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/1] Add FUTEX_SPIN operation
  2024-05-01 23:44   ` André Almeida
@ 2024-05-02  8:45     ` Christian Brauner
  2024-05-02  9:51       ` Florian Weimer
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Brauner @ 2024-05-02  8:45 UTC (permalink / raw)
  To: André Almeida
  Cc: Mathieu Desnoyers, Peter Zijlstra, Thomas Gleixner, linux-kernel,
	Paul E . McKenney, Boqun Feng, H . Peter Anvin, Paul Turner,
	linux-api, Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Alexander Mikhalitsyn, Chris Kennelly, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, libc-alpha, Steven Rostedt, Jonathan Corbet,
	Noah Goldstein, Daniel Colascione, longman, kernel-dev

On Wed, May 01, 2024 at 08:44:36PM -0300, André Almeida wrote:
> Hi Christian,
> 
> Em 26/04/2024 07:26, Christian Brauner escreveu:
> > On Thu, Apr 25, 2024 at 05:43:31PM -0300, André Almeida wrote:
> > > Hi,
> > > 
> > > In the last LPC, Mathieu Desnoyers and I presented[0] a proposal to extend the
> > > rseq interface to be able to implement spin locks in userspace correctly. Thomas
> > > Gleixner agreed that this is something that Linux could improve, but asked for
> > > an alternative proposal first: a futex operation that allows to spin a user
> > > lock inside the kernel. This patchset implements a prototype of this idea for
> > > further discussion.
> > > 
> > > With FUTEX2_SPIN flag set during a futex_wait(), the futex value is expected to
> > > be the PID of the lock owner. Then, the kernel gets the task_struct of the
> > > corresponding PID, and checks if it's running. It spins until the futex
> > > is awaken, the task is scheduled out or if a timeout happens.  If the lock owner
> > > is scheduled out at any time, then the syscall follows the normal path of
> > > sleeping as usual.
> > > 
> > > If the futex is awaken and we are spinning, we can return to userspace quickly,
> > > avoid the scheduling out and in again to wake from a futex_wait(), thus
> > > speeding up the wait operation.
> > > 
> > > I didn't manage to find a good mechanism to prevent race conditions between
> > > setting *futex = PID in userspace and doing find_get_task_by_vpid(PID) in kernel
> > > space, giving that there's enough room for the original PID owner exit and such
> > > PID to be relocated to another unrelated task in the system. I didn't performed
> > 
> > One option would be to also allow pidfds. Starting with v6.9 they can be
> > used to reference individual threads.
> > 
> > So for the really fast case where you have multiple threads and you
> > somehow may really do care about the impact of the atomic_long_inc() on
> > pidfd_file->f_count during fdget() (for the single-threaded case the
> > increment is elided), callers can pass the TID. But in cases where the
> > inc and put aren't a performance sensitive, you can use pidfds.
> > 
> 
> Thank you very much for making the effort here, much appreciated :)
> 
> While I agree that pidfds would fix the PID race conditions, I will move
> this interface to support TIDs instead, as noted by Florian and Peter. With
> TID the race conditions are diminished I reckon?

Unless I'm missing something the question here is PID (as in TGID aka
thread-group leader id gotten via getpid()) vs TID (thread specific id
gotten via gettid()). You want the thread-specific id as you want to
interact with the futex state of a specific thread not the thread-group
leader.

Aside from that TIDs are subject to the same race conditions that PIDs
are. They are allocated from the same pool (see alloc_pid()).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/1] Add FUTEX_SPIN operation
  2024-05-02  8:45     ` Christian Brauner
@ 2024-05-02  9:51       ` Florian Weimer
  2024-05-02 10:14         ` Christian Brauner
  0 siblings, 1 reply; 11+ messages in thread
From: Florian Weimer @ 2024-05-02  9:51 UTC (permalink / raw)
  To: Christian Brauner
  Cc: André Almeida, Mathieu Desnoyers, Peter Zijlstra,
	Thomas Gleixner, linux-kernel, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, David.Laight, carlos,
	Peter Oskolkov, Alexander Mikhalitsyn, Chris Kennelly,
	Ingo Molnar, Darren Hart, Davidlohr Bueso, libc-alpha,
	Steven Rostedt, Jonathan Corbet, Noah Goldstein,
	Daniel Colascione, longman, kernel-dev

* Christian Brauner:

> Unless I'm missing something the question here is PID (as in TGID aka
> thread-group leader id gotten via getpid()) vs TID (thread specific id
> gotten via gettid()). You want the thread-specific id as you want to
> interact with the futex state of a specific thread not the thread-group
> leader.
>
> Aside from that TIDs are subject to the same race conditions that PIDs
> are. They are allocated from the same pool (see alloc_pid()).

For most mutex types (but not robust mutexes), it is undefined in
userspace if a thread exits while it has locked a mutex.  Such a usage
condition would ensure that the race doesn't happen, I believe.

From a glibc perspective, we typically cannot use long-term file
descriptors (that are kept open across function calls) because some
applications do not expect them, or even close them behind our back.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/1] Add FUTEX_SPIN operation
  2024-05-02  9:51       ` Florian Weimer
@ 2024-05-02 10:14         ` Christian Brauner
  2024-05-02 10:39           ` Florian Weimer
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Brauner @ 2024-05-02 10:14 UTC (permalink / raw)
  To: Florian Weimer
  Cc: André Almeida, Mathieu Desnoyers, Peter Zijlstra,
	Thomas Gleixner, linux-kernel, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, David.Laight, carlos,
	Peter Oskolkov, Alexander Mikhalitsyn, Chris Kennelly,
	Ingo Molnar, Darren Hart, Davidlohr Bueso, libc-alpha,
	Steven Rostedt, Jonathan Corbet, Noah Goldstein,
	Daniel Colascione, longman, kernel-dev

On Thu, May 02, 2024 at 11:51:56AM +0200, Florian Weimer wrote:
> * Christian Brauner:
> 
> > Unless I'm missing something the question here is PID (as in TGID aka
> > thread-group leader id gotten via getpid()) vs TID (thread specific id
> > gotten via gettid()). You want the thread-specific id as you want to
> > interact with the futex state of a specific thread not the thread-group
> > leader.
> >
> > Aside from that TIDs are subject to the same race conditions that PIDs
> > are. They are allocated from the same pool (see alloc_pid()).
> 
> For most mutex types (but not robust mutexes), it is undefined in
> userspace if a thread exits while it has locked a mutex.  Such a usage
> condition would ensure that the race doesn't happen, I believe.

The argument is a bit shaky imho because the race not being able to
happen is predicated on no one being careless enough to exit with a
mutex held. That doesn't do anything against someone doing it on
purpose.

> 
> From a glibc perspective, we typically cannot use long-term file
> descriptors (that are kept open across function calls) because some
> applications do not expect them, or even close them behind our back.

Yeah, good point. Note, I suggested it as an extension not as a
replacement for the TID. I still think it would be a useful extension in
general.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/1] Add FUTEX_SPIN operation
  2024-05-02 10:14         ` Christian Brauner
@ 2024-05-02 10:39           ` Florian Weimer
  2024-05-02 13:08             ` Christian Brauner
  0 siblings, 1 reply; 11+ messages in thread
From: Florian Weimer @ 2024-05-02 10:39 UTC (permalink / raw)
  To: Christian Brauner
  Cc: André Almeida, Mathieu Desnoyers, Peter Zijlstra,
	Thomas Gleixner, linux-kernel, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, David.Laight, carlos,
	Peter Oskolkov, Alexander Mikhalitsyn, Chris Kennelly,
	Ingo Molnar, Darren Hart, Davidlohr Bueso, libc-alpha,
	Steven Rostedt, Jonathan Corbet, Noah Goldstein,
	Daniel Colascione, longman, kernel-dev

* Christian Brauner:

>> From a glibc perspective, we typically cannot use long-term file
>> descriptors (that are kept open across function calls) because some
>> applications do not expect them, or even close them behind our back.
>
> Yeah, good point. Note, I suggested it as an extension not as a
> replacement for the TID. I still think it would be a useful extension in
> general.

Applications will need a way to determine when it is safe to close the
pidfd, though.  If we automate this in glibc (in the same way we handle
thread stack deallocation for example), I think we are essentially back
to square one, except that pidfd collisions are much more likely than
TID collisions, especially on systems that have adjusted kernel.pid_max.
(File descriptor allocation is designed to maximize collisions, after
all.)

Thanks,
Florian


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/1] Add FUTEX_SPIN operation
  2024-05-02 10:39           ` Florian Weimer
@ 2024-05-02 13:08             ` Christian Brauner
  0 siblings, 0 replies; 11+ messages in thread
From: Christian Brauner @ 2024-05-02 13:08 UTC (permalink / raw)
  To: Florian Weimer
  Cc: André Almeida, Mathieu Desnoyers, Peter Zijlstra,
	Thomas Gleixner, linux-kernel, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, David.Laight, carlos,
	Peter Oskolkov, Alexander Mikhalitsyn, Chris Kennelly,
	Ingo Molnar, Darren Hart, Davidlohr Bueso, libc-alpha,
	Steven Rostedt, Jonathan Corbet, Noah Goldstein,
	Daniel Colascione, longman, kernel-dev

On Thu, May 02, 2024 at 12:39:34PM +0200, Florian Weimer wrote:
> * Christian Brauner:
> 
> >> From a glibc perspective, we typically cannot use long-term file
> >> descriptors (that are kept open across function calls) because some
> >> applications do not expect them, or even close them behind our back.
> >
> > Yeah, good point. Note, I suggested it as an extension not as a
> > replacement for the TID. I still think it would be a useful extension in
> > general.
> 
> Applications will need a way to determine when it is safe to close the
> pidfd, though.  If we automate this in glibc (in the same way we handle
> thread stack deallocation for example), I think we are essentially back
> to square one, except that pidfd collisions are much more likely than
> TID collisions, especially on systems that have adjusted kernel.pid_max.
> (File descriptor allocation is designed to maximize collisions, after
> all.)

(Note that with pidfs (current mainline), pidfds have 64bit unique inode
numbers that are unique for the lifetime of the system. So they can
reliably be compared via statx() and so on.)

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-05-02 13:08 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-25 20:43 [RFC PATCH 0/1] Add FUTEX_SPIN operation André Almeida
2024-04-25 20:43 ` [RFC PATCH 1/1] futex: " André Almeida
2024-04-26  9:43 ` [RFC PATCH 0/1] " Florian Weimer
2024-04-26 10:14   ` Peter Zijlstra
2024-04-26 10:26 ` Christian Brauner
2024-05-01 23:44   ` André Almeida
2024-05-02  8:45     ` Christian Brauner
2024-05-02  9:51       ` Florian Weimer
2024-05-02 10:14         ` Christian Brauner
2024-05-02 10:39           ` Florian Weimer
2024-05-02 13:08             ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).