linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "André Almeida" <andrealmeid@igalia.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org,
	"Paul E . McKenney" <paulmck@kernel.org>,
	"Boqun Feng" <boqun.feng@gmail.com>,
	"H . Peter Anvin" <hpa@zytor.com>, "Paul Turner" <pjt@google.com>,
	linux-api@vger.kernel.org,
	"Christian Brauner" <brauner@kernel.org>,
	"Florian Weimer" <fw@deneb.enyo.de>,
	David.Laight@ACULAB.COM, carlos@redhat.com,
	"Peter Oskolkov" <posk@posk.io>,
	"Alexander Mikhalitsyn" <alexander@mihalicyn.com>,
	"Chris Kennelly" <ckennelly@google.com>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Darren Hart" <dvhart@infradead.org>,
	"Davidlohr Bueso" <dave@stgolabs.net>,
	"André Almeida" <andrealmeid@igalia.com>,
	libc-alpha@sourceware.org, "Steven Rostedt" <rostedt@goodmis.org>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Noah Goldstein" <goldstein.w.n@gmail.com>,
	"Daniel Colascione" <dancol@google.com>,
	longman@redhat.com, kernel-dev@igalia.com
Subject: [RFC PATCH 0/1] Add FUTEX_SPIN operation
Date: Thu, 25 Apr 2024 17:43:31 -0300	[thread overview]
Message-ID: <20240425204332.221162-1-andrealmeid@igalia.com> (raw)

Hi,

In the last LPC, Mathieu Desnoyers and I presented[0] a proposal to extend the
rseq interface to be able to implement spin locks in userspace correctly. Thomas
Gleixner agreed that this is something that Linux could improve, but asked for
an alternative proposal first: a futex operation that allows to spin a user
lock inside the kernel. This patchset implements a prototype of this idea for
further discussion.

With FUTEX2_SPIN flag set during a futex_wait(), the futex value is expected to
be the PID of the lock owner. Then, the kernel gets the task_struct of the
corresponding PID, and checks if it's running. It spins until the futex
is awaken, the task is scheduled out or if a timeout happens.  If the lock owner
is scheduled out at any time, then the syscall follows the normal path of
sleeping as usual.

If the futex is awaken and we are spinning, we can return to userspace quickly,
avoid the scheduling out and in again to wake from a futex_wait(), thus
speeding up the wait operation.

I didn't manage to find a good mechanism to prevent race conditions between
setting *futex = PID in userspace and doing find_get_task_by_vpid(PID) in kernel
space, giving that there's enough room for the original PID owner exit and such
PID to be relocated to another unrelated task in the system. I didn't performed
benchmarks so far, as I hope to clarify if this interface makes sense prior to
doing measurements on it.

This implementation has some debug prints to make it easy to inspect what the
kernel is doing, so you can check if the futex woke during spinning or if
just slept as the normal path:

[ 6331] futex_spin: spinned 64738 times, sleeping
[ 6331] futex_spin: woke after 1864606 spins
[ 6332] futex_spin: woke after 1820906 spins
[ 6351] futex_spin: spinned 1603293 times, sleeping
[ 6352] futex_spin: woke after 1848199 spins

[0] https://lpc.events/event/17/contributions/1481/

You can find a small snippet to play with this interface here:

---

/*
 * futex2_spin example, by André Almeida <andrealmeid@igalia.com>
 *
 * gcc spin.c -o spin
 */

#define _GNU_SOURCE
#include <err.h>
#include <errno.h>
#include <linux/futex.h>
#include <linux/sched.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>

#define __NR_futex_wake 454
#define __NR_futex_wait 455

#define WAKE_WAIT_US	10000
#define FUTEX2_SPIN	0x08
#define STACK_SIZE	(1024 * 1024)

#define FUTEX2_SIZE_U32	0x02
#define FUTEX2_PRIVATE	FUTEX_PRIVATE_FLAG

#define timeout_ns  30000000

void *futex;

static inline int futex2_wake(volatile void *uaddr, unsigned long mask, int nr, unsigned int flags)
{
	return syscall(__NR_futex_wake, uaddr, mask, nr, flags);
}

static inline int futex2_wait(volatile void *uaddr, unsigned long val, unsigned long mask,
			      unsigned int flags, struct timespec *timo, clockid_t clockid)
{
	return syscall(__NR_futex_wait, uaddr, val, mask, flags, timo, clockid);
}

void waiter_fn()
{
	struct timespec to;
	unsigned int flags = FUTEX2_PRIVATE | FUTEX2_SIZE_U32 | FUTEX2_SPIN;

	uint32_t child_pid = *(uint32_t *) futex;

	clock_gettime(CLOCK_MONOTONIC, &to);
	to.tv_nsec += timeout_ns;
	if (to.tv_nsec >= 1000000000) {
		to.tv_sec++;
		to.tv_nsec -= 1000000000;
	}

	printf("waiting on PID %d...\n", child_pid);
	if (futex2_wait(futex, child_pid, ~0U, flags, &to, CLOCK_MONOTONIC))
		printf("waiter failed errno %d\n", errno);

	puts("waiting done");
}

int function(int n)
{
	return n + n;
}

#define CHILD_LOOPS 500000

static int child_fn(void *arg)
{
	int i, n = 2;

	for (i = 0; i < CHILD_LOOPS; i++)
		n = function(n);

	futex2_wake(futex, ~0U, 1, FUTEX2_SIZE_U32 | FUTEX_PRIVATE_FLAG);

	puts("child thread is done");

	return 0;
}

int main() {
	uint32_t child_pid = 0;
	char *stack;

	futex = &child_pid;

	stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE,
			MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);

	if (stack == MAP_FAILED)
		err(EXIT_FAILURE, "mmap");

	child_pid = clone(child_fn, stack + STACK_SIZE, CLONE_VM, NULL);

	waiter_fn();

	usleep(WAKE_WAIT_US * 10);

	return 0;
}

---

André Almeida (1):
  futex: Add FUTEX_SPIN operation

 include/uapi/linux/futex.h |  2 +-
 kernel/futex/futex.h       |  6 ++-
 kernel/futex/waitwake.c    | 79 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 83 insertions(+), 4 deletions(-)

-- 
2.44.0


             reply	other threads:[~2024-04-25 20:44 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-25 20:43 André Almeida [this message]
2024-04-25 20:43 ` [RFC PATCH 1/1] futex: Add FUTEX_SPIN operation André Almeida
2024-04-26  9:43 ` [RFC PATCH 0/1] " Florian Weimer
2024-04-26 10:14   ` Peter Zijlstra
2024-04-26 10:26 ` Christian Brauner
2024-05-01 23:44   ` André Almeida
2024-05-02  8:45     ` Christian Brauner
2024-05-02  9:51       ` Florian Weimer
2024-05-02 10:14         ` Christian Brauner
2024-05-02 10:39           ` Florian Weimer
2024-05-02 13:08             ` Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240425204332.221162-1-andrealmeid@igalia.com \
    --to=andrealmeid@igalia.com \
    --cc=David.Laight@ACULAB.COM \
    --cc=alexander@mihalicyn.com \
    --cc=boqun.feng@gmail.com \
    --cc=brauner@kernel.org \
    --cc=carlos@redhat.com \
    --cc=ckennelly@google.com \
    --cc=corbet@lwn.net \
    --cc=dancol@google.com \
    --cc=dave@stgolabs.net \
    --cc=dvhart@infradead.org \
    --cc=fw@deneb.enyo.de \
    --cc=goldstein.w.n@gmail.com \
    --cc=hpa@zytor.com \
    --cc=kernel-dev@igalia.com \
    --cc=libc-alpha@sourceware.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=posk@posk.io \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).