Linux-kselftest Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v3 0/4] Implement FUTEX_WAIT_MULTIPLE operation 
@ 2020-02-13 21:45 André Almeida
  2020-02-13 21:45 ` [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes André Almeida
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: André Almeida @ 2020-02-13 21:45 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: kernel, krisman, shuah, linux-kselftest, rostedt, ryao, peterz,
	dvhart, mingo, z.figura12, steven, pgriffais, steven,
	André Almeida

Hello,

This patchset implements a new futex operation, called FUTEX_WAIT_MULTIPLE,
which allows a thread to wait on several futexes at the same time, and be
awoken by any of them.

The use case lies in the Wine implementation of the Windows NT interface
WaitMultipleObjects. This Windows API function allows a thread to sleep
waiting on the first of a set of event sources (mutexes, timers, signal,
console input, etc) to signal.  Considering this is a primitive
synchronization operation for Windows applications, being able to quickly
signal events on the producer side, and quickly go to sleep on the
consumer side is essential for good performance of those running over Wine.

Since this API exposes a mechanism to wait on multiple objects, and
we might have multiple waiters for each of these events, a M->N
relationship, the current Linux interfaces fell short on performance
evaluation of large M,N scenarios.  We experimented, for instance, with
eventfd, which has performance problems discussed below, but we also
experimented with userspace solutions, like making each consumer wait on
a condition variable guarding the entire list of objects, and then
waking up multiple variables on the producer side, but this is
prohibitively expensive since we either need to signal many condition
variables or share that condition variable among multiple waiters, and
then verify for the event being signaled in userspace, which means
dealing with often false positive wakes ups.

The natural interface to implement the behavior we want, also
considering that one of the waitable objects is a mutex itself, would be
the futex interface.  Therefore, this patchset proposes a mechanism for
a thread to wait on multiple futexes at once, and wake up on the first
futex that was awaken.

In particular, using futexes in our Wine use case reduced the CPU
utilization by 4% for the game Beat Saber and by 1.5% for the game
Shadow of Tomb Raider, both running over Proton (a Wine based solution
for Windows emulation), when compared to the eventfd interface. This
implementation also doesn't rely of file descriptors, so it doesn't risk
overflowing the resource.

In time, we are also proposing modifications to glibc and libpthread to
make this feature available for Linux native multithreaded applications
using libpthread, which can benefit from the behavior of waiting on any
of a group of futexes.

Technically, the existing FUTEX_WAIT implementation can be easily
reworked by using futex_wait_multiple() with a count of one, and I
have a patch showing how it works.  I'm not proposing it, since
futex is such a tricky code, that I'd be more comfortable to have
FUTEX_WAIT_MULTIPLE running upstream for a couple development cycles,
before considering modifying FUTEX_WAIT.

The patch series includes an extensive set of kselftests validating
the behavior of the interface.  We also implemented support[1] on
Syzkaller and survived the fuzzy testing.

Finally, if you'd rather pull directly a branch with this set you can
find it here:

https://gitlab.collabora.com/tonyk/linux/commits/futex-dev-v3

The RFC for this patch can be found here:

https://lkml.org/lkml/2019/7/30/1399

=== Performance of eventfd ===

Polling on several eventfd contexts with semaphore semantics would
provide us with the exact semantics we are looking for.  However, as
shown below, in a scenario with sufficient producers and consumers, the
eventfd interface itself becomes a bottleneck, in particular because
each thread will compete to acquire a sequence of waitqueue locks for
each eventfd context in the poll list. In addition, in the uncontended
case, where the producer is ready for consumption, eventfd still
requires going into the kernel on the consumer side.  

When a write or a read operation in an eventfd file succeeds, it will try
to wake up all threads that are waiting to perform some operation to
the file. The lock (ctx->wqh.lock) that hold the access to the file value
(ctx->count) is the same lock used to control access the waitqueue. When
all those those thread woke, they will compete to get this lock. Along
with that, the poll() also manipulates the waitqueue and need to hold
this same lock. This lock is specially hard to acquire when poll() calls
poll_freewait(), where it tries to free all waitqueues associated with
this poll. While doing that, it will compete with a lot of read and
write operations that have been waken.

In our use case, with a huge number of parallel reads, writes and polls,
this lock is a bottleneck and hurts the performance of applications. Our
implementation of futex, however, decrease the calls of spin lock by more
than 80% in some user applications.

Finally, eventfd operates on file descriptors, which is a limited
resource that has shown its limitation in our use cases.  Despite the
Windows interface not waiting on more than 64 objects at once, we still
have multiple waiters at the same time, and we were easily able to
exhaust the FD limits on applications like games.

Thanks,
    André

[1] https://github.com/andrealmeid/syzkaller/tree/futex-wait-multiple

Gabriel Krisman Bertazi (4):
  futex: Implement mechanism to wait on any of several futexes
  selftests: futex: Add FUTEX_WAIT_MULTIPLE timeout test
  selftests: futex: Add FUTEX_WAIT_MULTIPLE wouldblock test
  selftests: futex: Add FUTEX_WAIT_MULTIPLE wake up test

 include/uapi/linux/futex.h                    |  20 +
 kernel/futex.c                                | 358 +++++++++++++++++-
 .../selftests/futex/functional/.gitignore     |   1 +
 .../selftests/futex/functional/Makefile       |   3 +-
 .../futex/functional/futex_wait_multiple.c    | 173 +++++++++
 .../futex/functional/futex_wait_timeout.c     |  38 +-
 .../futex/functional/futex_wait_wouldblock.c  |  28 +-
 .../testing/selftests/futex/functional/run.sh |   3 +
 .../selftests/futex/include/futextest.h       |  22 ++
 9 files changed, 639 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/futex/functional/futex_wait_multiple.c

-- 
2.25.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes
  2020-02-13 21:45 [PATCH v3 0/4] Implement FUTEX_WAIT_MULTIPLE operation André Almeida
@ 2020-02-13 21:45 ` André Almeida
  2020-02-28 19:07   ` Peter Zijlstra
  2020-02-13 21:45 ` [PATCH v3 2/4] selftests: futex: Add FUTEX_WAIT_MULTIPLE timeout test André Almeida
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: André Almeida @ 2020-02-13 21:45 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: kernel, krisman, shuah, linux-kselftest, rostedt, ryao, peterz,
	dvhart, mingo, z.figura12, steven, pgriffais, steven,
	André Almeida

From: Gabriel Krisman Bertazi <krisman@collabora.com>

This is a new futex operation, called FUTEX_WAIT_MULTIPLE, which allows
a thread to wait on several futexes at the same time, and be awoken by
any of them.  In a sense, it implements one of the features that was
supported by pooling on the old FUTEX_FD interface.

The use case lies in the Wine implementation of the Windows NT interface
WaitMultipleObjects. This Windows API function allows a thread to sleep
waiting on the first of a set of event sources (mutexes, timers, signal,
console input, etc) to signal.  Considering this is a primitive
synchronization operation for Windows applications, being able to quickly
signal events on the producer side, and quickly go to sleep on the
consumer side is essential for good performance of those running over Wine.

Wine developers have an implementation that uses eventfd, but it suffers
from FD exhaustion (there is applications that go to the order of
multi-milion FDs), and higher CPU utilization than this new operation.

The futex list is passed as an array of `struct futex_wait_block`
(pointer, value, bitset) to the kernel, which will enqueue all of them
and sleep if none was already triggered. It returns a hint of which
futex caused the wake up event to userspace, but the hint doesn't
guarantee that is the only futex triggered.  Before calling the syscall
again, userspace should traverse the list, trying to re-acquire any of
the other futexes, to prevent an immediate -EWOULDBLOCK return code from
the kernel.

This was tested using three mechanisms:

1) By reimplementing FUTEX_WAIT in terms of FUTEX_WAIT_MULTIPLE and
running the unmodified tools/testing/selftests/futex and a full linux
distro on top of this kernel.

2) By an example code that exercises the FUTEX_WAIT_MULTIPLE path on a
multi-threaded, event-handling setup.

3) By running the Wine fsync implementation and executing multi-threaded
applications, in particular modern games, on top of this implementation.

Changes were tested for the following ABIs: x86_64, i386 and x32.
Support for x32 applications is not implemented since it would
take a major rework adding a new entry point and splitting the current
futex 64 entry point in two and we can't change the current x32 syscall
number without breaking user space compatibility.

CC: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Yao <ryao@gentoo.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Zebediah Figura <z.figura12@gmail.com>
Signed-off-by: Zebediah Figura <z.figura12@gmail.com>
Co-developed-by: Steven Noonan <steven@valvesoftware.com>
Signed-off-by: Steven Noonan <steven@valvesoftware.com>
Co-developed-by: Pierre-Loup A. Griffais <pgriffais@valvesoftware.com>
Signed-off-by: Pierre-Loup A. Griffais <pgriffais@valvesoftware.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
[Added compatibility code]
Co-developed-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
---
Changes since v2:
  - Loop counters are now unsigned
  - Add ifdef around `in_x32_syscall()`, so this function is only compiled
    in architectures that declare it

Changes since RFC:
  - Limit waitlist to 128 futexes
  - Simplify wait loop
  - Document functions
  - Reduce allocated space
  - Return hint if a futex was awoken during setup
  - Check if any futex was awoken prior to sleep
  - Drop relative timer logic
  - Add compatibility struct and entry points
  - Add selftests
---
 include/uapi/linux/futex.h |  20 +++
 kernel/futex.c             | 358 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 376 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index a89eb0accd5e..580001e89c6c 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -21,6 +21,7 @@
 #define FUTEX_WAKE_BITSET	10
 #define FUTEX_WAIT_REQUEUE_PI	11
 #define FUTEX_CMP_REQUEUE_PI	12
+#define FUTEX_WAIT_MULTIPLE	13
 
 #define FUTEX_PRIVATE_FLAG	128
 #define FUTEX_CLOCK_REALTIME	256
@@ -40,6 +41,8 @@
 					 FUTEX_PRIVATE_FLAG)
 #define FUTEX_CMP_REQUEUE_PI_PRIVATE	(FUTEX_CMP_REQUEUE_PI | \
 					 FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAIT_MULTIPLE_PRIVATE	(FUTEX_WAIT_MULTIPLE | \
+					 FUTEX_PRIVATE_FLAG)
 
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
@@ -150,4 +153,21 @@ struct robust_list_head {
   (((op & 0xf) << 28) | ((cmp & 0xf) << 24)		\
    | ((oparg & 0xfff) << 12) | (cmparg & 0xfff))
 
+/*
+ * Maximum number of multiple futexes to wait for
+ */
+#define FUTEX_MULTIPLE_MAX_COUNT	128
+
+/**
+ * struct futex_wait_block - Block of futexes to be waited for
+ * @uaddr:	User address of the futex
+ * @val:	Futex value expected by userspace
+ * @bitset:	Bitset for the optional bitmasked wakeup
+ */
+struct futex_wait_block {
+	__u32 __user *uaddr;
+	__u32 val;
+	__u32 bitset;
+};
+
 #endif /* _UAPI_LINUX_FUTEX_H */
diff --git a/kernel/futex.c b/kernel/futex.c
index 0cf84c8664f2..58cf9eb2b851 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -215,6 +215,8 @@ struct futex_pi_state {
  * @rt_waiter:		rt_waiter storage for use with requeue_pi
  * @requeue_pi_key:	the requeue_pi target futex key
  * @bitset:		bitset for the optional bitmasked wakeup
+ * @uaddr:             userspace address of futex
+ * @uval:              expected futex's value
  *
  * We use this hashed waitqueue, instead of a normal wait_queue_entry_t, so
  * we can wake only the relevant ones (hashed queues may be shared).
@@ -237,6 +239,8 @@ struct futex_q {
 	struct rt_mutex_waiter *rt_waiter;
 	union futex_key *requeue_pi_key;
 	u32 bitset;
+	u32 __user *uaddr;
+	u32 uval;
 } __randomize_layout;
 
 static const struct futex_q futex_q_init = {
@@ -2420,6 +2424,29 @@ static int unqueue_me(struct futex_q *q)
 	return ret;
 }
 
+/**
+ * unqueue_multiple() - Remove several futexes from their futex_hash_bucket
+ * @q:	The list of futexes to unqueue
+ * @count: Number of futexes in the list
+ *
+ * Helper to unqueue a list of futexes. This can't fail.
+ *
+ * Return:
+ *  - >=0 - Index of the last futex that was awoken;
+ *  - -1  - If no futex was awoken
+ */
+static int unqueue_multiple(struct futex_q *q, int count)
+{
+	int ret = -1;
+	int i;
+
+	for (i = 0; i < count; i++) {
+		if (!unqueue_me(&q[i]))
+			ret = i;
+	}
+	return ret;
+}
+
 /*
  * PI futexes can not be requeued and must remove themself from the
  * hash bucket. The hash bucket lock (i.e. lock_ptr) is held on entry
@@ -2783,6 +2810,211 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
 	return ret;
 }
 
+/**
+ * futex_wait_multiple_setup() - Prepare to wait and enqueue multiple futexes
+ * @qs:		The corresponding futex list
+ * @count:	The size of the lists
+ * @flags:	Futex flags (FLAGS_SHARED, etc.)
+ * @awaken:	Index of the last awoken futex
+ *
+ * Prepare multiple futexes in a single step and enqueue them. This may fail if
+ * the futex list is invalid or if any futex was already awoken. On success the
+ * task is ready to interruptible sleep.
+ *
+ * Return:
+ *  -  1 - One of the futexes was awaken by another thread
+ *  -  0 - Success
+ *  - <0 - -EFAULT, -EWOULDBLOCK or -EINVAL
+ */
+static int futex_wait_multiple_setup(struct futex_q *qs, int count,
+				     unsigned int flags, int *awaken)
+{
+	struct futex_hash_bucket *hb;
+	int ret, i;
+	u32 uval;
+
+	/*
+	 * Enqueuing multiple futexes is tricky, because we need to
+	 * enqueue each futex in the list before dealing with the next
+	 * one to avoid deadlocking on the hash bucket.  But, before
+	 * enqueuing, we need to make sure that current->state is
+	 * TASK_INTERRUPTIBLE, so we don't absorb any awake events, which
+	 * cannot be done before the get_futex_key of the next key,
+	 * because it calls get_user_pages, which can sleep.  Thus, we
+	 * fetch the list of futexes keys in two steps, by first pinning
+	 * all the memory keys in the futex key, and only then we read
+	 * each key and queue the corresponding futex.
+	 */
+retry:
+	for (i = 0; i < count; i++) {
+		qs[i].key = FUTEX_KEY_INIT;
+		ret = get_futex_key(qs[i].uaddr, flags & FLAGS_SHARED,
+				    &qs[i].key, FUTEX_READ);
+		if (unlikely(ret)) {
+			for (--i; i >= 0; i--)
+				put_futex_key(&qs[i].key);
+			return ret;
+		}
+	}
+
+	set_current_state(TASK_INTERRUPTIBLE);
+
+	for (i = 0; i < count; i++) {
+		struct futex_q *q = &qs[i];
+
+		hb = queue_lock(q);
+
+		ret = get_futex_value_locked(&uval, q->uaddr);
+		if (ret) {
+			/*
+			 * We need to try to handle the fault, which
+			 * cannot be done without sleep, so we need to
+			 * undo all the work already done, to make sure
+			 * we don't miss any wake ups.  Therefore, clean
+			 * up, handle the fault and retry from the
+			 * beginning.
+			 */
+			queue_unlock(hb);
+
+			/*
+			 * Keys 0..(i-1) are implicitly put
+			 * on unqueue_multiple.
+			 */
+			put_futex_key(&q->key);
+
+			*awaken = unqueue_multiple(qs, i);
+
+			__set_current_state(TASK_RUNNING);
+
+			/*
+			 * On a real fault, prioritize the error even if
+			 * some other futex was awoken.  Userspace gave
+			 * us a bad address, -EFAULT them.
+			 */
+			ret = get_user(uval, q->uaddr);
+			if (ret)
+				return ret;
+
+			/*
+			 * Even if the page fault was handled, If
+			 * something was already awaken, we can safely
+			 * give up and succeed to give a hint for userspace to
+			 * acquire the right futex faster.
+			 */
+			if (*awaken >= 0)
+				return 1;
+
+			goto retry;
+		}
+
+		if (uval != q->uval) {
+			queue_unlock(hb);
+
+			put_futex_key(&qs[i].key);
+
+			/*
+			 * If something was already awaken, we can
+			 * safely ignore the error and succeed.
+			 */
+			*awaken = unqueue_multiple(qs, i);
+			__set_current_state(TASK_RUNNING);
+			if (*awaken >= 0)
+				return 1;
+
+			return -EWOULDBLOCK;
+		}
+
+		/*
+		 * The bucket lock can't be held while dealing with the
+		 * next futex. Queue each futex at this moment so hb can
+		 * be unlocked.
+		 */
+		queue_me(&qs[i], hb);
+	}
+	return 0;
+}
+
+/**
+ * futex_wait_multiple() - Prepare to wait on and enqueue several futexes
+ * @qs:		The list of futexes to wait on
+ * @op:		Operation code from futex's syscall
+ * @count:	The number of objects
+ * @abs_time:	Timeout before giving up and returning to userspace
+ *
+ * Entry point for the FUTEX_WAIT_MULTIPLE futex operation, this function
+ * sleeps on a group of futexes and returns on the first futex that
+ * triggered, or after the timeout has elapsed.
+ *
+ * Return:
+ *  - >=0 - Hint to the futex that was awoken
+ *  - <0  - On error
+ */
+static int futex_wait_multiple(struct futex_q *qs, int op,
+			       u32 count, ktime_t *abs_time)
+{
+	struct hrtimer_sleeper timeout, *to;
+	int ret, flags = 0, hint = 0;
+	unsigned int i;
+
+	if (!(op & FUTEX_PRIVATE_FLAG))
+		flags |= FLAGS_SHARED;
+
+	if (op & FUTEX_CLOCK_REALTIME)
+		flags |= FLAGS_CLOCKRT;
+
+	to = futex_setup_timer(abs_time, &timeout, flags, 0);
+	while (1) {
+		ret = futex_wait_multiple_setup(qs, count, flags, &hint);
+		if (ret) {
+			if (ret > 0) {
+				/* A futex was awaken during setup */
+				ret = hint;
+			}
+			break;
+		}
+
+		if (to)
+			hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
+
+		/*
+		 * Avoid sleeping if another thread already tried to
+		 * wake us.
+		 */
+		for (i = 0; i < count; i++) {
+			if (plist_node_empty(&qs[i].list))
+				break;
+		}
+
+		if (i == count && (!to || to->task))
+			freezable_schedule();
+
+		ret = unqueue_multiple(qs, count);
+
+		__set_current_state(TASK_RUNNING);
+
+		if (ret >= 0)
+			break;
+		if (to && !to->task) {
+			ret = -ETIMEDOUT;
+			break;
+		} else if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+		/*
+		 * The final case is a spurious wakeup, for
+		 * which just retry.
+		 */
+	}
+
+	if (to) {
+		hrtimer_cancel(&to->timer);
+		destroy_hrtimer_on_stack(&to->timer);
+	}
+
+	return ret;
+}
+
 static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 		      ktime_t *abs_time, u32 bitset)
 {
@@ -3907,6 +4139,43 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 	return -ENOSYS;
 }
 
+/**
+ * futex_read_wait_block - Read an array of futex_wait_block from userspace
+ * @uaddr:	Userspace address of the block
+ * @count:	Number of blocks to be read
+ *
+ * This function creates and allocate an array of futex_q (we zero it to
+ * initialize the fields) and then, for each futex_wait_block element from
+ * userspace, fill a futex_q element with proper values.
+ */
+inline struct futex_q *futex_read_wait_block(u32 __user *uaddr, u32 count)
+{
+	unsigned int i;
+	struct futex_q *qs;
+	struct futex_wait_block fwb;
+	struct futex_wait_block __user *entry =
+		(struct futex_wait_block __user *)uaddr;
+
+	if (!count || count > FUTEX_MULTIPLE_MAX_COUNT)
+		return ERR_PTR(-EINVAL);
+
+	qs = kcalloc(count, sizeof(*qs), GFP_KERNEL);
+	if (!qs)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < count; i++) {
+		if (copy_from_user(&fwb, &entry[i], sizeof(fwb))) {
+			kfree(qs);
+			return ERR_PTR(-EFAULT);
+		}
+
+		qs[i].uaddr = fwb.uaddr;
+		qs[i].uval = fwb.val;
+		qs[i].bitset = fwb.bitset;
+	}
+
+	return qs;
+}
 
 SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
 		struct __kernel_timespec __user *, utime, u32 __user *, uaddr2,
@@ -3919,7 +4188,8 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
 
 	if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
 		      cmd == FUTEX_WAIT_BITSET ||
-		      cmd == FUTEX_WAIT_REQUEUE_PI)) {
+		      cmd == FUTEX_WAIT_REQUEUE_PI ||
+		      cmd == FUTEX_WAIT_MULTIPLE)) {
 		if (unlikely(should_fail_futex(!(op & FUTEX_PRIVATE_FLAG))))
 			return -EFAULT;
 		if (get_timespec64(&ts, utime))
@@ -3940,6 +4210,25 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
 	    cmd == FUTEX_CMP_REQUEUE_PI || cmd == FUTEX_WAKE_OP)
 		val2 = (u32) (unsigned long) utime;
 
+	if (cmd == FUTEX_WAIT_MULTIPLE) {
+		int ret;
+		struct futex_q *qs;
+
+#ifdef CONFIG_X86_X32
+		if (unlikely(in_x32_syscall()))
+			return -ENOSYS;
+#endif
+		qs = futex_read_wait_block(uaddr, val);
+
+		if (IS_ERR(qs))
+			return PTR_ERR(qs);
+
+		ret = futex_wait_multiple(qs, op, val, tp);
+		kfree(qs);
+
+		return ret;
+	}
+
 	return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);
 }
 
@@ -4102,6 +4391,57 @@ COMPAT_SYSCALL_DEFINE3(get_robust_list, int, pid,
 #endif /* CONFIG_COMPAT */
 
 #ifdef CONFIG_COMPAT_32BIT_TIME
+/**
+ * struct compat_futex_wait_block - Block of futexes to be waited for
+ * @uaddr:	User address of the futex (compatible pointer)
+ * @val:	Futex value expected by userspace
+ * @bitset:	Bitset for the optional bitmasked wakeup
+ */
+struct compat_futex_wait_block {
+	compat_uptr_t	uaddr;
+	__u32 val;
+	__u32 bitset;
+};
+
+/**
+ * compat_futex_read_wait_block - Read an array of futex_wait_block from
+ * userspace
+ * @uaddr:	Userspace address of the block
+ * @count:	Number of blocks to be read
+ *
+ * This function does the same as futex_read_wait_block(), except that it
+ * converts the pointer to the futex from the compat version to the regular one.
+ */
+inline struct futex_q *compat_futex_read_wait_block(u32 __user *uaddr,
+						    u32 count)
+{
+	unsigned int i;
+	struct futex_q *qs;
+	struct compat_futex_wait_block fwb;
+	struct compat_futex_wait_block __user *entry =
+		(struct compat_futex_wait_block __user *)uaddr;
+
+	if (!count || count > FUTEX_MULTIPLE_MAX_COUNT)
+		return ERR_PTR(-EINVAL);
+
+	qs = kcalloc(count, sizeof(*qs), GFP_KERNEL);
+	if (!qs)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < count; i++) {
+		if (copy_from_user(&fwb, &entry[i], sizeof(fwb))) {
+			kfree(qs);
+			return ERR_PTR(-EFAULT);
+		}
+
+		qs[i].uaddr = compat_ptr(fwb.uaddr);
+		qs[i].uval = fwb.val;
+		qs[i].bitset = fwb.bitset;
+	}
+
+	return qs;
+}
+
 SYSCALL_DEFINE6(futex_time32, u32 __user *, uaddr, int, op, u32, val,
 		struct old_timespec32 __user *, utime, u32 __user *, uaddr2,
 		u32, val3)
@@ -4113,7 +4453,8 @@ SYSCALL_DEFINE6(futex_time32, u32 __user *, uaddr, int, op, u32, val,
 
 	if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
 		      cmd == FUTEX_WAIT_BITSET ||
-		      cmd == FUTEX_WAIT_REQUEUE_PI)) {
+		      cmd == FUTEX_WAIT_REQUEUE_PI ||
+		      cmd == FUTEX_WAIT_MULTIPLE)) {
 		if (get_old_timespec32(&ts, utime))
 			return -EFAULT;
 		if (!timespec64_valid(&ts))
@@ -4128,6 +4469,19 @@ SYSCALL_DEFINE6(futex_time32, u32 __user *, uaddr, int, op, u32, val,
 	    cmd == FUTEX_CMP_REQUEUE_PI || cmd == FUTEX_WAKE_OP)
 		val2 = (int) (unsigned long) utime;
 
+	if (cmd == FUTEX_WAIT_MULTIPLE) {
+		int ret;
+		struct futex_q *qs = compat_futex_read_wait_block(uaddr, val);
+
+		if (IS_ERR(qs))
+			return PTR_ERR(qs);
+
+		ret = futex_wait_multiple(qs, op, val, tp);
+		kfree(qs);
+
+		return ret;
+	}
+
 	return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);
 }
 #endif /* CONFIG_COMPAT_32BIT_TIME */
-- 
2.25.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 2/4] selftests: futex: Add FUTEX_WAIT_MULTIPLE timeout test
  2020-02-13 21:45 [PATCH v3 0/4] Implement FUTEX_WAIT_MULTIPLE operation André Almeida
  2020-02-13 21:45 ` [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes André Almeida
@ 2020-02-13 21:45 ` André Almeida
  2020-02-13 21:45 ` [PATCH v3 3/4] selftests: futex: Add FUTEX_WAIT_MULTIPLE wouldblock test André Almeida
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: André Almeida @ 2020-02-13 21:45 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: kernel, krisman, shuah, linux-kselftest, rostedt, ryao, peterz,
	dvhart, mingo, z.figura12, steven, pgriffais, steven,
	André Almeida

From: Gabriel Krisman Bertazi <krisman@collabora.com>

Add test for timeout when waiting for multiple futexes. Skip the test if
it's a x32 application and the kernel returned the approtiaded error,
since this ABI is not supported for this operation.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Co-developed-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
---
Changes since v2: none
---
 .../futex/functional/futex_wait_timeout.c     | 38 ++++++++++++++++++-
 .../selftests/futex/include/futextest.h       | 22 +++++++++++
 2 files changed, 58 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/futex/functional/futex_wait_timeout.c b/tools/testing/selftests/futex/functional/futex_wait_timeout.c
index ee55e6d389a3..2a63e1c2cfb6 100644
--- a/tools/testing/selftests/futex/functional/futex_wait_timeout.c
+++ b/tools/testing/selftests/futex/functional/futex_wait_timeout.c
@@ -11,6 +11,7 @@
  *
  * HISTORY
  *      2009-Nov-6: Initial version by Darren Hart <dvhart@linux.intel.com>
+ *      2019-Dec-13: Add WAIT_MULTIPLE test by Krisman <krisman@collabora.com>
  *
  *****************************************************************************/
 
@@ -41,6 +42,8 @@ int main(int argc, char *argv[])
 {
 	futex_t f1 = FUTEX_INITIALIZER;
 	struct timespec to;
+	time_t secs;
+	struct futex_wait_block fwb = {&f1, f1, 0};
 	int res, ret = RET_PASS;
 	int c;
 
@@ -65,7 +68,7 @@ int main(int argc, char *argv[])
 	}
 
 	ksft_print_header();
-	ksft_set_plan(1);
+	ksft_set_plan(2);
 	ksft_print_msg("%s: Block on a futex and wait for timeout\n",
 	       basename(argv[0]));
 	ksft_print_msg("\tArguments: timeout=%ldns\n", timeout_ns);
@@ -79,8 +82,39 @@ int main(int argc, char *argv[])
 	if (!res || errno != ETIMEDOUT) {
 		fail("futex_wait returned %d\n", ret < 0 ? errno : ret);
 		ret = RET_FAIL;
+	} else
+		ksft_test_result_pass("futex_wait timeout succeeds\n");
+
+	info("Calling futex_wait_multiple on f1: %u @ %p\n", f1, &f1);
+
+	/* Setup absolute time */
+	ret = clock_gettime(CLOCK_REALTIME, &to);
+	secs = (to.tv_nsec + timeout_ns) / 1000000000;
+	to.tv_nsec = ((int64_t)to.tv_nsec + timeout_ns) % 1000000000;
+	to.tv_sec += secs;
+	info("to.tv_sec  = %ld\n", to.tv_sec);
+	info("to.tv_nsec = %ld\n", to.tv_nsec);
+
+	res = futex_wait_multiple(&fwb, 1, &to,
+				  FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME);
+
+#ifdef __ILP32__
+	if (res == -1 && errno == ENOSYS) {
+		ksft_test_result_skip("futex_wait_multiple not supported at x32\n");
+	} else {
+		ksft_test_result_fail("futex_wait_multiple returned %d\n",
+				      res < 0 ? errno : res);
+		ret = RET_FAIL;
 	}
+#else
+	if (!res || errno != ETIMEDOUT) {
+		ksft_test_result_fail("futex_wait_multiple returned %d\n",
+				      res < 0 ? errno : res);
+		ret = RET_FAIL;
+	} else
+		ksft_test_result_pass("futex_wait_multiple timeout succeeds\n");
+#endif /* __ILP32__ */
 
-	print_result(TEST_NAME, ret);
+	ksft_print_cnts();
 	return ret;
 }
diff --git a/tools/testing/selftests/futex/include/futextest.h b/tools/testing/selftests/futex/include/futextest.h
index ddbcfc9b7bac..bb103bef4557 100644
--- a/tools/testing/selftests/futex/include/futextest.h
+++ b/tools/testing/selftests/futex/include/futextest.h
@@ -38,6 +38,14 @@ typedef volatile u_int32_t futex_t;
 #ifndef FUTEX_CMP_REQUEUE_PI
 #define FUTEX_CMP_REQUEUE_PI		12
 #endif
+#ifndef FUTEX_WAIT_MULTIPLE
+#define FUTEX_WAIT_MULTIPLE		13
+struct futex_wait_block {
+	futex_t *uaddr;
+	futex_t val;
+	__u32 bitset;
+};
+#endif
 #ifndef FUTEX_WAIT_REQUEUE_PI_PRIVATE
 #define FUTEX_WAIT_REQUEUE_PI_PRIVATE	(FUTEX_WAIT_REQUEUE_PI | \
 					 FUTEX_PRIVATE_FLAG)
@@ -80,6 +88,20 @@ futex_wait(futex_t *uaddr, futex_t val, struct timespec *timeout, int opflags)
 	return futex(uaddr, FUTEX_WAIT, val, timeout, NULL, 0, opflags);
 }
 
+/**
+ * futex_wait_multiple() - block on several futexes with optional timeout
+ * @fwb:	wait block user space address
+ * @count:	number of entities at fwb
+ * @timeout:	absolute timeout
+ */
+static inline int
+futex_wait_multiple(struct futex_wait_block *fwb, int count,
+		    struct timespec *timeout, int opflags)
+{
+	return futex(fwb, FUTEX_WAIT_MULTIPLE, count, timeout, NULL, 0,
+		     opflags);
+}
+
 /**
  * futex_wake() - wake one or more tasks blocked on uaddr
  * @nr_wake:	wake up to this many tasks
-- 
2.25.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 3/4] selftests: futex: Add FUTEX_WAIT_MULTIPLE wouldblock test
  2020-02-13 21:45 [PATCH v3 0/4] Implement FUTEX_WAIT_MULTIPLE operation André Almeida
  2020-02-13 21:45 ` [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes André Almeida
  2020-02-13 21:45 ` [PATCH v3 2/4] selftests: futex: Add FUTEX_WAIT_MULTIPLE timeout test André Almeida
@ 2020-02-13 21:45 ` André Almeida
  2020-02-13 21:45 ` [PATCH v3 4/4] selftests: futex: Add FUTEX_WAIT_MULTIPLE wake up test André Almeida
  2020-02-19 16:27 ` [PATCH v3 0/4] Implement FUTEX_WAIT_MULTIPLE operation shuah
  4 siblings, 0 replies; 21+ messages in thread
From: André Almeida @ 2020-02-13 21:45 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: kernel, krisman, shuah, linux-kselftest, rostedt, ryao, peterz,
	dvhart, mingo, z.figura12, steven, pgriffais, steven,
	André Almeida

From: Gabriel Krisman Bertazi <krisman@collabora.com>

Add test for wouldblock return when waiting for multiple futexes. Skip
the test if it's a x32 application and the kernel returned the approtiaded
error, since this ABI is not supported for this operation.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Co-developed-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
---
Changes since v2: none
---
 .../futex/functional/futex_wait_wouldblock.c  | 28 +++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c b/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c
index 0ae390ff8164..bcbac042992d 100644
--- a/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c
+++ b/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c
@@ -12,6 +12,7 @@
  *
  * HISTORY
  *      2009-Nov-14: Initial version by Gowrishankar <gowrishankar.m@in.ibm.com>
+ *      2019-Dec-13: Add WAIT_MULTIPLE test by Krisman <krisman@collabora.com>
  *
  *****************************************************************************/
 
@@ -40,6 +41,7 @@ int main(int argc, char *argv[])
 {
 	struct timespec to = {.tv_sec = 0, .tv_nsec = timeout_ns};
 	futex_t f1 = FUTEX_INITIALIZER;
+	struct futex_wait_block fwb = {&f1, f1+1, 0};
 	int res, ret = RET_PASS;
 	int c;
 
@@ -61,7 +63,7 @@ int main(int argc, char *argv[])
 	}
 
 	ksft_print_header();
-	ksft_set_plan(1);
+	ksft_set_plan(2);
 	ksft_print_msg("%s: Test the unexpected futex value in FUTEX_WAIT\n",
 	       basename(argv[0]));
 
@@ -71,8 +73,30 @@ int main(int argc, char *argv[])
 		fail("futex_wait returned: %d %s\n",
 		     res ? errno : res, res ? strerror(errno) : "");
 		ret = RET_FAIL;
+	} else
+		ksft_test_result_pass("futex_wait wouldblock succeeds\n");
+
+	info("Calling futex_wait_multiple on f1: %u @ %p with val=%u\n",
+	     f1, &f1, f1+1);
+	res = futex_wait_multiple(&fwb, 1, NULL, FUTEX_PRIVATE_FLAG);
+
+#ifdef __ILP32__
+	if (res != -1 || errno != ENOSYS) {
+		ksft_test_result_fail("futex_wait_multiple returned %d\n",
+				      res < 0 ? errno : res);
+		ret = RET_FAIL;
+	} else {
+		ksft_test_result_skip("futex_wait_multiple not supported at x32\n");
+	}
+#else
+	if (!res || errno != EWOULDBLOCK) {
+		ksft_test_result_fail("futex_wait_multiple returned %d\n",
+				      res < 0 ? errno : res);
+		ret = RET_FAIL;
 	}
+	ksft_test_result_pass("futex_wait_multiple wouldblock succeeds\n");
+#endif /* __ILP32__ */
 
-	print_result(TEST_NAME, ret);
+	ksft_print_cnts();
 	return ret;
 }
-- 
2.25.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 4/4] selftests: futex: Add FUTEX_WAIT_MULTIPLE wake up test
  2020-02-13 21:45 [PATCH v3 0/4] Implement FUTEX_WAIT_MULTIPLE operation André Almeida
                   ` (2 preceding siblings ...)
  2020-02-13 21:45 ` [PATCH v3 3/4] selftests: futex: Add FUTEX_WAIT_MULTIPLE wouldblock test André Almeida
@ 2020-02-13 21:45 ` André Almeida
  2020-02-19 16:27 ` [PATCH v3 0/4] Implement FUTEX_WAIT_MULTIPLE operation shuah
  4 siblings, 0 replies; 21+ messages in thread
From: André Almeida @ 2020-02-13 21:45 UTC (permalink / raw)
  To: linux-kernel, tglx
  Cc: kernel, krisman, shuah, linux-kselftest, rostedt, ryao, peterz,
	dvhart, mingo, z.figura12, steven, pgriffais, steven,
	André Almeida

From: Gabriel Krisman Bertazi <krisman@collabora.com>

Add test for wait at multiple futexes mechanism. Skip the test if it's a
x32 application and the kernel returned the approtiaded error, since this
ABI is not supported for this operation.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Co-developed-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
---
Changes since v2: none
---
 .../selftests/futex/functional/.gitignore     |   1 +
 .../selftests/futex/functional/Makefile       |   3 +-
 .../futex/functional/futex_wait_multiple.c    | 173 ++++++++++++++++++
 .../testing/selftests/futex/functional/run.sh |   3 +
 4 files changed, 179 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/futex/functional/futex_wait_multiple.c

diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore
index a09f57061902..4660128a545e 100644
--- a/tools/testing/selftests/futex/functional/.gitignore
+++ b/tools/testing/selftests/futex/functional/.gitignore
@@ -5,3 +5,4 @@ futex_wait_private_mapped_file
 futex_wait_timeout
 futex_wait_uninitialized_heap
 futex_wait_wouldblock
+futex_wait_multiple
diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile
index 30996306cabc..75f9fface11f 100644
--- a/tools/testing/selftests/futex/functional/Makefile
+++ b/tools/testing/selftests/futex/functional/Makefile
@@ -14,7 +14,8 @@ TEST_GEN_FILES := \
 	futex_requeue_pi_signal_restart \
 	futex_requeue_pi_mismatched_ops \
 	futex_wait_uninitialized_heap \
-	futex_wait_private_mapped_file
+	futex_wait_private_mapped_file \
+	futex_wait_multiple
 
 TEST_PROGS := run.sh
 
diff --git a/tools/testing/selftests/futex/functional/futex_wait_multiple.c b/tools/testing/selftests/futex/functional/futex_wait_multiple.c
new file mode 100644
index 000000000000..b48422e79f42
--- /dev/null
+++ b/tools/testing/selftests/futex/functional/futex_wait_multiple.c
@@ -0,0 +1,173 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/******************************************************************************
+ *
+ *   Copyright © Collabora, Ltd., 2019
+ *
+ * DESCRIPTION
+ *      Test basic semantics of FUTEX_WAIT_MULTIPLE
+ *
+ * AUTHOR
+ *      Gabriel Krisman Bertazi <krisman@collabora.com>
+ *
+ * HISTORY
+ *      2019-Dec-13: Initial version by Krisman <krisman@collabora.com>
+ *
+ *****************************************************************************/
+
+#include <errno.h>
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+#include <pthread.h>
+#include "futextest.h"
+#include "logging.h"
+
+#define TEST_NAME "futex-wait-multiple"
+#define timeout_ns 100000
+#define MAX_COUNT 128
+#define WAKE_WAIT_US 3000000
+
+int ret = RET_PASS;
+char *progname;
+futex_t f[MAX_COUNT] = {0};
+struct futex_wait_block fwb[MAX_COUNT];
+
+void usage(char *prog)
+{
+	printf("Usage: %s\n", prog);
+	printf("  -c	Use color\n");
+	printf("  -h	Display this help message\n");
+	printf("  -v L	Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
+	       VQUIET, VCRITICAL, VINFO);
+}
+
+void test_count_overflow(void)
+{
+	futex_t f = FUTEX_INITIALIZER;
+	struct futex_wait_block fwb[MAX_COUNT+1];
+	int res, i;
+
+	ksft_print_msg("%s: Test a too big number of futexes\n", progname);
+
+	for (i = 0; i < MAX_COUNT+1; i++) {
+		fwb[i].uaddr = &f;
+		fwb[i].val = f;
+		fwb[i].bitset = 0;
+	}
+
+	res = futex_wait_multiple(fwb, MAX_COUNT+1, NULL, FUTEX_PRIVATE_FLAG);
+
+#ifdef __ILP32__
+	if (res != -1 || errno != ENOSYS) {
+		ksft_test_result_fail("futex_wait_multiple returned %d\n",
+				      res < 0 ? errno : res);
+		ret = RET_FAIL;
+	} else {
+		ksft_test_result_skip("futex_wait_multiple not supported at x32\n");
+	}
+#else
+	if (res != -1 || errno != EINVAL) {
+		ksft_test_result_fail("futex_wait_multiple returned %d\n",
+				      res < 0 ? errno : res);
+		ret = RET_FAIL;
+	} else {
+		ksft_test_result_pass("futex_wait_multiple count overflow succeed\n");
+	}
+
+#endif /* __ILP32__ */
+}
+
+void *waiterfn(void *arg)
+{
+	int res;
+
+	res = futex_wait_multiple(fwb, MAX_COUNT, NULL, FUTEX_PRIVATE_FLAG);
+
+#ifdef __ILP32__
+	if (res != -1 || errno != ENOSYS) {
+		ksft_test_result_fail("futex_wait_multiple returned %d\n",
+				      res < 0 ? errno : res);
+		ret = RET_FAIL;
+	} else {
+		ksft_test_result_skip("futex_wait_multiple not supported at x32\n");
+	}
+#else
+	if (res < 0)
+		ksft_print_msg("waiter failed %d\n", res);
+
+	info("futex_wait_multiple: Got hint futex %d was freed\n", res);
+#endif /* __ILP32__ */
+
+	return NULL;
+}
+
+void test_fwb_wakeup(void)
+{
+	int res, i;
+	pthread_t waiter;
+
+	ksft_print_msg("%s: Test wake up in a list of futex\n", progname);
+
+	for (i = 0; i < MAX_COUNT; i++) {
+		fwb[i].uaddr = &f[i];
+		fwb[i].val = f[i];
+		fwb[i].bitset = 0xffffffff;
+	}
+
+	res = pthread_create(&waiter, NULL, waiterfn, NULL);
+	if (res) {
+		ksft_test_result_fail("Creating waiting thread failed");
+		ksft_exit_fail();
+	}
+
+	usleep(WAKE_WAIT_US);
+	res = futex_wake(&(f[MAX_COUNT-1]), 1, FUTEX_PRIVATE_FLAG);
+	if (res != 1) {
+		ksft_test_result_fail("Failed to wake thread res=%d\n", res);
+		ksft_exit_fail();
+	}
+
+	pthread_join(waiter, NULL);
+	ksft_test_result_pass("%s succeed\n", __func__);
+}
+
+int main(int argc, char *argv[])
+{
+	int c;
+
+	while ((c = getopt(argc, argv, "cht:v:")) != -1) {
+		switch (c) {
+		case 'c':
+			log_color(1);
+			break;
+		case 'h':
+			usage(basename(argv[0]));
+			exit(0);
+		case 'v':
+			log_verbosity(atoi(optarg));
+			break;
+		default:
+			usage(basename(argv[0]));
+			exit(1);
+		}
+	}
+
+	progname = basename(argv[0]);
+
+	ksft_print_header();
+	ksft_set_plan(2);
+
+	test_count_overflow();
+
+#ifdef __ILP32__
+	// if it's a 32x binary, there's no futex to wakeup
+	ksft_test_result_skip("futex_wait_multiple not supported at x32\n");
+#else
+	test_fwb_wakeup();
+#endif /* __ILP32__ */
+
+	ksft_print_cnts();
+	return ret;
+}
diff --git a/tools/testing/selftests/futex/functional/run.sh b/tools/testing/selftests/futex/functional/run.sh
index 1acb6ace1680..a8be94f28ff7 100755
--- a/tools/testing/selftests/futex/functional/run.sh
+++ b/tools/testing/selftests/futex/functional/run.sh
@@ -73,3 +73,6 @@ echo
 echo
 ./futex_wait_uninitialized_heap $COLOR
 ./futex_wait_private_mapped_file $COLOR
+
+echo
+./futex_wait_multiple $COLOR
-- 
2.25.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 0/4] Implement FUTEX_WAIT_MULTIPLE operation
  2020-02-13 21:45 [PATCH v3 0/4] Implement FUTEX_WAIT_MULTIPLE operation André Almeida
                   ` (3 preceding siblings ...)
  2020-02-13 21:45 ` [PATCH v3 4/4] selftests: futex: Add FUTEX_WAIT_MULTIPLE wake up test André Almeida
@ 2020-02-19 16:27 ` shuah
  4 siblings, 0 replies; 21+ messages in thread
From: shuah @ 2020-02-19 16:27 UTC (permalink / raw)
  To: André Almeida, linux-kernel, tglx
  Cc: kernel, krisman, linux-kselftest, rostedt, ryao, peterz, dvhart,
	mingo, z.figura12, steven, pgriffais, steven, shuah

On 2/13/20 2:45 PM, André Almeida wrote:
> Hello,
> 
> This patchset implements a new futex operation, called FUTEX_WAIT_MULTIPLE,
> which allows a thread to wait on several futexes at the same time, and be
> awoken by any of them.
> 
> The use case lies in the Wine implementation of the Windows NT interface
> WaitMultipleObjects. This Windows API function allows a thread to sleep
> waiting on the first of a set of event sources (mutexes, timers, signal,
> console input, etc) to signal.  Considering this is a primitive
> synchronization operation for Windows applications, being able to quickly
> signal events on the producer side, and quickly go to sleep on the
> consumer side is essential for good performance of those running over Wine.
> 
> Since this API exposes a mechanism to wait on multiple objects, and
> we might have multiple waiters for each of these events, a M->N
> relationship, the current Linux interfaces fell short on performance
> evaluation of large M,N scenarios.  We experimented, for instance, with
> eventfd, which has performance problems discussed below, but we also
> experimented with userspace solutions, like making each consumer wait on
> a condition variable guarding the entire list of objects, and then
> waking up multiple variables on the producer side, but this is
> prohibitively expensive since we either need to signal many condition
> variables or share that condition variable among multiple waiters, and
> then verify for the event being signaled in userspace, which means
> dealing with often false positive wakes ups.
> 
> The natural interface to implement the behavior we want, also
> considering that one of the waitable objects is a mutex itself, would be
> the futex interface.  Therefore, this patchset proposes a mechanism for
> a thread to wait on multiple futexes at once, and wake up on the first
> futex that was awaken.
> 
> In particular, using futexes in our Wine use case reduced the CPU
> utilization by 4% for the game Beat Saber and by 1.5% for the game
> Shadow of Tomb Raider, both running over Proton (a Wine based solution
> for Windows emulation), when compared to the eventfd interface. This
> implementation also doesn't rely of file descriptors, so it doesn't risk
> overflowing the resource.
> 
> In time, we are also proposing modifications to glibc and libpthread to
> make this feature available for Linux native multithreaded applications
> using libpthread, which can benefit from the behavior of waiting on any
> of a group of futexes.
> 
> Technically, the existing FUTEX_WAIT implementation can be easily
> reworked by using futex_wait_multiple() with a count of one, and I
> have a patch showing how it works.  I'm not proposing it, since
> futex is such a tricky code, that I'd be more comfortable to have
> FUTEX_WAIT_MULTIPLE running upstream for a couple development cycles,
> before considering modifying FUTEX_WAIT.
> 
> The patch series includes an extensive set of kselftests validating
> the behavior of the interface.  We also implemented support[1] on
> Syzkaller and survived the fuzzy testing.
> 
> Finally, if you'd rather pull directly a branch with this set you can
> find it here:
> 
> https://gitlab.collabora.com/tonyk/linux/commits/futex-dev-v3
> 
> The RFC for this patch can be found here:
> 
> https://lkml.org/lkml/2019/7/30/1399
> 
> === Performance of eventfd ===
> 
> Polling on several eventfd contexts with semaphore semantics would
> provide us with the exact semantics we are looking for.  However, as
> shown below, in a scenario with sufficient producers and consumers, the
> eventfd interface itself becomes a bottleneck, in particular because
> each thread will compete to acquire a sequence of waitqueue locks for
> each eventfd context in the poll list. In addition, in the uncontended
> case, where the producer is ready for consumption, eventfd still
> requires going into the kernel on the consumer side.
> 
> When a write or a read operation in an eventfd file succeeds, it will try
> to wake up all threads that are waiting to perform some operation to
> the file. The lock (ctx->wqh.lock) that hold the access to the file value
> (ctx->count) is the same lock used to control access the waitqueue. When
> all those those thread woke, they will compete to get this lock. Along
> with that, the poll() also manipulates the waitqueue and need to hold
> this same lock. This lock is specially hard to acquire when poll() calls
> poll_freewait(), where it tries to free all waitqueues associated with
> this poll. While doing that, it will compete with a lot of read and
> write operations that have been waken.
> 
> In our use case, with a huge number of parallel reads, writes and polls,
> this lock is a bottleneck and hurts the performance of applications. Our
> implementation of futex, however, decrease the calls of spin lock by more
> than 80% in some user applications.
> 
> Finally, eventfd operates on file descriptors, which is a limited
> resource that has shown its limitation in our use cases.  Despite the
> Windows interface not waiting on more than 64 objects at once, we still
> have multiple waiters at the same time, and we were easily able to
> exhaust the FD limits on applications like games.
> 
> Thanks,
>      André
> 
> [1] https://github.com/andrealmeid/syzkaller/tree/futex-wait-multiple
> 
> Gabriel Krisman Bertazi (4):
>    futex: Implement mechanism to wait on any of several futexes
>    selftests: futex: Add FUTEX_WAIT_MULTIPLE timeout test
>    selftests: futex: Add FUTEX_WAIT_MULTIPLE wouldblock test
>    selftests: futex: Add FUTEX_WAIT_MULTIPLE wake up test
> 
>   include/uapi/linux/futex.h                    |  20 +
>   kernel/futex.c                                | 358 +++++++++++++++++-
>   .../selftests/futex/functional/.gitignore     |   1 +
>   .../selftests/futex/functional/Makefile       |   3 +-
>   .../futex/functional/futex_wait_multiple.c    | 173 +++++++++
>   .../futex/functional/futex_wait_timeout.c     |  38 +-
>   .../futex/functional/futex_wait_wouldblock.c  |  28 +-
>   .../testing/selftests/futex/functional/run.sh |   3 +
>   .../selftests/futex/include/futextest.h       |  22 ++
>   9 files changed, 639 insertions(+), 7 deletions(-)
>   create mode 100644 tools/testing/selftests/futex/functional/futex_wait_multiple.c
> 

For selftests:

Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes
  2020-02-13 21:45 ` [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes André Almeida
@ 2020-02-28 19:07   ` Peter Zijlstra
  2020-02-28 19:49     ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2020-02-28 19:07 UTC (permalink / raw)
  To: André Almeida
  Cc: linux-kernel, tglx, kernel, krisman, shuah, linux-kselftest,
	rostedt, ryao, dvhart, mingo, z.figura12, steven, pgriffais,
	steven

On Thu, Feb 13, 2020 at 06:45:22PM -0300, André Almeida wrote:
> @@ -150,4 +153,21 @@ struct robust_list_head {
>    (((op & 0xf) << 28) | ((cmp & 0xf) << 24)		\
>     | ((oparg & 0xfff) << 12) | (cmparg & 0xfff))
>  
> +/*
> + * Maximum number of multiple futexes to wait for
> + */
> +#define FUTEX_MULTIPLE_MAX_COUNT	128
> +
> +/**
> + * struct futex_wait_block - Block of futexes to be waited for
> + * @uaddr:	User address of the futex
> + * @val:	Futex value expected by userspace
> + * @bitset:	Bitset for the optional bitmasked wakeup
> + */
> +struct futex_wait_block {
> +	__u32 __user *uaddr;
> +	__u32 val;
> +	__u32 bitset;
> +};

So I have a problem with this vector layout, it doesn't allow for
per-futex flags, and esp. with that multi-size futex support that
becomes important, but also with the already extand private/shared and
wait_bitset flags this means you cannot have a vector with mixed wait
types.

>  #endif /* _UAPI_LINUX_FUTEX_H */
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 0cf84c8664f2..58cf9eb2b851 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -215,6 +215,8 @@ struct futex_pi_state {
>   * @rt_waiter:		rt_waiter storage for use with requeue_pi
>   * @requeue_pi_key:	the requeue_pi target futex key
>   * @bitset:		bitset for the optional bitmasked wakeup
> + * @uaddr:             userspace address of futex
> + * @uval:              expected futex's value
>   *
>   * We use this hashed waitqueue, instead of a normal wait_queue_entry_t, so
>   * we can wake only the relevant ones (hashed queues may be shared).
> @@ -237,6 +239,8 @@ struct futex_q {
>  	struct rt_mutex_waiter *rt_waiter;
>  	union futex_key *requeue_pi_key;
>  	u32 bitset;
> +	u32 __user *uaddr;
> +	u32 uval;
>  } __randomize_layout;

That creates a hole for no reason.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes
  2020-02-28 19:07   ` Peter Zijlstra
@ 2020-02-28 19:49     ` Peter Zijlstra
  2020-02-28 21:25       ` Thomas Gleixner
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2020-02-28 19:49 UTC (permalink / raw)
  To: André Almeida
  Cc: linux-kernel, tglx, kernel, krisman, shuah, linux-kselftest,
	rostedt, ryao, dvhart, mingo, z.figura12, steven, pgriffais,
	steven, malteskarupke

On Fri, Feb 28, 2020 at 08:07:17PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 13, 2020 at 06:45:22PM -0300, André Almeida wrote:
> > @@ -150,4 +153,21 @@ struct robust_list_head {
> >    (((op & 0xf) << 28) | ((cmp & 0xf) << 24)		\
> >     | ((oparg & 0xfff) << 12) | (cmparg & 0xfff))
> >  
> > +/*
> > + * Maximum number of multiple futexes to wait for
> > + */
> > +#define FUTEX_MULTIPLE_MAX_COUNT	128
> > +
> > +/**
> > + * struct futex_wait_block - Block of futexes to be waited for
> > + * @uaddr:	User address of the futex
> > + * @val:	Futex value expected by userspace
> > + * @bitset:	Bitset for the optional bitmasked wakeup
> > + */
> > +struct futex_wait_block {
> > +	__u32 __user *uaddr;
> > +	__u32 val;
> > +	__u32 bitset;
> > +};
> 
> So I have a problem with this vector layout, it doesn't allow for
> per-futex flags, and esp. with that multi-size futex support that
> becomes important, but also with the already extand private/shared and
> wait_bitset flags this means you cannot have a vector with mixed wait
> types.

Alternatively, we throw the entire single-syscall futex interface under
the bus and design a bunch of new syscalls that are natively vectored or
something.

Thomas mentioned something like that, the problem is, ofcourse, that we
then want to fix a whole bunch of historical ills, and the probmem
becomes much bigger.

Thomas?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes
  2020-02-28 19:49     ` Peter Zijlstra
@ 2020-02-28 21:25       ` Thomas Gleixner
  2020-02-29  0:29         ` Pierre-Loup A. Griffais
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2020-02-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, André Almeida
  Cc: linux-kernel, kernel, krisman, shuah, linux-kselftest, rostedt,
	ryao, dvhart, mingo, z.figura12, steven, pgriffais, steven,
	malteskarupke

Peter Zijlstra <peterz@infradead.org> writes:
> On Fri, Feb 28, 2020 at 08:07:17PM +0100, Peter Zijlstra wrote:
>> So I have a problem with this vector layout, it doesn't allow for
>> per-futex flags, and esp. with that multi-size futex support that
>> becomes important, but also with the already extand private/shared and
>> wait_bitset flags this means you cannot have a vector with mixed wait
>> types.
>
> Alternatively, we throw the entire single-syscall futex interface under
> the bus and design a bunch of new syscalls that are natively vectored or
> something.
>
> Thomas mentioned something like that, the problem is, ofcourse, that we
> then want to fix a whole bunch of historical ills, and the probmem
> becomes much bigger.

We keep piling features on top of an interface and mechanism which is
fragile as hell and horrible to maintain. Adding vectoring, multi size
and whatever is not making it any better.

There is also the long standing issue with NUMA, which we can't address
with the current pile at all.

So I'm really advocating that all involved parties sit down ASAP and
hash out a new and less convoluted mechanism where all the magic new
features can be addressed in a sane way so that the 'F' in Futex really
only means Fast and not some other word starting with 'F'.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes
  2020-02-28 21:25       ` Thomas Gleixner
@ 2020-02-29  0:29         ` Pierre-Loup A. Griffais
  2020-02-29 10:27           ` Thomas Gleixner
  0 siblings, 1 reply; 21+ messages in thread
From: Pierre-Loup A. Griffais @ 2020-02-29  0:29 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra, André Almeida
  Cc: linux-kernel, kernel, krisman, shuah, linux-kselftest, rostedt,
	ryao, dvhart, mingo, z.figura12, steven, steven, malteskarupke



On 2/28/20 1:25 PM, Thomas Gleixner wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
>> On Fri, Feb 28, 2020 at 08:07:17PM +0100, Peter Zijlstra wrote:
>>> So I have a problem with this vector layout, it doesn't allow for
>>> per-futex flags, and esp. with that multi-size futex support that
>>> becomes important, but also with the already extand private/shared and
>>> wait_bitset flags this means you cannot have a vector with mixed wait
>>> types.
>>
>> Alternatively, we throw the entire single-syscall futex interface under
>> the bus and design a bunch of new syscalls that are natively vectored or
>> something.
>>
>> Thomas mentioned something like that, the problem is, ofcourse, that we
>> then want to fix a whole bunch of historical ills, and the probmem
>> becomes much bigger.
> 
> We keep piling features on top of an interface and mechanism which is
> fragile as hell and horrible to maintain. Adding vectoring, multi size
> and whatever is not making it any better.
> 
> There is also the long standing issue with NUMA, which we can't address
> with the current pile at all.
> 
> So I'm really advocating that all involved parties sit down ASAP and
> hash out a new and less convoluted mechanism where all the magic new
> features can be addressed in a sane way so that the 'F' in Futex really
> only means Fast and not some other word starting with 'F'.

Are you specifically talking about the interface, or the mechanism 
itself? Would you be OK with a new syscall that calls into the same code 
as this patch? It does seem like that's what we want, so if we rewrote a 
mechanism I'm not convinced it would come out any different. But, the 
interface itself seems fair-game to rewrite, as the current futex 
syscall is turning into an ioctl of sorts.

This solves a real problem with a real usecase; so I'd like to stay 
practical and not go into deeper issues like solving NUMA support for 
all of futex in the interest of users waiting at the other end. Can you 
point us to your preferred approach just for the scope of what we're 
trying to accomplish?

> 
> Thanks,
> 
>          tglx
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes
  2020-02-29  0:29         ` Pierre-Loup A. Griffais
@ 2020-02-29 10:27           ` Thomas Gleixner
  2020-03-03  2:47             ` Pierre-Loup A. Griffais
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2020-02-29 10:27 UTC (permalink / raw)
  To: Pierre-Loup A. Griffais, Peter Zijlstra, André Almeida
  Cc: linux-kernel, kernel, krisman, shuah, linux-kselftest, rostedt,
	ryao, dvhart, mingo, z.figura12, steven, steven, malteskarupke

"Pierre-Loup A. Griffais" <pgriffais@valvesoftware.com> writes:
> On 2/28/20 1:25 PM, Thomas Gleixner wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>>> Thomas mentioned something like that, the problem is, ofcourse, that we
>>> then want to fix a whole bunch of historical ills, and the probmem
>>> becomes much bigger.
>> 
>> We keep piling features on top of an interface and mechanism which is
>> fragile as hell and horrible to maintain. Adding vectoring, multi size
>> and whatever is not making it any better.
>> 
>> There is also the long standing issue with NUMA, which we can't address
>> with the current pile at all.
>> 
>> So I'm really advocating that all involved parties sit down ASAP and
>> hash out a new and less convoluted mechanism where all the magic new
>> features can be addressed in a sane way so that the 'F' in Futex really
>> only means Fast and not some other word starting with 'F'.
>
> Are you specifically talking about the interface, or the mechanism 
> itself? Would you be OK with a new syscall that calls into the same code 
> as this patch? It does seem like that's what we want, so if we rewrote a 
> mechanism I'm not convinced it would come out any different. But, the 
> interface itself seems fair-game to rewrite, as the current futex 
> syscall is turning into an ioctl of sorts.

No, you are misreading what I said. How does a new syscall make any
difference? It still adds new crap to a maze which is already in a state
of dubious maintainability. 

> This solves a real problem with a real usecase; so I'd like to stay 
> practical and not go into deeper issues like solving NUMA support for 
> all of futex in the interest of users waiting at the other end. Can you 
> point us to your preferred approach just for the scope of what we're 
> trying to accomplish?

If we go by the argument that something solves a real use case and take
this as justification to proliferate existing crap, then we never get to
the point where things get redesigned from ground up. Quite the
contrary, we are going to duct tape it to death.

It does not matter at all whether the syscall is multiplexing or split
up into 5 different ones. That's a pure cosmetic exercise.

While all the currently proposed extensions (multiple wait, variable
size) make sense conceptually, I'm really uncomfortable to just cram
them into the existing code. They create an ABI which we have to
maintain forever.

From experience I just know that every time we extended the futex
interface we opened another can of worms which hunted us for years if
not for more then a decade. Guess who has to deal with that. Surely not
the people who drive by and solve their real world usecases. Just go and
read the changelog history of futexes very carefully and you might
understand what kind of complex beasts they are.

At some point we simply have to say stop, sit down and figure out which
kind of functionality we really need in order to solve real world user
space problems and which of the gazillion futex (mis)features are just
there as historical ballast and do not have to be supported in a new
implementation, REQUEUE is just the most obvious example.

I completely understand that you want to stay practical and just want to
solve your particular itch, but please understand that the people who
have to deal with the fallout and have dealt with it for 15+ years have
very practical reasons to say no.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes
  2020-02-29 10:27           ` Thomas Gleixner
@ 2020-03-03  2:47             ` Pierre-Loup A. Griffais
  2020-03-03 12:00               ` 'simple' futex interface [Was: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes] Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Pierre-Loup A. Griffais @ 2020-03-03  2:47 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra, André Almeida
  Cc: linux-kernel, kernel, krisman, shuah, linux-kselftest, rostedt,
	ryao, dvhart, mingo, z.figura12, steven, steven, malteskarupke



On 2/29/20 2:27 AM, Thomas Gleixner wrote:
> "Pierre-Loup A. Griffais" <pgriffais@valvesoftware.com> writes:
>> On 2/28/20 1:25 PM, Thomas Gleixner wrote:
>>> Peter Zijlstra <peterz@infradead.org> writes:
>>>> Thomas mentioned something like that, the problem is, ofcourse, that we
>>>> then want to fix a whole bunch of historical ills, and the probmem
>>>> becomes much bigger.
>>>
>>> We keep piling features on top of an interface and mechanism which is
>>> fragile as hell and horrible to maintain. Adding vectoring, multi size
>>> and whatever is not making it any better.
>>>
>>> There is also the long standing issue with NUMA, which we can't address
>>> with the current pile at all.
>>>
>>> So I'm really advocating that all involved parties sit down ASAP and
>>> hash out a new and less convoluted mechanism where all the magic new
>>> features can be addressed in a sane way so that the 'F' in Futex really
>>> only means Fast and not some other word starting with 'F'.
>>
>> Are you specifically talking about the interface, or the mechanism
>> itself? Would you be OK with a new syscall that calls into the same code
>> as this patch? It does seem like that's what we want, so if we rewrote a
>> mechanism I'm not convinced it would come out any different. But, the
>> interface itself seems fair-game to rewrite, as the current futex
>> syscall is turning into an ioctl of sorts.
> 
> No, you are misreading what I said. How does a new syscall make any
> difference? It still adds new crap to a maze which is already in a state
> of dubious maintainability.

I was just going by the context added by Peter, which seemed to imply 
your concerns were mostly around the interface, because I couldn't 
understand a clear course of action to follow just from your email. And 
frankly, still can't, but hopefully you can help us get there.

> 
>> This solves a real problem with a real usecase; so I'd like to stay
>> practical and not go into deeper issues like solving NUMA support for
>> all of futex in the interest of users waiting at the other end. Can you
>> point us to your preferred approach just for the scope of what we're
>> trying to accomplish?
> 
> If we go by the argument that something solves a real use case and take
> this as justification to proliferate existing crap, then we never get to
> the point where things get redesigned from ground up. Quite the
> contrary, we are going to duct tape it to death.
> 
> It does not matter at all whether the syscall is multiplexing or split
> up into 5 different ones. That's a pure cosmetic exercise.
> 
> While all the currently proposed extensions (multiple wait, variable
> size) make sense conceptually, I'm really uncomfortable to just cram
> them into the existing code. They create an ABI which we have to
> maintain forever.
> 
>  From experience I just know that every time we extended the futex
> interface we opened another can of worms which hunted us for years if
> not for more then a decade. Guess who has to deal with that. Surely not
> the people who drive by and solve their real world usecases. Just go and
> read the changelog history of futexes very carefully and you might
> understand what kind of complex beasts they are.
> 
> At some point we simply have to say stop, sit down and figure out which
> kind of functionality we really need in order to solve real world user
> space problems and which of the gazillion futex (mis)features are just
> there as historical ballast and do not have to be supported in a new
> implementation, REQUEUE is just the most obvious example.
> 
> I completely understand that you want to stay practical and just want to
> solve your particular itch, but please understand that the people who
> have to deal with the fallout and have dealt with it for 15+ years have
> very practical reasons to say no.

Note that it would have been nice to get that high-level feedback on the 
first version; instead we just received back specific feedback on the 
implementation itself, and questions about usecase/motivation that we 
tried to address, but that didn't elicit any follow-ups.

Please bear with me for a second in case you thought you were obviously 
very clear about the path forward, but are you saying that:

  1. Our usecase is valid, but we're not correct about futex being the 
right fit for it, and we should design an implement a new primitive to 
handle it?

  2. Our usecase is valid, and our research showing that futex is the 
optimal right fit for it might be correct, but futex has to be 
significantly refactored before accepting this new feature. (or any new 
feature?)

If it was 1., I think our new solution would either end up looking more 
or less exactly like futex, just with some of the more exotic 
functionality removed (although even that is arguable, since I wouldn't 
be surprised if we ended up using eg. requeue for some of the more 
complex migration scenarios). In which case I assume someone else would 
ask the question on why we're doing this new thing instead of adding to 
futex. OR, if intentionally made not futex-like, would end up not being 
optimal, which would make it not the right solution and a non-started to 
begin with. There's a reason we moved away from eventfd, even ignoring 
the fd exhaustion problem that some problematic apps fall victim to.

If it's 2., then we'd be hard-pressed to proceed forward without your 
guidance.

Conceptually it seems like multiple wait is an important missing feature 
in futex compared to core threading primitives of other platforms. It 
isn't the first time that the lack of it has come up for us and other 
game developers. Due to futex being so central and important, I 
completely understand it is tricky to get right and might be hard to 
maintain if not done correctly. It seems worthwhile to undertake, at 
least from our limited perspective. We'd be glad to help upstream get 
there, if possible.

Thanks,
  - Pierre-Loup


> 
> Thanks,
> 
>          tglx
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* 'simple' futex interface [Was: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes]
  2020-03-03  2:47             ` Pierre-Loup A. Griffais
@ 2020-03-03 12:00               ` Peter Zijlstra
  2020-03-03 13:00                 ` Florian Weimer
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2020-03-03 12:00 UTC (permalink / raw)
  To: Pierre-Loup A. Griffais
  Cc: Thomas Gleixner, André Almeida, linux-kernel, kernel,
	krisman, shuah, linux-kselftest, rostedt, ryao, dvhart, mingo,
	z.figura12, steven, steven, malteskarupke, carlos,
	adhemerval.zanella, fweimer, libc-alpha

Hi All,

Added some people harvested from glibc.git and added libc-alpha.

We currently have 2 big new futex features proposed, and still have the
whole NUMA thing on the table.

The proposed features are:

 - a vectored FUTEX_WAIT (as per the parent thread); allows userspace to
   wait on up-to 128 futex values.

 - multi-size (8,16,32) futexes (WAIT,WAKE,CMP_REQUEUE).

Both these features are specific to the 'simple' futex interfaces, that
is, they exclude all the PI / robust stuff.

As is; the vectored WAIT doesn't nicely interact with the multi-size
proposal (or for that matter with the already existing PRIVATE flag),
for not allowing to specify flags per WAIT instance, but this should be
fixable with some little changes to the proposed ABI.

The much bigger sticking point; as already noticed by the multi-size
patches; is that the current ABI is a limiting factor. The giant
horrible syscall.

Now, we have a whole bunch of futex ops that are already gone (FD) or
are fundamentally broken (REQUEUE) or partially weird (WAIT_BITSET has
CLOCK selection where WAIT does not) or unused (per glibc, WAKE_OP,
WAKE_BITSET, WAIT_BITSET (except for that CLOCK crud)).

So how about we introduce new syscalls:

  sys_futex_wait(void *uaddr, unsigned long val, unsigned long flags, ktime_t *timo);

  struct futex_wait {
	void *uaddr;
	unsigned long val;
	unsigned long flags;
  };
  sys_futex_waitv(struct futex_wait *waiters, unsigned int nr_waiters,
		  unsigned long flags, ktime_t *timo);

  sys_futex_wake(void *uaddr, unsigned int nr, unsigned long flags);

  sys_futex_cmp_requeue(void *uaddr1, void *uaddr2, unsigned int nr_wake,
			unsigned int nr_requeue, unsigned long cmpval, unsigned long flags);

Where flags:

  - has 2 bits for size: 8,16,32,64
  - has 2 more bits for size (requeue) ??
  - has ... bits for clocks
  - has private/shared
  - has numa


This does not provide BITSET functionality, as I found no use in glibc.
Both wait and wake have arguments left, do we needs this?

For NUMA I propose that when NUMA_FLAG is set, uaddr-4 will be 'int
node_id', with the following semantics:

 - on WAIT, node_id is read and when 0 <= node_id <= nr_nodes, is
   directly used to index into per-node hash-tables. When -1, it is
   replaced by the current node_id and an smp_mb() is issued before we
   load and compare the @uaddr.

 - on WAKE/REQUEUE, it is an immediate index.

Any invalid value with result in EINVAL.


Then later, we can look at doing sys_futex_{,un}lock_{,pi}(), which have
all the mind-meld associated with robust and PI and possibly optimistic
spinning etc.

Opinions?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 'simple' futex interface [Was: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes]
  2020-03-03 12:00               ` 'simple' futex interface [Was: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes] Peter Zijlstra
@ 2020-03-03 13:00                 ` Florian Weimer
  2020-03-03 13:21                   ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Florian Weimer @ 2020-03-03 13:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pierre-Loup A. Griffais, Thomas Gleixner, André Almeida,
	linux-kernel, kernel, krisman, shuah, linux-kselftest, rostedt,
	ryao, dvhart, mingo, z.figura12, steven, steven, malteskarupke,
	carlos, adhemerval.zanella, libc-alpha

* Peter Zijlstra:

> So how about we introduce new syscalls:
>
>   sys_futex_wait(void *uaddr, unsigned long val, unsigned long flags, ktime_t *timo);
>
>   struct futex_wait {
> 	void *uaddr;
> 	unsigned long val;
> 	unsigned long flags;
>   };
>   sys_futex_waitv(struct futex_wait *waiters, unsigned int nr_waiters,
> 		  unsigned long flags, ktime_t *timo);
>
>   sys_futex_wake(void *uaddr, unsigned int nr, unsigned long flags);
>
>   sys_futex_cmp_requeue(void *uaddr1, void *uaddr2, unsigned int nr_wake,
> 			unsigned int nr_requeue, unsigned long cmpval, unsigned long flags);
>
> Where flags:
>
>   - has 2 bits for size: 8,16,32,64
>   - has 2 more bits for size (requeue) ??
>   - has ... bits for clocks
>   - has private/shared
>   - has numa

What's the actual type of *uaddr?  Does it vary by size (which I assume
is in bits?)?  Are there alignment constraints?

These system calls seemed to be type-polymorphic still, which is
problematic for defining a really nice C interface.  I would really like
to have a strongly typed interface for this, with a nice struct futex
wrapper type (even if it means that we need four of them).

Will all architectures support all sizes?  If not, how do we probe which
size/flags combinations are supported?

> For NUMA I propose that when NUMA_FLAG is set, uaddr-4 will be 'int
> node_id', with the following semantics:
>
>  - on WAIT, node_id is read and when 0 <= node_id <= nr_nodes, is
>    directly used to index into per-node hash-tables. When -1, it is
>    replaced by the current node_id and an smp_mb() is issued before we
>    load and compare the @uaddr.
>
>  - on WAKE/REQUEUE, it is an immediate index.

Does this mean the first waiter determines the NUMA index, and all
future waiters use the same chain even if they are on different nodes?

I think documenting this as a node index would be a mistake.  It could
be an arbitrary hint for locating the corresponding kernel data
structures.

> Any invalid value with result in EINVAL.

Using uaddr-4 is slightly tricky with a 64-bit futex value, due to the
need to maintain alignment and avoid padding.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 'simple' futex interface [Was: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes]
  2020-03-03 13:00                 ` Florian Weimer
@ 2020-03-03 13:21                   ` Peter Zijlstra
  2020-03-03 13:47                     ` Florian Weimer
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2020-03-03 13:21 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Pierre-Loup A. Griffais, Thomas Gleixner, André Almeida,
	linux-kernel, kernel, krisman, shuah, linux-kselftest, rostedt,
	ryao, dvhart, mingo, z.figura12, steven, steven, malteskarupke,
	carlos, adhemerval.zanella, libc-alpha

On Tue, Mar 03, 2020 at 02:00:12PM +0100, Florian Weimer wrote:
> * Peter Zijlstra:
> 
> > So how about we introduce new syscalls:
> >
> >   sys_futex_wait(void *uaddr, unsigned long val, unsigned long flags, ktime_t *timo);
> >
> >   struct futex_wait {
> > 	void *uaddr;
> > 	unsigned long val;
> > 	unsigned long flags;
> >   };
> >   sys_futex_waitv(struct futex_wait *waiters, unsigned int nr_waiters,
> > 		  unsigned long flags, ktime_t *timo);
> >
> >   sys_futex_wake(void *uaddr, unsigned int nr, unsigned long flags);
> >
> >   sys_futex_cmp_requeue(void *uaddr1, void *uaddr2, unsigned int nr_wake,
> > 			unsigned int nr_requeue, unsigned long cmpval, unsigned long flags);
> >
> > Where flags:
> >
> >   - has 2 bits for size: 8,16,32,64
> >   - has 2 more bits for size (requeue) ??
> >   - has ... bits for clocks
> >   - has private/shared
> >   - has numa
> 
> What's the actual type of *uaddr?  Does it vary by size (which I assume
> is in bits?)?  Are there alignment constraints?

Yeah, u8, u16, u32, u64 depending on the size specified in flags.
Naturally aligned.

> These system calls seemed to be type-polymorphic still, which is
> problematic for defining a really nice C interface.  I would really like
> to have a strongly typed interface for this, with a nice struct futex
> wrapper type (even if it means that we need four of them).

You mean like: futex_wait1(u8 *,...) futex_wait2(u16 *,...)
futex_wait4(u32 *,...) etc.. ?

I suppose making it 16 or so syscalls (more if we want WAKE_OP or
requeue across size) is a bit daft, so yeah, sucks.

> Will all architectures support all sizes?  If not, how do we probe which
> size/flags combinations are supported?

Up to the native word size (long), IOW ILP32 will not support u64.

Overlapping futexes are expressly forbidden, that is:

{
	u32 var;
	void *addr = &var;
}

P0()
{
	futex_wait4(addr,...);
}

P1()
{
	futex_wait1(addr+1,...);
}

Will have one of them return something bad.


> > For NUMA I propose that when NUMA_FLAG is set, uaddr-4 will be 'int
> > node_id', with the following semantics:
> >
> >  - on WAIT, node_id is read and when 0 <= node_id <= nr_nodes, is
> >    directly used to index into per-node hash-tables. When -1, it is
> >    replaced by the current node_id and an smp_mb() is issued before we
> >    load and compare the @uaddr.
> >
> >  - on WAKE/REQUEUE, it is an immediate index.
> 
> Does this mean the first waiter determines the NUMA index, and all
> future waiters use the same chain even if they are on different nodes?

Every new waiter could (re)set node_id, after all, when its not actually
waiting, nobody cares what's in that field.

> I think documenting this as a node index would be a mistake.  It could
> be an arbitrary hint for locating the corresponding kernel data
> structures.

Nah, it allows explicit placement, after all, we have set_mempolicy()
and sched_setaffinity() and all the other NUMA crud so that programs
that think they know what they're doing, can do explicit placement.

> > Any invalid value with result in EINVAL.
> 
> Using uaddr-4 is slightly tricky with a 64-bit futex value, due to the
> need to maintain alignment and avoid padding.

Yes, but it works, unlike uaddr+4 :-) Also, 1 and 2 byte futexes and
NUMA_FLAG are incompatible due to this, but I feel short futexes and
NUMA don't really make sense anyway, the only reason to use a short
futex is to save space, so you don't want another 4 bytes for numa on
top of that anyway.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 'simple' futex interface [Was: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes]
  2020-03-03 13:21                   ` Peter Zijlstra
@ 2020-03-03 13:47                     ` Florian Weimer
  2020-03-03 15:01                       ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Florian Weimer @ 2020-03-03 13:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pierre-Loup A. Griffais, Thomas Gleixner, André Almeida,
	linux-kernel, kernel, krisman, shuah, linux-kselftest, rostedt,
	ryao, dvhart, mingo, z.figura12, steven, steven, malteskarupke,
	carlos, adhemerval.zanella, libc-alpha, linux-api

(added missing Cc: for linux-api, better late than never I guess)

* Peter Zijlstra:

>> What's the actual type of *uaddr?  Does it vary by size (which I assume
>> is in bits?)?  Are there alignment constraints?
>
> Yeah, u8, u16, u32, u64 depending on the size specified in flags.
> Naturally aligned.

So 4-byte alignment for u32 and 8-byte alignment for u64 on all
architectures?

(I really want to nail this down, sorry.)

>> These system calls seemed to be type-polymorphic still, which is
>> problematic for defining a really nice C interface.  I would really like
>> to have a strongly typed interface for this, with a nice struct futex
>> wrapper type (even if it means that we need four of them).
>
> You mean like: futex_wait1(u8 *,...) futex_wait2(u16 *,...)
> futex_wait4(u32 *,...) etc.. ?
>
> I suppose making it 16 or so syscalls (more if we want WAKE_OP or
> requeue across size) is a bit daft, so yeah, sucks.

We could abstract this in the userspace wrapper.  It would help to have
an explicit size argument, or at least an extension-safe way to pass
this information to the kernel.  I guess if everything else fails, we
could use the flags bits for that, as long as it is clear that the
interface will only support these six types (four without NUMA, two with
NUMA).

>> Will all architectures support all sizes?  If not, how do we probe which
>> size/flags combinations are supported?
>
> Up to the native word size (long), IOW ILP32 will not support u64.

Many ILP32 targets could support atomic accesses on 8-byte storage
units, as long as there is 8-byte alignment.  But given how common
4-byte-align u64 is on 32-bit, maybe that's not such a good idea.

> Overlapping futexes are expressly forbidden, that is:
>
> {
> 	u32 var;
> 	void *addr = &var;
> }
>
> P0()
> {
> 	futex_wait4(addr,...);
> }
>
> P1()
> {
> 	futex_wait1(addr+1,...);
> }
>
> Will have one of them return something bad.

That makes sense.  A strongly typed interface would also reflect that in
the types.

>> > For NUMA I propose that when NUMA_FLAG is set, uaddr-4 will be 'int
>> > node_id', with the following semantics:
>> >
>> >  - on WAIT, node_id is read and when 0 <= node_id <= nr_nodes, is
>> >    directly used to index into per-node hash-tables. When -1, it is
>> >    replaced by the current node_id and an smp_mb() is issued before we
>> >    load and compare the @uaddr.
>> >
>> >  - on WAKE/REQUEUE, it is an immediate index.
>> 
>> Does this mean the first waiter determines the NUMA index, and all
>> future waiters use the same chain even if they are on different nodes?
>
> Every new waiter could (re)set node_id, after all, when its not actually
> waiting, nobody cares what's in that field.
>
>> I think documenting this as a node index would be a mistake.  It could
>> be an arbitrary hint for locating the corresponding kernel data
>> structures.
>
> Nah, it allows explicit placement, after all, we have set_mempolicy()
> and sched_setaffinity() and all the other NUMA crud so that programs
> that think they know what they're doing, can do explicit placement.

But I'm not sure if it makes sense to read the node ID from the
neighboring value of a futex used in this way.  Or do you think that
userspace might set the node ID to help the kernel implementation, and
not just relying on it to be set by the kernel after initializing it to
-1?

Conversely, even for non-NUMA systems, a lookup hint that allows to
reduce in-kernel futex contention might be helpful.  If it's documented
to be the NUME node ID, that wouldn't be possible.

>> > Any invalid value with result in EINVAL.
>> 
>> Using uaddr-4 is slightly tricky with a 64-bit futex value, due to the
>> need to maintain alignment and avoid padding.
>
> Yes, but it works, unlike uaddr+4 :-) Also, 1 and 2 byte futexes and
> NUMA_FLAG are incompatible due to this, but I feel short futexes and
> NUMA don't really make sense anyway, the only reason to use a short
> futex is to save space, so you don't want another 4 bytes for numa on
> top of that anyway.

I think it would be much easier to make the NUMA hint the same size of
the futex, so 4 and 8 bytes.  It could also make sense to require 8 and
16 byte alignment, to permit different implementation choices in the
future.

So we'd have:

struct futex8  { u8 value; };
struct futex16 { u16 value __attribute__ ((aligned (2))); };
struct futex32 { u32 value __attribute__ ((aligned (4))); };
struct futex64 { u64 value __attribute__ ((aligned (8))); };
struct futex32_numa { u32 value __attribute__ ((aligned (8))); u32 hint; };
struct futex64_numa { u64 value __attribute__ ((aligned (16))); u64 hint; };

Thanks,
Florian


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 'simple' futex interface [Was: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes]
  2020-03-03 13:47                     ` Florian Weimer
@ 2020-03-03 15:01                       ` Peter Zijlstra
  2020-03-05 16:14                         ` André Almeida
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2020-03-03 15:01 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Pierre-Loup A. Griffais, Thomas Gleixner, André Almeida,
	linux-kernel, kernel, krisman, shuah, linux-kselftest, rostedt,
	ryao, dvhart, mingo, z.figura12, steven, steven, malteskarupke,
	carlos, adhemerval.zanella, libc-alpha, linux-api

On Tue, Mar 03, 2020 at 02:47:11PM +0100, Florian Weimer wrote:
> (added missing Cc: for linux-api, better late than never I guess)
> 
> * Peter Zijlstra:
> 
> >> What's the actual type of *uaddr?  Does it vary by size (which I assume
> >> is in bits?)?  Are there alignment constraints?
> >
> > Yeah, u8, u16, u32, u64 depending on the size specified in flags.
> > Naturally aligned.
> 
> So 4-byte alignment for u32 and 8-byte alignment for u64 on all
> architectures?
> 
> (I really want to nail this down, sorry.)

Exactly so.

> >> These system calls seemed to be type-polymorphic still, which is
> >> problematic for defining a really nice C interface.  I would really like
> >> to have a strongly typed interface for this, with a nice struct futex
> >> wrapper type (even if it means that we need four of them).
> >
> > You mean like: futex_wait1(u8 *,...) futex_wait2(u16 *,...)
> > futex_wait4(u32 *,...) etc.. ?
> >
> > I suppose making it 16 or so syscalls (more if we want WAKE_OP or
> > requeue across size) is a bit daft, so yeah, sucks.
> 
> We could abstract this in the userspace wrapper.  It would help to have
> an explicit size argument, or at least an extension-safe way to pass
> this information to the kernel.  I guess if everything else fails, we
> could use the flags bits for that, as long as it is clear that the
> interface will only support these six types (four without NUMA, two with
> NUMA).

The problem is the cmp_requeue syscall, that already has 6 arguments. I
don't see where else than the flags field we can stuff this :/

> >> Will all architectures support all sizes?  If not, how do we probe which
> >> size/flags combinations are supported?
> >
> > Up to the native word size (long), IOW ILP32 will not support u64.
> 
> Many ILP32 targets could support atomic accesses on 8-byte storage
> units, as long as there is 8-byte alignment.  But given how common
> 4-byte-align u64 is on 32-bit, maybe that's not such a good idea.

'Many' might be over-stating it, but yeah, there are definitely a bunch
of them that can do it (x86, armv7-lpae, arc, are the ones I know from
memory). The problem is that the syscalls then look like:

  sys_futex_wait(void *uaddr, u64 val, unsigned long flags, ktime_t *timo);
  struct futex_wait {
	  void *uaddr;
	  u64 val;
	  u64 flags;
  };
  sys_futex_waitv(struct futex_wait *waiters, unsigned int nr_waiters,
		  u64 flags, ktime_t *timo);
  sys_futex_wake(void *uaddr, unsigned int nr, u64 flags);
  sys_futex_cmp_requeue(void *uaddr1, void *uaddr2, unsigned int nr_wake,
		  unsigned int nr_requeue, u64 cmpval, unsigned long flags);

And that makes 7 arguments for cmp_requeue, which can't be. Maybe we if
combine nr_wake and nr_requeue in one as 2 u16... ?

And then we need to go detector if the platform supports it or not..

> >> > For NUMA I propose that when NUMA_FLAG is set, uaddr-4 will be 'int
> >> > node_id', with the following semantics:
> >> >
> >> >  - on WAIT, node_id is read and when 0 <= node_id <= nr_nodes, is
> >> >    directly used to index into per-node hash-tables. When -1, it is
> >> >    replaced by the current node_id and an smp_mb() is issued before we
> >> >    load and compare the @uaddr.
> >> >
> >> >  - on WAKE/REQUEUE, it is an immediate index.
> >> 
> >> Does this mean the first waiter determines the NUMA index, and all
> >> future waiters use the same chain even if they are on different nodes?
> >
> > Every new waiter could (re)set node_id, after all, when its not actually
> > waiting, nobody cares what's in that field.
> >
> >> I think documenting this as a node index would be a mistake.  It could
> >> be an arbitrary hint for locating the corresponding kernel data
> >> structures.
> >
> > Nah, it allows explicit placement, after all, we have set_mempolicy()
> > and sched_setaffinity() and all the other NUMA crud so that programs
> > that think they know what they're doing, can do explicit placement.
> 
> But I'm not sure if it makes sense to read the node ID from the
> neighboring value of a futex used in this way.  Or do you think that
> userspace might set the node ID to help the kernel implementation, and
> not just relying on it to be set by the kernel after initializing it to
> -1?

I'm fairly sure that there will be a number of users that will
definitely want to do that; this would be the same people that use
set_mempolicy() and sched_setaffinity() and do all the other numa
binding crud.

HPC, certain database vendors, possibly RT and KVM users.

> Conversely, even for non-NUMA systems, a lookup hint that allows to
> reduce in-kernel futex contention might be helpful.  If it's documented
> to be the NUME node ID, that wouldn't be possible.

Do we really have significant contention on small systems? And how would
increasing the hash-table not solve that?

> >> > Any invalid value with result in EINVAL.
> >> 
> >> Using uaddr-4 is slightly tricky with a 64-bit futex value, due to the
> >> need to maintain alignment and avoid padding.
> >
> > Yes, but it works, unlike uaddr+4 :-) Also, 1 and 2 byte futexes and
> > NUMA_FLAG are incompatible due to this, but I feel short futexes and
> > NUMA don't really make sense anyway, the only reason to use a short
> > futex is to save space, so you don't want another 4 bytes for numa on
> > top of that anyway.
> 
> I think it would be much easier to make the NUMA hint the same size of
> the futex, so 4 and 8 bytes.  It could also make sense to require 8 and
> 16 byte alignment, to permit different implementation choices in the
> future.
> 
> So we'd have:
> 
> struct futex8  { u8 value; };
> struct futex16 { u16 value __attribute__ ((aligned (2))); };
> struct futex32 { u32 value __attribute__ ((aligned (4))); };
> struct futex64 { u64 value __attribute__ ((aligned (8))); };
> struct futex32_numa { u32 value __attribute__ ((aligned (8))); u32 hint; };
> struct futex64_numa { u64 value __attribute__ ((aligned (16))); u64 hint; };

That works, I suppose... although I'm sure someone will curse us for it
when trying to pack some extra things in his cacheline.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 'simple' futex interface [Was: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes]
  2020-03-03 15:01                       ` Peter Zijlstra
@ 2020-03-05 16:14                         ` André Almeida
  2020-03-05 16:25                           ` Florian Weimer
  2020-03-05 18:51                           ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: André Almeida @ 2020-03-05 16:14 UTC (permalink / raw)
  To: Peter Zijlstra, Florian Weimer
  Cc: Pierre-Loup A. Griffais, Thomas Gleixner, linux-kernel, kernel,
	krisman, shuah, linux-kselftest, rostedt, ryao, dvhart, mingo,
	z.figura12, steven, steven, malteskarupke, carlos,
	adhemerval.zanella, libc-alpha, linux-api

On 3/3/20 12:01 PM, Peter Zijlstra wrote:
> On Tue, Mar 03, 2020 at 02:47:11PM +0100, Florian Weimer wrote:
>> (added missing Cc: for linux-api, better late than never I guess)
>>
>> * Peter Zijlstra:
>>
>>>> What's the actual type of *uaddr?  Does it vary by size (which I assume
>>>> is in bits?)?  Are there alignment constraints?
>>>
>>> Yeah, u8, u16, u32, u64 depending on the size specified in flags.
>>> Naturally aligned.
>>
>> So 4-byte alignment for u32 and 8-byte alignment for u64 on all
>> architectures?
>>
>> (I really want to nail this down, sorry.)
> 
> Exactly so.
> 
>>>> These system calls seemed to be type-polymorphic still, which is
>>>> problematic for defining a really nice C interface.  I would really like
>>>> to have a strongly typed interface for this, with a nice struct futex
>>>> wrapper type (even if it means that we need four of them).
>>>
>>> You mean like: futex_wait1(u8 *,...) futex_wait2(u16 *,...)
>>> futex_wait4(u32 *,...) etc.. ?
>>>
>>> I suppose making it 16 or so syscalls (more if we want WAKE_OP or
>>> requeue across size) is a bit daft, so yeah, sucks.
>>
>> We could abstract this in the userspace wrapper.  It would help to have
>> an explicit size argument, or at least an extension-safe way to pass
>> this information to the kernel.  I guess if everything else fails, we
>> could use the flags bits for that, as long as it is clear that the
>> interface will only support these six types (four without NUMA, two with
>> NUMA).
> 
> The problem is the cmp_requeue syscall, that already has 6 arguments. I
> don't see where else than the flags field we can stuff this :/
> 
>>>> Will all architectures support all sizes?  If not, how do we probe which
>>>> size/flags combinations are supported?
>>>
>>> Up to the native word size (long), IOW ILP32 will not support u64.
>>
>> Many ILP32 targets could support atomic accesses on 8-byte storage
>> units, as long as there is 8-byte alignment.  But given how common
>> 4-byte-align u64 is on 32-bit, maybe that's not such a good idea.
> 
> 'Many' might be over-stating it, but yeah, there are definitely a bunch
> of them that can do it (x86, armv7-lpae, arc, are the ones I know from
> memory). The problem is that the syscalls then look like:
> 
>   sys_futex_wait(void *uaddr, u64 val, unsigned long flags, ktime_t *timo);
>   struct futex_wait {
> 	  void *uaddr;
> 	  u64 val;
> 	  u64 flags;
>   };
>   sys_futex_waitv(struct futex_wait *waiters, unsigned int nr_waiters,
> 		  u64 flags, ktime_t *timo);
>   sys_futex_wake(void *uaddr, unsigned int nr, u64 flags);
>   sys_futex_cmp_requeue(void *uaddr1, void *uaddr2, unsigned int nr_wake,
> 		  unsigned int nr_requeue, u64 cmpval, unsigned long flags);
> 
> And that makes 7 arguments for cmp_requeue, which can't be. Maybe we if
> combine nr_wake and nr_requeue in one as 2 u16... ?
> 
> And then we need to go detector if the platform supports it or not..
> 

Thanks everyone for the feedback around our mechanism. Are the
performance benefits of implementing a syscall to wait on a single futex
significant enough to maintain it instead of just using
`sys_futex_waitv()` with `nr_waiters = 1`? If we join both cases in a
single interface, we may even add a new member for NUMA hint in `struct
futex_wait`.

>>>>> For NUMA I propose that when NUMA_FLAG is set, uaddr-4 will be 'int
>>>>> node_id', with the following semantics:
>>>>>
>>>>>  - on WAIT, node_id is read and when 0 <= node_id <= nr_nodes, is
>>>>>    directly used to index into per-node hash-tables. When -1, it is
>>>>>    replaced by the current node_id and an smp_mb() is issued before we
>>>>>    load and compare the @uaddr.
>>>>>
>>>>>  - on WAKE/REQUEUE, it is an immediate index.
>>>>
>>>> Does this mean the first waiter determines the NUMA index, and all
>>>> future waiters use the same chain even if they are on different nodes?
>>>
>>> Every new waiter could (re)set node_id, after all, when its not actually
>>> waiting, nobody cares what's in that field.
>>>
>>>> I think documenting this as a node index would be a mistake.  It could
>>>> be an arbitrary hint for locating the corresponding kernel data
>>>> structures.
>>>
>>> Nah, it allows explicit placement, after all, we have set_mempolicy()
>>> and sched_setaffinity() and all the other NUMA crud so that programs
>>> that think they know what they're doing, can do explicit placement.
>>
>> But I'm not sure if it makes sense to read the node ID from the
>> neighboring value of a futex used in this way.  Or do you think that
>> userspace might set the node ID to help the kernel implementation, and
>> not just relying on it to be set by the kernel after initializing it to
>> -1?
> 
> I'm fairly sure that there will be a number of users that will
> definitely want to do that; this would be the same people that use
> set_mempolicy() and sched_setaffinity() and do all the other numa
> binding crud.
> 
> HPC, certain database vendors, possibly RT and KVM users.
> 
>> Conversely, even for non-NUMA systems, a lookup hint that allows to
>> reduce in-kernel futex contention might be helpful.  If it's documented
>> to be the NUME node ID, that wouldn't be possible.
> 
> Do we really have significant contention on small systems? And how would
> increasing the hash-table not solve that?
> 
>>>>> Any invalid value with result in EINVAL.
>>>>
>>>> Using uaddr-4 is slightly tricky with a 64-bit futex value, due to the
>>>> need to maintain alignment and avoid padding.
>>>
>>> Yes, but it works, unlike uaddr+4 :-) Also, 1 and 2 byte futexes and
>>> NUMA_FLAG are incompatible due to this, but I feel short futexes and
>>> NUMA don't really make sense anyway, the only reason to use a short
>>> futex is to save space, so you don't want another 4 bytes for numa on
>>> top of that anyway.
>>
>> I think it would be much easier to make the NUMA hint the same size of
>> the futex, so 4 and 8 bytes.  It could also make sense to require 8 and
>> 16 byte alignment, to permit different implementation choices in the
>> future.
>>
>> So we'd have:
>>
>> struct futex8  { u8 value; };
>> struct futex16 { u16 value __attribute__ ((aligned (2))); };
>> struct futex32 { u32 value __attribute__ ((aligned (4))); };
>> struct futex64 { u64 value __attribute__ ((aligned (8))); };
>> struct futex32_numa { u32 value __attribute__ ((aligned (8))); u32 hint; };
>> struct futex64_numa { u64 value __attribute__ ((aligned (16))); u64 hint; };
> 
> That works, I suppose... although I'm sure someone will curse us for it
> when trying to pack some extra things in his cacheline.
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 'simple' futex interface [Was: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes]
  2020-03-05 16:14                         ` André Almeida
@ 2020-03-05 16:25                           ` Florian Weimer
  2020-03-05 18:51                           ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: Florian Weimer @ 2020-03-05 16:25 UTC (permalink / raw)
  To: André Almeida
  Cc: Peter Zijlstra, Pierre-Loup A. Griffais, Thomas Gleixner,
	linux-kernel, kernel, krisman, shuah, linux-kselftest, rostedt,
	ryao, dvhart, mingo, z.figura12, steven, steven, malteskarupke,
	carlos, adhemerval.zanella, libc-alpha, linux-api

* André Almeida:

> Thanks everyone for the feedback around our mechanism. Are the
> performance benefits of implementing a syscall to wait on a single futex
> significant enough to maintain it instead of just using
> `sys_futex_waitv()` with `nr_waiters = 1`? If we join both cases in a
> single interface, we may even add a new member for NUMA hint in `struct
> futex_wait`.

Some seccomp user might want to verify the address, and that's easier if
it's in an argument.  But that's just a rather minor aspect.

Do you propose to drop the storage requirement for the NUMA hint
next to the futex completely?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 'simple' futex interface [Was: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes]
  2020-03-05 16:14                         ` André Almeida
  2020-03-05 16:25                           ` Florian Weimer
@ 2020-03-05 18:51                           ` Peter Zijlstra
  2020-03-06 16:57                             ` David Laight
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2020-03-05 18:51 UTC (permalink / raw)
  To: André Almeida
  Cc: Florian Weimer, Pierre-Loup A. Griffais, Thomas Gleixner,
	linux-kernel, kernel, krisman, shuah, linux-kselftest, rostedt,
	ryao, dvhart, mingo, z.figura12, steven, steven, malteskarupke,
	carlos, adhemerval.zanella, libc-alpha, linux-api

On Thu, Mar 05, 2020 at 01:14:17PM -0300, André Almeida wrote:

> >   sys_futex_wait(void *uaddr, u64 val, unsigned long flags, ktime_t *timo);
> >   struct futex_wait {
> > 	  void *uaddr;
> > 	  u64 val;
> > 	  u64 flags;
> >   };
> >   sys_futex_waitv(struct futex_wait *waiters, unsigned int nr_waiters,
> > 		  u64 flags, ktime_t *timo);
> >   sys_futex_wake(void *uaddr, unsigned int nr, u64 flags);
> >   sys_futex_cmp_requeue(void *uaddr1, void *uaddr2, unsigned int nr_wake,
> > 		  unsigned int nr_requeue, u64 cmpval, unsigned long flags);
> > 
> > And that makes 7 arguments for cmp_requeue, which can't be. Maybe we if
> > combine nr_wake and nr_requeue in one as 2 u16... ?
> > 
> > And then we need to go detector if the platform supports it or not..
> > 
> 
> Thanks everyone for the feedback around our mechanism. Are the
> performance benefits of implementing a syscall to wait on a single futex
> significant enough to maintain it instead of just using
> `sys_futex_waitv()` with `nr_waiters = 1`? If we join both cases in a
> single interface, we may even add a new member for NUMA hint in `struct
> futex_wait`.

My consideration was that avoiding the get_user/copy_from_user might
become measurable on !PTI systems with SMAP.

But someone would have to build it and measure it before we can be sure
of course.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: 'simple' futex interface [Was: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes]
  2020-03-05 18:51                           ` Peter Zijlstra
@ 2020-03-06 16:57                             ` David Laight
  0 siblings, 0 replies; 21+ messages in thread
From: David Laight @ 2020-03-06 16:57 UTC (permalink / raw)
  To: 'Peter Zijlstra', André Almeida
  Cc: Florian Weimer, Pierre-Loup A. Griffais, Thomas Gleixner,
	linux-kernel, kernel, krisman, shuah, linux-kselftest, rostedt,
	ryao, dvhart, mingo, z.figura12, steven, steven, malteskarupke,
	carlos, adhemerval.zanella, libc-alpha, linux-api

From: Peter Zijlstra
> Sent: 05 March 2020 18:52
+> On Thu, Mar 05, 2020 at 01:14:17PM -0300, André Almeida wrote:
> 
> > >   sys_futex_wait(void *uaddr, u64 val, unsigned long flags, ktime_t *timo);
> > >   struct futex_wait {
> > > 	  void *uaddr;
> > > 	  u64 val;
> > > 	  u64 flags;
> > >   };
> > >   sys_futex_waitv(struct futex_wait *waiters, unsigned int nr_waiters,
> > > 		  u64 flags, ktime_t *timo);
> > >   sys_futex_wake(void *uaddr, unsigned int nr, u64 flags);
> > >   sys_futex_cmp_requeue(void *uaddr1, void *uaddr2, unsigned int nr_wake,
> > > 		  unsigned int nr_requeue, u64 cmpval, unsigned long flags);
> > >
> > > And that makes 7 arguments for cmp_requeue, which can't be. Maybe we if
> > > combine nr_wake and nr_requeue in one as 2 u16... ?
> > >
> > > And then we need to go detector if the platform supports it or not..
> > >
> >
> > Thanks everyone for the feedback around our mechanism. Are the
> > performance benefits of implementing a syscall to wait on a single futex
> > significant enough to maintain it instead of just using
> > `sys_futex_waitv()` with `nr_waiters = 1`? If we join both cases in a
> > single interface, we may even add a new member for NUMA hint in `struct
> > futex_wait`.
> 
> My consideration was that avoiding the get_user/copy_from_user might
> become measurable on !PTI systems with SMAP.
> 
> But someone would have to build it and measure it before we can be sure
> of course.

An extra copy_from_user is likely to be noticable.
It certainly makes recvmsg() slower than recv().
Especially if the hardended usercopy crap gets involved.

	David
 

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, back to index

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-13 21:45 [PATCH v3 0/4] Implement FUTEX_WAIT_MULTIPLE operation André Almeida
2020-02-13 21:45 ` [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes André Almeida
2020-02-28 19:07   ` Peter Zijlstra
2020-02-28 19:49     ` Peter Zijlstra
2020-02-28 21:25       ` Thomas Gleixner
2020-02-29  0:29         ` Pierre-Loup A. Griffais
2020-02-29 10:27           ` Thomas Gleixner
2020-03-03  2:47             ` Pierre-Loup A. Griffais
2020-03-03 12:00               ` 'simple' futex interface [Was: [PATCH v3 1/4] futex: Implement mechanism to wait on any of several futexes] Peter Zijlstra
2020-03-03 13:00                 ` Florian Weimer
2020-03-03 13:21                   ` Peter Zijlstra
2020-03-03 13:47                     ` Florian Weimer
2020-03-03 15:01                       ` Peter Zijlstra
2020-03-05 16:14                         ` André Almeida
2020-03-05 16:25                           ` Florian Weimer
2020-03-05 18:51                           ` Peter Zijlstra
2020-03-06 16:57                             ` David Laight
2020-02-13 21:45 ` [PATCH v3 2/4] selftests: futex: Add FUTEX_WAIT_MULTIPLE timeout test André Almeida
2020-02-13 21:45 ` [PATCH v3 3/4] selftests: futex: Add FUTEX_WAIT_MULTIPLE wouldblock test André Almeida
2020-02-13 21:45 ` [PATCH v3 4/4] selftests: futex: Add FUTEX_WAIT_MULTIPLE wake up test André Almeida
2020-02-19 16:27 ` [PATCH v3 0/4] Implement FUTEX_WAIT_MULTIPLE operation shuah

Linux-kselftest Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-kselftest/0 linux-kselftest/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-kselftest linux-kselftest/ https://lore.kernel.org/linux-kselftest \
		linux-kselftest@vger.kernel.org
	public-inbox-index linux-kselftest

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kselftest


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git