All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/4] futex request support
@ 2021-06-01 14:58 Pavel Begunkov
  2021-06-01 14:58 ` [RFC 1/4] futex: add op wake for a single key Pavel Begunkov
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Pavel Begunkov @ 2021-06-01 14:58 UTC (permalink / raw)
  To: io-uring
  Cc: Jens Axboe, Andres Freund, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Darren Hart, Davidlohr Bueso, linux-kernel

Proof of concept for io_uring futex requests. The wake side does
FUTEX_WAKE_OP type of modify-compare operation but with a single
address. Wait reqs go through io-wq and so slow path.

Should be interesting for a bunch of people, so we should first
outline API and capabilities it should give. As I almost never
had to deal with futexes myself, would especially love to hear
use case, what might be lacking and other blind spots.

1) Do we need PI?
2) Do we need requeue? Anything else?
3) How hot waits are? May be done fully async avoiding io-wq, but
apparently requires more changes in futex code.
4) We can avoid all page locking/unlocking for shared futexes
with pre-registration. How much is it interesting?

For _very_ basic examples see [1]

[1] https://github.com/isilence/liburing/tree/futex_v1

Pavel Begunkov (4):
  futex: add op wake for a single key
  io_uring: frame out futex op
  io_uring: support futex wake requests
  io_uring: implement futex wait

 fs/io_uring.c                 | 91 +++++++++++++++++++++++++++++++++++
 include/linux/futex.h         | 15 ++++++
 include/uapi/linux/io_uring.h | 19 +++++++-
 kernel/futex.c                | 64 +++++++++++++++++++++++-
 4 files changed, 186 insertions(+), 3 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC 1/4] futex: add op wake for a single key
  2021-06-01 14:58 [RFC 0/4] futex request support Pavel Begunkov
@ 2021-06-01 14:58 ` Pavel Begunkov
  2021-06-01 14:58 ` [RFC 2/4] io_uring: frame out futex op Pavel Begunkov
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2021-06-01 14:58 UTC (permalink / raw)
  To: io-uring
  Cc: Jens Axboe, Andres Freund, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Darren Hart, Davidlohr Bueso, linux-kernel

Add a new futex wake function futex_wake_op_single(), which works
similar to futex_wake_op() but only for a single futex address as it
takes too many arguments. Also export it and other functions that will
be used by io_uring.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/futex.h | 15 ++++++++++
 kernel/futex.c        | 64 +++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 77 insertions(+), 2 deletions(-)

diff --git a/include/linux/futex.h b/include/linux/futex.h
index b70df27d7e85..04d500ae5983 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -77,6 +77,10 @@ void futex_exec_release(struct task_struct *tsk);
 
 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 	      u32 __user *uaddr2, u32 val2, u32 val3);
+int futex_wake_op_single(u32 __user *uaddr, int nr_wake, unsigned int op,
+			 bool shared, bool try);
+int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
+	       ktime_t *abs_time, u32 bitset);
 #else
 static inline void futex_init_task(struct task_struct *tsk) { }
 static inline void futex_exit_recursive(struct task_struct *tsk) { }
@@ -88,6 +92,17 @@ static inline long do_futex(u32 __user *uaddr, int op, u32 val,
 {
 	return -EINVAL;
 }
+static inline int futex_wake_op_single(u32 __user *uaddr, int nr_wake,
+				       unsigned int op, bool shared, bool try)
+{
+	return -EINVAL;
+}
+static inline int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
+			     ktime_t *abs_time, u32 bitset)
+{
+	return -EINVAL;
+}
+
 #endif
 
 #endif
diff --git a/kernel/futex.c b/kernel/futex.c
index 4938a00bc785..75dc600062a4 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1681,6 +1681,66 @@ static int futex_atomic_op_inuser(unsigned int encoded_op, u32 __user *uaddr)
 	}
 }
 
+int futex_wake_op_single(u32 __user *uaddr, int nr_wake, unsigned int op,
+			 bool shared, bool try)
+{
+	union futex_key key;
+	struct futex_hash_bucket *hb;
+	struct futex_q *this, *next;
+	int ret, op_ret;
+	DEFINE_WAKE_Q(wake_q);
+
+retry:
+	ret = get_futex_key(uaddr, shared, &key, FUTEX_WRITE);
+	if (unlikely(ret != 0))
+		return ret;
+	hb = hash_futex(&key);
+retry_private:
+	spin_lock(&hb->lock);
+	op_ret = futex_atomic_op_inuser(op, uaddr);
+	if (unlikely(op_ret < 0)) {
+		spin_unlock(&hb->lock);
+
+		if (!IS_ENABLED(CONFIG_MMU) ||
+		    unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
+			/*
+			 * we don't get EFAULT from MMU faults if we don't have
+			 * an MMU, but we might get them from range checking
+			 */
+			ret = op_ret;
+			return ret;
+		}
+		if (try)
+			return -EAGAIN;
+
+		if (op_ret == -EFAULT) {
+			ret = fault_in_user_writeable(uaddr);
+			if (ret)
+				return ret;
+		}
+		cond_resched();
+		if (shared)
+			goto retry;
+		goto retry_private;
+	}
+	if (op_ret) {
+		plist_for_each_entry_safe(this, next, &hb->chain, list) {
+			if (match_futex(&this->key, &key)) {
+				if (this->pi_state || this->rt_waiter) {
+					ret = -EINVAL;
+					break;
+				}
+				mark_wake_futex(&wake_q, this);
+				if (++ret >= nr_wake)
+					break;
+			}
+		}
+	}
+	spin_unlock(&hb->lock);
+	wake_up_q(&wake_q);
+	return ret;
+}
+
 /*
  * Wake up all waiters hashed on the physical page that is mapped
  * to this virtual address:
@@ -2680,8 +2740,8 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
 	return ret;
 }
 
-static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
-		      ktime_t *abs_time, u32 bitset)
+int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
+	       ktime_t *abs_time, u32 bitset)
 {
 	struct hrtimer_sleeper timeout, *to;
 	struct restart_block *restart;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 2/4] io_uring: frame out futex op
  2021-06-01 14:58 [RFC 0/4] futex request support Pavel Begunkov
  2021-06-01 14:58 ` [RFC 1/4] futex: add op wake for a single key Pavel Begunkov
@ 2021-06-01 14:58 ` Pavel Begunkov
  2021-06-01 14:58 ` [RFC 3/4] io_uring: support futex wake requests Pavel Begunkov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2021-06-01 14:58 UTC (permalink / raw)
  To: io-uring
  Cc: Jens Axboe, Andres Freund, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Darren Hart, Davidlohr Bueso, linux-kernel

Add userspace futex request definitions and draft some internal
functions.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c                 | 23 +++++++++++++++++++++++
 include/uapi/linux/io_uring.h |  9 +++++++++
 2 files changed, 32 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index fc9325472e8d..2c6b14a3a4f6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -77,6 +77,7 @@
 #include <linux/splice.h>
 #include <linux/task_work.h>
 #include <linux/pagemap.h>
+#include <linux/futex.h>
 #include <linux/io_uring.h>
 
 #define CREATE_TRACE_POINTS
@@ -665,6 +666,10 @@ struct io_unlink {
 	struct filename			*filename;
 };
 
+struct io_futex {
+	struct file			*file;
+};
+
 struct io_completion {
 	struct file			*file;
 	struct list_head		list;
@@ -809,6 +814,7 @@ struct io_kiocb {
 		struct io_shutdown	shutdown;
 		struct io_rename	rename;
 		struct io_unlink	unlink;
+		struct io_futex		futex;
 		/* use only after cleaning per-op data, see io_clean_op() */
 		struct io_completion	compl;
 	};
@@ -1021,6 +1027,7 @@ static const struct io_op_def io_op_defs[] = {
 	},
 	[IORING_OP_RENAMEAT] = {},
 	[IORING_OP_UNLINKAT] = {},
+	[IORING_OP_FUTEX] = {},
 };
 
 static bool io_disarm_next(struct io_kiocb *req);
@@ -5865,6 +5872,16 @@ static int io_files_update(struct io_kiocb *req, unsigned int issue_flags)
 	return 0;
 }
 
+static int io_futex_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	return -EINVAL;
+}
+
+static int io_futex(struct io_kiocb *req, unsigned int issue_flags)
+{
+	return -EINVAL;
+}
+
 static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	switch (req->opcode) {
@@ -5936,6 +5953,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return io_renameat_prep(req, sqe);
 	case IORING_OP_UNLINKAT:
 		return io_unlinkat_prep(req, sqe);
+	case IORING_OP_FUTEX:
+		return io_futex_prep(req, sqe);
 	}
 
 	printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
@@ -6203,6 +6222,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 	case IORING_OP_UNLINKAT:
 		ret = io_unlinkat(req, issue_flags);
 		break;
+	case IORING_OP_FUTEX:
+		ret = io_futex(req, issue_flags);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -10158,6 +10180,7 @@ static int __init io_uring_init(void)
 	BUILD_BUG_SQE_ELEM(28, __u32,  statx_flags);
 	BUILD_BUG_SQE_ELEM(28, __u32,  fadvise_advice);
 	BUILD_BUG_SQE_ELEM(28, __u32,  splice_flags);
+	BUILD_BUG_SQE_ELEM(28, __u32,  futex_flags);
 	BUILD_BUG_SQE_ELEM(32, __u64,  user_data);
 	BUILD_BUG_SQE_ELEM(40, __u16,  buf_index);
 	BUILD_BUG_SQE_ELEM(42, __u16,  personality);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e1ae46683301..6a1af5bb2ddf 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -44,6 +44,7 @@ struct io_uring_sqe {
 		__u32		splice_flags;
 		__u32		rename_flags;
 		__u32		unlink_flags;
+		__u32		futex_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	union {
@@ -137,6 +138,7 @@ enum {
 	IORING_OP_SHUTDOWN,
 	IORING_OP_RENAMEAT,
 	IORING_OP_UNLINKAT,
+	IORING_OP_FUTEX,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
@@ -174,6 +176,13 @@ enum {
 #define IORING_POLL_UPDATE_EVENTS	(1U << 1)
 #define IORING_POLL_UPDATE_USER_DATA	(1U << 2)
 
+enum {
+	IORING_FUTEX_WAKE_OP = 0,
+
+	IORING_FUTEX_LAST,
+};
+
+
 /*
  * IO completion data structure (Completion Queue Entry)
  */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 3/4] io_uring: support futex wake requests
  2021-06-01 14:58 [RFC 0/4] futex request support Pavel Begunkov
  2021-06-01 14:58 ` [RFC 1/4] futex: add op wake for a single key Pavel Begunkov
  2021-06-01 14:58 ` [RFC 2/4] io_uring: frame out futex op Pavel Begunkov
@ 2021-06-01 14:58 ` Pavel Begunkov
  2021-06-01 14:58 ` [RFC 4/4] io_uring: implement futex wait Pavel Begunkov
  2021-06-03 18:59 ` [RFC 0/4] futex request support Andres Freund
  4 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2021-06-01 14:58 UTC (permalink / raw)
  To: io-uring
  Cc: Jens Axboe, Andres Freund, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Darren Hart, Davidlohr Bueso, linux-kernel

Add support for futex wake requests, which also modifies the addr and
checks against it with encoded operation as FUTEX_WAKE_OP does, but only
operates with a single address as may be problematic to squeeze into SQE
and io_kiocb otherwise.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c                 | 48 +++++++++++++++++++++++++++++++++--
 include/uapi/linux/io_uring.h | 10 +++++++-
 2 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2c6b14a3a4f6..99f4f8d9f685 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -668,6 +668,12 @@ struct io_unlink {
 
 struct io_futex {
 	struct file			*file;
+	unsigned int			futex_op;
+
+	unsigned int			nr_wake;
+	unsigned int			wake_op_arg;
+	unsigned int			flags;
+	void __user			*uaddr;
 };
 
 struct io_completion {
@@ -5874,12 +5880,50 @@ static int io_files_update(struct io_kiocb *req, unsigned int issue_flags)
 
 static int io_futex_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
-	return -EINVAL;
+	struct io_futex *f = &req->futex;
+	u64 v;
+
+	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
+		return -EINVAL;
+	if (unlikely(req->flags & (REQ_F_FIXED_FILE | REQ_F_BUFFER_SELECT)))
+		return -EINVAL;
+	if (sqe->len)
+		return -EINVAL;
+	f->flags = READ_ONCE(sqe->futex_flags);
+	if (f->flags & ~IORING_FUTEX_SHARED)
+		return -EINVAL;
+
+	v = READ_ONCE(sqe->off);
+	f->nr_wake = (u32)v;
+	f->wake_op_arg = (u32)(v >> 32);
+	f->futex_op = READ_ONCE(sqe->futex_op);
+	f->uaddr = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	return 0;
 }
 
 static int io_futex(struct io_kiocb *req, unsigned int issue_flags)
 {
-	return -EINVAL;
+	bool nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	struct io_futex *f = &req->futex;
+	int ret;
+
+	switch (f->futex_op) {
+	case IORING_FUTEX_WAKE_OP:
+		ret = futex_wake_op_single(f->uaddr, f->nr_wake, f->wake_op_arg,
+					   !(f->flags & IORING_FUTEX_SHARED),
+					   nonblock);
+		/* retry from blocking context */
+		if (nonblock && ret == -EAGAIN)
+			return -EAGAIN;
+		break;
+	default:
+		ret = -EINVAL;
+	}
+
+	if (ret < 0)
+		req_set_fail(req);
+	__io_req_complete(req, issue_flags, ret, 0);
+	return 0;
 }
 
 static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 6a1af5bb2ddf..6fa5a6e59934 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -17,7 +17,10 @@
 struct io_uring_sqe {
 	__u8	opcode;		/* type of operation for this sqe */
 	__u8	flags;		/* IOSQE_ flags */
-	__u16	ioprio;		/* ioprio for the request */
+	union {
+		__u16	ioprio;		/* ioprio for the request */
+		__u16	futex_op;	/* futex operation */
+	} __attribute__((packed));
 	__s32	fd;		/* file descriptor to do IO on */
 	union {
 		__u64	off;	/* offset into file */
@@ -161,6 +164,11 @@ enum {
  */
 #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
 
+/*
+ * sqe->futex_flags
+ */
+#define IORING_FUTEX_SHARED	(1U << 0)
+
 /*
  * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
  * command flags for POLL_ADD are stored in sqe->len.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 4/4] io_uring: implement futex wait
  2021-06-01 14:58 [RFC 0/4] futex request support Pavel Begunkov
                   ` (2 preceding siblings ...)
  2021-06-01 14:58 ` [RFC 3/4] io_uring: support futex wake requests Pavel Begunkov
@ 2021-06-01 14:58 ` Pavel Begunkov
  2021-06-01 15:45   ` Jens Axboe
  2021-06-03 18:59 ` [RFC 0/4] futex request support Andres Freund
  4 siblings, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2021-06-01 14:58 UTC (permalink / raw)
  To: io-uring
  Cc: Jens Axboe, Andres Freund, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Darren Hart, Davidlohr Bueso, linux-kernel

Add futex wait requests, those always go through io-wq for simplicity.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c                 | 38 ++++++++++++++++++++++++++++-------
 include/uapi/linux/io_uring.h |  2 +-
 2 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 99f4f8d9f685..9c3d075a6647 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -670,8 +670,15 @@ struct io_futex {
 	struct file			*file;
 	unsigned int			futex_op;
 
-	unsigned int			nr_wake;
-	unsigned int			wake_op_arg;
+	union {
+		/* wake */
+		struct {
+			unsigned int	nr_wake;
+			unsigned int	wake_op_arg;
+		};
+		/* wait */
+		u32 			val;
+	};
 	unsigned int			flags;
 	void __user			*uaddr;
 };
@@ -5894,10 +5901,23 @@ static int io_futex_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return -EINVAL;
 
 	v = READ_ONCE(sqe->off);
-	f->nr_wake = (u32)v;
-	f->wake_op_arg = (u32)(v >> 32);
-	f->futex_op = READ_ONCE(sqe->futex_op);
 	f->uaddr = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	f->futex_op = READ_ONCE(sqe->futex_op);
+
+	switch (f->futex_op) {
+	case IORING_FUTEX_WAKE_OP:
+		f->nr_wake = (u32)v;
+		f->wake_op_arg = (u32)(v >> 32);
+		break;
+	case IORING_FUTEX_WAIT:
+		f->val = (u32)v;
+		if (unlikely(v >> 32))
+			return -EINVAL;
+		break;
+	default:
+		return -EINVAL;
+	}
+
 	return 0;
 }
 
@@ -5916,8 +5936,12 @@ static int io_futex(struct io_kiocb *req, unsigned int issue_flags)
 		if (nonblock && ret == -EAGAIN)
 			return -EAGAIN;
 		break;
-	default:
-		ret = -EINVAL;
+	case IORING_FUTEX_WAIT:
+		if (nonblock)
+			return -EAGAIN;
+		ret = futex_wait(f->uaddr, f->flags, f->val, NULL,
+				 FUTEX_BITSET_MATCH_ANY);
+		break;
 	}
 
 	if (ret < 0)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 6fa5a6e59934..7ab11ee2ce12 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -186,7 +186,7 @@ enum {
 
 enum {
 	IORING_FUTEX_WAKE_OP = 0,
-
+	IORING_FUTEX_WAIT,
 	IORING_FUTEX_LAST,
 };
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC 4/4] io_uring: implement futex wait
  2021-06-01 14:58 ` [RFC 4/4] io_uring: implement futex wait Pavel Begunkov
@ 2021-06-01 15:45   ` Jens Axboe
  2021-06-01 15:58     ` Pavel Begunkov
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2021-06-01 15:45 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring
  Cc: Andres Freund, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Darren Hart, Davidlohr Bueso, linux-kernel

On 6/1/21 8:58 AM, Pavel Begunkov wrote:
> Add futex wait requests, those always go through io-wq for simplicity.

Not a huge fan of that, I think this should tap into the waitqueue
instead and just rely on the wakeup callback to trigger the event. That
would be a lot more efficient than punting to io-wq, both in terms of
latency on trigger, but also for efficiency if the app is waiting on a
lot of futexes.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 4/4] io_uring: implement futex wait
  2021-06-01 15:45   ` Jens Axboe
@ 2021-06-01 15:58     ` Pavel Begunkov
  2021-06-01 16:01       ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2021-06-01 15:58 UTC (permalink / raw)
  To: Jens Axboe, io-uring
  Cc: Andres Freund, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Darren Hart, Davidlohr Bueso, linux-kernel

On 6/1/21 4:45 PM, Jens Axboe wrote:
> On 6/1/21 8:58 AM, Pavel Begunkov wrote:
>> Add futex wait requests, those always go through io-wq for simplicity.
> 
> Not a huge fan of that, I think this should tap into the waitqueue
> instead and just rely on the wakeup callback to trigger the event. That
> would be a lot more efficient than punting to io-wq, both in terms of
> latency on trigger, but also for efficiency if the app is waiting on a
> lot of futexes.

Yes, that would be preferable, but looks futexes don't use
waitqueues but some manual enqueuing into a plist_node, see
futex_wait_queue_me() or mark_wake_futex().
Did I miss it somewhere?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 4/4] io_uring: implement futex wait
  2021-06-01 15:58     ` Pavel Begunkov
@ 2021-06-01 16:01       ` Jens Axboe
  2021-06-01 16:29         ` Pavel Begunkov
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2021-06-01 16:01 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring
  Cc: Andres Freund, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Darren Hart, Davidlohr Bueso, linux-kernel

On 6/1/21 9:58 AM, Pavel Begunkov wrote:
> On 6/1/21 4:45 PM, Jens Axboe wrote:
>> On 6/1/21 8:58 AM, Pavel Begunkov wrote:
>>> Add futex wait requests, those always go through io-wq for simplicity.
>>
>> Not a huge fan of that, I think this should tap into the waitqueue
>> instead and just rely on the wakeup callback to trigger the event. That
>> would be a lot more efficient than punting to io-wq, both in terms of
>> latency on trigger, but also for efficiency if the app is waiting on a
>> lot of futexes.
> 
> Yes, that would be preferable, but looks futexes don't use
> waitqueues but some manual enqueuing into a plist_node, see
> futex_wait_queue_me() or mark_wake_futex().
> Did I miss it somewhere?

Yes, we'd need to augment that with a callback. I do think that's going
to be necessary, I don't see the io-wq solution working well outside of
the most basic of use cases. And even for that, it won't be particularly
efficient for single waits.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 4/4] io_uring: implement futex wait
  2021-06-01 16:01       ` Jens Axboe
@ 2021-06-01 16:29         ` Pavel Begunkov
  2021-06-01 21:53           ` Thomas Gleixner
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2021-06-01 16:29 UTC (permalink / raw)
  To: Jens Axboe, io-uring
  Cc: Andres Freund, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Darren Hart, Davidlohr Bueso, linux-kernel

On 6/1/21 5:01 PM, Jens Axboe wrote:
> On 6/1/21 9:58 AM, Pavel Begunkov wrote:
>> On 6/1/21 4:45 PM, Jens Axboe wrote:
>>> On 6/1/21 8:58 AM, Pavel Begunkov wrote:
>>>> Add futex wait requests, those always go through io-wq for simplicity.
>>>
>>> Not a huge fan of that, I think this should tap into the waitqueue
>>> instead and just rely on the wakeup callback to trigger the event. That
>>> would be a lot more efficient than punting to io-wq, both in terms of
>>> latency on trigger, but also for efficiency if the app is waiting on a
>>> lot of futexes.
>>
>> Yes, that would be preferable, but looks futexes don't use
>> waitqueues but some manual enqueuing into a plist_node, see
>> futex_wait_queue_me() or mark_wake_futex().
>> Did I miss it somewhere?
> 
> Yes, we'd need to augment that with a callback. I do think that's going

Yeah, that was the first idea, but it's also more intrusive for the
futex codebase. Can be piled on top for next revision of patches.

A question to futex maintainers, how much resistance to merging
something like that I may expect?

> to be necessary, I don't see the io-wq solution working well outside of
> the most basic of use cases. And even for that, it won't be particularly
> efficient for single waits.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 4/4] io_uring: implement futex wait
  2021-06-01 16:29         ` Pavel Begunkov
@ 2021-06-01 21:53           ` Thomas Gleixner
  2021-06-03 10:31             ` Pavel Begunkov
  2021-06-03 19:03             ` Andres Freund
  0 siblings, 2 replies; 23+ messages in thread
From: Thomas Gleixner @ 2021-06-01 21:53 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe, io-uring
  Cc: Andres Freund, Ingo Molnar, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, linux-kernel

Pavel, Jens,

On Tue, Jun 01 2021 at 17:29, Pavel Begunkov wrote:
> On 6/1/21 5:01 PM, Jens Axboe wrote:
>>> Yes, that would be preferable, but looks futexes don't use
>>> waitqueues but some manual enqueuing into a plist_node, see
>>> futex_wait_queue_me() or mark_wake_futex().
>>> Did I miss it somewhere?
>> 
>> Yes, we'd need to augment that with a callback. I do think that's going
>
> Yeah, that was the first idea, but it's also more intrusive for the
> futex codebase. Can be piled on top for next revision of patches.
>
> A question to futex maintainers, how much resistance to merging
> something like that I may expect?

Adding a waitqueue like callback or the proposed thing?

TBH. Neither one has a charm.

1) The proposed solution: I can't figure out from the changelogs or the
   cover letter what kind of problems it solves and what the exact
   semantics are. If you ever consider to submit futex patches, may I
   recommend to study Documentation/process and get some inspiration
   from git-log?

   What are the lifetime rules, what's the interaction with regular
   futexes, what's the interaction with robust list ....? Without
   interaction with regular futexes such a functionality does not make
   any sense at all.

   Also once we'd open that can of worms where is this going to end and
   where can we draw the line? This is going to be a bottomless pit
   because I don't believe for a split second that this simple interface
   is going to be sufficient.

   Aside of that we are definitely _not_ going to expose any of the
   internal functions simply because they evade any sanity check which
   happens at the syscall wrappers and I have zero interest to deal with
   the fallout of unfiltered input which comes via io-uring interfaces
   or try to keep those up to date when the core filtering changes.

2) Adding a waitqueue like callback is daft.

   a) Lifetime rules

      See mark_wake_futex().

   b) Locking

      The wakeup mechanism is designed to avoid hb->lock contention as much
      as possible. The dequeue/mark for wakeup happens under hb->lock
      and the actual wakeup happens after dropping hb->lock.

      This is not going to change. It's not even debatable.

      Aside of that this is optimized for minimal hb->lock hold time in
      general.

   So the only way to do that would be to invoke the callback from
   mark_wake_futex() _before_ invalidating futex_q and the callback plus
   the argument(s) would have to be stored in futex_q.

   Where does this information come from? Which context would invoke the
   wait with callback and callback arguments? User space, io-uring state
   machine or what?

   Aside of the extra storage (on stack) and yet more undefined life
   time rules and no semantics for error cases etc., that also would
   enforce that the callback is invoked with hb->lock held. IOW, it's
   making the hb->lock held time larger, which is exactly what the
   existing code tries to avoid by all means.

But what would that solve?

I can't tell because the provided information is absolutely useless
for anyone not familiar with your great idea:
   
  "Add futex wait requests, those always go through io-wq for
   simplicity."
      
What am I supposed to read out of this? Doing it elsewhere would be
more complex? Really useful information.

And I can't tell either what Jens means here:

  "Not a huge fan of that, I think this should tap into the waitqueue
   instead and just rely on the wakeup callback to trigger the
   event. That would be a lot more efficient than punting to io-wq, both
   in terms of latency on trigger, but also for efficiency if the app is
   waiting on a lot of futexes."

and here:

  "Yes, we'd need to augment that with a callback. I do think that's
   going to be necessary, I don't see the io-wq solution working well
   outside of the most basic of use cases. And even for that, it won't
   be particularly efficient for single waits."

All of these quotes are useless word salad without context and worse
without the minimal understanding how futexes work.

So can you folks please sit down and write up a coherent description of:

 1) The problem you are trying to solve

 2) How this futex functionality should be integrated into io-uring
    including the contexts which invoke it.

 3) Interaction with regular sys_futex() operations.

 4) Lifetime and locking rules.

Unless that materializes any futex related changes are not even going to
be reviewed.

I did not even try to review this stuff, I just tried to make sense out
of it, but while skimming it, it was inevitable to spot this gem:

 +int futex_wake_op_single(u32 __user *uaddr, int nr_wake, unsigned int op,
 +			 bool shared, bool try);
...
 +		ret = futex_wake_op_single(f->uaddr, f->nr_wake, f->wake_op_arg,
 +					   !(f->flags & IORING_FUTEX_SHARED),
 +					   nonblock);

You surely made your point that this is well thought out.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 4/4] io_uring: implement futex wait
  2021-06-01 21:53           ` Thomas Gleixner
@ 2021-06-03 10:31             ` Pavel Begunkov
  2021-06-04  9:19               ` Thomas Gleixner
  2021-06-03 19:03             ` Andres Freund
  1 sibling, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2021-06-03 10:31 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe, io-uring
  Cc: Andres Freund, Ingo Molnar, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, linux-kernel

On 6/1/21 10:53 PM, Thomas Gleixner wrote:
> Pavel, Jens,
> 
> On Tue, Jun 01 2021 at 17:29, Pavel Begunkov wrote:
>> On 6/1/21 5:01 PM, Jens Axboe wrote:
>>>> Yes, that would be preferable, but looks futexes don't use
>>>> waitqueues but some manual enqueuing into a plist_node, see
>>>> futex_wait_queue_me() or mark_wake_futex().
>>>> Did I miss it somewhere?
>>>
>>> Yes, we'd need to augment that with a callback. I do think that's going
>>
>> Yeah, that was the first idea, but it's also more intrusive for the
>> futex codebase. Can be piled on top for next revision of patches.
>>
>> A question to futex maintainers, how much resistance to merging
>> something like that I may expect?
> 
> Adding a waitqueue like callback or the proposed thing?
> 
> TBH. Neither one has a charm.
> 
> 1) The proposed solution: I can't figure out from the changelogs or the
>    cover letter what kind of problems it solves and what the exact
>    semantics are. If you ever consider to submit futex patches, may I
>    recommend to study Documentation/process and get some inspiration
>    from git-log?

I'm sorry you're incapable of grasping ideas quick, but may we
stop this stupid galling and switch to a more productive way of
speaking?

>    What are the lifetime rules, what's the interaction with regular>    futexes, what's the interaction with robust list ....? Without

Robust lists are not supported, neither they are mentioned to be.

>    interaction with regular futexes such a functionality does not make
>    any sense at all.

Would have not touched futex code if it was not interoperable.

 
>    Also once we'd open that can of worms where is this going to end and
>    where can we draw the line? This is going to be a bottomless pit
>    because I don't believe for a split second that this simple interface
>    is going to be sufficient.
> 
>    Aside of that we are definitely _not_ going to expose any of the
>    internal functions simply because they evade any sanity check which
>    happens at the syscall wrappers and I have zero interest to deal with
>    the fallout of unfiltered input which comes via io-uring interfaces
>    or try to keep those up to date when the core filtering changes.
> 
> 2) Adding a waitqueue like callback is daft.
> 
>    a) Lifetime rules
> 
>       See mark_wake_futex().
> 
>    b) Locking
> 
>       The wakeup mechanism is designed to avoid hb->lock contention as much
>       as possible. The dequeue/mark for wakeup happens under hb->lock
>       and the actual wakeup happens after dropping hb->lock.
> 
>       This is not going to change. It's not even debatable.
> 
>       Aside of that this is optimized for minimal hb->lock hold time in
>       general.
> 
>    So the only way to do that would be to invoke the callback from
>    mark_wake_futex() _before_ invalidating futex_q and the callback plus
>    the argument(s) would have to be stored in futex_q.
> 
>    Where does this information come from? Which context would invoke the
>    wait with callback and callback arguments? User space, io-uring state
>    machine or what?
> 
>    Aside of the extra storage (on stack) and yet more undefined life
>    time rules and no semantics for error cases etc., that also would
>    enforce that the callback is invoked with hb->lock held. IOW, it's
>    making the hb->lock held time larger, which is exactly what the
>    existing code tries to avoid by all means.
> 
> But what would that solve?
> 
> I can't tell because the provided information is absolutely useless
> for anyone not familiar with your great idea:
>    
>   "Add futex wait requests, those always go through io-wq for
>    simplicity."
>       
> What am I supposed to read out of this? Doing it elsewhere would be
> more complex? Really useful information.

It's io_uring specific, I'd rather assume this information should
be irrelevant for you, but you're welcome to look at io_uring/io-wq
if of any interest.


> And I can't tell either what Jens means here:
> 
>   "Not a huge fan of that, I think this should tap into the waitqueue
>    instead and just rely on the wakeup callback to trigger the
>    event. That would be a lot more efficient than punting to io-wq, both
>    in terms of latency on trigger, but also for efficiency if the app is
>    waiting on a lot of futexes."
> 
> and here:
> 
>   "Yes, we'd need to augment that with a callback. I do think that's
>    going to be necessary, I don't see the io-wq solution working well
>    outside of the most basic of use cases. And even for that, it won't
>    be particularly efficient for single waits."
> 
> All of these quotes are useless word salad without context and worse

Same, it's weird to assume that there won't be io_uring specific parts
and discussions.

> without the minimal understanding how futexes work.

> So can you folks please sit down and write up a coherent description of:
> 
>  1) The problem you are trying to solve

1. allowing to use futex from io_uring without extra context switches,
which is the case if we would have a request doing futex operations.

2. having async futexes in general, again handled by io_uring.
And if I get if right, interaction b/w futexes and other primitives
is limited. Is it? io_uring naturally allows to pass notify eventfd,
via some other read/write mechanism and so on, as per this series.

3. maybe .net would be interested in it as a half-baked solution
for their wait many problem. However, I haven't heard long about it.

4. To use futexes as a synchronisation for io_uring-bpf requests.
Won't be called from a bpf program, but bpf may be responsible for
doing memory modifications and then only a simple wake will be
needed instead of an *op version. But that's orthogonal

> 
>  2) How this futex functionality should be integrated into io-uring
>     including the contexts which invoke it.

If we're not talking about the callback stuff, it's always invoked from
a context of a user task or a worker task that looks almost exactly like
a normal user thread.

>  3) Interaction with regular sys_futex() operations.

Think of it as passing a request to make a sys_futex() call. Any
happens-before / acquire-release is propagated via io_uring submissions
and completions.

Is there some dependency on syscalling itself? E.g. in terms of
synchronisation. I don't think so, but better to clear it up if
you're concerned.

>  4) Lifetime and locking rules.

For the locking, there is only one extra io_uring specific mutex
under which it may happen. Do you think it may be a problem?

Lifetime of what exactly? In any case it may be trickier with
callbacks, e.g. considering futex_exit_release() and so on, but
currently it's just called by one of tasks that die as any other
normal task.

> Unless that materializes any futex related changes are not even going to
> be reviewed.
> 
> I did not even try to review this stuff, I just tried to make sense out
> of it, but while skimming it, it was inevitable to spot this gem:
> 
>  +int futex_wake_op_single(u32 __user *uaddr, int nr_wake, unsigned int op,
>  +			 bool shared, bool try);
> ...
>  +		ret = futex_wake_op_single(f->uaddr, f->nr_wake, f->wake_op_arg,
>  +					   !(f->flags & IORING_FUTEX_SHARED),
>  +					   nonblock);
> 
> You surely made your point that this is well thought out.

Specifically? I may tell what I don't like

1) it being a rather internal helper. Other options I had are
to go through do_futex(FUTEX_WAKE_OP), but I don't see why
io_uring would need two addresses as a default operation,
though open to suggestions and clue bats. Either to add a
new futex opcode doing same thing as this function and
again call do_futex().

2) flags, but I didn't want to export internal futex flags,
especially as it may be not the way to go in the first place.

3) naming. Suggestions?

However, that's exactly the reason why it's an RFC. Would
be stupid rehearsing it if there are deeper problems

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/4] futex request support
  2021-06-01 14:58 [RFC 0/4] futex request support Pavel Begunkov
                   ` (3 preceding siblings ...)
  2021-06-01 14:58 ` [RFC 4/4] io_uring: implement futex wait Pavel Begunkov
@ 2021-06-03 18:59 ` Andres Freund
  2021-06-04 15:26   ` Pavel Begunkov
  4 siblings, 1 reply; 23+ messages in thread
From: Andres Freund @ 2021-06-03 18:59 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: io-uring, Jens Axboe, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Darren Hart, Davidlohr Bueso, linux-kernel

Hi,

On 2021-06-01 15:58:25 +0100, Pavel Begunkov wrote:
> Should be interesting for a bunch of people, so we should first
> outline API and capabilities it should give. As I almost never
> had to deal with futexes myself, would especially love to hear
> use case, what might be lacking and other blind spots.

I did chat with Jens about how useful futex support would be in io_uring, so I
should outline our / my needs. I'm off work this week though, so I don't think
I'll have much time to experiment.

For postgres's AIO support (which I am working on) there are two, largely
independent, use-cases for desiring futex support in io_uring.

The first is the ability to wait for locks (queued r/w locks, blocking
implemented via futexes) and IO at the same time, within one task. Quickly and
efficiently processing IO completions can improve whole-system latency and
throughput substantially in some cases (journalling, indexes and other
high-contention areas - which often have a low queue depth). This is true
*especially* when there also is lock contention, which tends to make efficient
IO scheduling harder.

The second use case is the ability to efficiently wait in several tasks for
one IO to be processed. The prototypical example here is group commit/journal
flush, where each task can only continue once the journal flush has
completed. Typically one of waiters has to do a small amount of work with the
completion (updating a few shared memory variables) before the other waiters
can be released. It is hard to implement this efficiently and race-free with
io_uring right now without adding locking around *waiting* on the completion
side (instead of just consumption of completions). One cannot just wait on the
io_uring, because of a) the obvious race that another process could reap all
completions between check and wait b) there is no good way to wake up other
waiters once the userspace portion of IO completion is through.


All answers for postgres:

> 1) Do we need PI?

Not right now.

Not related to io_uring: I do wish there were a lower overhead (and lower
guarantees) version of PI futexes. Not for correctness reasons, but
performance. Granting the waiter's timeslice to the lock holder would improve
common contention scenarios with more runnable tasks than cores.


> 2) Do we need requeue? Anything else?

I can see requeue being useful, but I haven't thought it through fully.

Do the wake/wait ops as you have them right now support bitsets?


> 3) How hot waits are? May be done fully async avoiding io-wq, but
> apparently requires more changes in futex code.

The waits can be quite hot, most prominently on low latency storage, but not
just.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 4/4] io_uring: implement futex wait
  2021-06-01 21:53           ` Thomas Gleixner
  2021-06-03 10:31             ` Pavel Begunkov
@ 2021-06-03 19:03             ` Andres Freund
  2021-06-03 21:10               ` Peter Zijlstra
  2021-06-05  0:43               ` Thomas Gleixner
  1 sibling, 2 replies; 23+ messages in thread
From: Andres Freund @ 2021-06-03 19:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Pavel Begunkov, Jens Axboe, io-uring, Ingo Molnar,
	Peter Zijlstra, Darren Hart, Davidlohr Bueso, linux-kernel

On 2021-06-01 23:53:00 +0200, Thomas Gleixner wrote:
> You surely made your point that this is well thought out.

Really impressed with your effort to generously interpret the first
version of a proof of concept patch that explicitly was aimed at getting
feedback on the basic design and the different use cases.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 4/4] io_uring: implement futex wait
  2021-06-03 19:03             ` Andres Freund
@ 2021-06-03 21:10               ` Peter Zijlstra
  2021-06-03 21:21                 ` Andres Freund
  2021-06-05  0:43               ` Thomas Gleixner
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2021-06-03 21:10 UTC (permalink / raw)
  To: Andres Freund
  Cc: Thomas Gleixner, Pavel Begunkov, Jens Axboe, io-uring,
	Ingo Molnar, Darren Hart, Davidlohr Bueso, linux-kernel

On Thu, Jun 03, 2021 at 12:03:38PM -0700, Andres Freund wrote:
> On 2021-06-01 23:53:00 +0200, Thomas Gleixner wrote:
> > You surely made your point that this is well thought out.
> 
> Really impressed with your effort to generously interpret the first
> version of a proof of concept patch that explicitly was aimed at getting
> feedback on the basic design and the different use cases.

You *completely* failed to describe any. I'm with tglx; I see no reason
to even look at the patches. If you don't have a problem, you don't need
a solution.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 4/4] io_uring: implement futex wait
  2021-06-03 21:10               ` Peter Zijlstra
@ 2021-06-03 21:21                 ` Andres Freund
  0 siblings, 0 replies; 23+ messages in thread
From: Andres Freund @ 2021-06-03 21:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Pavel Begunkov, Jens Axboe, io-uring,
	Ingo Molnar, Darren Hart, Davidlohr Bueso, linux-kernel

Hi,

On Thu, Jun 3, 2021, at 14:10, Peter Zijlstra wrote:
> On Thu, Jun 03, 2021 at 12:03:38PM -0700, Andres Freund wrote:
> > On 2021-06-01 23:53:00 +0200, Thomas Gleixner wrote:
> > > You surely made your point that this is well thought out.
> > 
> > Really impressed with your effort to generously interpret the first
> > version of a proof of concept patch that explicitly was aimed at getting
> > feedback on the basic design and the different use cases.
> 
> You *completely* failed to describe any. I'm with tglx; I see no reason
> to even look at the patches. If you don't have a problem, you don't need
> a solution.

Note that this is not my patch, and I hadn't posted anything on the thread when Thomas sent that email? So I'm not sure what you're even referring to here as me having failed to do. If you think the explanation of my use cases I have since sent is insufficient, I'm happy to reply to that/expand on them, but you're going to have to be a bit more specific on why I "failed to describe any".

I just don't see why a simple "what's the use case" wouldn't be a lot more productive than this posturing. Particularly on a thread that is explicitly inviting people from outside the core kernel dev community to chime in.

Andres

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 4/4] io_uring: implement futex wait
  2021-06-03 10:31             ` Pavel Begunkov
@ 2021-06-04  9:19               ` Thomas Gleixner
  2021-06-04 11:58                 ` Pavel Begunkov
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2021-06-04  9:19 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe, io-uring
  Cc: Andres Freund, Ingo Molnar, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, linux-kernel

Pavel,

On Thu, Jun 03 2021 at 11:31, Pavel Begunkov wrote:
> On 6/1/21 10:53 PM, Thomas Gleixner wrote:
>> 1) The proposed solution: I can't figure out from the changelogs or the
>>    cover letter what kind of problems it solves and what the exact
>>    semantics are. If you ever consider to submit futex patches, may I
>>    recommend to study Documentation/process and get some inspiration
>>    from git-log?
>
> I'm sorry you're incapable of grasping ideas quick, but may we
> stop this stupid galling and switch to a more productive way of
> speaking?

which you just achieved by telling me I'm too stupid to understand your
brilliant idea. Great start for a productive discussion. Try again.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 4/4] io_uring: implement futex wait
  2021-06-04  9:19               ` Thomas Gleixner
@ 2021-06-04 11:58                 ` Pavel Begunkov
  2021-06-05  2:09                   ` Thomas Gleixner
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2021-06-04 11:58 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe, io-uring
  Cc: Andres Freund, Ingo Molnar, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, linux-kernel

On 6/4/21 10:19 AM, Thomas Gleixner wrote:
> Pavel,
> 
> On Thu, Jun 03 2021 at 11:31, Pavel Begunkov wrote:
>> On 6/1/21 10:53 PM, Thomas Gleixner wrote:
>>> 1) The proposed solution: I can't figure out from the changelogs or the
>>>    cover letter what kind of problems it solves and what the exact
>>>    semantics are. If you ever consider to submit futex patches, may I
>>>    recommend to study Documentation/process and get some inspiration
>>>    from git-log?
>>
>> I'm sorry you're incapable of grasping ideas quick, but may we
>> stop this stupid galling and switch to a more productive way of
>> speaking?
> 
> which you just achieved by telling me I'm too stupid to understand your
> brilliant idea. Great start for a productive discussion. Try again.

Exactly why there was "we". I have my share of annoyance, which I would
readily put aside if that saves me time. And that's the suggestion
made

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/4] futex request support
  2021-06-03 18:59 ` [RFC 0/4] futex request support Andres Freund
@ 2021-06-04 15:26   ` Pavel Begunkov
  0 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2021-06-04 15:26 UTC (permalink / raw)
  To: Andres Freund
  Cc: io-uring, Jens Axboe, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Darren Hart, Davidlohr Bueso, linux-kernel

On 6/3/21 7:59 PM, Andres Freund wrote:
> Hi,
> 
> On 2021-06-01 15:58:25 +0100, Pavel Begunkov wrote:
>> Should be interesting for a bunch of people, so we should first
>> outline API and capabilities it should give. As I almost never
>> had to deal with futexes myself, would especially love to hear
>> use case, what might be lacking and other blind spots.
> 
> I did chat with Jens about how useful futex support would be in io_uring, so I
> should outline our / my needs. I'm off work this week though, so I don't think
> I'll have much time to experiment.
> 
> For postgres's AIO support (which I am working on) there are two, largely
> independent, use-cases for desiring futex support in io_uring.
> 
> The first is the ability to wait for locks (queued r/w locks, blocking
> implemented via futexes) and IO at the same time, within one task. Quickly and
> efficiently processing IO completions can improve whole-system latency and
> throughput substantially in some cases (journalling, indexes and other
> high-contention areas - which often have a low queue depth). This is true
> *especially* when there also is lock contention, which tends to make efficient
> IO scheduling harder.

Can you give a quick pointer to futex uses in the postgres code or
where they are? Can't find in master but want to see what types of
futex operations are used and how.

> The second use case is the ability to efficiently wait in several tasks for
> one IO to be processed. The prototypical example here is group commit/journal
> flush, where each task can only continue once the journal flush has
> completed. Typically one of waiters has to do a small amount of work with the
> completion (updating a few shared memory variables) before the other waiters
> can be released. It is hard to implement this efficiently and race-free with
> io_uring right now without adding locking around *waiting* on the completion
> side (instead of just consumption of completions). One cannot just wait on the
> io_uring, because of a) the obvious race that another process could reap all
> completions between check and wait b) there is no good way to wake up other
> waiters once the userspace portion of IO completion is through.

IIRC, the idea is to have a link "I/O -> fut_wake(master_task or nr=1)",
and then after getting some work done the woken task does wake(nr=all),
presumably via sys_futex or io_uring. Is that right?

As with this option userspace can't modify the memory on which futex
sits, the wake in the patchset allows to do an atomic add similarly
to FUTEX_WAKE_OP. However, I still have general concerns if that's
a flexible enough way.

When io_uring-BPF is added it can be offloaded to BPF programs
probably together with "updating a few shared memory variables",
but these are just thoughts for the future.

> All answers for postgres:
> 
>> 1) Do we need PI?
> 
> Not right now.
> 
> Not related to io_uring: I do wish there were a lower overhead (and lower
> guarantees) version of PI futexes. Not for correctness reasons, but
> performance. Granting the waiter's timeslice to the lock holder would improve
> common contention scenarios with more runnable tasks than cores.
> 
> 
>> 2) Do we need requeue? Anything else?
> 
> I can see requeue being useful, but I haven't thought it through fully.
> 
> Do the wake/wait ops as you have them right now support bitsets?

No, but trivial to add

>> 3) How hot waits are? May be done fully async avoiding io-wq, but
>> apparently requires more changes in futex code.
> 
> The waits can be quite hot, most prominently on low latency storage, but not
> just.

Thanks Andres, that clears it up. The next step would be to verify
that FUTEX_WAKE_OP-style waking is enough.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 4/4] io_uring: implement futex wait
  2021-06-03 19:03             ` Andres Freund
  2021-06-03 21:10               ` Peter Zijlstra
@ 2021-06-05  0:43               ` Thomas Gleixner
  2021-06-07 11:31                 ` Pavel Begunkov
  1 sibling, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2021-06-05  0:43 UTC (permalink / raw)
  To: Andres Freund
  Cc: Pavel Begunkov, Jens Axboe, io-uring, Ingo Molnar,
	Peter Zijlstra, Darren Hart, Davidlohr Bueso, linux-kernel

Andres,

On Thu, Jun 03 2021 at 12:03, Andres Freund wrote:
> On 2021-06-01 23:53:00 +0200, Thomas Gleixner wrote:
>> You surely made your point that this is well thought out.
>
> Really impressed with your effort to generously interpret the first
> version of a proof of concept patch that explicitly was aimed at getting
> feedback on the basic design and the different use cases.

feedback on what?

There is absolutely no description of design and obviously there is no
use case either. So what do you expect me to be generous about?

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 4/4] io_uring: implement futex wait
  2021-06-04 11:58                 ` Pavel Begunkov
@ 2021-06-05  2:09                   ` Thomas Gleixner
  2021-06-07 12:14                     ` Pavel Begunkov
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2021-06-05  2:09 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe, io-uring
  Cc: Andres Freund, Ingo Molnar, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, linux-kernel, Chris Mason

Pavel!

On Fri, Jun 04 2021 at 12:58, Pavel Begunkov wrote:
> On 6/4/21 10:19 AM, Thomas Gleixner wrote:
>> On Thu, Jun 03 2021 at 11:31, Pavel Begunkov wrote:
>>> On 6/1/21 10:53 PM, Thomas Gleixner wrote:
>>>> 1) The proposed solution: I can't figure out from the changelogs or the
>>>>    cover letter what kind of problems it solves and what the exact
>>>>    semantics are. If you ever consider to submit futex patches, may I
>>>>    recommend to study Documentation/process and get some inspiration
>>>>    from git-log?
>>>
>>> I'm sorry you're incapable of grasping ideas quick, but may we
>>> stop this stupid galling and switch to a more productive way of
>>> speaking?
>> 
>> which you just achieved by telling me I'm too stupid to understand your
>> brilliant idea. Great start for a productive discussion. Try again.
>
> Exactly why there was "we". I have my share of annoyance, which I would
> readily put aside if that saves me time. And that's the suggestion
> made

"We"?

Here is _all_ the information you provided:

 0) Cover letter:

   > Proof of concept for io_uring futex requests. The wake side does
   > FUTEX_WAKE_OP type of modify-compare operation but with a single
   > address. Wait reqs go through io-wq and so slow path.

  Describes WHAT it is supposed to do, but not at all WHY.

   Plus it describes it in terms which are maybe understandable for
   io-uring aware people, but certainly not for the general audience.
  
   > Should be interesting for a bunch of people, so we should first outline
   > API and capabilities it should give.

  You post patches which _should_ be interesting for a unspecified bunch
  of people, but you have no idea what the API and capabilities should
  be?

  IOW, this follows the design principle of: Throw stuff at the wall and
  see what sticks?

  But at the same time you want feedback from the people responsible for
  the subsystems you are modifying without providing the required
  information and worse:

   > As I almost never had to deal with futexes myself, would especially
   > love to hear use case, what might be lacking and other blind spots.

  So you came up with a solution with no use case and expect the futex
  people or whoever to figure out what you actually want to solve?

 1) futex: add op wake for a single key

   > Add a new futex wake function futex_wake_op_single(), which works
   > similar to futex_wake_op() but only for a single futex address as it
   > takes too many arguments. Also export it and other functions that will
   > be used by io_uring.

  Tells WHAT the patch is doing but not WHY. And "as it takes too many
  arguments" really does not qualify in case that you think so.

 2) io_uring: frame out futex op

   > Add userspace futex request definitions and draft some internal
   > functions.

  Tells WHAT the patch is doing but not WHY.

 3) io_uring: support futex wake requests

   > Add support for futex wake requests, which also modifies the addr and
   > checks against it with encoded operation as FUTEX_WAKE_OP does, but only
   > operates with a single address as may be problematic to squeeze into SQE
   > and io_kiocb otherwise.

  Tells WHAT the patch is doing but not WHY. And of course "may be" is a
  really conclusive technical argument to make.

 4) io_uring: implement futex wait

   > Add futex wait requests, those always go through io-wq for simplicity.

  Tells WHAT the patch is doing but not WHY. Of course the 'for
  simplicity' aspect is not argued because it's obvious for anyone
  except stupid me.

Now let me quote Documentation/process/submitting-patches.rst:

  "Describe your problem.  Whether your patch is a one-line bug fix or
   5000 lines of a new feature, there must be an underlying problem that
   motivated you to do this work.  Convince the reviewer that there is a
   problem worth fixing and that it makes sense for them to read past the
   first paragraph."

Can you seriously point me to a single sentence in the above verbatim
quotes from your cover letter and changelogs which complies with that rule?

It does not matter whether this is RFC or not. You simply ignore well
documented rules and then you get upset because I told you so:

 > 1) The proposed solution: I can't figure out from the changelogs or the
 >    cover letter what kind of problems it solves and what the exact
 >    semantics are. If you ever consider to submit futex patches, may I
 >    recommend to study Documentation/process and get some inspiration
 >    from git-log?

And what's worse, you get impertinent about it:

 > I'm sorry you're incapable of grasping ideas quick

Sure. I'm incompetent and stupid just because I can't figure out your
brilliant ideas which are so well described - let me quote again:

   > Should be interesting for a bunch of people, so we should first outline
   > API and capabilities it should give.
   >
   > As I almost never had to deal with futexes myself, would especially
   > love to hear use case, what might be lacking and other blind spots.

and then you chose to tell me:

 > > Exactly why there was "we". I have my share of annoyance, which I
 > > would readily put aside if that saves me time. And that's the
 > > suggestion made.

Let me digest that. Your full response was:

   > I'm sorry you're incapable of grasping ideas quick, but may we stop
   > this stupid galling and switch to a more productive way of speaking?

Fact is that I told you that neither your cover letter nor your changelogs
give any clue (see above) and therefore violate the well documented patch
submission rules.

What's galling about that?

  - You wasted _my_ time by _not_ providing the information which I need
    to digest your submission.

  - I went way beyond what Documentation/process/ says and read past the
    first paragraph of useless information.

  - I provided you a detailed technical feedback nevertheless

And as a result you attack me at a non-technical level. So where exactly
is the "we" and who started galling?

> Exactly why there was "we". I have my share of annoyance, which I would
> readily put aside if that saves me time.

I grant you to be annoyed as much as you want. But you are getting
something fundamentaly wrong:

  "which I would readily put aside if that saves me time."

As I told you above: You have been already wasting _my_ time by not
providing the information which is required to actually look at what you
propose.

> Exactly why there was "we". I have my share of annoyance, which I would
> readily put aside if that saves me time. And that's the suggestion
> made

In my first reply I made that a recommendation, so let me rephrease
that:

     Read and comply to Documentation/process!

It does not matter at all how brilliant the idea you have is and how
stupid the reviewer at the other end might be. There are still rules to
follow and they apply to the most brilliant people on the planet.

So, as I told you before: Try again.

Yours sincerely,

	Thomas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 4/4] io_uring: implement futex wait
  2021-06-05  0:43               ` Thomas Gleixner
@ 2021-06-07 11:31                 ` Pavel Begunkov
  2021-06-07 11:48                   ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2021-06-07 11:31 UTC (permalink / raw)
  To: Thomas Gleixner, Andres Freund
  Cc: Jens Axboe, io-uring, Ingo Molnar, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, linux-kernel

On 6/5/21 1:43 AM, Thomas Gleixner wrote:
> Andres,
> 
> On Thu, Jun 03 2021 at 12:03, Andres Freund wrote:
>> On 2021-06-01 23:53:00 +0200, Thomas Gleixner wrote:
>>> You surely made your point that this is well thought out.
>>
>> Really impressed with your effort to generously interpret the first
>> version of a proof of concept patch that explicitly was aimed at getting
>> feedback on the basic design and the different use cases.
> 
> feedback on what?
> 
> There is absolutely no description of design and obviously there is no
> use case either. So what do you expect me to be generous about?

That's a complete fallacy, the very RFC is about clarifying a
use case that I was hinted about, not mentioning those I described
you in a reply. Obviously

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 4/4] io_uring: implement futex wait
  2021-06-07 11:31                 ` Pavel Begunkov
@ 2021-06-07 11:48                   ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2021-06-07 11:48 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Thomas Gleixner, Andres Freund, Jens Axboe, io-uring,
	Ingo Molnar, Darren Hart, Davidlohr Bueso, linux-kernel

On Mon, Jun 07, 2021 at 12:31:48PM +0100, Pavel Begunkov wrote:
> On 6/5/21 1:43 AM, Thomas Gleixner wrote:
> > Andres,
> > 
> > On Thu, Jun 03 2021 at 12:03, Andres Freund wrote:
> >> On 2021-06-01 23:53:00 +0200, Thomas Gleixner wrote:
> >>> You surely made your point that this is well thought out.
> >>
> >> Really impressed with your effort to generously interpret the first
> >> version of a proof of concept patch that explicitly was aimed at getting
> >> feedback on the basic design and the different use cases.
> > 
> > feedback on what?
> > 
> > There is absolutely no description of design and obviously there is no
> > use case either. So what do you expect me to be generous about?
> 
> That's a complete fallacy, the very RFC is about clarifying a
> use case that I was hinted about, not mentioning those I described
> you in a reply. Obviously

Then consider this:

Nacked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

for anything touching futex.c, until such time that you can provide a
coherent description of what and why you're doing things.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 4/4] io_uring: implement futex wait
  2021-06-05  2:09                   ` Thomas Gleixner
@ 2021-06-07 12:14                     ` Pavel Begunkov
  0 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2021-06-07 12:14 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe, io-uring
  Cc: Andres Freund, Ingo Molnar, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, linux-kernel, Chris Mason

On 6/5/21 3:09 AM, Thomas Gleixner wrote:
[...]
> Here is _all_ the information you provided:
> 
>  0) Cover letter:
> 
>    > Proof of concept for io_uring futex requests. The wake side does
>    > FUTEX_WAKE_OP type of modify-compare operation but with a single
>    > address. Wait reqs go through io-wq and so slow path.
> 
>   Describes WHAT it is supposed to do, but not at all WHY.
> 
>    Plus it describes it in terms which are maybe understandable for
>    io-uring aware people, but certainly not for the general audience.

I actually agree with that and going to add it once I get details
I needed.

>    > Should be interesting for a bunch of people, so we should first outline
>    > API and capabilities it should give.
> 
>   You post patches which _should_ be interesting for a unspecified bunch
>   of people, but you have no idea what the API and capabilities should
>   be?

That's word carping. Some of the cases were known, but was more
interested atm in others I heard only a brief idea about, that's
why that person was CC'ed.

>   IOW, this follows the design principle of: Throw stuff at the wall and
>   see what sticks?

Exactly what it is *not*. Emails were chosen to clarify details,
nobody tells it wouldn't be reworked and adjusted. Do you imply
I should discuss ideas privately?

>   But at the same time you want feedback from the people responsible for
>   the subsystems you are modifying without providing the required
>   information and worse:
> 
>    > As I almost never had to deal with futexes myself, would especially
>    > love to hear use case, what might be lacking and other blind spots.
> 
>   So you came up with a solution with no use case and expect the futex
>   people or whoever to figure out what you actually want to solve?

Again, not true. Where did you get that?

> 
[...]
> Now let me quote Documentation/process/submitting-patches.rst:
> 
>   "Describe your problem.  Whether your patch is a one-line bug fix or
>    5000 lines of a new feature, there must be an underlying problem that
>    motivated you to do this work.  Convince the reviewer that there is a
>    problem worth fixing and that it makes sense for them to read past the
>    first paragraph."
> 
> Can you seriously point me to a single sentence in the above verbatim
> quotes from your cover letter and changelogs which complies with that rule?
> 
> It does not matter whether this is RFC or not. You simply ignore well
> documented rules and then you get upset because I told you so:
> 
>  > 1) The proposed solution: I can't figure out from the changelogs or the
>  >    cover letter what kind of problems it solves and what the exact
>  >    semantics are. If you ever consider to submit futex patches, may I
>  >    recommend to study Documentation/process and get some inspiration
>  >    from git-log?
> 
> And what's worse, you get impertinent about it:

Impertinent? Was just keeping up with your nice way of conveying
ideas. FWIW, it's not in particularly related to this small chunk
above at all.

>  > I'm sorry you're incapable of grasping ideas quick
> 
> Sure. I'm incompetent and stupid just because I can't figure out your
> brilliant ideas which are so well described - let me quote again:

That's your own interpretation, can't help you with that

[...]
> What's galling about that?
> 
>   - You wasted _my_ time by _not_ providing the information which I need
>     to digest your submission.
> 
>   - I went way beyond what Documentation/process/ says and read past the
>     first paragraph of useless information.
> 
>   - I provided you a detailed technical feedback nevertheless
> 
> And as a result you attack me at a non-technical level. So where exactly
> is the "we" and who started galling?

If you think it was an attack, your response might have been interpreted
in a such way as well, even though it haven't by me. There are enough of
weird phrases and implications in your reply, but I have no intention
of going through it and picking up on every phrase, would be useless

>> Exactly why there was "we". I have my share of annoyance, which I would
>> readily put aside if that saves me time.
> 
> I grant you to be annoyed as much as you want. But you are getting
> something fundamentaly wrong:
> 
>   "which I would readily put aside if that saves me time."
> 
> As I told you above: You have been already wasting _my_ time by not
> providing the information which is required to actually look at what you
> propose.
> 
>> Exactly why there was "we". I have my share of annoyance, which I would
>> readily put aside if that saves me time. And that's the suggestion
>> made
> 
> In my first reply I made that a recommendation, so let me rephrease
> that:
> 
>      Read and comply to Documentation/process!
> 
> It does not matter at all how brilliant the idea you have is and how
> stupid the reviewer at the other end might be. There are still rules to
> follow and they apply to the most brilliant people on the planet.
> 
> So, as I told you before: Try again.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-06-07 12:15 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-01 14:58 [RFC 0/4] futex request support Pavel Begunkov
2021-06-01 14:58 ` [RFC 1/4] futex: add op wake for a single key Pavel Begunkov
2021-06-01 14:58 ` [RFC 2/4] io_uring: frame out futex op Pavel Begunkov
2021-06-01 14:58 ` [RFC 3/4] io_uring: support futex wake requests Pavel Begunkov
2021-06-01 14:58 ` [RFC 4/4] io_uring: implement futex wait Pavel Begunkov
2021-06-01 15:45   ` Jens Axboe
2021-06-01 15:58     ` Pavel Begunkov
2021-06-01 16:01       ` Jens Axboe
2021-06-01 16:29         ` Pavel Begunkov
2021-06-01 21:53           ` Thomas Gleixner
2021-06-03 10:31             ` Pavel Begunkov
2021-06-04  9:19               ` Thomas Gleixner
2021-06-04 11:58                 ` Pavel Begunkov
2021-06-05  2:09                   ` Thomas Gleixner
2021-06-07 12:14                     ` Pavel Begunkov
2021-06-03 19:03             ` Andres Freund
2021-06-03 21:10               ` Peter Zijlstra
2021-06-03 21:21                 ` Andres Freund
2021-06-05  0:43               ` Thomas Gleixner
2021-06-07 11:31                 ` Pavel Begunkov
2021-06-07 11:48                   ` Peter Zijlstra
2021-06-03 18:59 ` [RFC 0/4] futex request support Andres Freund
2021-06-04 15:26   ` Pavel Begunkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.