All of lore.kernel.org
 help / color / mirror / Atom feed
* [POC RFC 0/3] support graph like dependent sqes
@ 2021-12-14  5:57 Hao Xu
  2021-12-14  5:57 ` [PATCH 1/3] io_uring: add data structure for graph sqe feature Hao Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Hao Xu @ 2021-12-14  5:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

This is just a proof of concept which is incompleted, send it early for
thoughts and suggestions.

We already have IOSQE_IO_LINK to describe linear dependency
relationship sqes. While this patchset provides a new feature to
support DAG dependency. For instance, 4 sqes have a relationship
as below:
      --> 2 --
     /        \
1 ---          ---> 4
     \        /
      --> 3 --
IOSQE_IO_LINK serializes them to 1-->2-->3-->4, which unneccessarily
serializes 2 and 3. But a DAG can fully describe it.

For the detail usage, see the following patches' messages.

Tested it with 100 direct read sqes, each one reads a BS=4k block data
in a same file, blocks are not overlapped. These sqes form a graph:
      2
      3
1 --> 4 --> 100
     ...
      99

This is an extreme case, just to show the idea.

results below:
io_link:
IOPS: 15898251
graph_link:
IOPS: 29325513
io_link:
IOPS: 16420361
graph_link:
IOPS: 29585798
io_link:
IOPS: 18148820
graph_link:
IOPS: 27932960

Tested many times, numbers are not very stable but shows the difference.

something to concern:
1. overhead to the hot path: several IF checks
2. many memory allocations
3. many atomic_read/inc/dec stuff

many things to be done:
1. cancellation, failure path
2. integrate with other features.
3. maybe need some cache design to overcome the overhead of memory
   allcation
4. some thing like topological sorting to avoid rings in the graph

Any thoughts?

Hao Xu (3):
  io_uring: add data structure for graph sqe feature
  io_uring: implement new sqe opcode to build graph like links
  io_uring: implement logic of IOSQE_GRAPH request

 fs/io_uring.c                 | 231 +++++++++++++++++++++++++++++++++-
 include/uapi/linux/io_uring.h |   9 ++
 2 files changed, 235 insertions(+), 5 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] io_uring: add data structure for graph sqe feature
  2021-12-14  5:57 [POC RFC 0/3] support graph like dependent sqes Hao Xu
@ 2021-12-14  5:57 ` Hao Xu
  2021-12-14  5:57 ` [PATCH 2/3] io_uring: implement new sqe opcode to build graph like links Hao Xu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Hao Xu @ 2021-12-14  5:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Add basic data structure for graph sqe feature.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index d86372664f9f..e96d38d268a8 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -326,8 +326,25 @@ struct io_submit_state {
 	struct blk_plug		plug;
 };
 
+struct io_graph_child_node {
+	struct io_graph_node		*to;
+	struct io_graph_child_node	*next;
+};
+
+struct io_graph_node {
+	struct io_kiocb			*req;
+	struct io_graph_child_node	*children;
+	atomic_t			refs;
+};
+
+struct io_graph_info {
+	int				nr_nodes;
+	int				pos;
+	struct io_graph_node		**graph_array;
+};
+
 struct io_ring_ctx {
-	/* const or read-mostly hot data */
+	/* cons t or read-mostly hot data */
 	struct {
 		struct percpu_ref	refs;
 
@@ -459,6 +476,7 @@ struct io_ring_ctx {
 		u32				iowq_limits[2];
 		bool				iowq_limits_set;
 	};
+	struct io_graph_info			graph_info;
 };
 
 struct io_uring_task {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] io_uring: implement new sqe opcode to build graph like links
  2021-12-14  5:57 [POC RFC 0/3] support graph like dependent sqes Hao Xu
  2021-12-14  5:57 ` [PATCH 1/3] io_uring: add data structure for graph sqe feature Hao Xu
@ 2021-12-14  5:57 ` Hao Xu
  2021-12-14  5:57 ` [PATCH 3/3] io_uring: implement logic of IOSQE_GRAPH request Hao Xu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Hao Xu @ 2021-12-14  5:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Add a new opcode IORING_OP_BUILD_GRAPH to build graph dependency between
sqes. The format is:
sqe->fd: number of nodes in graph
sqe->len: number of edges in graph
sqe->addr: array of edges, each element is in <a, b> format which
           represents an edge from a to b. graph sqes are indexed from 1
           to sqe->fd.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c                 | 140 +++++++++++++++++++++++++++++++++-
 include/uapi/linux/io_uring.h |   6 ++
 2 files changed, 145 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e96d38d268a8..53309eaef69d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -709,6 +709,13 @@ struct io_hardlink {
 	int				flags;
 };
 
+struct io_graph {
+	struct file			*file;
+	int				nr_nodes;
+	int				nr_edges;
+	u64				edges;
+};
+
 struct io_async_connect {
 	struct sockaddr_storage		address;
 };
@@ -783,7 +790,6 @@ enum {
 	REQ_F_BUFFER_SELECT	= BIT(REQ_F_BUFFER_SELECT_BIT),
 	/* IOSQE_CQE_SKIP_SUCCESS */
 	REQ_F_CQE_SKIP		= BIT(REQ_F_CQE_SKIP_BIT),
-
 	/* fail rest of links */
 	REQ_F_FAIL		= BIT(REQ_F_FAIL_BIT),
 	/* on inflight list, should be cancelled and waited on exit reliably */
@@ -874,6 +880,7 @@ struct io_kiocb {
 		struct io_mkdir		mkdir;
 		struct io_symlink	symlink;
 		struct io_hardlink	hardlink;
+		struct io_graph		graph;
 	};
 
 	u8				opcode;
@@ -1123,6 +1130,7 @@ static const struct io_op_def io_op_defs[] = {
 	[IORING_OP_MKDIRAT] = {},
 	[IORING_OP_SYMLINKAT] = {},
 	[IORING_OP_LINKAT] = {},
+	[IORING_OP_BUILD_GRAPH] = {},
 };
 
 /* requests with any of those set should undergo io_disarm_next() */
@@ -5300,6 +5308,131 @@ IO_NETOP_FN(send);
 IO_NETOP_FN(recv);
 #endif /* CONFIG_NET */
 
+#define MAX_GRAPH_NODES 1000
+
+static int io_buildgraph_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	int *nr_nodes = &req->graph.nr_nodes;
+
+	if (req->ctx->graph_info.graph_array)
+		return -EBUSY;
+
+	*nr_nodes = READ_ONCE(sqe->fd);
+	if (*nr_nodes <= 0 || *nr_nodes > MAX_GRAPH_NODES)
+		return -EINVAL;
+
+	req->graph.nr_edges = READ_ONCE(sqe->len);
+	req->graph.edges = READ_ONCE(sqe->addr);
+	return 0;
+}
+
+static int io_add_edge(struct io_graph_node *from, struct io_graph_node *to)
+{
+	struct io_graph_child_node *head = from->children;
+	struct io_graph_child_node *new_edge =
+		kzalloc(sizeof(struct io_graph_child_node), GFP_KERNEL);
+
+	if (!new_edge)
+		return -ENOMEM;
+
+	new_edge->to = to;
+	new_edge->next = head;
+	from->children = new_edge;
+
+	return 0;
+}
+static void io_remove_edges(struct io_graph_node *node)
+{
+	struct io_graph_child_node *next, *edge_node = node->children;
+
+	while (edge_node) {
+		next = edge_node->next;
+		kfree(edge_node);
+		edge_node = next;
+	}
+}
+
+static int __io_buildgraph(struct io_graph_node **graph_array, struct io_graph_edge *io_edges,
+			   int nr_edges, int nr_nodes)
+{
+	int i, ret;
+
+	for (i = 0; i < nr_edges; i++) {
+		int from = io_edges[i].from, to = io_edges[i].to;
+
+		ret = io_add_edge(graph_array[from], graph_array[to]);
+		if (ret)
+			goto fail;
+		atomic_inc(&graph_array[to]->refs);
+	}
+	for (i = 1; i <= nr_nodes; i++) {
+		if (atomic_read(&graph_array[i]->refs))
+			continue;
+
+		ret = io_add_edge(graph_array[0], graph_array[i]);
+		if (ret)
+			goto fail;
+		atomic_set(&graph_array[i]->refs, 1);
+	}
+
+	return 0;
+
+fail:
+	for (i = 0; i <= nr_nodes; i++)
+		io_remove_edges(graph_array[i]);
+	return ret;
+}
+
+static int io_check_graph(struct io_graph_edge *io_edges, int nr_edges, int nr_nodes)
+{
+	return 0;
+}
+
+static int io_buildgraph(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_graph_info *gi = &req->ctx->graph_info;
+	struct io_graph_edge __user *edges, *io_edges = NULL;
+	int i, n = req->graph.nr_nodes, m = req->graph.nr_edges;
+	int ret = -ENOMEM;
+
+	edges = u64_to_user_ptr(req->graph.edges);
+	io_edges = kvmalloc_array(m, sizeof(struct io_graph_edge), GFP_KERNEL);
+	if (!io_edges)
+		goto fail;
+
+	ret = -EFAULT;
+	if (copy_from_user(io_edges, edges, sizeof(struct io_graph_edge) * m))
+		goto fail;
+	ret = io_check_graph(io_edges, m, n);
+	if (ret)
+		goto fail;
+
+	gi->nr_nodes = n;
+	gi->pos = 0;
+	ret = -ENOMEM;
+	gi->graph_array = kmalloc(sizeof(struct io_graph_node *) * (n + 1),  GFP_KERNEL);
+	if (!gi->graph_array)
+		goto fail;
+	for (i = 0; i <= n; i++) {
+		gi->graph_array[i] = kzalloc(sizeof(struct io_graph_node), GFP_KERNEL);
+		if (!gi->graph_array[i])
+			goto fail;
+		atomic_set(&gi->graph_array[i]->refs, 0);
+	}
+	ret = __io_buildgraph(gi->graph_array, io_edges, m, n);
+
+	if (ret) {
+fail:
+		kvfree(io_edges);
+		for (i = 0; i <= n; i++)
+			kfree(gi->graph_array[i]);
+		kfree(gi->graph_array);
+	}
+
+	__io_req_complete(req, issue_flags, ret, 0);
+	return 0;
+}
+
 struct io_poll_table {
 	struct poll_table_struct pt;
 	struct io_kiocb *req;
@@ -6493,6 +6626,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return io_symlinkat_prep(req, sqe);
 	case IORING_OP_LINKAT:
 		return io_linkat_prep(req, sqe);
+	case IORING_OP_BUILD_GRAPH:
+		return io_buildgraph_prep(req, sqe);
 	}
 
 	printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
@@ -6776,6 +6911,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 	case IORING_OP_LINKAT:
 		ret = io_linkat(req, issue_flags);
 		break;
+	case IORING_OP_BUILD_GRAPH:
+		ret = io_buildgraph(req, issue_flags);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 787f491f0d2a..7ffbd9c4afd9 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -143,6 +143,7 @@ enum {
 	IORING_OP_MKDIRAT,
 	IORING_OP_SYMLINKAT,
 	IORING_OP_LINKAT,
+	IORING_OP_BUILD_GRAPH,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
@@ -422,4 +423,9 @@ struct io_uring_getevents_arg {
 	__u64	ts;
 };
 
+struct io_graph_edge {
+	int from;
+	int to;
+};
+
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] io_uring: implement logic of IOSQE_GRAPH request
  2021-12-14  5:57 [POC RFC 0/3] support graph like dependent sqes Hao Xu
  2021-12-14  5:57 ` [PATCH 1/3] io_uring: add data structure for graph sqe feature Hao Xu
  2021-12-14  5:57 ` [PATCH 2/3] io_uring: implement new sqe opcode to build graph like links Hao Xu
@ 2021-12-14  5:57 ` Hao Xu
  2021-12-14 15:21 ` [POC RFC 0/3] support graph like dependent sqes Pavel Begunkov
  2021-12-23 10:06 ` Christian Dietrich
  4 siblings, 0 replies; 15+ messages in thread
From: Hao Xu @ 2021-12-14  5:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

After we build a graph, we need to fill the sqe nodes to it. Introduce
a new flag for graph node sqes. The whole graph will be issued when all
the sqes it needs are ready.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c                 | 71 +++++++++++++++++++++++++++++++++--
 include/uapi/linux/io_uring.h |  3 ++
 2 files changed, 71 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 53309eaef69d..fbe1ab029c4b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -109,7 +109,8 @@
 			  IOSQE_IO_HARDLINK | IOSQE_ASYNC)
 
 #define SQE_VALID_FLAGS	(SQE_COMMON_FLAGS | IOSQE_BUFFER_SELECT | \
-			IOSQE_IO_DRAIN | IOSQE_CQE_SKIP_SUCCESS)
+			IOSQE_IO_DRAIN | IOSQE_CQE_SKIP_SUCCESS | \
+			IOSQE_GRAPH)
 
 #define IO_REQ_CLEAN_FLAGS (REQ_F_BUFFER_SELECTED | REQ_F_NEED_CLEANUP | \
 				REQ_F_POLLED | REQ_F_INFLIGHT | REQ_F_CREDS | \
@@ -750,6 +751,7 @@ enum {
 	REQ_F_FORCE_ASYNC_BIT	= IOSQE_ASYNC_BIT,
 	REQ_F_BUFFER_SELECT_BIT	= IOSQE_BUFFER_SELECT_BIT,
 	REQ_F_CQE_SKIP_BIT	= IOSQE_CQE_SKIP_SUCCESS_BIT,
+	REQ_F_GRAPH_BIT		= IOSQE_GRAPH_BIT,
 
 	/* first byte is taken by user flags, shift it to not overlap */
 	REQ_F_FAIL_BIT		= 8,
@@ -790,6 +792,8 @@ enum {
 	REQ_F_BUFFER_SELECT	= BIT(REQ_F_BUFFER_SELECT_BIT),
 	/* IOSQE_CQE_SKIP_SUCCESS */
 	REQ_F_CQE_SKIP		= BIT(REQ_F_CQE_SKIP_BIT),
+	/* IOSQE_GRAPH */
+	REQ_F_GRAPH		= BIT(REQ_F_GRAPH_BIT),
 	/* fail rest of links */
 	REQ_F_FAIL		= BIT(REQ_F_FAIL_BIT),
 	/* on inflight list, should be cancelled and waited on exit reliably */
@@ -916,6 +920,8 @@ struct io_kiocb {
 	const struct cred		*creds;
 	/* stores selected buf, valid IFF REQ_F_BUFFER_SELECTED is set */
 	struct io_buffer		*kbuf;
+	/* reverse link to the io_graph_node which points to it */
+	struct io_graph_node            *graph_node;
 };
 
 struct io_tctx_node {
@@ -1916,6 +1922,8 @@ static noinline bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data,
 	return __io_fill_cqe(ctx, user_data, res, cflags);
 }
 
+static inline void io_graph_queue_children(struct io_kiocb *req);
+
 static void io_req_complete_post(struct io_kiocb *req, s32 res,
 				 u32 cflags)
 {
@@ -1937,6 +1945,7 @@ static void io_req_complete_post(struct io_kiocb *req, s32 res,
 				req->link = NULL;
 			}
 		}
+		io_graph_queue_children(req);
 		io_req_put_rsrc(req, ctx);
 		io_dismantle_req(req);
 		io_put_task(req->task, 1);
@@ -2100,6 +2109,10 @@ static __cold void __io_free_req(struct io_kiocb *req)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
+	/*
+	 * not a proper place to put this, but..
+	 */
+	io_graph_queue_children(req);
 	io_req_put_rsrc(req, ctx);
 	io_dismantle_req(req);
 	io_put_task(req->task, 1);
@@ -2414,6 +2427,7 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
 
 		io_req_put_rsrc_locked(req, ctx);
 		io_queue_next(req);
+		io_graph_queue_children(req);
 		io_dismantle_req(req);
 
 		if (req->task != task) {
@@ -5341,6 +5355,7 @@ static int io_add_edge(struct io_graph_node *from, struct io_graph_node *to)
 
 	return 0;
 }
+
 static void io_remove_edges(struct io_graph_node *node)
 {
 	struct io_graph_child_node *next, *edge_node = node->children;
@@ -5352,6 +5367,31 @@ static void io_remove_edges(struct io_graph_node *node)
 	}
 }
 
+static void free_graph_node(struct io_graph_node *gn)
+{
+	io_remove_edges(gn);
+	kfree(gn);
+}
+
+static inline void io_graph_queue_children(struct io_kiocb *req)
+{
+	struct io_graph_node *gn;
+	struct io_graph_child_node *node;
+
+	if (!(req->flags & REQ_F_GRAPH))
+		return;
+
+	gn = req->graph_node;
+	node = gn->children;
+	while (node) {
+		gn = node->to;
+		if (atomic_dec_and_test(&gn->refs))
+			io_req_task_queue(gn->req);
+		node = node->next;
+	}
+	free_graph_node(req->graph_node);
+}
+
 static int __io_buildgraph(struct io_graph_node **graph_array, struct io_graph_edge *io_edges,
 			   int nr_edges, int nr_nodes)
 {
@@ -7185,20 +7225,36 @@ static void io_queue_sqe_fallback(struct io_kiocb *req)
 		io_req_complete_fail_submit(req);
 	} else if (unlikely(req->ctx->drain_active)) {
 		io_drain_req(req);
-	} else {
+	} else if (req->flags & REQ_F_FORCE_ASYNC) {
 		int ret = io_req_prep_async(req);
 
 		if (unlikely(ret))
 			io_req_complete_failed(req, ret);
 		else
 			io_queue_async_work(req, NULL);
+	} else {
+		struct io_ring_ctx *ctx = req->ctx;
+		struct io_graph_node **ga = ctx->graph_info.graph_array;
+		struct io_graph_child_node *node;
+
+		if (ctx->graph_info.pos != ctx->graph_info.nr_nodes)
+			return;
+
+		node = ga[0]->children;
+		while (node) {
+			__io_queue_sqe(node->to->req);
+			node = node->next;
+		}
+		free_graph_node(ga[0]);
+		kfree(ctx->graph_info.graph_array);
+		memset(&ctx->graph_info, 0, sizeof(ctx->graph_info));
 	}
 }
 
 static inline void io_queue_sqe(struct io_kiocb *req)
 	__must_hold(&req->ctx->uring_lock)
 {
-	if (likely(!(req->flags & (REQ_F_FORCE_ASYNC | REQ_F_FAIL))))
+	if (likely(!(req->flags & (REQ_F_FORCE_ASYNC | REQ_F_FAIL | REQ_F_GRAPH))))
 		__io_queue_sqe(req);
 	else
 		io_queue_sqe_fallback(req);
@@ -7281,6 +7337,15 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 				return -EOPNOTSUPP;
 			io_init_req_drain(req);
 		}
+		if (sqe_flags & IOSQE_GRAPH) {
+			struct io_graph_node **ga = ctx->graph_info.graph_array;
+
+			if (!ga)
+				return -EINVAL;
+
+			ga[++(ctx->graph_info.pos)]->req = req;
+			req->graph_node = ga[ctx->graph_info.pos];
+		}
 	}
 	if (unlikely(ctx->restricted || ctx->drain_active || ctx->drain_next)) {
 		if (ctx->restricted && !io_check_restriction(ctx, req, sqe_flags))
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 7ffbd9c4afd9..a29503060640 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -71,6 +71,7 @@ enum {
 	IOSQE_ASYNC_BIT,
 	IOSQE_BUFFER_SELECT_BIT,
 	IOSQE_CQE_SKIP_SUCCESS_BIT,
+	IOSQE_GRAPH_BIT,
 };
 
 /*
@@ -90,6 +91,8 @@ enum {
 #define IOSQE_BUFFER_SELECT	(1U << IOSQE_BUFFER_SELECT_BIT)
 /* don't post CQE if request succeeded */
 #define IOSQE_CQE_SKIP_SUCCESS	(1U << IOSQE_CQE_SKIP_SUCCESS_BIT)
+/* sqes that in a graph dependency relationship */
+#define IOSQE_GRAPH		(1U << IOSQE_GRAPH_BIT)
 
 /*
  * io_uring_setup() flags
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [POC RFC 0/3] support graph like dependent sqes
  2021-12-14  5:57 [POC RFC 0/3] support graph like dependent sqes Hao Xu
                   ` (2 preceding siblings ...)
  2021-12-14  5:57 ` [PATCH 3/3] io_uring: implement logic of IOSQE_GRAPH request Hao Xu
@ 2021-12-14 15:21 ` Pavel Begunkov
  2021-12-14 16:53   ` Hao Xu
  2021-12-23 10:06 ` Christian Dietrich
  4 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2021-12-14 15:21 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 12/14/21 05:57, Hao Xu wrote:
> This is just a proof of concept which is incompleted, send it early for
> thoughts and suggestions.
> 
> We already have IOSQE_IO_LINK to describe linear dependency
> relationship sqes. While this patchset provides a new feature to
> support DAG dependency. For instance, 4 sqes have a relationship
> as below:
>        --> 2 --
>       /        \
> 1 ---          ---> 4
>       \        /
>        --> 3 --
> IOSQE_IO_LINK serializes them to 1-->2-->3-->4, which unneccessarily
> serializes 2 and 3. But a DAG can fully describe it.
> 
> For the detail usage, see the following patches' messages.
> 
> Tested it with 100 direct read sqes, each one reads a BS=4k block data
> in a same file, blocks are not overlapped. These sqes form a graph:
>        2
>        3
> 1 --> 4 --> 100
>       ...
>        99
> 
> This is an extreme case, just to show the idea.
> 
> results below:
> io_link:
> IOPS: 15898251
> graph_link:
> IOPS: 29325513
> io_link:
> IOPS: 16420361
> graph_link:
> IOPS: 29585798
> io_link:
> IOPS: 18148820
> graph_link:
> IOPS: 27932960

Hmm, what do we compare here? IIUC,
"io_link" is a huge link of 100 requests. Around 15898251 IOPS
"graph_link" is a graph of diameter 3. Around 29585798 IOPS

Is that right? If so it'd more more fair to compare with a
similar graph-like scheduling on the userspace side.

submit(req={1});
wait(nr=1);
submit({2-99});
wait(nr=98);
submit(req={100});
wait(nr=1);


> Tested many times, numbers are not very stable but shows the difference.
> 
> something to concern:
> 1. overhead to the hot path: several IF checks
> 2. many memory allocations
> 3. many atomic_read/inc/dec stuff
> 
> many things to be done:
> 1. cancellation, failure path
> 2. integrate with other features.
> 3. maybe need some cache design to overcome the overhead of memory
>     allcation
> 4. some thing like topological sorting to avoid rings in the graph
> 
> Any thoughts?
> 
> Hao Xu (3):
>    io_uring: add data structure for graph sqe feature
>    io_uring: implement new sqe opcode to build graph like links
>    io_uring: implement logic of IOSQE_GRAPH request
> 
>   fs/io_uring.c                 | 231 +++++++++++++++++++++++++++++++++-
>   include/uapi/linux/io_uring.h |   9 ++
>   2 files changed, 235 insertions(+), 5 deletions(-)
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [POC RFC 0/3] support graph like dependent sqes
  2021-12-14 15:21 ` [POC RFC 0/3] support graph like dependent sqes Pavel Begunkov
@ 2021-12-14 16:53   ` Hao Xu
  2021-12-14 18:16     ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Hao Xu @ 2021-12-14 16:53 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi


在 2021/12/14 下午11:21, Pavel Begunkov 写道:
> On 12/14/21 05:57, Hao Xu wrote:
>> This is just a proof of concept which is incompleted, send it early for
>> thoughts and suggestions.
>>
>> We already have IOSQE_IO_LINK to describe linear dependency
>> relationship sqes. While this patchset provides a new feature to
>> support DAG dependency. For instance, 4 sqes have a relationship
>> as below:
>>        --> 2 --
>>       /        \
>> 1 ---          ---> 4
>>       \        /
>>        --> 3 --
>> IOSQE_IO_LINK serializes them to 1-->2-->3-->4, which unneccessarily
>> serializes 2 and 3. But a DAG can fully describe it.
>>
>> For the detail usage, see the following patches' messages.
>>
>> Tested it with 100 direct read sqes, each one reads a BS=4k block data
>> in a same file, blocks are not overlapped. These sqes form a graph:
>>        2
>>        3
>> 1 --> 4 --> 100
>>       ...
>>        99
>>
>> This is an extreme case, just to show the idea.
>>
>> results below:
>> io_link:
>> IOPS: 15898251
>> graph_link:
>> IOPS: 29325513
>> io_link:
>> IOPS: 16420361
>> graph_link:
>> IOPS: 29585798
>> io_link:
>> IOPS: 18148820
>> graph_link:
>> IOPS: 27932960
>
> Hmm, what do we compare here? IIUC,
> "io_link" is a huge link of 100 requests. Around 15898251 IOPS
> "graph_link" is a graph of diameter 3. Around 29585798 IOPS
>
> Is that right? If so it'd more more fair to compare with a
> similar graph-like scheduling on the userspace side.

The above test is more like to show the disadvantage of LINK

But yes, it's better to test the similar userspace  scheduling since

LINK is definitely not a good choice so have to prove the graph stuff

beat the userspace scheduling. Will test that soon. Thanks.

>
> submit(req={1});
> wait(nr=1);
> submit({2-99});
> wait(nr=98);
> submit(req={100});
> wait(nr=1);
>
>
>> Tested many times, numbers are not very stable but shows the difference.
>>
>> something to concern:
>> 1. overhead to the hot path: several IF checks
>> 2. many memory allocations
>> 3. many atomic_read/inc/dec stuff
>>
>> many things to be done:
>> 1. cancellation, failure path
>> 2. integrate with other features.
>> 3. maybe need some cache design to overcome the overhead of memory
>>     allcation
>> 4. some thing like topological sorting to avoid rings in the graph
>>
>> Any thoughts?
>>
>> Hao Xu (3):
>>    io_uring: add data structure for graph sqe feature
>>    io_uring: implement new sqe opcode to build graph like links
>>    io_uring: implement logic of IOSQE_GRAPH request
>>
>>   fs/io_uring.c                 | 231 +++++++++++++++++++++++++++++++++-
>>   include/uapi/linux/io_uring.h |   9 ++
>>   2 files changed, 235 insertions(+), 5 deletions(-)
>>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [POC RFC 0/3] support graph like dependent sqes
  2021-12-14 16:53   ` Hao Xu
@ 2021-12-14 18:16     ` Pavel Begunkov
  2021-12-16 16:55       ` Hao Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2021-12-14 18:16 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 12/14/21 16:53, Hao Xu wrote:
> 
> 在 2021/12/14 下午11:21, Pavel Begunkov 写道:
>> On 12/14/21 05:57, Hao Xu wrote:
>>> This is just a proof of concept which is incompleted, send it early for
>>> thoughts and suggestions.
>>>
>>> We already have IOSQE_IO_LINK to describe linear dependency
>>> relationship sqes. While this patchset provides a new feature to
>>> support DAG dependency. For instance, 4 sqes have a relationship
>>> as below:
>>>        --> 2 --
>>>       /        \
>>> 1 ---          ---> 4
>>>       \        /
>>>        --> 3 --
>>> IOSQE_IO_LINK serializes them to 1-->2-->3-->4, which unneccessarily
>>> serializes 2 and 3. But a DAG can fully describe it.
>>>
>>> For the detail usage, see the following patches' messages.
>>>
>>> Tested it with 100 direct read sqes, each one reads a BS=4k block data
>>> in a same file, blocks are not overlapped. These sqes form a graph:
>>>        2
>>>        3
>>> 1 --> 4 --> 100
>>>       ...
>>>        99
>>>
>>> This is an extreme case, just to show the idea.
>>>
>>> results below:
>>> io_link:
>>> IOPS: 15898251
>>> graph_link:
>>> IOPS: 29325513
>>> io_link:
>>> IOPS: 16420361
>>> graph_link:
>>> IOPS: 29585798
>>> io_link:
>>> IOPS: 18148820
>>> graph_link:
>>> IOPS: 27932960
>>
>> Hmm, what do we compare here? IIUC,
>> "io_link" is a huge link of 100 requests. Around 15898251 IOPS
>> "graph_link" is a graph of diameter 3. Around 29585798 IOPS

Diam 2 graph, my bad


>> Is that right? If so it'd more more fair to compare with a
>> similar graph-like scheduling on the userspace side.
> 
> The above test is more like to show the disadvantage of LINK

Oh yeah, links can be slow, especially when it kills potential
parallelism or need extra allocations for keeping state, like
READV and WRITEV.


> But yes, it's better to test the similar userspace  scheduling since
> 
> LINK is definitely not a good choice so have to prove the graph stuff
> 
> beat the userspace scheduling. Will test that soon. Thanks.

Would be also great if you can also post the benchmark once
it's done


>> submit(req={1});
>> wait(nr=1);
>> submit({2-99});
>> wait(nr=98);
>> submit(req={100});
>> wait(nr=1);
>>

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [POC RFC 0/3] support graph like dependent sqes
  2021-12-14 18:16     ` Pavel Begunkov
@ 2021-12-16 16:55       ` Hao Xu
  2021-12-17 19:33         ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Hao Xu @ 2021-12-16 16:55 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi


在 2021/12/15 上午2:16, Pavel Begunkov 写道:
> On 12/14/21 16:53, Hao Xu wrote:
>>
>> 在 2021/12/14 下午11:21, Pavel Begunkov 写道:
>>> On 12/14/21 05:57, Hao Xu wrote:
>>>> This is just a proof of concept which is incompleted, send it early 
>>>> for
>>>> thoughts and suggestions.
>>>>
>>>> We already have IOSQE_IO_LINK to describe linear dependency
>>>> relationship sqes. While this patchset provides a new feature to
>>>> support DAG dependency. For instance, 4 sqes have a relationship
>>>> as below:
>>>>        --> 2 --
>>>>       /        \
>>>> 1 ---          ---> 4
>>>>       \        /
>>>>        --> 3 --
>>>> IOSQE_IO_LINK serializes them to 1-->2-->3-->4, which unneccessarily
>>>> serializes 2 and 3. But a DAG can fully describe it.
>>>>
>>>> For the detail usage, see the following patches' messages.
>>>>
>>>> Tested it with 100 direct read sqes, each one reads a BS=4k block data
>>>> in a same file, blocks are not overlapped. These sqes form a graph:
>>>>        2
>>>>        3
>>>> 1 --> 4 --> 100
>>>>       ...
>>>>        99
>>>>
>>>> This is an extreme case, just to show the idea.
>>>>
>>>> results below:
>>>> io_link:
>>>> IOPS: 15898251
>>>> graph_link:
>>>> IOPS: 29325513
>>>> io_link:
>>>> IOPS: 16420361
>>>> graph_link:
>>>> IOPS: 29585798
>>>> io_link:
>>>> IOPS: 18148820
>>>> graph_link:
>>>> IOPS: 27932960
>>>
>>> Hmm, what do we compare here? IIUC,
>>> "io_link" is a huge link of 100 requests. Around 15898251 IOPS
>>> "graph_link" is a graph of diameter 3. Around 29585798 IOPS
>
> Diam 2 graph, my bad
>
>
>>> Is that right? If so it'd more more fair to compare with a
>>> similar graph-like scheduling on the userspace side.
>>
>> The above test is more like to show the disadvantage of LINK
>
> Oh yeah, links can be slow, especially when it kills potential
> parallelism or need extra allocations for keeping state, like
> READV and WRITEV.
>
>
>> But yes, it's better to test the similar userspace  scheduling since
>>
>> LINK is definitely not a good choice so have to prove the graph stuff
>>
>> beat the userspace scheduling. Will test that soon. Thanks.
>
> Would be also great if you can also post the benchmark once
> it's done

Wrote a new test to test nop sqes forming a full binary tree with 
(2^10)-1 nodes,

which I think it a more general case.  Turns out the result is still not 
stable and

the kernel side graph link is much slow. I'll try to optimize it.

Btw, is there any comparison data between the current io link feature 
and the

userspace scheduling.


Regards,

Hao

>
>
>>> submit(req={1});
>>> wait(nr=1);
>>> submit({2-99});
>>> wait(nr=98);
>>> submit(req={100});
>>> wait(nr=1);
>>>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [POC RFC 0/3] support graph like dependent sqes
  2021-12-16 16:55       ` Hao Xu
@ 2021-12-17 19:33         ` Pavel Begunkov
  2021-12-18  6:57           ` Hao Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2021-12-17 19:33 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 12/16/21 16:55, Hao Xu wrote:
> 在 2021/12/15 上午2:16, Pavel Begunkov 写道:
>> On 12/14/21 16:53, Hao Xu wrote:
>>> 在 2021/12/14 下午11:21, Pavel Begunkov 写道:
>>>> On 12/14/21 05:57, Hao Xu wrote:
>>>>> This is just a proof of concept which is incompleted, send it early for
>>>>> thoughts and suggestions.
>>>>>
>>>>> We already have IOSQE_IO_LINK to describe linear dependency
>>>>> relationship sqes. While this patchset provides a new feature to
>>>>> support DAG dependency. For instance, 4 sqes have a relationship
>>>>> as below:
>>>>>        --> 2 --
>>>>>       /        \
>>>>> 1 ---          ---> 4
>>>>>       \        /
>>>>>        --> 3 --
>>>>> IOSQE_IO_LINK serializes them to 1-->2-->3-->4, which unneccessarily
>>>>> serializes 2 and 3. But a DAG can fully describe it.
>>>>>
>>>>> For the detail usage, see the following patches' messages.
>>>>>
>>>>> Tested it with 100 direct read sqes, each one reads a BS=4k block data
>>>>> in a same file, blocks are not overlapped. These sqes form a graph:
>>>>>        2
>>>>>        3
>>>>> 1 --> 4 --> 100
>>>>>       ...
>>>>>        99
>>>>>
>>>>> This is an extreme case, just to show the idea.
>>>>>
>>>>> results below:
>>>>> io_link:
>>>>> IOPS: 15898251
>>>>> graph_link:
>>>>> IOPS: 29325513
>>>>> io_link:
>>>>> IOPS: 16420361
>>>>> graph_link:
>>>>> IOPS: 29585798
>>>>> io_link:
>>>>> IOPS: 18148820
>>>>> graph_link:
>>>>> IOPS: 27932960
>>>>
>>>> Hmm, what do we compare here? IIUC,
>>>> "io_link" is a huge link of 100 requests. Around 15898251 IOPS
>>>> "graph_link" is a graph of diameter 3. Around 29585798 IOPS
>>
>> Diam 2 graph, my bad
>>
>>
>>>> Is that right? If so it'd more more fair to compare with a
>>>> similar graph-like scheduling on the userspace side.
>>>
>>> The above test is more like to show the disadvantage of LINK
>>
>> Oh yeah, links can be slow, especially when it kills potential
>> parallelism or need extra allocations for keeping state, like
>> READV and WRITEV.
>>
>>
>>> But yes, it's better to test the similar userspace  scheduling since
>>>
>>> LINK is definitely not a good choice so have to prove the graph stuff
>>>
>>> beat the userspace scheduling. Will test that soon. Thanks.
>>
>> Would be also great if you can also post the benchmark once
>> it's done
> 
> Wrote a new test to test nop sqes forming a full binary tree with (2^10)-1 nodes,
> which I think it a more general case.  Turns out the result is still not stable and
> the kernel side graph link is much slow. I'll try to optimize it.

That's expected unfortunately. And without reacting on results
of previous requests, it's hard to imagine to be useful. BPF may
have helped, e.g. not keeping an explicit graph but just generating
new requests from the kernel... But apparently even with this it's
hard to compete with just leaving it in userspace.

> Btw, is there any comparison data between the current io link feature and the
> userspace scheduling.

Don't remember. I'd try to look up the cover-letter for the patches
implementing it, I believe there should've been some numbers and
hopefully test description.

fwiw, before io_uring mailing list got established patches/etc.
were mostly going through linux-block mailing list. Links are old, so
patches might be there.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [POC RFC 0/3] support graph like dependent sqes
  2021-12-17 19:33         ` Pavel Begunkov
@ 2021-12-18  6:57           ` Hao Xu
  2021-12-21 16:19             ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Hao Xu @ 2021-12-18  6:57 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi


在 2021/12/18 上午3:33, Pavel Begunkov 写道:
> On 12/16/21 16:55, Hao Xu wrote:
>> 在 2021/12/15 上午2:16, Pavel Begunkov 写道:
>>> On 12/14/21 16:53, Hao Xu wrote:
>>>> 在 2021/12/14 下午11:21, Pavel Begunkov 写道:
>>>>> On 12/14/21 05:57, Hao Xu wrote:
>>>>>> This is just a proof of concept which is incompleted, send it 
>>>>>> early for
>>>>>> thoughts and suggestions.
>>>>>>
>>>>>> We already have IOSQE_IO_LINK to describe linear dependency
>>>>>> relationship sqes. While this patchset provides a new feature to
>>>>>> support DAG dependency. For instance, 4 sqes have a relationship
>>>>>> as below:
>>>>>>        --> 2 --
>>>>>>       /        \
>>>>>> 1 ---          ---> 4
>>>>>>       \        /
>>>>>>        --> 3 --
>>>>>> IOSQE_IO_LINK serializes them to 1-->2-->3-->4, which unneccessarily
>>>>>> serializes 2 and 3. But a DAG can fully describe it.
>>>>>>
>>>>>> For the detail usage, see the following patches' messages.
>>>>>>
>>>>>> Tested it with 100 direct read sqes, each one reads a BS=4k block 
>>>>>> data
>>>>>> in a same file, blocks are not overlapped. These sqes form a graph:
>>>>>>        2
>>>>>>        3
>>>>>> 1 --> 4 --> 100
>>>>>>       ...
>>>>>>        99
>>>>>>
>>>>>> This is an extreme case, just to show the idea.
>>>>>>
>>>>>> results below:
>>>>>> io_link:
>>>>>> IOPS: 15898251
>>>>>> graph_link:
>>>>>> IOPS: 29325513
>>>>>> io_link:
>>>>>> IOPS: 16420361
>>>>>> graph_link:
>>>>>> IOPS: 29585798
>>>>>> io_link:
>>>>>> IOPS: 18148820
>>>>>> graph_link:
>>>>>> IOPS: 27932960
>>>>>
>>>>> Hmm, what do we compare here? IIUC,
>>>>> "io_link" is a huge link of 100 requests. Around 15898251 IOPS
>>>>> "graph_link" is a graph of diameter 3. Around 29585798 IOPS
>>>
>>> Diam 2 graph, my bad
>>>
>>>
>>>>> Is that right? If so it'd more more fair to compare with a
>>>>> similar graph-like scheduling on the userspace side.
>>>>
>>>> The above test is more like to show the disadvantage of LINK
>>>
>>> Oh yeah, links can be slow, especially when it kills potential
>>> parallelism or need extra allocations for keeping state, like
>>> READV and WRITEV.
>>>
>>>
>>>> But yes, it's better to test the similar userspace  scheduling since
>>>>
>>>> LINK is definitely not a good choice so have to prove the graph stuff
>>>>
>>>> beat the userspace scheduling. Will test that soon. Thanks.
>>>
>>> Would be also great if you can also post the benchmark once
>>> it's done
>>
>> Wrote a new test to test nop sqes forming a full binary tree with 
>> (2^10)-1 nodes,
>> which I think it a more general case.  Turns out the result is still 
>> not stable and
>> the kernel side graph link is much slow. I'll try to optimize it.
>
> That's expected unfortunately. And without reacting on results
> of previous requests, it's hard to imagine to be useful. BPF may
> have helped, e.g. not keeping an explicit graph but just generating
> new requests from the kernel... But apparently even with this it's
> hard to compete with just leaving it in userspace.
>
Tried to exclude the memory allocation stuff, seems it's a bit better 
than the user graph.

For the result delivery, I was thinking of attaching BPF program within 
a sqe, not creating

a single BPF type sqe. Then we can have data flow in the graph or 
linkchain. But I haven't

had a clear draft for it

>> Btw, is there any comparison data between the current io link feature 
>> and the
>> userspace scheduling.
>
> Don't remember. I'd try to look up the cover-letter for the patches
> implementing it, I believe there should've been some numbers and
> hopefully test description.
>
> fwiw, before io_uring mailing list got established patches/etc.
> were mostly going through linux-block mailing list. Links are old, so
> patches might be there.
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [POC RFC 0/3] support graph like dependent sqes
  2021-12-18  6:57           ` Hao Xu
@ 2021-12-21 16:19             ` Pavel Begunkov
  2021-12-23  4:14               ` Hao Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2021-12-21 16:19 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 12/18/21 06:57, Hao Xu wrote:
> 在 2021/12/18 上午3:33, Pavel Begunkov 写道:
>> On 12/16/21 16:55, Hao Xu wrote:
>>> 在 2021/12/15 上午2:16, Pavel Begunkov 写道:
>>>> On 12/14/21 16:53, Hao Xu wrote:
>>>>> 在 2021/12/14 下午11:21, Pavel Begunkov 写道:
>>>>>> On 12/14/21 05:57, Hao Xu wrote:
>>>>>>> This is just a proof of concept which is incompleted, send it early for
>>>>>>> thoughts and suggestions.
>>>>>>>
>>>>>>> We already have IOSQE_IO_LINK to describe linear dependency
>>>>>>> relationship sqes. While this patchset provides a new feature to
>>>>>>> support DAG dependency. For instance, 4 sqes have a relationship
>>>>>>> as below:
>>>>>>>        --> 2 --
>>>>>>>       /        \
>>>>>>> 1 ---          ---> 4
>>>>>>>       \        /
>>>>>>>        --> 3 --
>>>>>>> IOSQE_IO_LINK serializes them to 1-->2-->3-->4, which unneccessarily
>>>>>>> serializes 2 and 3. But a DAG can fully describe it.
>>>>>>>
>>>>>>> For the detail usage, see the following patches' messages.
>>>>>>>
>>>>>>> Tested it with 100 direct read sqes, each one reads a BS=4k block data
>>>>>>> in a same file, blocks are not overlapped. These sqes form a graph:
>>>>>>>        2
>>>>>>>        3
>>>>>>> 1 --> 4 --> 100
>>>>>>>       ...
>>>>>>>        99
>>>>>>>
>>>>>>> This is an extreme case, just to show the idea.
>>>>>>>
>>>>>>> results below:
>>>>>>> io_link:
>>>>>>> IOPS: 15898251
>>>>>>> graph_link:
>>>>>>> IOPS: 29325513
>>>>>>> io_link:
>>>>>>> IOPS: 16420361
>>>>>>> graph_link:
>>>>>>> IOPS: 29585798
>>>>>>> io_link:
>>>>>>> IOPS: 18148820
>>>>>>> graph_link:
>>>>>>> IOPS: 27932960
>>>>>>
>>>>>> Hmm, what do we compare here? IIUC,
>>>>>> "io_link" is a huge link of 100 requests. Around 15898251 IOPS
>>>>>> "graph_link" is a graph of diameter 3. Around 29585798 IOPS
>>>>
>>>> Diam 2 graph, my bad
>>>>
>>>>
>>>>>> Is that right? If so it'd more more fair to compare with a
>>>>>> similar graph-like scheduling on the userspace side.
>>>>>
>>>>> The above test is more like to show the disadvantage of LINK
>>>>
>>>> Oh yeah, links can be slow, especially when it kills potential
>>>> parallelism or need extra allocations for keeping state, like
>>>> READV and WRITEV.
>>>>
>>>>
>>>>> But yes, it's better to test the similar userspace  scheduling since
>>>>>
>>>>> LINK is definitely not a good choice so have to prove the graph stuff
>>>>>
>>>>> beat the userspace scheduling. Will test that soon. Thanks.
>>>>
>>>> Would be also great if you can also post the benchmark once
>>>> it's done
>>>
>>> Wrote a new test to test nop sqes forming a full binary tree with (2^10)-1 nodes,
>>> which I think it a more general case.  Turns out the result is still not stable and
>>> the kernel side graph link is much slow. I'll try to optimize it.
>>
>> That's expected unfortunately. And without reacting on results
>> of previous requests, it's hard to imagine to be useful. BPF may
>> have helped, e.g. not keeping an explicit graph but just generating
>> new requests from the kernel... But apparently even with this it's
>> hard to compete with just leaving it in userspace.
>>
> Tried to exclude the memory allocation stuff, seems it's a bit better than the user graph.
> 
> For the result delivery, I was thinking of attaching BPF program within a sqe, not creating
> a single BPF type sqe. Then we can have data flow in the graph or linkchain. But I haven't
> had a clear draft for it

Oh, I dismissed this idea before. Even if it can be done in-place without any
additional tw (consider recursion and submit_state not prepared for that), it'll
be a horror to maintain. And I also don't see it being flexible enough.

There is one idea from guys that I have to implement, i.e. having a per-CQ
callback. Might interesting to experiment, but I don't see it being viable
in the long run.


>>> Btw, is there any comparison data between the current io link feature and the
>>> userspace scheduling.
>>
>> Don't remember. I'd try to look up the cover-letter for the patches
>> implementing it, I believe there should've been some numbers and
>> hopefully test description.
>>
>> fwiw, before io_uring mailing list got established patches/etc.
>> were mostly going through linux-block mailing list. Links are old, so
>> patches might be there.
>>

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [POC RFC 0/3] support graph like dependent sqes
  2021-12-21 16:19             ` Pavel Begunkov
@ 2021-12-23  4:14               ` Hao Xu
  0 siblings, 0 replies; 15+ messages in thread
From: Hao Xu @ 2021-12-23  4:14 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/12/22 上午12:19, Pavel Begunkov 写道:
> On 12/18/21 06:57, Hao Xu wrote:
>> 在 2021/12/18 上午3:33, Pavel Begunkov 写道:
>>> On 12/16/21 16:55, Hao Xu wrote:
>>>> 在 2021/12/15 上午2:16, Pavel Begunkov 写道:
>>>>> On 12/14/21 16:53, Hao Xu wrote:
>>>>>> 在 2021/12/14 下午11:21, Pavel Begunkov 写道:
>>>>>>> On 12/14/21 05:57, Hao Xu wrote:
>>>>>>>> This is just a proof of concept which is incompleted, send it 
>>>>>>>> early for
>>>>>>>> thoughts and suggestions.
>>>>>>>>
>>>>>>>> We already have IOSQE_IO_LINK to describe linear dependency
>>>>>>>> relationship sqes. While this patchset provides a new feature to
>>>>>>>> support DAG dependency. For instance, 4 sqes have a relationship
>>>>>>>> as below:
>>>>>>>>        --> 2 --
>>>>>>>>       /        \
>>>>>>>> 1 ---          ---> 4
>>>>>>>>       \        /
>>>>>>>>        --> 3 --
>>>>>>>> IOSQE_IO_LINK serializes them to 1-->2-->3-->4, which 
>>>>>>>> unneccessarily
>>>>>>>> serializes 2 and 3. But a DAG can fully describe it.
>>>>>>>>
>>>>>>>> For the detail usage, see the following patches' messages.
>>>>>>>>
>>>>>>>> Tested it with 100 direct read sqes, each one reads a BS=4k 
>>>>>>>> block data
>>>>>>>> in a same file, blocks are not overlapped. These sqes form a graph:
>>>>>>>>        2
>>>>>>>>        3
>>>>>>>> 1 --> 4 --> 100
>>>>>>>>       ...
>>>>>>>>        99
>>>>>>>>
>>>>>>>> This is an extreme case, just to show the idea.
>>>>>>>>
>>>>>>>> results below:
>>>>>>>> io_link:
>>>>>>>> IOPS: 15898251
>>>>>>>> graph_link:
>>>>>>>> IOPS: 29325513
>>>>>>>> io_link:
>>>>>>>> IOPS: 16420361
>>>>>>>> graph_link:
>>>>>>>> IOPS: 29585798
>>>>>>>> io_link:
>>>>>>>> IOPS: 18148820
>>>>>>>> graph_link:
>>>>>>>> IOPS: 27932960
>>>>>>>
>>>>>>> Hmm, what do we compare here? IIUC,
>>>>>>> "io_link" is a huge link of 100 requests. Around 15898251 IOPS
>>>>>>> "graph_link" is a graph of diameter 3. Around 29585798 IOPS
>>>>>
>>>>> Diam 2 graph, my bad
>>>>>
>>>>>
>>>>>>> Is that right? If so it'd more more fair to compare with a
>>>>>>> similar graph-like scheduling on the userspace side.
>>>>>>
>>>>>> The above test is more like to show the disadvantage of LINK
>>>>>
>>>>> Oh yeah, links can be slow, especially when it kills potential
>>>>> parallelism or need extra allocations for keeping state, like
>>>>> READV and WRITEV.
>>>>>
>>>>>
>>>>>> But yes, it's better to test the similar userspace  scheduling since
>>>>>>
>>>>>> LINK is definitely not a good choice so have to prove the graph stuff
>>>>>>
>>>>>> beat the userspace scheduling. Will test that soon. Thanks.
>>>>>
>>>>> Would be also great if you can also post the benchmark once
>>>>> it's done
>>>>
>>>> Wrote a new test to test nop sqes forming a full binary tree with 
>>>> (2^10)-1 nodes,
>>>> which I think it a more general case.  Turns out the result is still 
>>>> not stable and
>>>> the kernel side graph link is much slow. I'll try to optimize it.
>>>
>>> That's expected unfortunately. And without reacting on results
>>> of previous requests, it's hard to imagine to be useful. BPF may
>>> have helped, e.g. not keeping an explicit graph but just generating
>>> new requests from the kernel... But apparently even with this it's
>>> hard to compete with just leaving it in userspace.
Not sure what it looks like by 'generating new requests by BPF', since
a sqe may have multiple pre-sqes. Can a BPF sleeps there and waits for
all the pre-sqes done, and if it can, there still have to be some
counting stuff and wakeups. Another problem is does the bpf sleep blocks
the main context? So not sure 'BPF fully controls graph/link logic'
beats 'only let BPF do data flow and build graph/link explicitly'.
Have you tested using BPF to form links/graph, curious about the
real performance comparison with userspace scheduling.
>>>
>> Tried to exclude the memory allocation stuff, seems it's a bit better 
>> than the user graph.
>>
>> For the result delivery, I was thinking of attaching BPF program 
>> within a sqe, not creating
>> a single BPF type sqe. Then we can have data flow in the graph or 
>> linkchain. But I haven't
>> had a clear draft for it
> 
> Oh, I dismissed this idea before. Even if it can be done in-place 
> without any
> additional tw (consider recursion and submit_state not prepared for 
> that), it'll
> be a horror to maintain. And I also don't see it being flexible enough
> 
> There is one idea from guys that I have to implement, i.e. having a per-CQ
> callback. Might interesting to experiment, but I don't see it being viable
> in the long run.
> 
> 
>>>> Btw, is there any comparison data between the current io link 
>>>> feature and the
>>>> userspace scheduling.
>>>
>>> Don't remember. I'd try to look up the cover-letter for the patches
>>> implementing it, I believe there should've been some numbers and
>>> hopefully test description.
>>>
>>> fwiw, before io_uring mailing list got established patches/etc.
>>> were mostly going through linux-block mailing list. Links are old, so
>>> patches might be there.
>>>
> 
https://lore.kernel.org/linux-block/20190517214131.5925-1-axboe@kernel.dk/
found the initial patchset here, seems Jens tested it by
example/iouring-cp, I'll test it later. Also found the same idea of 
attaching BPF to a sqe so that it can run at the end of sqe completion.
Curious if there has been POC implemented before and numbers of this,
compared with userspace result delivery.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [POC RFC 0/3] support graph like dependent sqes
  2021-12-14  5:57 [POC RFC 0/3] support graph like dependent sqes Hao Xu
                   ` (3 preceding siblings ...)
  2021-12-14 15:21 ` [POC RFC 0/3] support graph like dependent sqes Pavel Begunkov
@ 2021-12-23 10:06 ` Christian Dietrich
  2021-12-27  3:27   ` Hao Xu
  4 siblings, 1 reply; 15+ messages in thread
From: Christian Dietrich @ 2021-12-23 10:06 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe
  Cc: io-uring, Joseph Qi, horst.schirmeier, Franz-B. Tuneke, Hendrik Sieck

Hi everyone!

We experimented with the BPF patchset provided by Pavel a few months
ago. And I had the exact same question: How can we compare the benefits
and drawbacks of a more flexible io_uring implementation? In that
specific use case, I wanted to show that a flexible SQE-dependency
generation with BPF could outperform user-space SQE scheduling. From my
experience with BPF, I learned that it is quite hard to beat
io_uring+userspace, if there is enough parallelism in your IO jobs.

For this purpose, I've built a benchmark generator that is able to
produce random dependency graphs of various shapes (isolated nodes,
binary tree, parallel-dependency chains, random DAC) and different
scheduling backends (usual system-call backend, plain io_uring,
BPF-enhanced io_uring) and different workloads.

At this point, I didn't have the time to polish the generator and
publish it, but I put the current state into this git:

https://collaborating.tuhh.de/e-exk4/projects/syscall-graph-generator

After running:

    ./generate.sh
    [sudo modprobe null_blk...]
    ./run.sh
    ./analyze.py

You get the following results (at least if you own my machine):

generator              iouring      syscall      iouring_norm
graph action size
chain read   128    938.563366  2019.199010   46.48%
flat  read   128    922.132673  2011.566337   45.84%
graph read   128   1129.017822  2021.905941   55.84%
rope  read   128   2051.763366  2014.563366  101.85%
tree  read   128   1049.427723  2015.254455   52.07%

For the userspace scheduler, I perform an offline analysis that finds
linear chains of operations that are not (anymore) dependent on other previous
unfinished results. These linear chains are then pushed into io_uring
with a SQE-link chain.

As I'm highly interested in this topic of pushing complex
IO-dependencies into the kernel space, I would be delighted to see how
your SQE-graph extension would compare against my rudimentary userspace
scheduler.

@Hao: Do you have a specific use case for your graph-like dependencies
      in mind? If you need assistance with the generator, please feel
      free to contact me.

chris
-- 
Prof. Dr.-Ing. Christian Dietrich
Operating System Group (E-EXK4)
Technische Universität Hamburg
Am Schwarzenberg-Campus 3 (E), 4.092
21073 Hamburg

eMail:  christian.dietrich@tuhh.de
Tel:    +49 40 42878 2188
WWW:    https://osg.tuhh.de/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [POC RFC 0/3] support graph like dependent sqes
  2021-12-23 10:06 ` Christian Dietrich
@ 2021-12-27  3:27   ` Hao Xu
  2021-12-27  5:49     ` Christian Dietrich
  0 siblings, 1 reply; 15+ messages in thread
From: Hao Xu @ 2021-12-27  3:27 UTC (permalink / raw)
  To: Christian Dietrich, Jens Axboe
  Cc: io-uring, Joseph Qi, horst.schirmeier, Franz-B. Tuneke, Hendrik Sieck

在 2021/12/23 下午6:06, Christian Dietrich 写道:
> Hi everyone!
> 
> We experimented with the BPF patchset provided by Pavel a few months
> ago. And I had the exact same question: How can we compare the benefits
> and drawbacks of a more flexible io_uring implementation? In that
> specific use case, I wanted to show that a flexible SQE-dependency
> generation with BPF could outperform user-space SQE scheduling. From my
> experience with BPF, I learned that it is quite hard to beat
> io_uring+userspace, if there is enough parallelism in your IO jobs.
> 
> For this purpose, I've built a benchmark generator that is able to
> produce random dependency graphs of various shapes (isolated nodes,
> binary tree, parallel-dependency chains, random DAC) and different
> scheduling backends (usual system-call backend, plain io_uring,
> BPF-enhanced io_uring) and different workloads.
> 
> At this point, I didn't have the time to polish the generator and
> publish it, but I put the current state into this git:
> 
> https://collaborating.tuhh.de/e-exk4/projects/syscall-graph-generator
> 
> After running:
> 
>      ./generate.sh
>      [sudo modprobe null_blk...]
>      ./run.sh
>      ./analyze.py
> 
> You get the following results (at least if you own my machine):
> 
> generator              iouring      syscall      iouring_norm
> graph action size
> chain read   128    938.563366  2019.199010   46.48%
> flat  read   128    922.132673  2011.566337   45.84%
> graph read   128   1129.017822  2021.905941   55.84%
> rope  read   128   2051.763366  2014.563366  101.85%
> tree  read   128   1049.427723  2015.254455   52.07%
> 
Hi Christian,
Great! Thanks for the testing, a question here: the first generator
iouring means BPF-enhanced iouring?
> For the userspace scheduler, I perform an offline analysis that finds
> linear chains of operations that are not (anymore) dependent on other previous
> unfinished results. These linear chains are then pushed into io_uring
> with a SQE-link chain.
> 
> As I'm highly interested in this topic of pushing complex
> IO-dependencies into the kernel space, I would be delighted to see how
> your SQE-graph extension would compare against my rudimentary userspace
> scheduler.
> 
> @Hao: Do you have a specific use case for your graph-like dependencies
>        in mind? If you need assistance with the generator, please feel
>        free to contact me.
I currently don't have a specifuc use case, just feel this may be useful
since there are simple cases like open-->parallel reads->close that
linear dependency doesn't apply, so this POC is sent more like to get
people's thought about user cases..
Thanks again for the benchmark, I'll leverage it to test my approach
though a bit busy with other work recently..

Regards,
Hao
> 
> chris
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [POC RFC 0/3] support graph like dependent sqes
  2021-12-27  3:27   ` Hao Xu
@ 2021-12-27  5:49     ` Christian Dietrich
  0 siblings, 0 replies; 15+ messages in thread
From: Christian Dietrich @ 2021-12-27  5:49 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe
  Cc: io-uring, Joseph Qi, horst.schirmeier, Franz-B. Tuneke, Hendrik Sieck

Hao Xu <haoxu@linux.alibaba.com> [27. Dezember 2021]:

> Hi Christian,
> Great! Thanks for the testing, a question here: the first generator
> iouring means BPF-enhanced iouring?

No, actually not. I've only shown the plain io_uring vs. linear
systemcall here. Within the generator, I call the io_uring+bpf just
'bpf'.


> I currently don't have a specifuc use case, just feel this may be useful
> since there are simple cases like open-->parallel reads->close that
> linear dependency doesn't apply, so this POC is sent more like to get
> people's thought about user cases..

That is in principle a good idea. However, as Franz can tell you, for
many uses cases predefining the followup actions does not work. For
example, for reads, it can happen that you get a short read, whereby you
cannot directly continue to the next action.

chris
-- 
Prof. Dr.-Ing. Christian Dietrich
Operating System Group (E-EXK4)
Technische Universität Hamburg
Am Schwarzenberg-Campus 3 (E), 4.092
21073 Hamburg

eMail:  christian.dietrich@tuhh.de
Tel:    +49 40 42878 2188
WWW:    https://osg.tuhh.de/

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-12-27  5:49 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-14  5:57 [POC RFC 0/3] support graph like dependent sqes Hao Xu
2021-12-14  5:57 ` [PATCH 1/3] io_uring: add data structure for graph sqe feature Hao Xu
2021-12-14  5:57 ` [PATCH 2/3] io_uring: implement new sqe opcode to build graph like links Hao Xu
2021-12-14  5:57 ` [PATCH 3/3] io_uring: implement logic of IOSQE_GRAPH request Hao Xu
2021-12-14 15:21 ` [POC RFC 0/3] support graph like dependent sqes Pavel Begunkov
2021-12-14 16:53   ` Hao Xu
2021-12-14 18:16     ` Pavel Begunkov
2021-12-16 16:55       ` Hao Xu
2021-12-17 19:33         ` Pavel Begunkov
2021-12-18  6:57           ` Hao Xu
2021-12-21 16:19             ` Pavel Begunkov
2021-12-23  4:14               ` Hao Xu
2021-12-23 10:06 ` Christian Dietrich
2021-12-27  3:27   ` Hao Xu
2021-12-27  5:49     ` Christian Dietrich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.