linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET 0/2] io_uring: improve handling of buffered writes
@ 2019-09-10 16:42 Jens Axboe
  2019-09-10 16:42 ` [PATCH 1/2] io_uring: add io_queue_async_work() helper Jens Axboe
  2019-09-10 16:42 ` [PATCH 2/2] io_uring: limit parallelism of buffered writes Jens Axboe
  0 siblings, 2 replies; 3+ messages in thread
From: Jens Axboe @ 2019-09-10 16:42 UTC (permalink / raw)
  To: linux-block

XFS/ext4/others all need to lock the inode for buffered writes. Since
io_uring handles any IO in an async manner, this means that for higher
queue depth buffered write workloads, we have a lot of workers
hammering on the same mutex.

Running a QD=32 random write workload on my test box yields about 200K
4k random write IOPS with io_uring. Looking at system profiles, we're
spending about half the time contending on the inode mutex. Oof.

For buffered writes, we don't necessarily need a huge amount of threads
issuing that IO. If we instead rely on normal flushing to take care of
getting the parallelism we need on the device side, we can limit
ourselves to a much lower depth. This still gets us async behavior on
the submission side.

With this small series, my 200K IOPS goes to 370K IOPS for the same
workload.

This issue came out of postgres implementing io_uring support, and
reporting some of the issues they saw.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 1/2] io_uring: add io_queue_async_work() helper
  2019-09-10 16:42 [PATCHSET 0/2] io_uring: improve handling of buffered writes Jens Axboe
@ 2019-09-10 16:42 ` Jens Axboe
  2019-09-10 16:42 ` [PATCH 2/2] io_uring: limit parallelism of buffered writes Jens Axboe
  1 sibling, 0 replies; 3+ messages in thread
From: Jens Axboe @ 2019-09-10 16:42 UTC (permalink / raw)
  To: linux-block; +Cc: Jens Axboe

Add a helper for queueing a request for async execution, in preparation
for optimizing it.

No functional change in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index b2f88c2dc2fd..41840bf26d3b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -443,6 +443,12 @@ static void __io_commit_cqring(struct io_ring_ctx *ctx)
 	}
 }
 
+static inline void io_queue_async_work(struct io_ring_ctx *ctx,
+				       struct io_kiocb *req)
+{
+	queue_work(ctx->sqo_wq, &req->work);
+}
+
 static void io_commit_cqring(struct io_ring_ctx *ctx)
 {
 	struct io_kiocb *req;
@@ -456,7 +462,7 @@ static void io_commit_cqring(struct io_ring_ctx *ctx)
 			continue;
 		}
 		req->flags |= REQ_F_IO_DRAINED;
-		queue_work(ctx->sqo_wq, &req->work);
+		io_queue_async_work(ctx, req);
 	}
 }
 
@@ -619,7 +625,7 @@ static void io_req_link_next(struct io_kiocb *req)
 
 		nxt->flags |= REQ_F_LINK_DONE;
 		INIT_WORK(&nxt->work, io_sq_wq_submit_work);
-		queue_work(req->ctx->sqo_wq, &nxt->work);
+		io_queue_async_work(req->ctx, nxt);
 	}
 }
 
@@ -1519,7 +1525,7 @@ static void io_poll_remove_one(struct io_kiocb *req)
 	WRITE_ONCE(poll->canceled, true);
 	if (!list_empty(&poll->wait.entry)) {
 		list_del_init(&poll->wait.entry);
-		queue_work(req->ctx->sqo_wq, &req->work);
+		io_queue_async_work(req->ctx, req);
 	}
 	spin_unlock(&poll->head->lock);
 
@@ -1633,7 +1639,7 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 		io_cqring_ev_posted(ctx);
 		io_put_req(req);
 	} else {
-		queue_work(ctx->sqo_wq, &req->work);
+		io_queue_async_work(ctx, req);
 	}
 
 	return 1;
@@ -2073,7 +2079,7 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 				if (list)
 					atomic_inc(&list->cnt);
 				INIT_WORK(&req->work, io_sq_wq_submit_work);
-				queue_work(ctx->sqo_wq, &req->work);
+				io_queue_async_work(ctx, req);
 			}
 
 			/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH 2/2] io_uring: limit parallelism of buffered writes
  2019-09-10 16:42 [PATCHSET 0/2] io_uring: improve handling of buffered writes Jens Axboe
  2019-09-10 16:42 ` [PATCH 1/2] io_uring: add io_queue_async_work() helper Jens Axboe
@ 2019-09-10 16:42 ` Jens Axboe
  1 sibling, 0 replies; 3+ messages in thread
From: Jens Axboe @ 2019-09-10 16:42 UTC (permalink / raw)
  To: linux-block; +Cc: Jens Axboe

All the popular filesystems need to grab the inode lock for buffered
writes. With io_uring punting buffered writes to async context, we
observe a lot of contention with all workers hamming this mutex.

For buffered writes, we generally don't need a lot of parallelism on
the submission side, as the flushing will take care of that for us.
Hence we don't need a deep queue on the write side, as long as we
can safely punt from the original submission context.

Add a workqueue with a limit of 2 that we can use for buffered writes.
This greatly improves the performance and efficiency of higher queue
depth buffered async writes with io_uring.

Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 47 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 39 insertions(+), 8 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 41840bf26d3b..03fcd974fd1d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -203,7 +203,7 @@ struct io_ring_ctx {
 	} ____cacheline_aligned_in_smp;
 
 	/* IO offload */
-	struct workqueue_struct	*sqo_wq;
+	struct workqueue_struct	*sqo_wq[2];
 	struct task_struct	*sqo_thread;	/* if using sq thread polling */
 	struct mm_struct	*sqo_mm;
 	wait_queue_head_t	sqo_wait;
@@ -446,7 +446,19 @@ static void __io_commit_cqring(struct io_ring_ctx *ctx)
 static inline void io_queue_async_work(struct io_ring_ctx *ctx,
 				       struct io_kiocb *req)
 {
-	queue_work(ctx->sqo_wq, &req->work);
+	int rw;
+
+	switch (req->submit.sqe->opcode) {
+	case IORING_OP_WRITEV:
+	case IORING_OP_WRITE_FIXED:
+		rw = !(req->rw.ki_flags & IOCB_DIRECT);
+		break;
+	default:
+		rw = 0;
+		break;
+	}
+
+	queue_work(ctx->sqo_wq[rw], &req->work);
 }
 
 static void io_commit_cqring(struct io_ring_ctx *ctx)
@@ -2634,11 +2646,15 @@ static void io_sq_thread_stop(struct io_ring_ctx *ctx)
 
 static void io_finish_async(struct io_ring_ctx *ctx)
 {
+	int i;
+
 	io_sq_thread_stop(ctx);
 
-	if (ctx->sqo_wq) {
-		destroy_workqueue(ctx->sqo_wq);
-		ctx->sqo_wq = NULL;
+	for (i = 0; i < ARRAY_SIZE(ctx->sqo_wq); i++) {
+		if (ctx->sqo_wq[i]) {
+			destroy_workqueue(ctx->sqo_wq[i]);
+			ctx->sqo_wq[i] = NULL;
+		}
 	}
 }
 
@@ -2846,16 +2862,31 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx,
 	}
 
 	/* Do QD, or 2 * CPUS, whatever is smallest */
-	ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE,
+	ctx->sqo_wq[0] = alloc_workqueue("io_ring-wq",
+			WQ_UNBOUND | WQ_FREEZABLE,
 			min(ctx->sq_entries - 1, 2 * num_online_cpus()));
-	if (!ctx->sqo_wq) {
+	if (!ctx->sqo_wq[0]) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	/*
+	 * This is for buffered writes, where we want to limit the parallelism
+	 * due to file locking in file systems. As "normal" buffered writes
+	 * should parellelize on writeout quite nicely, limit us to having 2
+	 * pending. This avoids massive contention on the inode when doing
+	 * buffered async writes.
+	 */
+	ctx->sqo_wq[1] = alloc_workqueue("io_ring-write-wq",
+						WQ_UNBOUND | WQ_FREEZABLE, 2);
+	if (!ctx->sqo_wq[1]) {
 		ret = -ENOMEM;
 		goto err;
 	}
 
 	return 0;
 err:
-	io_sq_thread_stop(ctx);
+	io_finish_async(ctx);
 	mmdrop(ctx->sqo_mm);
 	ctx->sqo_mm = NULL;
 	return ret;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-09-10 16:42 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-10 16:42 [PATCHSET 0/2] io_uring: improve handling of buffered writes Jens Axboe
2019-09-10 16:42 ` [PATCH 1/2] io_uring: add io_queue_async_work() helper Jens Axboe
2019-09-10 16:42 ` [PATCH 2/2] io_uring: limit parallelism of buffered writes Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).