From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9E0BBC43219 for ; Mon, 27 Sep 2021 06:17:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 84B256117A for ; Mon, 27 Sep 2021 06:17:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232283AbhI0GTJ (ORCPT ); Mon, 27 Sep 2021 02:19:09 -0400 Received: from out30-44.freemail.mail.aliyun.com ([115.124.30.44]:40879 "EHLO out30-44.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232940AbhI0GTH (ORCPT ); Mon, 27 Sep 2021 02:19:07 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R171e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04426;MF=haoxu@linux.alibaba.com;NM=1;PH=DS;RN=4;SR=0;TI=SMTPD_---0UpiKFOp_1632723441; Received: from e18g09479.et15sqa.tbsite.net(mailfrom:haoxu@linux.alibaba.com fp:SMTPD_---0UpiKFOp_1632723441) by smtp.aliyun-inc.com(127.0.0.1); Mon, 27 Sep 2021 14:17:28 +0800 From: Hao Xu To: Jens Axboe Cc: io-uring@vger.kernel.org, Pavel Begunkov , Joseph Qi Subject: [PATCH 3/8] io_uring: add a limited tw list for irq completion work Date: Mon, 27 Sep 2021 14:17:16 +0800 Message-Id: <20210927061721.180806-4-haoxu@linux.alibaba.com> X-Mailer: git-send-email 2.24.4 In-Reply-To: <20210927061721.180806-1-haoxu@linux.alibaba.com> References: <20210927061721.180806-1-haoxu@linux.alibaba.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org Now we have a lot of task_work users, some are just to complete a req and generate a cqe. Let's put the work to a new tw list which has a higher priority, so that it can be handled quickly and thus to reduce avg req latency. an explanatory case: origin timeline: submit_sqe-->irq-->add completion task_work -->run heavy work0~n-->run completion task_work now timeline: submit_sqe-->irq-->add completion task_work -->run completion task_work-->run heavy work0~n One thing to watch out is sometimes irq completion TWs comes overwhelmingly, which makes the new tw list grows fast, and TWs in the old list are starved. So we have to limit the length of the new tw list. A practical value is 1/3: len of new tw list < 1/3 * (len of new + old tw list) In this way, the new tw list has a limited length and normal task get there chance to run. Tested this patch(and the following ones) by manually replace __io_queue_sqe() to io_req_task_complete() to construct 'heavy' task works. Then test with fio: ioengine=io_uring thread=1 bs=4k direct=1 rw=randread time_based=1 runtime=600 randrepeat=0 group_reporting=1 filename=/dev/nvme0n1 Tried various iodepth. The peak IOPS for this patch is 314K, while the old one is 249K. For avg latency, difference shows when iodepth grow: depth and avg latency(usec): depth new old 1 22.80 23.77 2 23.48 24.54 4 24.26 25.57 8 29.21 32.89 16 53.61 63.50 32 106.29 131.34 64 217.21 256.33 128 421.59 513.87 256 815.15 1050.99 95%, 99% etc more data in cover letter. Signed-off-by: Hao Xu --- fs/io_uring.c | 44 +++++++++++++++++++++++++++++++------------- 1 file changed, 31 insertions(+), 13 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 8317c360f7a4..9272b2cfcfb7 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -461,6 +461,7 @@ struct io_ring_ctx { }; }; +#define MAX_EMERGENCY_TW_RATIO 3 struct io_uring_task { /* submission side */ int cached_refs; @@ -475,6 +476,9 @@ struct io_uring_task { spinlock_t task_lock; struct io_wq_work_list task_list; struct callback_head task_work; + struct io_wq_work_list prior_task_list; + unsigned int nr; + unsigned int prior_nr; bool task_running; }; @@ -2132,12 +2136,16 @@ static void tctx_task_work(struct callback_head *cb) while (1) { struct io_wq_work_node *node; - if (!tctx->task_list.first && locked) + if (!tctx->prior_task_list.first && + !tctx->task_list.first && locked) io_submit_flush_completions(ctx); spin_lock_irq(&tctx->task_lock); - node = tctx->task_list.first; + wq_list_merge(&tctx->prior_task_list, &tctx->task_list); + node = tctx->prior_task_list.first; INIT_WQ_LIST(&tctx->task_list); + INIT_WQ_LIST(&tctx->prior_task_list); + tctx->nr = tctx->prior_nr = 0; if (!node) tctx->task_running = false; spin_unlock_irq(&tctx->task_lock); @@ -2166,7 +2174,7 @@ static void tctx_task_work(struct callback_head *cb) ctx_flush_and_put(ctx, &locked); } -static void io_req_task_work_add(struct io_kiocb *req) +static void io_req_task_work_add(struct io_kiocb *req, bool emergency) { struct task_struct *tsk = req->task; struct io_uring_task *tctx = tsk->io_uring; @@ -2178,7 +2186,13 @@ static void io_req_task_work_add(struct io_kiocb *req) WARN_ON_ONCE(!tctx); spin_lock_irqsave(&tctx->task_lock, flags); - wq_list_add_tail(&req->io_task_work.node, &tctx->task_list); + if (emergency && tctx->prior_nr * MAX_EMERGENCY_TW_RATIO < tctx->nr) { + wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list); + tctx->prior_nr++; + } else { + wq_list_add_tail(&req->io_task_work.node, &tctx->task_list); + } + tctx->nr++; running = tctx->task_running; if (!running) tctx->task_running = true; @@ -2202,9 +2216,12 @@ static void io_req_task_work_add(struct io_kiocb *req) } spin_lock_irqsave(&tctx->task_lock, flags); + tctx->nr = tctx->prior_nr = 0; tctx->task_running = false; - node = tctx->task_list.first; + wq_list_merge(&tctx->prior_task_list, &tctx->task_list); + node = tctx->prior_task_list.first; INIT_WQ_LIST(&tctx->task_list); + INIT_WQ_LIST(&tctx->prior_task_list); spin_unlock_irqrestore(&tctx->task_lock, flags); while (node) { @@ -2241,19 +2258,19 @@ static void io_req_task_queue_fail(struct io_kiocb *req, int ret) { req->result = ret; req->io_task_work.func = io_req_task_cancel; - io_req_task_work_add(req); + io_req_task_work_add(req, true); } static void io_req_task_queue(struct io_kiocb *req) { req->io_task_work.func = io_req_task_submit; - io_req_task_work_add(req); + io_req_task_work_add(req, false); } static void io_req_task_queue_reissue(struct io_kiocb *req) { req->io_task_work.func = io_queue_async_work; - io_req_task_work_add(req); + io_req_task_work_add(req, false); } static inline void io_queue_next(struct io_kiocb *req) @@ -2373,7 +2390,7 @@ static inline void io_put_req_deferred(struct io_kiocb *req) { if (req_ref_put_and_test(req)) { req->io_task_work.func = io_free_req_work; - io_req_task_work_add(req); + io_req_task_work_add(req, false); } } @@ -2693,7 +2710,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2) return; req->result = res; req->io_task_work.func = io_req_task_complete; - io_req_task_work_add(req); + io_req_task_work_add(req, true); } static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2) @@ -5256,7 +5273,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, * of executing it. We can't safely execute it anyway, as we may not * have the needed state needed for it anyway. */ - io_req_task_work_add(req); + io_req_task_work_add(req, false); return 1; } @@ -5934,7 +5951,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) spin_unlock_irqrestore(&ctx->timeout_lock, flags); req->io_task_work.func = io_req_task_timeout; - io_req_task_work_add(req); + io_req_task_work_add(req, false); return HRTIMER_NORESTART; } @@ -6916,7 +6933,7 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) spin_unlock_irqrestore(&ctx->timeout_lock, flags); req->io_task_work.func = io_req_task_link_timeout; - io_req_task_work_add(req); + io_req_task_work_add(req, false); return HRTIMER_NORESTART; } @@ -8543,6 +8560,7 @@ static int io_uring_alloc_task_context(struct task_struct *task, task->io_uring = tctx; spin_lock_init(&tctx->task_lock); INIT_WQ_LIST(&tctx->task_list); + INIT_WQ_LIST(&tctx->prior_task_list); init_task_work(&tctx->task_work, tctx_task_work); return 0; } -- 2.24.4