From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 39019C43334 for ; Wed, 22 Jun 2022 11:25:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235609AbiFVLY7 (ORCPT ); Wed, 22 Jun 2022 07:24:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39990 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235548AbiFVLY6 (ORCPT ); Wed, 22 Jun 2022 07:24:58 -0400 Received: from out0.migadu.com (out0.migadu.com [IPv6:2001:41d0:2:267::]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2EEC6B856 for ; Wed, 22 Jun 2022 04:24:57 -0700 (PDT) Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1655897095; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dCwpjhS+8x/r+azzCv9AmFXPRzUYZwI4RGRDcVSY9vM=; b=SOPv2yGGNBZHMrI6LS8z5z3dyQRBp0M42+MwqlZDTYxo+ElYcyF5nX4fyLB2Nccwt/7Mqk fDEzVR8VxiIbfOw0UXGSrA202F2bZsD4+efGk6rTbmj3do7vVMA+ls+Cw9aiBv/Z2ZqRAE 6bplwFUQd7etvHAsv81vw0bZ+R4CNMQ= Date: Wed, 22 Jun 2022 19:24:51 +0800 MIME-Version: 1.0 Subject: Re: [PATCH RFC for-next 0/8] io_uring: tw contention improvments Content-Language: en-US X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Hao Xu To: Dylan Yudaken , "axboe@kernel.dk" , "asml.silence@gmail.com" , "io-uring@vger.kernel.org" Cc: Kernel Team References: <20220620161901.1181971-1-dylany@fb.com> <15e36a76-65d5-2acb-8cb7-3952d9d8f7d1@linux.dev> <1c29ad13-cc42-8bc5-0f12-3413054a4faf@linux.dev> <02e7f2adc191cd207eb17dd84efa10f86d965200.camel@fb.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Migadu-Auth-User: linux.dev Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org On 6/22/22 19:16, Hao Xu wrote: > On 6/22/22 17:31, Dylan Yudaken wrote: >> On Tue, 2022-06-21 at 15:34 +0800, Hao Xu wrote: >>> On 6/21/22 15:03, Dylan Yudaken wrote: >>>> On Tue, 2022-06-21 at 13:10 +0800, Hao Xu wrote: >>>>> On 6/21/22 00:18, Dylan Yudaken wrote: >>>>>> Task work currently uses a spin lock to guard task_list and >>>>>> task_running. Some use cases such as networking can trigger >>>>>> task_work_add >>>>>> from multiple threads all at once, which suffers from >>>>>> contention >>>>>> here. >>>>>> >>>>>> This can be changed to use a lockless list which seems to have >>>>>> better >>>>>> performance. Running the micro benchmark in [1] I see 20% >>>>>> improvment in >>>>>> multithreaded task work add. It required removing the priority >>>>>> tw >>>>>> list >>>>>> optimisation, however it isn't clear how important that >>>>>> optimisation is. >>>>>> Additionally it has fairly easy to break semantics. >>>>>> >>>>>> Patch 1-2 remove the priority tw list optimisation >>>>>> Patch 3-5 add lockless lists for task work >>>>>> Patch 6 fixes a bug I noticed in io_uring event tracing >>>>>> Patch 7-8 adds tracing for task_work_run >>>>>> >>>>> >>>>> Compared to the spinlock overhead, the prio task list >>>>> optimization is >>>>> definitely unimportant, so I agree with removing it here. >>>>> Replace the task list with llisy was something I considered but I >>>>> gave >>>>> it up since it changes the list to a stack which means we have to >>>>> handle >>>>> the tasks in a reverse order. This may affect the latency, do you >>>>> have >>>>> some numbers for it, like avg and 99% 95% lat? >>>>> >>>> >>>> Do you have an idea for how to test that? I used a microbenchmark >>>> as >>>> well as a network benchmark [1] to verify that overall throughput >>>> is >>>> higher. TW latency sounds a lot more complicated to measure as it's >>>> difficult to trigger accurately. >>>> >>>> My feeling is that with reasonable batching (say 8-16 items) the >>>> latency will be low as TW is generally very quick, but if you have >>>> an >>>> idea for benchmarking I can take a look >>>> >>>> [1]: https://github.com/DylanZA/netbench >>> >>> It can be normal IO requests I think. We can test the latency by fio >>> with small size IO to a fast block device(like nvme) in SQPOLL >>> mode(since for non-SQPOLL, it doesn't make difference). This way we >>> can >>> see the influence of reverse order handling. >>> >>> Regards, >>> Hao >> >> I see little difference locally, but there is quite a big stdev so it's >> possible my test setup is a bit wonky >> >> new: >>      clat (msec): min=2027, max=10544, avg=6347.10, stdev=2458.20 >>       lat (nsec): min=1440, max=16719k, avg=119714.72, stdev=153571.49 >> old: >>      clat (msec): min=2738, max=10550, avg=6700.68, stdev=2251.77 >>       lat (nsec): min=1278, max=16610k, avg=121025.73, stdev=211896.14 >> > > Hi Dylan, > > Could you post the arguments you use and the 99% 95% latency as well? > > Regards, > Hao > One thing I'm worrying about is under heavy workloads, there are contiguous TWs coming in, thus the TWs at the end of the TW list doesn't get the chance to run, which leads to the latency of those ones becoming high.