From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0FA8CC10F27 for ; Tue, 10 Mar 2020 15:13:40 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id BE46B20727 for ; Tue, 10 Mar 2020 15:13:38 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (4096-bit key) header.d=crudebyte.com header.i=@crudebyte.com header.b="pKBQiuG2" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BE46B20727 Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=crudebyte.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:35244 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jBgZd-0003HE-0W for qemu-devel@archiver.kernel.org; Tue, 10 Mar 2020 11:13:37 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55856) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jBgWt-0007G1-48 for qemu-devel@nongnu.org; Tue, 10 Mar 2020 11:10:49 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1jBgWr-0005j5-1c for qemu-devel@nongnu.org; Tue, 10 Mar 2020 11:10:46 -0400 Received: from kylie.crudebyte.com ([5.189.157.229]:42177) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1jBgWq-0005Yk-GI for qemu-devel@nongnu.org; Tue, 10 Mar 2020 11:10:44 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=crudebyte.com; s=kylie; h=Content-Type:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From: Content-ID:Content-Description; bh=XLVSFPAhfyc6r9DBj7+GItSI8pVcZm4xZEhpsjpyDM4=; b=pKBQiuG2BZZ1qoxMSGARb5aVKZ IUURyMdHUawX5YazlK7unnRhxLQewdhXzsDeU3pDOnq0Uvm2CsyCrIytYJa/CTYjLz39YDvhE68Y5 AGM6js/HBGO1Xl9JMnPgqJvBtqnEuseY4UjBrxcI7VB1N+6tQHH13MxjEwsQpgj0z3jeCZ5o9ZQ+K hTNZ/4EMXBXvOlQ5+YrgA9WhPwMJXBzWGM4jyPE7moWukL1g7k7b8jxFJAhA4dVZbao7dsKmvMFEn +FtxnQDGYRTrfhXtyYSWCQUHBn43ibgJb3jdDLy1qVQjdsYWh/Mja8SycmSNOfMLM5NOaFS2dtq/o TXcZeFWNb5v8iHqcyQYzedcvR06You+NyHAI/YoMayKXhJipzsBIqv9A0dRGYhehKemb/a7njc+Uk nPMvZu0XSMMvFJWxMMgkC64ptA3SBj71zKWPRmctLAhCNH3ErMqPefiu5akzzwT3HEBwAFuCrqhIz 08bKwEkrq48iWhPnDHQbQVVsr2KKE58BD2Kc3jF67g3nb14angwfQHjJZiPZtwGs8NPm45ZJJ/rlK lZm+NqclqlbcOgscuV7xMFSgsILadly9HFNZv90YFw+8d1w6PsFCBEWJh1ZCtAdzddQQnDs3wOS/1 aL/E480Gt0/I0rn8vA7Y6L4eUzRXTK+l8TDxe8Xko=; From: Christian Schoenebeck To: qemu-devel@nongnu.org Cc: Greg Kurz Subject: Re: [PATCH v4 10/11] 9pfs: T_readdir latency optimization Date: Tue, 10 Mar 2020 16:10:41 +0100 Message-ID: <1651612.bE1MAdl4rA@silver> In-Reply-To: <20200309164259.2affe4bc@bahia.home> References: <1706149.tZr2OiJYdg@silver> <20200309164259.2affe4bc@bahia.home> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="iso-8859-1" X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 5.189.157.229 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" On Montag, 9. M=E4rz 2020 16:42:59 CET Greg Kurz wrote: > The main issue I see with this patch is that it is too big. At > least from my POV, with the little time I have for 9p, that's > the first thing that pushes me back... So if you could split > this patch down to some reviewable chunks, this may help > going forward. This patch is also too big for my preference, but I don't see a viable way = to=20 split it further into separate patches. I already separated all the patches= I=20 could. If you have suggestions, very much appreciated! The reason for this is that in order to fix these issues with current=20 T_readdir implementation, it requires to separate what's currently one task= =20 (i.e. one function) into two separate tasks (i.e. two functions). There is = no=20 sensible way to do that. But don't be too scared about this patch; if you look just at the diff=20 directly then yes, the diff is not very human readable. However if you appl= y=20 the patch and look at the resulting code, you'll notice that there are=20 actually only 2 functions (v9fs_do_readdir() in 9p.c and do_readdir_lowlat(= )=20 in codir.c) which require careful reviewing and that their resulting code i= s=20 actually very, very straight forward, and not long either. Current code on master is much more tricky and error prone due to the huge= =20 amount of potential branches, individual error/cleanup handlers, high amoun= t=20 of thread dispatching and much more. In the patched version all these code= =20 complexities and error sources are removed. > > > diff --git a/hw/9pfs/codir.c b/hw/9pfs/codir.c > > > index 73f9a751e1..6ce7dc8cde 100644 > > > --- a/hw/9pfs/codir.c > > > +++ b/hw/9pfs/codir.c > > > @@ -18,28 +18,189 @@ > > >=20 > > > #include "qemu/main-loop.h" > > > #include "coth.h" > > >=20 > > > +/* > > > + * This is solely executed on a background IO thread. > > > + */ > > > +static int do_readdir(V9fsPDU *pdu, V9fsFidState *fidp, struct dirent > > > **dent) +{ > > > + int err =3D 0; > > > + V9fsState *s =3D pdu->s; > > > + struct dirent *entry; > > > + > > > + errno =3D 0; > > > + entry =3D s->ops->readdir(&s->ctx, &fidp->fs); > > > + if (!entry && errno) { > > > + *dent =3D NULL; > > > + err =3D -errno; > > > + } else { > > > + *dent =3D entry; > > > + } > > > + return err; > > > +} > > > + > > > +/* > > > + * TODO: This will be removed for performance reasons. > > > + * Use v9fs_co_readdir_lowlat() instead. >=20 > Oh, so we'd still have the current implementation being used, even > with this patch applied... This would be okay for a RFC patch but > in the end I'd really like to merge something that also converts > v9fs_do_readdir_with_stat(). Yes, I know, but I would not recommend mixing these things at this point,=20 because it would be an entire effort on its own. v9fs_do_readdir_with_stat() is used for 9P2000.u, while v9fs_do_readdir() i= s=20 used for 9P2000.L. They're behaving very differently, so it would not only= =20 require me to update v9fs_do_readdir_with_stat() and v9fs_read(), I would a= lso=20 need to write their own test cases (plural, since there are none at all yet= )=20 and benchmarks, and of course somebody what need to review all that additio= nal=20 amount of code, and who would actually test it? I am actively testing my=20 9P2000.L changes, but I am not actually using 9P2000.u, so I could not=20 guarantee for the same quality. Considering the current commitment situation regarding 9pfs, we can only br= ing=20 things forward with compromises. Hence I suggest I concentrate on fixing th= e=20 worst performance issues on 9P2000.L side first, and later on if there are= =20 really people interested in seeing these improvements on 9P2000.u side and= =20 willing to at least help out with testing, then I can definitely also adjus= t=20 9P2000.u side accordingly as well. But to also make this clear: this patch 10 is not introducing any behaviour= =20 change for 9P2000.u side. > > > int coroutine_fn v9fs_co_readdir(V9fsPDU *pdu, V9fsFidState *fidp, > > > =20 > > > struct dirent **dent) > > > =20 > > > { > > > =20 > > > int err; > > >=20 > > > - V9fsState *s =3D pdu->s; > > >=20 > > > if (v9fs_request_cancelled(pdu)) { > > > =20 > > > return -EINTR; > > > =20 > > > } > > >=20 > > > - v9fs_co_run_in_worker( > > > - { > > > - struct dirent *entry; > > > + v9fs_co_run_in_worker({ > > > + err =3D do_readdir(pdu, fidp, dent); > > > + }); > > > + return err; > > > +} > > > + > > > +/* > > > + * This is solely executed on a background IO thread. > > > + * > > > + * See v9fs_co_readdir_lowlat() (as its only user) below for details. >=20 > Since the only user is quite small, maybe just open-code this. Mmm... maybe later on, as a cleanup patch or something. Current version is= =20 tested. I would like to avoid to make things more complicated than they=20 already are (i.e. accidental introduction of some bug just due to this). >=20 > > > + */ > > > +static int do_readdir_lowlat(V9fsPDU *pdu, V9fsFidState *fidp, > > > + struct V9fsDirEnt **entries, > > > + int32_t maxsize, bool dostat) > > > +{ > > > + V9fsState *s =3D pdu->s; > > > + V9fsString name; > > > + int len, err =3D 0; > > > + int32_t size =3D 0; > > > + off_t saved_dir_pos; > > > + struct dirent *dent; > > > + struct V9fsDirEnt *e =3D NULL; > > > + V9fsPath path; > > > + struct stat stbuf; > > > + > > > + *entries =3D NULL; > > > + v9fs_path_init(&path); > > > + > > > + /* > > > + * TODO: Here should be a warn_report_once() if lock failed. > > > + * > > > + * With a good 9p client we should not get into concurrency here, > > > + * because a good client would not use the same fid for concurre= nt > > > + * requests. We do the lock here for safety reasons though. Howe= ver > > > + * the client would then suffer performance issues, so better log > > > that > > > + * issue here. > > > + */ >=20 > This should probably be done with qemu_mutex_trylock() in a loop directly > in v9fs_readdir_lock(). No definitely not in the loop. That's intentionally *one* lock *outside* of= =20 the loop. The normal case is not requiring a lock at all, like the comment= =20 describes. Doing multiple locks inside the loop unnecessaririly reduces=20 performance for no reason. About *_trylock() instead of v9fs_readdir_lock(): might be a candidate to=20 address the TODO comment, but I would like to pospone that for the same=20 reasons: it is an odd/ill case encountering concurrency here. So for all we= ll=20 implemented clients this is not an issue at all, and security (concerning=20 malicious clients) is provided already by this lock. Addressing this TODO=20 already now would potentially unnecessarily introduce bugs due to added=20 complexity, so really I would postpone that. > > > +/** > > > + * @brief Low latency variant of fs driver readdir handling. > > > + * >=20 > Not sure it brings much to mention latency here... telling what this > function does actually ie. "read multiple directory entries in a row" > would be more informative. NP > > > + * Retrieves the requested (max. amount of) directory entries from t= he > > > fs > > > + * driver. This function must only be called by the main IO thread (= top > > > half). + * Internally this function call will be dispatched to a > > > background > > > IO thread + * (bottom half) where it is eventually executed by the fs > > > driver. + * > > > + * The old readdir implementation above just retrieves always one dir > > > entry + * per call. The problem of that implementation above is that > > > latency is + * added for (retrieval of) each directory entry, which in > > > practice lead to + * latencies of several hundred ms for readdir of > > > only one directory. + * + * This is avoided in this function by letti= ng > > > the fs driver retrieve all + * required directory entries with only > > > call of this function and hence with + * only a single fs driver > > > request. >=20 > True but these kind of considerations rather belong to the changelog... > otherwise, we'll have to not forget to kill these lines when the "old > readdir implementation" is no more. Mmm... I do think this overall latency issue should be clarified somewhere = as=20 a comment. Where exactly is not that important to me. For instance it could= =20 also be described on v9fs_co_run_in_worker() macro definition in coth.h=20 instead, as general rule of thumb when dispatching stuff. The thing is: for >10 years several developers obviously did not realize th= e=20 severe negative performance impact of frivolously dispatching tasks to=20 different threads too often. It looks like linear code, right? But no, it i= s=20 not. > > > + * > > > + * @param pdu - the causing 9p (T_readdir) client request > > > + * @param fidp - already opened directory where readdir shall be > > > performed > > > on + * @param entries - output for directory entries (must not be NUL= L) > > > + * > > > @param maxsize - maximum result message body size (in bytes) > > > + * @param dostat - whether a stat() should be performed and returned > > > for > > > + * each directory entry > > > + * @returns resulting response message body size (in bytes) on succe= ss, > > > + * negative error code otherwise > > > + */ > > > +int coroutine_fn v9fs_co_readdir_lowlat(V9fsPDU *pdu, V9fsFidState > > > *fidp, >=20 > Not really a fan of the _lowlat (low fat ?) wording. >=20 > v9fs_co_readdir_many() or anything else that suggests we're going to > read several entries in a row. NP >=20 > > > + struct V9fsDirEnt **entries, > > > + int32_t maxsize, bool dostat) >=20 > s/maxsize/max_count since this is the wording use in the caller. Wouldn't do that. This is the driver side, not the 9p protocol request=20 handler. As you might have relized before, "max_count" is semantically=20 completely wrong. This variable does not count integral entries, it is rath= er=20 a maximum size (in bytes) of the destination buffer being filled. On 9p request handler side it is ok to use "max_count" instead, because the= =20 protocol definition uses exactly this term for the request parameter, but o= n=20 driver side it's IMO inappropriate. Best regards, Christian Schoenebeck