From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C33EBC4332F
	for <linux-kernel@archiver.kernel.org>; Tue,  8 Nov 2022 17:25:16 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S234501AbiKHRZP (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Tue, 8 Nov 2022 12:25:15 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57068 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234108AbiKHRZM (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 8 Nov 2022 12:25:12 -0500
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8D5E6EAF
        for <linux-kernel@vger.kernel.org>; Tue,  8 Nov 2022 09:24:17 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1667928256;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         in-reply-to:in-reply-to:references:references;
        bh=jaxJz/zVXTLSHM/FX5ogpiQ4fyPxyR99ZK0ufiiW1NY=;
        b=faXUYWMpiPveNOKcLprQijaYFfSTFkVpeVtVpo7VOF6hDdS1QmtfAr5mTnpSGPK6Y9W8S3
        lG4pq16est6lYulVIBJJqgTTY49j88ixV2KqD3M7lQrSmQM+OPiNakNjt2Akb4D4eHNZsB
        6xBIN9L2alPQlnVpAIhuq7GSDqE9bnM=
Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com
 [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-261-zSt8iAHUMaOs1HYu3Wzv0Q-1; Tue, 08 Nov 2022 12:24:13 -0500
X-MC-Unique: zSt8iAHUMaOs1HYu3Wzv0Q-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id A2AD43C43B21;
        Tue,  8 Nov 2022 17:24:12 +0000 (UTC)
Received: from localhost (unknown [10.39.195.193])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 1D93C2166B29;
        Tue,  8 Nov 2022 17:24:11 +0000 (UTC)
Date:   Tue, 8 Nov 2022 12:24:10 -0500
From:   Stefan Hajnoczi <stefanha@redhat.com>
To:     Jens Axboe <axboe@kernel.dk>
Cc:     linux-kernel@vger.kernel.org, netdev@vger.kernel.org
Subject: Re: [PATCHSET v3 0/5] Add support for epoll min_wait
Message-ID: <Y2qQuiZvuML14wX0@fedora>
References: <Y2lw4Qc1uI+Ep+2C@fedora>
 <4281b354-d67d-2883-d966-a7816ed4f811@kernel.dk>
 <Y2phEZKYuSmPL5B5@fedora>
 <93fa2da5-c81a-d7f8-115c-511ed14dcdbb@kernel.dk>
 <Y2p/YcUFhFDUnLGq@fedora>
 <75c8f5fe-6d5f-32a9-1417-818246126789@kernel.dk>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
        protocol="application/pgp-signature"; boundary="O0PSbzeHbyNIUTpg"
Content-Disposition: inline
In-Reply-To: <75c8f5fe-6d5f-32a9-1417-818246126789@kernel.dk>
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.6
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


--O0PSbzeHbyNIUTpg
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Nov 08, 2022 at 09:15:23AM -0700, Jens Axboe wrote:
> On 11/8/22 9:10 AM, Stefan Hajnoczi wrote:
> > On Tue, Nov 08, 2022 at 07:09:30AM -0700, Jens Axboe wrote:
> >> On 11/8/22 7:00 AM, Stefan Hajnoczi wrote:
> >>> On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
> >>>> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
> >>>>> Hi Jens,
> >>>>> NICs and storage controllers have interrupt mitigation/coalescing
> >>>>> mechanisms that are similar.
> >>>>
> >>>> Yep
> >>>>
> >>>>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
> >>>>> (counter) value. When a completion occurs, the device waits until t=
he
> >>>>> timeout or until the completion counter value is reached.
> >>>>>
> >>>>> If I've read the code correctly, min_wait is computed at the beginn=
ing
> >>>>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
> >>>>> completion.
> >>>>>
> >>>>> It makes me wonder which approach is more useful for applications. =
With
> >>>>> the Aggregation Time approach applications can control how much ext=
ra
> >>>>> latency is added. What do you think about that approach?
> >>>>
> >>>> We only tested the current approach, which is time noted from entry,=
 not
> >>>> from when the first event arrives. I suspect the nvme approach is be=
tter
> >>>> suited to the hw side, the epoll timeout helps ensure that we batch
> >>>> within xx usec rather than xx usec + whatever the delay until the fi=
rst
> >>>> one arrives. Which is why it's handled that way currently. That gives
> >>>> you a fixed batch latency.
> >>>
> >>> min_wait is fine when the goal is just maximizing throughput without =
any
> >>> latency targets.
> >>
> >> That's not true at all, I think you're in different time scales than
> >> this would be used for.
> >>
> >>> The min_wait approach makes it hard to set a useful upper bound on
> >>> latency because unlucky requests that complete early experience much
> >>> more latency than requests that complete later.
> >>
> >> As mentioned in the cover letter or the main patch, this is most useful
> >> for the medium load kind of scenarios. For high load, the min_wait time
> >> ends up not mattering because you will hit maxevents first anyway. For
> >> the testing that we did, the target was 2-300 usec, and 200 usec was
> >> used for the actual test. Depending on what the kind of traffic the
> >> server is serving, that's usually not much of a concern. From your
> >> reply, I'm guessing you're thinking of much higher min_wait numbers. I
> >> don't think those would make sense. If your rate of arrival is low
> >> enough that min_wait needs to be high to make a difference, then the
> >> load is low enough anyway that it doesn't matter. Hence I'd argue that
> >> it is indeed NOT hard to set a useful upper bound on latency, because
> >> that is very much what min_wait is.
> >>
> >> I'm happy to argue merits of one approach over another, but keep in mi=
nd
> >> that this particular approach was not pulled out of thin air AND it has
> >> actually been tested and verified successfully on a production workloa=
d.
> >> This isn't a hypothetical benchmark kind of setup.
> >=20
> > Fair enough. I just wanted to make sure the syscall interface that gets
> > merged is as useful as possible.
>=20
> That is indeed the main discussion as far as I'm concerned - syscall,
> ctl, or both? At this point I'm inclined to just push forward with the
> ctl addition. A new syscall can always be added, and if we do, then it'd
> be nice to make one that will work going forward so we don't have to
> keep adding epoll_wait variants...

epoll_wait3() would be consistent with how maxevents and timeout work.
It does not suffer from extra ctl syscall overhead when applications
need to change min_wait.

The way the current patches add min_wait into epoll_ctl() seems hacky to
me. struct epoll_event was meant for file descriptor event entries. It
won't necessarily be large enough for future extensions (luckily
min_wait only needs a uint64_t value). It's turning epoll_ctl() into an
ioctl()/setsockopt()-style interface, which is bad for anything that
needs to understand syscalls, like seccomp. A properly typed
epoll_wait3() seems cleaner to me.

Stefan

--O0PSbzeHbyNIUTpg
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEhpWov9P5fNqsNXdanKSrs4Grc8gFAmNqkLoACgkQnKSrs4Gr
c8jMTwgAh6tsZz93MQq2mnRSH6XAQo+Ph8jToGrOwVvszEAkUWVuU43QLvwenksK
1qKC6u6XF67qTJFEuv0GranpsTrkrthQblxDd+MZjFd9XwWg3/JlmEqsqPM7BnJs
zKsO3vAf7FH6kn5EN2lW3CVZPQm/9M5aZjpkYZR9RGJInqLgG5yf686ZV1gXQx+F
AId8I4UVY2iQIpbtOewVDs92y6kZCU5GbTv5eZffU+r0a+nS/heGghbTY0BfNcix
ZBPffReBZOIWnXyC5gPMH0tRGkc8exm8ZIMPvm21eXqaCo2vwT5EVPkYup19OyEk
27EdGvpWh6p8WHDZydntmVkLqcD87w==
=/P5y
-----END PGP SIGNATURE-----

--O0PSbzeHbyNIUTpg--