On Tue, Nov 08, 2022 at 09:15:23AM -0700, Jens Axboe wrote: > On 11/8/22 9:10 AM, Stefan Hajnoczi wrote: > > On Tue, Nov 08, 2022 at 07:09:30AM -0700, Jens Axboe wrote: > >> On 11/8/22 7:00 AM, Stefan Hajnoczi wrote: > >>> On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote: > >>>> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote: > >>>>> Hi Jens, > >>>>> NICs and storage controllers have interrupt mitigation/coalescing > >>>>> mechanisms that are similar. > >>>> > >>>> Yep > >>>> > >>>>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold > >>>>> (counter) value. When a completion occurs, the device waits until the > >>>>> timeout or until the completion counter value is reached. > >>>>> > >>>>> If I've read the code correctly, min_wait is computed at the beginning > >>>>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first > >>>>> completion. > >>>>> > >>>>> It makes me wonder which approach is more useful for applications. With > >>>>> the Aggregation Time approach applications can control how much extra > >>>>> latency is added. What do you think about that approach? > >>>> > >>>> We only tested the current approach, which is time noted from entry, not > >>>> from when the first event arrives. I suspect the nvme approach is better > >>>> suited to the hw side, the epoll timeout helps ensure that we batch > >>>> within xx usec rather than xx usec + whatever the delay until the first > >>>> one arrives. Which is why it's handled that way currently. That gives > >>>> you a fixed batch latency. > >>> > >>> min_wait is fine when the goal is just maximizing throughput without any > >>> latency targets. > >> > >> That's not true at all, I think you're in different time scales than > >> this would be used for. > >> > >>> The min_wait approach makes it hard to set a useful upper bound on > >>> latency because unlucky requests that complete early experience much > >>> more latency than requests that complete later. > >> > >> As mentioned in the cover letter or the main patch, this is most useful > >> for the medium load kind of scenarios. For high load, the min_wait time > >> ends up not mattering because you will hit maxevents first anyway. For > >> the testing that we did, the target was 2-300 usec, and 200 usec was > >> used for the actual test. Depending on what the kind of traffic the > >> server is serving, that's usually not much of a concern. From your > >> reply, I'm guessing you're thinking of much higher min_wait numbers. I > >> don't think those would make sense. If your rate of arrival is low > >> enough that min_wait needs to be high to make a difference, then the > >> load is low enough anyway that it doesn't matter. Hence I'd argue that > >> it is indeed NOT hard to set a useful upper bound on latency, because > >> that is very much what min_wait is. > >> > >> I'm happy to argue merits of one approach over another, but keep in mind > >> that this particular approach was not pulled out of thin air AND it has > >> actually been tested and verified successfully on a production workload. > >> This isn't a hypothetical benchmark kind of setup. > > > > Fair enough. I just wanted to make sure the syscall interface that gets > > merged is as useful as possible. > > That is indeed the main discussion as far as I'm concerned - syscall, > ctl, or both? At this point I'm inclined to just push forward with the > ctl addition. A new syscall can always be added, and if we do, then it'd > be nice to make one that will work going forward so we don't have to > keep adding epoll_wait variants... epoll_wait3() would be consistent with how maxevents and timeout work. It does not suffer from extra ctl syscall overhead when applications need to change min_wait. The way the current patches add min_wait into epoll_ctl() seems hacky to me. struct epoll_event was meant for file descriptor event entries. It won't necessarily be large enough for future extensions (luckily min_wait only needs a uint64_t value). It's turning epoll_ctl() into an ioctl()/setsockopt()-style interface, which is bad for anything that needs to understand syscalls, like seccomp. A properly typed epoll_wait3() seems cleaner to me. Stefan