From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.4 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B1C4BC3565B for ; Fri, 21 Feb 2020 19:25:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 7B573208E4 for ; Fri, 21 Feb 2020 19:25:03 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="F6q9WQwd" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729455AbgBUTZD (ORCPT ); Fri, 21 Feb 2020 14:25:03 -0500 Received: from mail-ot1-f67.google.com ([209.85.210.67]:44288 "EHLO mail-ot1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726423AbgBUTZD (ORCPT ); Fri, 21 Feb 2020 14:25:03 -0500 Received: by mail-ot1-f67.google.com with SMTP id h9so2997070otj.11 for ; Fri, 21 Feb 2020 11:25:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=oqcR90LgqvbE9yqjWpMvuiplXmRBGyosSajnxu8rR+Y=; b=F6q9WQwdv1QL44LPDJfXiWe1Kw0fMaFjPDU8DQm4oL459TppZ/H0pN23VxDPvRNjEQ XOL6TTBiKf6YeCRDSNHWzUdETUhk5HOuhSgxu0eKtdS1Ku0k/dImGt5yTrCi9WYZMTXT pkTvQ5rlSOUx07YRn6WWO1ZVubu/Wt3Pe6IWy3FPuhI/EXAQ9Q6Si42XWwatp43aNJyT zasPssmoC/MI51NSpQtnVgyFBCQs/LHVsWwQMOJULeJWCgWtFSIC4ZsoZaFHrkBBEImh lYIEu3UjhQnuX5M151a3erWUEsNdR3GIGtluVTv/wU9T+q+j+Zch7cmQoACiLDKOAF8M MUUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=oqcR90LgqvbE9yqjWpMvuiplXmRBGyosSajnxu8rR+Y=; b=kd7zVQw+IaiY+o6IP2nHoDtZMYuweO+kHbi9SGN5fvUoYRVpkE/dy6fE1JHOuGUBKl XoYrKLg/oLYf6N5JaniMfS2GbeNyS9bph1TL56RoedJj07UZAcCytXKBt1IIaefzu5pt RjvQKSXYUT6MrR9cuuFi9rTJ0/tS0q16fKayxHV6fKU6cwZ5NB1UxjBGLqNMal0WiXn9 PaISSrWzPysHsWiXpTCerD6047Y/5GNklNHHwoIKDIUAvcKebzJ+5XVnXdi/sCPNnP8+ qgPb0VRoYJz6CwVK4kwCZCa9L7D5ttGGMMfjet7GR+C5d+GXEupwa4dYYz2TL8nmEr9e +ZmA== X-Gm-Message-State: APjAAAUYOVDoUBibNLdr6gIfLAA0zIe63p9fvFGBQPIGJVdZ4PDlNibs QJ2hjcLyFqfoJRJ/+LnDoFFrRxqXtvu4Wl3UkPGmKw== X-Google-Smtp-Source: APXvYqyuyR9mg7VMoHm/yqlr/05ZpPwtnsmnbNiAK1qZjQv2HryZA/v2Is4pe0BaHTrWylHRwlk+Kfhmzsm1qme2iOc= X-Received: by 2002:a05:6830:1d6e:: with SMTP id l14mr28785836oti.32.1582313101292; Fri, 21 Feb 2020 11:25:01 -0800 (PST) MIME-Version: 1.0 References: <20200220203151.18709-1-axboe@kernel.dk> <20200220203151.18709-8-axboe@kernel.dk> <4caec29c-469d-7448-f779-af3ba9c6c6a9@kernel.dk> In-Reply-To: From: Jann Horn Date: Fri, 21 Feb 2020 20:24:34 +0100 Message-ID: Subject: Re: [PATCH 7/9] io_uring: add per-task callback handler To: Jens Axboe Cc: io-uring , Glauber Costa , Peter Zijlstra , Pavel Begunkov Content-Type: text/plain; charset="UTF-8" Sender: io-uring-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org On Fri, Feb 21, 2020 at 6:32 PM Jens Axboe wrote: > On 2/20/20 6:29 PM, Jann Horn wrote: > > On Fri, Feb 21, 2020 at 12:22 AM Jens Axboe wrote: > >> On 2/20/20 4:12 PM, Jann Horn wrote: > >>> On Fri, Feb 21, 2020 at 12:00 AM Jens Axboe wrote: > >>>> On 2/20/20 3:23 PM, Jann Horn wrote: > >>>>> On Thu, Feb 20, 2020 at 11:14 PM Jens Axboe wrote: > >>>>>> On 2/20/20 3:02 PM, Jann Horn wrote: > >>>>>>> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe wrote: > >>>>>>>> For poll requests, it's not uncommon to link a read (or write) after > >>>>>>>> the poll to execute immediately after the file is marked as ready. > >>>>>>>> Since the poll completion is called inside the waitqueue wake up handler, > >>>>>>>> we have to punt that linked request to async context. This slows down > >>>>>>>> the processing, and actually means it's faster to not use a link for this > >>>>>>>> use case. > >>> [...] > >>>>>>>> -static void io_poll_trigger_evfd(struct io_wq_work **workptr) > >>>>>>>> +static void io_poll_task_func(struct callback_head *cb) > >>>>>>>> { > >>>>>>>> - struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); > >>>>>>>> + struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work); > >>>>>>>> + struct io_kiocb *nxt = NULL; > >>>>>>>> > >>>>>>> [...] > >>>>>>>> + io_poll_task_handler(req, &nxt); > >>>>>>>> + if (nxt) > >>>>>>>> + __io_queue_sqe(nxt, NULL); > >>>>>>> > >>>>>>> This can now get here from anywhere that calls schedule(), right? > >>>>>>> Which means that this might almost double the required kernel stack > >>>>>>> size, if one codepath exists that calls schedule() while near the > >>>>>>> bottom of the stack and another codepath exists that goes from here > >>>>>>> through the VFS and again uses a big amount of stack space? This is a > >>>>>>> somewhat ugly suggestion, but I wonder whether it'd make sense to > >>>>>>> check whether we've consumed over 25% of stack space, or something > >>>>>>> like that, and if so, directly punt the request. > >>> [...] > >>>>>>> Also, can we recursively hit this point? Even if __io_queue_sqe() > >>>>>>> doesn't *want* to block, the code it calls into might still block on a > >>>>>>> mutex or something like that, at which point the mutex code would call > >>>>>>> into schedule(), which would then again hit sched_out_update() and get > >>>>>>> here, right? As far as I can tell, this could cause unbounded > >>>>>>> recursion. > >>>>>> > >>>>>> The sched_work items are pruned before being run, so that can't happen. > >>>>> > >>>>> And is it impossible for new ones to be added in the meantime if a > >>>>> second poll operation completes in the background just when we're > >>>>> entering __io_queue_sqe()? > >>>> > >>>> True, that can happen. > >>>> > >>>> I wonder if we just prevent the recursion whether we can ignore most > >>>> of it. Eg never process the sched_work list if we're not at the top > >>>> level, so to speak. > >>>> > >>>> This should also prevent the deadlock that you mentioned with FUSE > >>>> in the next email that just rolled in. > >>> > >>> But there the first ->read_iter could be from outside io_uring. So you > >>> don't just have to worry about nesting inside an already-running uring > >>> work; you also have to worry about nesting inside more or less > >>> anything else that might be holding mutexes. So I think you'd pretty > >>> much have to whitelist known-safe schedule() callers, or something > >>> like that. > >> > >> I'll see if I can come up with something for that. Ideally any issue > >> with IOCB_NOWAIT set should be honored, and trylock etc should be used. > > > > Are you sure? For example, an IO operation typically copies data to > > userspace, which can take pagefaults. And those should be handled > > synchronously even with IOCB_NOWAIT set, right? And the page fault > > code can block on mutexes (like the mmap_sem) or even wait for a > > blocking filesystem operation (via file mappings) or for userspace > > (via userfaultfd or FUSE mappings). > > Yeah that's a good point. The more I think about it, the less I think > the scheduler invoked callback is going to work. We need to be able to > manage the context of when we are called, see later messages on the > task_work usage instead. > > >> But I don't think we can fully rely on that, we need something a bit > >> more solid... > >> > >>> Taking a step back: Do you know why this whole approach brings the > >>> kind of performance benefit you mentioned in the cover letter? 4x is a > >>> lot... Is it that expensive to take a trip through the scheduler? > >>> I wonder whether the performance numbers for the echo test would > >>> change if you commented out io_worker_spin_for_work()... > >> > >> If anything, I expect the spin removal to make it worse. There's really > >> no magic there on why it's faster, if you offload work to a thread that > >> is essentially sync, then you're going to take a huge hit in > >> performance. It's the difference between: > >> > >> 1) Queue work with thread, wake up thread > >> 2) Thread wakes, starts work, goes to sleep. > > > > If we go to sleep here, then the other side hasn't yet sent us > > anything, so up to this point, it shouldn't have any impact on the > > measured throughput, right? > > > >> 3) Data available, thread is woken, does work > > > > This is the same in the other case: Data is available, the > > application's thread is woken and does the work. > > > >> 4) Thread signals completion of work > > > > And this is also basically the same, except that in the worker-thread > > case, we have to go through the scheduler to reach userspace, while > > with this patch series, we can signal "work is completed" and return > > to userspace without an extra trip through the scheduler. > > There's a big difference between: > > - Task needs to do work, task goes to sleep on it, task is woken > > and > > - Task needs to do work, task passes work to thread. Task goes to sleep. > Thread wakes up, tries to do work, goes to sleep. Thread is woken, > does work, notifies task. Task is woken up. > > If you've ever done any sort of thread poll (userspace or otherwise), > this is painful, and particularly so when you're only keeping one > work item in flight. That kind of pipeline is rife with bubbles. If we > can have multiple items in flight, then we start to gain ground due to > the parallelism. > > > I could imagine this optimization having some performance benefit, but > > I'm still sceptical about it buying a 4x benefit without some more > > complicated reason behind it. > > I just re-ran the testing, this time on top of the current tree, where > instead of doing the task/sched_work_add() we simply queue for async. > This should be an even better case than before, since hopefully the > thread will not need to go to sleep to process the work, it'll complete > without blocking. For an echo test setup over a socket, this approach > yields about 45-48K requests per second. This, btw, is with the io-wq > spin removed. Using the callback method where the task itself does the > work, 175K-180K requests per second. Huh. So that's like, what, somewhere on the order of 7.6 microseconds or somewhere around 15000 cycles overhead for shoving a request completion event from worker context over to a task, assuming that you're running at something around 2GHz? Well, I guess that's a little more than twice as much time as it takes to switch from one blocked thread to another via eventfd (including overhead from syscall and CPU mitigations and stuff), so I guess it's not completely unreasonable... Anyway, I'll stop nagging about this since it sounds like you're going to implement this in a less unorthodox way now. ^^