Re: [PATCH 7/9] io_uring: add per-task callback handler

From: Jann Horn <jannh@google.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: io-uring <io-uring@vger.kernel.org>,
	Glauber Costa <glauber@scylladb.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Pavel Begunkov <asml.silence@gmail.com>
Subject: Re: [PATCH 7/9] io_uring: add per-task callback handler
Date: Fri, 21 Feb 2020 20:24:34 +0100	[thread overview]
Message-ID: <CAG48ez0wObFg58RyZW4eOiJP39xSxeU9VedOD2OJQqSFV7dAwg@mail.gmail.com> (raw)
In-Reply-To: <ee96f96d-ff69-1ca8-25d8-a9b5b25512cd@kernel.dk>

On Fri, Feb 21, 2020 at 6:32 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 2/20/20 6:29 PM, Jann Horn wrote:
> > On Fri, Feb 21, 2020 at 12:22 AM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 2/20/20 4:12 PM, Jann Horn wrote:
> >>> On Fri, Feb 21, 2020 at 12:00 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>> On 2/20/20 3:23 PM, Jann Horn wrote:
> >>>>> On Thu, Feb 20, 2020 at 11:14 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>> On 2/20/20 3:02 PM, Jann Horn wrote:
> >>>>>>> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>>> For poll requests, it's not uncommon to link a read (or write) after
> >>>>>>>> the poll to execute immediately after the file is marked as ready.
> >>>>>>>> Since the poll completion is called inside the waitqueue wake up handler,
> >>>>>>>> we have to punt that linked request to async context. This slows down
> >>>>>>>> the processing, and actually means it's faster to not use a link for this
> >>>>>>>> use case.
> >>> [...]
> >>>>>>>> -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
> >>>>>>>> +static void io_poll_task_func(struct callback_head *cb)
> >>>>>>>>  {
> >>>>>>>> -       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
> >>>>>>>> +       struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
> >>>>>>>> +       struct io_kiocb *nxt = NULL;
> >>>>>>>>
> >>>>>>> [...]
> >>>>>>>> +       io_poll_task_handler(req, &nxt);
> >>>>>>>> +       if (nxt)
> >>>>>>>> +               __io_queue_sqe(nxt, NULL);
> >>>>>>>
> >>>>>>> This can now get here from anywhere that calls schedule(), right?
> >>>>>>> Which means that this might almost double the required kernel stack
> >>>>>>> size, if one codepath exists that calls schedule() while near the
> >>>>>>> bottom of the stack and another codepath exists that goes from here
> >>>>>>> through the VFS and again uses a big amount of stack space? This is a
> >>>>>>> somewhat ugly suggestion, but I wonder whether it'd make sense to
> >>>>>>> check whether we've consumed over 25% of stack space, or something
> >>>>>>> like that, and if so, directly punt the request.
> >>> [...]
> >>>>>>> Also, can we recursively hit this point? Even if __io_queue_sqe()
> >>>>>>> doesn't *want* to block, the code it calls into might still block on a
> >>>>>>> mutex or something like that, at which point the mutex code would call
> >>>>>>> into schedule(), which would then again hit sched_out_update() and get
> >>>>>>> here, right? As far as I can tell, this could cause unbounded
> >>>>>>> recursion.
> >>>>>>
> >>>>>> The sched_work items are pruned before being run, so that can't happen.
> >>>>>
> >>>>> And is it impossible for new ones to be added in the meantime if a
> >>>>> second poll operation completes in the background just when we're
> >>>>> entering __io_queue_sqe()?
> >>>>
> >>>> True, that can happen.
> >>>>
> >>>> I wonder if we just prevent the recursion whether we can ignore most
> >>>> of it. Eg never process the sched_work list if we're not at the top
> >>>> level, so to speak.
> >>>>
> >>>> This should also prevent the deadlock that you mentioned with FUSE
> >>>> in the next email that just rolled in.
> >>>
> >>> But there the first ->read_iter could be from outside io_uring. So you
> >>> don't just have to worry about nesting inside an already-running uring
> >>> work; you also have to worry about nesting inside more or less
> >>> anything else that might be holding mutexes. So I think you'd pretty
> >>> much have to whitelist known-safe schedule() callers, or something
> >>> like that.
> >>
> >> I'll see if I can come up with something for that. Ideally any issue
> >> with IOCB_NOWAIT set should be honored, and trylock etc should be used.
> >
> > Are you sure? For example, an IO operation typically copies data to
> > userspace, which can take pagefaults. And those should be handled
> > synchronously even with IOCB_NOWAIT set, right? And the page fault
> > code can block on mutexes (like the mmap_sem) or even wait for a
> > blocking filesystem operation (via file mappings) or for userspace
> > (via userfaultfd or FUSE mappings).
>
> Yeah that's a good point. The more I think about it, the less I think
> the scheduler invoked callback is going to work. We need to be able to
> manage the context of when we are called, see later messages on the
> task_work usage instead.
>
> >> But I don't think we can fully rely on that, we need something a bit
> >> more solid...
> >>
> >>> Taking a step back: Do you know why this whole approach brings the
> >>> kind of performance benefit you mentioned in the cover letter? 4x is a
> >>> lot... Is it that expensive to take a trip through the scheduler?
> >>> I wonder whether the performance numbers for the echo test would
> >>> change if you commented out io_worker_spin_for_work()...
> >>
> >> If anything, I expect the spin removal to make it worse. There's really
> >> no magic there on why it's faster, if you offload work to a thread that
> >> is essentially sync, then you're going to take a huge hit in
> >> performance. It's the difference between:
> >>
> >> 1) Queue work with thread, wake up thread
> >> 2) Thread wakes, starts work, goes to sleep.
> >
> > If we go to sleep here, then the other side hasn't yet sent us
> > anything, so up to this point, it shouldn't have any impact on the
> > measured throughput, right?
> >
> >> 3) Data available, thread is woken, does work
> >
> > This is the same in the other case: Data is available, the
> > application's thread is woken and does the work.
> >
> >> 4) Thread signals completion of work
> >
> > And this is also basically the same, except that in the worker-thread
> > case, we have to go through the scheduler to reach userspace, while
> > with this patch series, we can signal "work is completed" and return
> > to userspace without an extra trip through the scheduler.
>
> There's a big difference between:
>
> - Task needs to do work, task goes to sleep on it, task is woken
>
> and
>
> - Task needs to do work, task passes work to thread. Task goes to sleep.
>   Thread wakes up, tries to do work, goes to sleep. Thread is woken,
>   does work, notifies task. Task is woken up.
>
> If you've ever done any sort of thread poll (userspace or otherwise),
> this is painful, and particularly so when you're only keeping one
> work item in flight. That kind of pipeline is rife with bubbles. If we
> can have multiple items in flight, then we start to gain ground due to
> the parallelism.
>
> > I could imagine this optimization having some performance benefit, but
> > I'm still sceptical about it buying a 4x benefit without some more
> > complicated reason behind it.
>
> I just re-ran the testing, this time on top of the current tree, where
> instead of doing the task/sched_work_add() we simply queue for async.
> This should be an even better case than before, since hopefully the
> thread will not need to go to sleep to process the work, it'll complete
> without blocking. For an echo test setup over a socket, this approach
> yields about 45-48K requests per second. This, btw, is with the io-wq
> spin removed. Using the callback method where the task itself does the
> work, 175K-180K requests per second.

Huh. So that's like, what, somewhere on the order of 7.6 microseconds
or somewhere around 15000 cycles overhead for shoving a request
completion event from worker context over to a task, assuming that
you're running at something around 2GHz? Well, I guess that's a little
more than twice as much time as it takes to switch from one blocked
thread to another via eventfd (including overhead from syscall and CPU
mitigations and stuff), so I guess it's not completely unreasonable...

Anyway, I'll stop nagging about this since it sounds like you're going
to implement this in a less unorthodox way now. ^^