Re: [PATCH 1/3] io_uring: add support for async work inheriting files table

From: Jann Horn <jannh@google.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org,
	"David S. Miller" <davem@davemloft.net>,
	Network Development <netdev@vger.kernel.org>
Subject: Re: [PATCH 1/3] io_uring: add support for async work inheriting files table
Date: Fri, 25 Oct 2019 01:13:20 +0200	[thread overview]
Message-ID: <CAG48ez00zr2P1WCznnXmTvq+FQ4Ji8kDnuNqbeeMvOh_MhXeTg@mail.gmail.com> (raw)
In-Reply-To: <c252182a-4d09-5e9b-112b-2dad9ef123b5@kernel.dk>

On Fri, Oct 25, 2019 at 12:04 AM Jens Axboe <axboe@kernel.dk> wrote:
> On 10/24/19 2:31 PM, Jann Horn wrote:
> > On Thu, Oct 24, 2019 at 9:41 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 10/18/19 12:50 PM, Jann Horn wrote:
> >>> On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>> On 10/18/19 12:06 PM, Jann Horn wrote:
> >>>>> But actually, by the way: Is this whole files_struct thing creating a
> >>>>> reference loop? The files_struct has a reference to the uring file,
> >>>>> and the uring file has ACCEPT work that has a reference to the
> >>>>> files_struct. If the task gets killed and the accept work blocks, the
> >>>>> entire files_struct will stay alive, right?
> >>>>
> >>>> Yes, for the lifetime of the request, it does create a loop. So if the
> >>>> application goes away, I think you're right, the files_struct will stay.
> >>>> And so will the io_uring, for that matter, as we depend on the closing
> >>>> of the files to do the final reap.
> >>>>
> >>>> Hmm, not sure how best to handle that, to be honest. We need some way to
> >>>> break the loop, if the request never finishes.
> >>>
> >>> A wacky and dubious approach would be to, instead of taking a
> >>> reference to the files_struct, abuse f_op->flush() to synchronously
> >>> flush out pending requests with references to the files_struct... But
> >>> it's probably a bad idea, given that in f_op->flush(), you can't
> >>> easily tell which files_struct the close is coming from. I suppose you
> >>> could keep a list of (fdtable, fd) pairs through which ACCEPT requests
> >>> have come in and then let f_op->flush() probe whether the file
> >>> pointers are gone from them...
> >>
> >> Got back to this after finishing the io-wq stuff, which we need for the
> >> cancel.
> >>
> >> Here's an updated patch:
> >>
> >> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=1ea847edc58d6a54ca53001ad0c656da57257570
> >>
> >> that seems to work for me (lightly tested), we correctly find and cancel
> >> work that is holding on to the file table.
> >>
> >> The full series sits on top of my for-5.5/io_uring-wq branch, and can be
> >> viewed here:
> >>
> >> http://git.kernel.dk/cgit/linux-block/log/?h=for-5.5/io_uring-test
> >>
> >> Let me know what you think!
> >
> > Ah, I didn't realize that the second argument to f_op->flush is a
> > pointer to the files_struct. That's neat.
> >
> >
> > Security: There is no guarantee that ->flush() will run after the last
> > io_uring_enter() finishes. You can race like this, with threads A and
> > B in one process and C in another one:
> >
> > A: sends uring fd to C via unix domain socket
> > A: starts syscall io_uring_enter(fd, ...)
> > A: calls fdget(fd), takes reference to file
> > B: starts syscall close(fd)
> > B: fd table entry is removed
> > B: f_op->flush is invoked and finds no pending transactions
> > B: syscall close() returns
> > A: continues io_uring_enter(), grabbing current->files
> > A: io_uring_enter() returns
> > A and B: exit
> > worker: use-after-free access to files_struct
> >
> > I think the solution to this would be (unless you're fine with adding
> > some broad global read-write mutex) something like this in
> > __io_queue_sqe(), where "fd" and "f" are the variables from
> > io_uring_enter(), plumbed through the stack somehow:
> >
> > if (req->flags & REQ_F_NEED_FILES) {
> >    rcu_read_lock();
> >    spin_lock_irq(&ctx->inflight_lock);
> >    if (fcheck(fd) == f) {
> >      list_add(&req->inflight_list,
> >        &ctx->inflight_list);
> >      req->work.files = current->files;
> >      ret = 0;
> >    } else {
> >      ret = -EBADF;
> >    }
> >    spin_unlock_irq(&ctx->inflight_lock);
> >    rcu_read_unlock();
> >    if (ret)
> >      goto put_req;
> > }
>
> First of all, thanks for the thorough look at this! We already have f
> available here, it's req->file. And we just made a copy of the sqe, so
> we have sqe->fd available as well. I fixed this up.

sqe->fd is the file descriptor we're doing I/O on, not the file
descriptor of the uring file, right? Same thing for req->file. This
check only detects whether the fd we're doing I/O on was closed, which
is irrelevant.

> > Security + Correctness: If there is more than one io_wqe, it seems to
> > me that io_uring_flush() calls io_wq_cancel_work(), which calls
> > io_wqe_cancel_work(), which may return IO_WQ_CANCEL_OK if the first
> > request it looks at is pending. In that case, io_wq_cancel_work() will
> > immediately return, and io_uring_flush() will also immediately return.
> > It looks like any other requests will continue running?
>
> Ah good point, I missed that. We need to keep looping until we get
> NOTFOUND returned. Fixed as well.
>
> Also added cancellation if the task is going away. Here's the
> incremental patch, I'll resend with the full version.
[...]
> +static int io_uring_flush(struct file *file, void *data)
> +{
> +       struct io_ring_ctx *ctx = file->private_data;
> +
> +       if (fatal_signal_pending(current) || (current->flags & PF_EXITING))
> +               io_wq_cancel_all(ctx->io_wq);

Looking at io_wq_cancel_all(), this will just send a signal to the
task without waiting for anything, right? Isn't that unsafe?

> +       else
> +               io_uring_cancel_files(ctx, data);
>         return 0;
>  }