All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] Support for the io_uring IO interface
@ 2019-03-06 16:13 Jens Axboe
  2019-03-06 20:05 ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2019-03-06 16:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-block, linux-aio

Hi Linus,

2nd attempt at adding the io_uring interface. Since the first one,
we've added basic unit testing of the three system calls, that
resides in liburing like the other unit tests that we have so far.
It'll take a while to get full coverage of it, but we're working
towards it. I've also added two basic test programs to tools/io_uring.
One uses the raw interface and has support for all the various
features that io_uring supports outside of standard IO, like fixed
files, fixed IO buffers, and polled IO. The other uses the liburing
API, and is a simplified version of cp(1).

This pull request adds support for a new IO interface, io_uring.
io_uring allows an application to communicate with the kernel through
two rings, the submission queue (SQ) and completion queue (CQ) ring.
This allows for very efficient handling of IOs, see the v5 posting for
some basic numbers:

https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/

Outside of just efficiency, the interface is also flexible and
extendable, and allows for future use cases like the upcoming NVMe
key-value store API, networked IO, and so on. It also supports async
buffered IO, something that we've always failed to support in the
kernel.

Outside of basic IO features, it supports async polled IO as well. This
particular feature has already been tested at Facebook months ago for
flash storage boxes, with 25-33% improvements. It makes polled IO
actually useful for real world use cases, where even basic flash sees a
nice win in terms of efficiency, latency, and performance. These boxes
were IOPS bound before, now they are not.

This series adds three new system calls. One for setting up an io_uring
instance (io_uring_setup(2)), one for submitting/completing IO
(io_uring_enter(2)), and one for aux functions like registrating file
sets, buffers, etc (io_uring_register(2)). Through the help of Arnd,
I've coordinated the syscall numbers so merge on that front should be
painless.

Jon did a writeup of the interface a while back, which (except for minor
details that have been tweaked) is still accurate. Find that here:

https://lwn.net/Articles/776703/

Huge thanks to Al Viro for helping getting the reference cycle code
correct, and to Jann Horn for his extensive reviews focused on both
security and bugs in general.

There's a userspace library that provides basic functionality for
applications that don't need or want to care about how to fiddle with
the rings directly. It has helpers to allow applications to easily set
up an io_uring instance, and submit/complete IO through it without
knowing about the intricacies of the rings. It also includes man pages
(thanks to Jeff Moyer), and will continue to grow support helper
functions and features as time progresses. Find it here:

git://git.kernel.dk/liburing

Fio has full support for the raw interface, both in the form of an IO
engine (io_uring), but also with a small test application (t/io_uring)
that can exercise and benchmark the interface.

Note that this branch sits on top of my for-5.1/block branch, since the
multi-page bvec changes caused a few conflicts with the pre-mapped
buffer support. I also moved a few prep patches to that branch today,
which is why it appears recently rebased (moved the 4 bottom patches
from io_uring to for-5.1/block).

Please consider this feature for 5.1, so we can finally have something
that's both fast, efficient, and feature rich for IO instead of the sad
niche case that is aio/libaio.


  git://git.kernel.dk/linux-block.git tags/io_uring-2019-03-06


----------------------------------------------------------------
Christoph Hellwig (1):
      io_uring: add fsync support

Jens Axboe (14):
      Add io_uring IO interface
      io_uring: support for IO polling
      fs: add fget_many() and fput_many()
      io_uring: use fget/fput_many() for file references
      io_uring: batch io_kiocb allocation
      block: implement bio helper to add iter bvec pages to bio
      io_uring: add support for pre-mapped user IO buffers
      net: split out functions related to registering inflight socket files
      io_uring: add file set registration
      io_uring: add submission polling
      io_uring: add io_kiocb ref count
      io_uring: add support for IORING_OP_POLL
      io_uring: allow workqueue item to handle multiple buffered requests
      io_uring: add a few test tools

 arch/x86/entry/syscalls/syscall_32.tbl |    3 +
 arch/x86/entry/syscalls/syscall_64.tbl |    3 +
 block/bio.c                            |   62 +-
 fs/Makefile                            |    1 +
 fs/aio.c                               |    6 +-
 fs/file.c                              |   15 +-
 fs/file_table.c                        |    9 +-
 fs/io_uring.c                          | 2969 ++++++++++++++++++++++++++++++++
 include/linux/file.h                   |    2 +
 include/linux/fs.h                     |   13 +-
 include/linux/sched/user.h             |    2 +-
 include/linux/syscalls.h               |    8 +
 include/net/af_unix.h                  |    1 +
 include/uapi/asm-generic/unistd.h      |    8 +-
 include/uapi/linux/io_uring.h          |  137 ++
 init/Kconfig                           |    9 +
 kernel/sys_ni.c                        |    3 +
 net/Makefile                           |    2 +-
 net/unix/Kconfig                       |    5 +
 net/unix/Makefile                      |    2 +
 net/unix/af_unix.c                     |   63 +-
 net/unix/garbage.c                     |   68 +-
 net/unix/scm.c                         |  151 ++
 net/unix/scm.h                         |   10 +
 tools/io_uring/Makefile                |   18 +
 tools/io_uring/README                  |   29 +
 tools/io_uring/barrier.h               |   16 +
 tools/io_uring/io_uring-bench.c        |  616 +++++++
 tools/io_uring/io_uring-cp.c           |  251 +++
 tools/io_uring/liburing.h              |  143 ++
 tools/io_uring/queue.c                 |  164 ++
 tools/io_uring/setup.c                 |  103 ++
 tools/io_uring/syscall.c               |   40 +
 33 files changed, 4784 insertions(+), 148 deletions(-)
 create mode 100644 fs/io_uring.c
 create mode 100644 include/uapi/linux/io_uring.h
 create mode 100644 net/unix/scm.c
 create mode 100644 net/unix/scm.h
 create mode 100644 tools/io_uring/Makefile
 create mode 100644 tools/io_uring/README
 create mode 100644 tools/io_uring/barrier.h
 create mode 100644 tools/io_uring/io_uring-bench.c
 create mode 100644 tools/io_uring/io_uring-cp.c
 create mode 100644 tools/io_uring/liburing.h
 create mode 100644 tools/io_uring/queue.c
 create mode 100644 tools/io_uring/setup.c
 create mode 100644 tools/io_uring/syscall.c

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Support for the io_uring IO interface
  2019-03-06 16:13 [GIT PULL] Support for the io_uring IO interface Jens Axboe
@ 2019-03-06 20:05 ` Jens Axboe
  2019-03-08 22:55   ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2019-03-06 20:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-block, linux-aio

On 3/6/19 9:13 AM, Jens Axboe wrote:
> Hi Linus,
> 
> 2nd attempt at adding the io_uring interface. Since the first one,
> we've added basic unit testing of the three system calls, that
> resides in liburing like the other unit tests that we have so far.
> It'll take a while to get full coverage of it, but we're working
> towards it. I've also added two basic test programs to tools/io_uring.
> One uses the raw interface and has support for all the various
> features that io_uring supports outside of standard IO, like fixed
> files, fixed IO buffers, and polled IO. The other uses the liburing
> API, and is a simplified version of cp(1).
> 
> This pull request adds support for a new IO interface, io_uring.
> io_uring allows an application to communicate with the kernel through
> two rings, the submission queue (SQ) and completion queue (CQ) ring.
> This allows for very efficient handling of IOs, see the v5 posting for
> some basic numbers:
> 
> https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/
> 
> Outside of just efficiency, the interface is also flexible and
> extendable, and allows for future use cases like the upcoming NVMe
> key-value store API, networked IO, and so on. It also supports async
> buffered IO, something that we've always failed to support in the
> kernel.
> 
> Outside of basic IO features, it supports async polled IO as well. This
> particular feature has already been tested at Facebook months ago for
> flash storage boxes, with 25-33% improvements. It makes polled IO
> actually useful for real world use cases, where even basic flash sees a
> nice win in terms of efficiency, latency, and performance. These boxes
> were IOPS bound before, now they are not.
> 
> This series adds three new system calls. One for setting up an io_uring
> instance (io_uring_setup(2)), one for submitting/completing IO
> (io_uring_enter(2)), and one for aux functions like registrating file
> sets, buffers, etc (io_uring_register(2)). Through the help of Arnd,
> I've coordinated the syscall numbers so merge on that front should be
> painless.
> 
> Jon did a writeup of the interface a while back, which (except for minor
> details that have been tweaked) is still accurate. Find that here:
> 
> https://lwn.net/Articles/776703/
> 
> Huge thanks to Al Viro for helping getting the reference cycle code
> correct, and to Jann Horn for his extensive reviews focused on both
> security and bugs in general.
> 
> There's a userspace library that provides basic functionality for
> applications that don't need or want to care about how to fiddle with
> the rings directly. It has helpers to allow applications to easily set
> up an io_uring instance, and submit/complete IO through it without
> knowing about the intricacies of the rings. It also includes man pages
> (thanks to Jeff Moyer), and will continue to grow support helper
> functions and features as time progresses. Find it here:
> 
> git://git.kernel.dk/liburing
> 
> Fio has full support for the raw interface, both in the form of an IO
> engine (io_uring), but also with a small test application (t/io_uring)
> that can exercise and benchmark the interface.
> 
> Note that this branch sits on top of my for-5.1/block branch, since the
> multi-page bvec changes caused a few conflicts with the pre-mapped
> buffer support. I also moved a few prep patches to that branch today,
> which is why it appears recently rebased (moved the 4 bottom patches
> from io_uring to for-5.1/block).
> 
> Please consider this feature for 5.1, so we can finally have something
> that's both fast, efficient, and feature rich for IO instead of the sad
> niche case that is aio/libaio.
> 
> 
>   git://git.kernel.dk/linux-block.git tags/io_uring-2019-03-06

Slight mess up in the stats, here's the correct one... Note that this
also throws a few more merge conflicts now, due to the syscall merges.
All trivial, though, and the branch was prepared for it in terms of
numbering.

----------------------------------------------------------------
Christoph Hellwig (1):
      io_uring: add fsync support

Jens Axboe (14):
      Add io_uring IO interface
      io_uring: support for IO polling
      fs: add fget_many() and fput_many()
      io_uring: use fget/fput_many() for file references
      io_uring: batch io_kiocb allocation
      block: implement bio helper to add iter bvec pages to bio
      io_uring: add support for pre-mapped user IO buffers
      net: split out functions related to registering inflight socket files
      io_uring: add file set registration
      io_uring: add submission polling
      io_uring: add io_kiocb ref count
      io_uring: add support for IORING_OP_POLL
      io_uring: allow workqueue item to handle multiple buffered requests
      io_uring: add a few test tools

 arch/x86/entry/syscalls/syscall_32.tbl |    3 +
 arch/x86/entry/syscalls/syscall_64.tbl |    3 +
 block/bio.c                            |   62 +-
 fs/Makefile                            |    1 +
 fs/file.c                              |   15 +-
 fs/file_table.c                        |    9 +-
 fs/io_uring.c                          | 2971 ++++++++++++++++++++++++++++++++
 include/linux/file.h                   |    2 +
 include/linux/fs.h                     |   13 +-
 include/linux/sched/user.h             |    2 +-
 include/linux/syscalls.h               |    8 +
 include/net/af_unix.h                  |    1 +
 include/uapi/asm-generic/unistd.h      |    8 +-
 include/uapi/linux/io_uring.h          |  137 ++
 init/Kconfig                           |    9 +
 kernel/sys_ni.c                        |    3 +
 net/Makefile                           |    2 +-
 net/unix/Kconfig                       |    5 +
 net/unix/Makefile                      |    2 +
 net/unix/af_unix.c                     |   63 +-
 net/unix/garbage.c                     |   68 +-
 net/unix/scm.c                         |  151 ++
 net/unix/scm.h                         |   10 +
 tools/io_uring/Makefile                |   18 +
 tools/io_uring/README                  |   29 +
 tools/io_uring/barrier.h               |   16 +
 tools/io_uring/io_uring-bench.c        |  616 +++++++
 tools/io_uring/io_uring-cp.c           |  251 +++
 tools/io_uring/liburing.h              |  143 ++
 tools/io_uring/queue.c                 |  164 ++
 tools/io_uring/setup.c                 |  103 ++
 tools/io_uring/syscall.c               |   40 +
 32 files changed, 4782 insertions(+), 146 deletions(-)
 create mode 100644 fs/io_uring.c
 create mode 100644 include/uapi/linux/io_uring.h
 create mode 100644 net/unix/scm.c
 create mode 100644 net/unix/scm.h
 create mode 100644 tools/io_uring/Makefile
 create mode 100644 tools/io_uring/README
 create mode 100644 tools/io_uring/barrier.h
 create mode 100644 tools/io_uring/io_uring-bench.c
 create mode 100644 tools/io_uring/io_uring-cp.c
 create mode 100644 tools/io_uring/liburing.h
 create mode 100644 tools/io_uring/queue.c
 create mode 100644 tools/io_uring/setup.c
 create mode 100644 tools/io_uring/syscall.c

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Support for the io_uring IO interface
  2019-03-06 20:05 ` Jens Axboe
@ 2019-03-08 22:55   ` Linus Torvalds
  2019-03-08 23:36     ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2019-03-08 22:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-aio

On Wed, Mar 6, 2019 at 12:05 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Slight mess up in the stats, here's the correct one... Note that this
> also throws a few more merge conflicts now, due to the syscall merges.
> All trivial, though, and the branch was prepared for it in terms of
> numbering.

Ok, merged in my tree, but not pushed out yet.

I'm going to run the usual build tests, but also look at the basic
sanity tests and boot and run them just to be careful before actually
doing that final "ok pushed out".

                  Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Support for the io_uring IO interface
  2019-03-08 22:55   ` Linus Torvalds
@ 2019-03-08 23:36     ` Linus Torvalds
  2019-03-08 23:52       ` Linus Torvalds
  2019-03-09  3:03       ` Al Viro
  0 siblings, 2 replies; 15+ messages in thread
From: Linus Torvalds @ 2019-03-08 23:36 UTC (permalink / raw)
  To: Jens Axboe, Al Viro; +Cc: linux-block, linux-aio

On Fri, Mar 8, 2019 at 2:55 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'm going to run the usual build tests, but also look at the basic
> sanity tests and boot and run them just to be careful before actually
> doing that final "ok pushed out".

While waiting for that, I'm looking at the file pointer refs, because
that was completely buggered in fs/aio.c.

What protects somebody from:

 - io_uring_register(IORING_REGISTER_FILES);

 - start async IO

 - io_uring_register(IORING_UNREGISTER_FILES);

and now it had better synchronize everything.

It looks like it migth work due to

 (a) the mutex_lock(&ctx->uring_lock) around registration

 (b) the wait_for_completion(&ctx->ctx_done) in __io_uring_register
presumably waits for each outstanding request.

HOWEVER.

The io_kiocb reference counting seems to be the *exact* same bogus
reference counting that fs/aio.c had, with the magical "zero means it
was never initialized and counts as one" handling, which was buggy in
fs/aio.c too, and caused serious problems with races between request
creation (the "synchronous" part) and requests actually being finished
asynchronously (the "completion" part).

IOW, I can already tell that the reference counting looks suspicious with

  static void io_free_req(struct io_kiocb *req)
  {
        if (!refcount_read(&req->refs) || refcount_dec_and_test(&req->refs)) {

where that whole first "oh, a zero refcount is magic" handling looks
*very* suspicious. It basically says "some ops don't do references
properly at all".

This is *exactly* the same bogosity that fds/aio.c had, and it has
*exactly* the same pattern, with the "poll" code doing

        /* one for removal from waitqueue, one for this function */
        refcount_set(&req->refs, 2);

and then everybody else seems to initialize req->refs to zero at
allocation time, because they magically have no lifetime rules for the
req.

It was bogus garbage in fs/aio.c, and honestly, looking at how much of
the logic looks suspiciously very similar jhere, I suspect it's bogus
garbage here too.

In fact, all the setup code looks suspiciously similar in general. It
has the same broken "prep_rw" and "complete_rw" thing which was racy
with files opened with O_DIRECT "competing" their IO while still
holding i_rwsem on the inode, and then possibly things getting
dropped.

I'm getting the very strong signal that this code had all the logic
taken from fs/aio.c, bugs and all.

Al has patches to fix the aio.c cases. I very much suspect some very
similar patches are needed in fs/io_ring.c.

                        Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Support for the io_uring IO interface
  2019-03-08 23:36     ` Linus Torvalds
@ 2019-03-08 23:52       ` Linus Torvalds
  2019-03-09  0:08         ` Linus Torvalds
  2019-03-09  3:03       ` Al Viro
  1 sibling, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2019-03-08 23:52 UTC (permalink / raw)
  To: Jens Axboe, Al Viro; +Cc: linux-block, linux-aio

On Fri, Mar 8, 2019 at 3:36 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It was bogus garbage in fs/aio.c, and honestly, looking at how much of
> the logic looks suspiciously very similar here, I suspect it's bogus
> garbage here too.

Apart from that "this really looks suspicious" I actually like what
the refcounting code does. The code looks like it took refcounting
seriously, and the context refcounting and setup synchronization looks
like it should work with that whole registration logic etc.

But if the low-level request refcount is off, all bets are off. And I
do think it looks like the basic io_kiocb refcounting is wrong, and
that

        refcount_set(&req->refs, 0);

in io_get_req() really is a big big red flag for "Oh, I'm not doing my
refcount properly". It seems to be an optimization for "I only have a
single refcount, and I don't want to have that expensive atomic
decrement".

But that automatically means that the whole "req" pointer is now
really really dangerous, because if IO finishes early, it will be
free'd immediately, possibly even before the synchronous part of
submission is all done.

And, as mentioned, it can free the file (and thus the inode) before
the synchronous part is even done, which can cause really subtle
problems.

I think the fix is the same that fs/aio.c is getting: just always
initialize the request refcount to 2: one for the submission side, and
one for the completion side.

Initializing a refcount to 0 is basically always a bug. You can't
return a structure that has a zero refcount - by definition you'd
better have your own reference on it.

                   Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Support for the io_uring IO interface
  2019-03-08 23:52       ` Linus Torvalds
@ 2019-03-09  0:08         ` Linus Torvalds
  2019-03-09  4:25           ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2019-03-09  0:08 UTC (permalink / raw)
  To: Jens Axboe, Al Viro; +Cc: linux-block, linux-aio

Oh well. I'm obviously not entirely happy about it, but the refcount
issues look fixable.

So I pushed it out. Al, mind taking a look? Jens claimed in his pull
request that you've already looked at the refcounting, so I think the
only issue is whether it shares the bug with fs/aio.c that I think it
does, or whether it might be avoiding it because of doing things
slightly differently.

                    Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Support for the io_uring IO interface
  2019-03-08 23:36     ` Linus Torvalds
  2019-03-08 23:52       ` Linus Torvalds
@ 2019-03-09  3:03       ` Al Viro
  1 sibling, 0 replies; 15+ messages in thread
From: Al Viro @ 2019-03-09  3:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jens Axboe, linux-block, linux-aio

On Fri, Mar 08, 2019 at 03:36:46PM -0800, Linus Torvalds wrote:

> I'm getting the very strong signal that this code had all the logic
> taken from fs/aio.c, bugs and all.

Definitely.

> Al has patches to fix the aio.c cases. I very much suspect some very
> similar patches are needed in fs/io_ring.c.

Very likely so.  BTW, another lovely problem is that read from
a registered AF_UNIX socket really needs an extra reference to
struct file that would *NOT* be covered by ->inflight.  Otherwise
we are asking for serious trouble with garbage collector...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Support for the io_uring IO interface
  2019-03-09  0:08         ` Linus Torvalds
@ 2019-03-09  4:25           ` Jens Axboe
  2019-03-09  4:33             ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2019-03-09  4:25 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro; +Cc: linux-block, linux-aio

On 3/8/19 5:08 PM, Linus Torvalds wrote:
> Oh well. I'm obviously not entirely happy about it, but the refcount
> issues look fixable.
> 
> So I pushed it out. Al, mind taking a look? Jens claimed in his pull
> request that you've already looked at the refcounting, so I think the
> only issue is whether it shares the bug with fs/aio.c that I think it
> does, or whether it might be avoiding it because of doing things
> slightly differently.

Sorry for the late replies, was away.

The refcount req business all rests on the poll command implementation,
which is basically a straight port of the aio version. So I guess not
surprising if it suffers from the same reference issue. I'll take a look
at the patches that went into aio and apply the same fix(es) to
io_uring.

I'll let Al expand on the AF_UNIX socket issue.

Thanks for getting this merged! I'll be on top of squashing the above
two cases asap.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Support for the io_uring IO interface
  2019-03-09  4:25           ` Jens Axboe
@ 2019-03-09  4:33             ` Linus Torvalds
  2019-03-09  4:39               ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2019-03-09  4:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Al Viro, linux-block, linux-aio

On Fri, Mar 8, 2019 at 8:25 PM Jens Axboe <axboe@kernel.dk> wrote:
>
>              I'll take a look
> at the patches that went into aio and apply the same fix(es) to
> io_uring.

Note that they are not upstream yet, there's a work-in-progress branch
in Al's tree at

   git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git work.aio

so you can see what the fixes are, but they are not in their final form yet.

              Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Support for the io_uring IO interface
  2019-03-09  4:33             ` Linus Torvalds
@ 2019-03-09  4:39               ` Jens Axboe
  2019-03-09  5:41                 ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2019-03-09  4:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-block, linux-aio

On 3/8/19 9:33 PM, Linus Torvalds wrote:
> On Fri, Mar 8, 2019 at 8:25 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>>              I'll take a look
>> at the patches that went into aio and apply the same fix(es) to
>> io_uring.
> 
> Note that they are not upstream yet, there's a work-in-progress branch
> in Al's tree at
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git work.aio
> 
> so you can see what the fixes are, but they are not in their final form yet.

That explains it, was looking at mainline and saw just the fget/fput
related parts so far.

If all else fails, it's also trivial to yank the poll command support for
now, which would kill the io_req refs. But I'll look at Al's pending
fixes first, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Support for the io_uring IO interface
  2019-03-09  4:39               ` Jens Axboe
@ 2019-03-09  5:41                 ` Linus Torvalds
  2019-03-12 21:52                   ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2019-03-09  5:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Al Viro, linux-block, linux-aio

On Fri, Mar 8, 2019 at 8:39 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> If all else fails, it's also trivial to yank the poll command support for
> now, which would kill the io_req refs.

No, it really wouldn't.

The poll() code thinks it needs the refs, and the poll code is right -
but the refs are just badly done and there's a ref leak in a special
case and just generally ugly code.

The read/write code thinks it _doesn't_ need the refs, and the
read-write code is *WRONG*. It actually does need them.

So what the fs/aio.c and now fs/io_ring.c code *thinks* it can do is:

 - I'm submitting IO

 - I'm not touching the iocb after submission

 - so I'm done with the iocb, and all I need to do is to say that if
the IO submission succeeded, then the completion of the IO will free
the iocb.

That sounds simple, but it is *WRONG*.

Here's roughly what happens for the write code on the submission side:

  call_write_iter(file, req, &iter)
    file->f_op->write_iter(kio, iter);
      filesystem_file_write_iter()
        inode_lock(inode);
        __generic_file_write_iter(iocb, from);
        inode_unlock(inode);
        if (ret > 0)
                ret = generic_write_sync(iocb, ret);

and guess what? The IO completion might for example happen *while*
__generic_file_write_iter() is running, which might be calling
generic_file_direct_write(), and the IO might just effectively finish
immediately.

What happens then? The iocb is released as part of IO completion, and
now that whole *submission* side that is still accessing the inode,
still accessing the file, and still even accessing the "iocb" (for
that "write_sync" case) is all touching something that may not exist
any more, because it's all been released!

See what's going on?

The thing is, the *submission* path absolutely has to keep a reference
to the iocb until the submission is fully and entirely done. So read
and write need to have that ref on the iocb too, and one ref is given
to the actual IO part, while another ref is held by the submission
side.

So it's really not "poll does ref handling wrong, nobody else needs it".

No, poll at least _tries_ to do ref handling, and admittedly gets it
slightly wrong, but at least it gets a B for _effort_.

The read/write paths don't even try. They get an F. They think they
don't need a ref at all, and they are very much mistaken.

So this is why the io_get_req() function simply should never return
that "ref 0" io_kiocb. It should always be initialized with a
ref-count of 2: one for the submission side (which should do a final
io_put() after it has finished submission, so that the iocb and the
file and the inode are guaranteed to stay around for the _whole_ IO
callchain), and one for the actual IO side (which should do it as part
of io_complete).

This is why Al's branch has that whole series that starts off with
that "pin iocb through aio" commit that does that

-   refcount_set(&req->ki_refcnt, 0);
+   refcount_set(&req->ki_refcnt, 2);

in aio_get_req(), and then works at keeping the iocb around until it's
been fully submitted. And then Al fixes all the small details that
I've glossed over above because aio_poll() had various ugly cases, and
cleans things up a bit.

                  Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Support for the io_uring IO interface
  2019-03-09  5:41                 ` Linus Torvalds
@ 2019-03-12 21:52                   ` Jens Axboe
  0 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2019-03-12 21:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-block, linux-aio

On 3/8/19 10:41 PM, Linus Torvalds wrote:
> On Fri, Mar 8, 2019 at 8:39 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> If all else fails, it's also trivial to yank the poll command support for
>> now, which would kill the io_req refs.
> 
> No, it really wouldn't.
> 
> The poll() code thinks it needs the refs, and the poll code is right -
> but the refs are just badly done and there's a ref leak in a special
> case and just generally ugly code.
> 
> The read/write code thinks it _doesn't_ need the refs, and the
> read-write code is *WRONG*. It actually does need them.
> 
> So what the fs/aio.c and now fs/io_ring.c code *thinks* it can do is:
> 
>  - I'm submitting IO
> 
>  - I'm not touching the iocb after submission
> 
>  - so I'm done with the iocb, and all I need to do is to say that if
> the IO submission succeeded, then the completion of the IO will free
> the iocb.
> 
> That sounds simple, but it is *WRONG*.
> 
> Here's roughly what happens for the write code on the submission side:
> 
>   call_write_iter(file, req, &iter)
>     file->f_op->write_iter(kio, iter);
>       filesystem_file_write_iter()
>         inode_lock(inode);
>         __generic_file_write_iter(iocb, from);
>         inode_unlock(inode);
>         if (ret > 0)
>                 ret = generic_write_sync(iocb, ret);
> 
> and guess what? The IO completion might for example happen *while*
> __generic_file_write_iter() is running, which might be calling
> generic_file_direct_write(), and the IO might just effectively finish
> immediately.
> 
> What happens then? The iocb is released as part of IO completion, and
> now that whole *submission* side that is still accessing the inode,
> still accessing the file, and still even accessing the "iocb" (for
> that "write_sync" case) is all touching something that may not exist
> any more, because it's all been released!
> 
> See what's going on?
> 
> The thing is, the *submission* path absolutely has to keep a reference
> to the iocb until the submission is fully and entirely done. So read
> and write need to have that ref on the iocb too, and one ref is given
> to the actual IO part, while another ref is held by the submission
> side.
> 
> So it's really not "poll does ref handling wrong, nobody else needs it".
> 
> No, poll at least _tries_ to do ref handling, and admittedly gets it
> slightly wrong, but at least it gets a B for _effort_.
> 
> The read/write paths don't even try. They get an F. They think they
> don't need a ref at all, and they are very much mistaken.
> 
> So this is why the io_get_req() function simply should never return
> that "ref 0" io_kiocb. It should always be initialized with a
> ref-count of 2: one for the submission side (which should do a final
> io_put() after it has finished submission, so that the iocb and the
> file and the inode are guaranteed to stay around for the _whole_ IO
> callchain), and one for the actual IO side (which should do it as part
> of io_complete).
> 
> This is why Al's branch has that whole series that starts off with
> that "pin iocb through aio" commit that does that
> 
> -   refcount_set(&req->ki_refcnt, 0);
> +   refcount_set(&req->ki_refcnt, 2);
> 
> in aio_get_req(), and then works at keeping the iocb around until it's
> been fully submitted. And then Al fixes all the small details that
> I've glossed over above because aio_poll() had various ugly cases, and
> cleans things up a bit.

The bit I was missing was sync completion combined with the iocb
being used in the not after submission, but during that part.

I've reworked the references to follow the above proposed scheme, one
for submission and one for completion.

I've also ported Al's race fix for the poll implementation. Will throw
it all into some testing.

Thanks Linus!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Support for the io_uring IO interface
  2019-03-04  2:34 ` Linus Torvalds
@ 2019-03-04  3:23   ` Jens Axboe
  0 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2019-03-04  3:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-block, linux-aio

On 3/3/19 7:34 PM, Linus Torvalds wrote:
> On Sun, Mar 3, 2019 at 5:53 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> This pull request adds support for a new IO interface, io_uring.
>> io_uring allows an application to communicate with the kernel through
>> two rings, the submission queue (SQ) and completion queue (CQ) ring.
>> This allows for very efficient handling of IOs, see the v5 posting for
>> some basic numbers:
> 
> Having just looked at the bugs in fs/aio.c and not finding a lot of
> users of *that*, and absolutely zero actual test-cases, I'm actually
> less than thrilled to see a new interface with the same behavior.
> 
> So I would want to see some test programs that actually use this.
> Perhaps as part of tools/testing/?

I think there's a few things to take into account here. Nothing
(basically) uses fs/aio.c. Obviously that's true for io_uring as well
right now, but the hope is that that is NOT going to be true going
forward as it works for any type of IO, instead of being just O_DIRECT.

That said, there are test programs. As mentioned in the pull, there's a
fio IO engine that uses it. In the fio repo, there's also a standalone
test program, t/io_uring.c.

Outside of that, the liburing repo mentioned does have some actual test
cases, in the test/ directory. Some are unit tests, some are just random
test cases.

So it's not like there are NO test cases. Not a huge amount, but most of
them have been organically grown as bugs have shown up.

Now that you know there are some, how do you want to proceed? I can
write some more unit like tests and migrate the ones I do have to
tools/testing/?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] Support for the io_uring IO interface
  2019-03-04  1:53 Jens Axboe
@ 2019-03-04  2:34 ` Linus Torvalds
  2019-03-04  3:23   ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2019-03-04  2:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-aio

On Sun, Mar 3, 2019 at 5:53 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> This pull request adds support for a new IO interface, io_uring.
> io_uring allows an application to communicate with the kernel through
> two rings, the submission queue (SQ) and completion queue (CQ) ring.
> This allows for very efficient handling of IOs, see the v5 posting for
> some basic numbers:

Having just looked at the bugs in fs/aio.c and not finding a lot of
users of *that*, and absolutely zero actual test-cases, I'm actually
less than thrilled to see a new interface with the same behavior.

So I would want to see some test programs that actually use this.
Perhaps as part of tools/testing/?

                     Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [GIT PULL] Support for the io_uring IO interface
@ 2019-03-04  1:53 Jens Axboe
  2019-03-04  2:34 ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2019-03-04  1:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-block, linux-aio

Hi Linus,

This pull request adds support for a new IO interface, io_uring.
io_uring allows an application to communicate with the kernel through
two rings, the submission queue (SQ) and completion queue (CQ) ring.
This allows for very efficient handling of IOs, see the v5 posting for
some basic numbers:

https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/

Outside of just efficiency, the interface is also flexible and
extendable, and allows for future use cases like the upcoming NVMe
key-value store API, networked IO, and so on. It also supports async
buffered IO, something that we've always failed to support in the
kernel.

Outside of basic IO features, it supports async polled IO as well. This
particular feature has already been tested at Facebook months ago for
flash storage boxes, with 25-33% improvements. It makes polled IO
actually useful for real world use cases, where even basic flash sees a
nice win in terms of efficiency, latency, and performance. These boxes
were IOPS bound before, now they are not.

This series adds three new system calls. One for setting up an io_uring
instance (io_uring_setup(2)), one for submitting/completing IO
(io_uring_enter(2)), and one for aux functions like registrating file
sets, buffers, etc (io_uring_register(2)). Through the help of Arnd,
I've coordinated the syscall numbers so merge on that front should be
painless.

Jon did a writeup of the interface a while back, which (except for minor
details that have been tweaked) is still accurate. Find that here:

https://lwn.net/Articles/776703/

Huge thanks to Al Viro for helping getting the reference cycle code
correct, and to Jann Horn for his extensive reviews focused on both
security and bugs in general.

There's a userspace library that provides basic functionality for
applications that don't need or want to care about how to fiddle with
the rings directly. It has helpers to allow applications to easily set
up an io_uring instance, and submit/complete IO through it without
knowing about the intricacies of the rings. It also includes man pages
(thanks to Jeff Moyer), and will continue to grow support helper
functions and features as time progresses. Find it here:

git://git.kernel.dk/liburing

Fio has full support for the raw interface, both in the form of an IO
engine (io_uring), but also with a small test application (t/io_uring)
that can exercise and benchmark the interface.

Note that this branch sits on top of my for-5.1/block branch, since the
multi-page bvec changes caused a few conflicts with the pre-mapped
buffer support. I also moved a few prep patches to that branch today,
which is why it appears recently rebased (moved the 4 bottom patches
from io_uring to for-5.1/block).

Please consider this feature for 5.1, so we can finally have something
that's both fast, efficient, and feature rich for IO instead of the sad
niche case that is aio/libaio.


  git://git.kernel.dk/linux-block.git tags/io_uring-20190301


----------------------------------------------------------------
Christoph Hellwig (1):
      io_uring: add fsync support

Jens Axboe (13):
      Add io_uring IO interface
      io_uring: support for IO polling
      fs: add fget_many() and fput_many()
      io_uring: use fget/fput_many() for file references
      io_uring: batch io_kiocb allocation
      block: implement bio helper to add iter bvec pages to bio
      io_uring: add support for pre-mapped user IO buffers
      net: split out functions related to registering inflight socket files
      io_uring: add file set registration
      io_uring: add submission polling
      io_uring: add io_kiocb ref count
      io_uring: add support for IORING_OP_POLL
      io_uring: allow workqueue item to handle multiple buffered requests

 arch/x86/entry/syscalls/syscall_32.tbl |    3 +
 arch/x86/entry/syscalls/syscall_64.tbl |    3 +
 block/bio.c                            |   62 +-
 fs/Makefile                            |    1 +
 fs/file.c                              |   15 +-
 fs/file_table.c                        |    9 +-
 fs/io_uring.c                          | 2969 ++++++++++++++++++++++++++++++++
 include/linux/file.h                   |    2 +
 include/linux/fs.h                     |   13 +-
 include/linux/sched/user.h             |    2 +-
 include/linux/syscalls.h               |    8 +
 include/net/af_unix.h                  |    1 +
 include/uapi/asm-generic/unistd.h      |    8 +-
 include/uapi/linux/io_uring.h          |  137 ++
 init/Kconfig                           |    9 +
 kernel/sys_ni.c                        |    3 +
 net/Makefile                           |    2 +-
 net/unix/Kconfig                       |    5 +
 net/unix/Makefile                      |    2 +
 net/unix/af_unix.c                     |   63 +-
 net/unix/garbage.c                     |   68 +-
 net/unix/scm.c                         |  151 ++
 net/unix/scm.h                         |   10 +
 23 files changed, 3400 insertions(+), 146 deletions(-)
 create mode 100644 fs/io_uring.c
 create mode 100644 include/uapi/linux/io_uring.h
 create mode 100644 net/unix/scm.c
 create mode 100644 net/unix/scm.h

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-03-12 21:52 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-06 16:13 [GIT PULL] Support for the io_uring IO interface Jens Axboe
2019-03-06 20:05 ` Jens Axboe
2019-03-08 22:55   ` Linus Torvalds
2019-03-08 23:36     ` Linus Torvalds
2019-03-08 23:52       ` Linus Torvalds
2019-03-09  0:08         ` Linus Torvalds
2019-03-09  4:25           ` Jens Axboe
2019-03-09  4:33             ` Linus Torvalds
2019-03-09  4:39               ` Jens Axboe
2019-03-09  5:41                 ` Linus Torvalds
2019-03-12 21:52                   ` Jens Axboe
2019-03-09  3:03       ` Al Viro
  -- strict thread matches above, loose matches on Subject: below --
2019-03-04  1:53 Jens Axboe
2019-03-04  2:34 ` Linus Torvalds
2019-03-04  3:23   ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.