All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Al Viro <viro@zeniv.linux.org.uk>, Jann Horn <jannh@google.com>
Cc: linux-aio@kvack.org, linux-block@vger.kernel.org,
	Linux API <linux-api@vger.kernel.org>,
	hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 13/18] io_uring: add file set registration
Date: Mon, 4 Feb 2019 19:19:34 -0700	[thread overview]
Message-ID: <785c6db4-095e-65b0-ded5-72b41af5174e@kernel.dk> (raw)
In-Reply-To: <20190204025612.GR2217@ZenIV.linux.org.uk>

On 2/3/19 7:56 PM, Al Viro wrote:
> On Wed, Jan 30, 2019 at 02:29:05AM +0100, Jann Horn wrote:
>> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
>>> We normally have to fget/fput for each IO we do on a file. Even with
>>> the batching we do, the cost of the atomic inc/dec of the file usage
>>> count adds up.
>>>
>>> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
>>> for the io_uring_register(2) system call. The arguments passed in must
>>> be an array of __s32 holding file descriptors, and nr_args should hold
>>> the number of file descriptors the application wishes to pin for the
>>> duration of the io_uring context (or until IORING_UNREGISTER_FILES is
>>> called).
>>>
>>> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
>>> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
>>> to the index in the array passed in to IORING_REGISTER_FILES.
>>>
>>> Files are automatically unregistered when the io_uring context is
>>> torn down. An application need only unregister if it wishes to
>>> register a new set of fds.
>>
>> Crazy idea:
>>
>> Taking a step back, at a high level, basically this patch creates sort
>> of the same difference that you get when you compare the following
>> scenarios for normal multithreaded I/O in userspace:
> 
>> This kinda makes me wonder whether this is really something that
>> should be implemented specifically for the io_uring API, or whether it
>> would make sense to somehow handle part of this in the generic VFS
>> code and give the user the ability to prepare a new files_struct that
>> can then be transferred to the worker thread, or something like
>> that... I'm not sure whether there's a particularly clean way to do
>> that though.
> 
> Using files_struct for that opens a can of worms you really don't
> want to touch.
> 
> Consider the following scenario with any variant of this interface:
> 	* create io_uring fd.
> 	* send an SCM_RIGHTS with that fd to AF_UNIX socket.
> 	* add the descriptor of that AF_UNIX socket to your fd
> 	* close AF_UNIX fd, close io_uring fd.
> Voila - you've got a shiny leak.  No ->release() is called for
> anyone (and you really don't want to do that on ->flush(), because
> otherwise a library helper doing e.g. system("/bin/date") will tear
> down all the io_uring in your process).  The socket is held by
> the reference you've stashed into io_uring (whichever way you do
> that).  io_uring is held by the reference you've stashed into
> SCM_RIGHTS datagram in queue of the socket.
> 
> No matter what, you need net/unix/garbage.c to be aware of that stuff.
> And getting files_struct lifetime mixed into that would be beyond
> any reason.
> 
> The only reason for doing that as a descriptor table would be
> avoiding the cost of fget() in whatever uses it, right?  Since

Right, the only purpose of this patch is to avoid doing fget/fput for
each IO.

> those are *not* the normal syscalls (and fdget() really should not
> be used anywhere other than the very top of syscall's call chain -
> that's another reason why tossing file_struct around like that
> is insane) and since the benefit is all due to the fact that it's
> *NOT* shared, *NOT* modified in parallel, etc., allowing us to
> treat file references as stable... why the hell use the descriptor
> tables at all?

This one is not a regular system call, since we don't do fget, then IO,
then fput. We hang on to it. But for the non-registered case, it's very
much just like a regular read/write system call, where we fget to do IO
on it, then fput when we are done.

> All you need is an array of struct file *, explicitly populated.
> With net/unix/garbage.c aware of such beasts.  Guess what?  We
> do have such an object already.  The one net/unix/garbage.c is
> working with.  SCM_RIGHTS datagrams, that is.
> 
> IOW, can't we give those io_uring descriptors associated struct
> unix_sock?  No socket descriptors, no struct socket (probably),
> just the AF_UNIX-specific part thereof.  Then teach
> unix_inflight()/unix_notinflight() about getting unix_sock out
> of these guys (incidentally, both would seem to benefit from
> _not_ touching unix_gc_lock in case when there's no unix_sock
> attached to file we are dealing with - I might be missing
> something very subtle about barriers there, but it doesn't
> look likely).

That might be workable, though I'm not sure we currently have helpers to
just explicitly create a unix_sock by itself. Not familiar with the
networking bits at all, I'll take a look.

> And make that (i.e. registering the descriptors) mandatory.

I don't want to make it mandatory, that's very inflexible for managing
tons of files. The registration is useful for specific cases where we
have high frequency of operations on a set of files. Besides, it'd make
the use of the API cumbersome as well for the basic case of just wanting
to do async IO.

> Hell, combine that with creating io_uring fd, if we really
> care about the syscall count.  Benefits:

We don't care about syscall count for setup as much. If you're doing
registration of a file set, you're expected to do a LOT of IO to those
files. Hence having an extra one for setup is not a concern. My concern
is just making it mandatory to do registration, I don't think that's a
workable alternative.

> 	* no file_struct refcount wanking
> 	* no fget()/fput() (conditional, at that) from kernel
> threads
> 	* no CLOEXEC-dependent anything; just the teardown
> on the final fput(), whichever way it comes.
> 	* no fun with duelling garbage collectors.

The fget/fput from a kernel thread can be solved by just hanging on to
the struct file * when we punt the IO. Right now we don't, which is a
little silly, that should be changed.

Getting rid of the files_struct{} is doable.

-- 
Jens Axboe


WARNING: multiple messages have this Message-ID (diff)
From: Jens Axboe <axboe@kernel.dk>
To: Al Viro <viro@zeniv.linux.org.uk>, Jann Horn <jannh@google.com>
Cc: linux-aio@kvack.org, linux-block@vger.kernel.org,
	Linux API <linux-api@vger.kernel.org>,
	hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 13/18] io_uring: add file set registration
Date: Mon, 4 Feb 2019 19:19:34 -0700	[thread overview]
Message-ID: <785c6db4-095e-65b0-ded5-72b41af5174e@kernel.dk> (raw)
In-Reply-To: <20190204025612.GR2217@ZenIV.linux.org.uk>

On 2/3/19 7:56 PM, Al Viro wrote:
> On Wed, Jan 30, 2019 at 02:29:05AM +0100, Jann Horn wrote:
>> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
>>> We normally have to fget/fput for each IO we do on a file. Even with
>>> the batching we do, the cost of the atomic inc/dec of the file usage
>>> count adds up.
>>>
>>> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
>>> for the io_uring_register(2) system call. The arguments passed in must
>>> be an array of __s32 holding file descriptors, and nr_args should hold
>>> the number of file descriptors the application wishes to pin for the
>>> duration of the io_uring context (or until IORING_UNREGISTER_FILES is
>>> called).
>>>
>>> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
>>> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
>>> to the index in the array passed in to IORING_REGISTER_FILES.
>>>
>>> Files are automatically unregistered when the io_uring context is
>>> torn down. An application need only unregister if it wishes to
>>> register a new set of fds.
>>
>> Crazy idea:
>>
>> Taking a step back, at a high level, basically this patch creates sort
>> of the same difference that you get when you compare the following
>> scenarios for normal multithreaded I/O in userspace:
> 
>> This kinda makes me wonder whether this is really something that
>> should be implemented specifically for the io_uring API, or whether it
>> would make sense to somehow handle part of this in the generic VFS
>> code and give the user the ability to prepare a new files_struct that
>> can then be transferred to the worker thread, or something like
>> that... I'm not sure whether there's a particularly clean way to do
>> that though.
> 
> Using files_struct for that opens a can of worms you really don't
> want to touch.
> 
> Consider the following scenario with any variant of this interface:
> 	* create io_uring fd.
> 	* send an SCM_RIGHTS with that fd to AF_UNIX socket.
> 	* add the descriptor of that AF_UNIX socket to your fd
> 	* close AF_UNIX fd, close io_uring fd.
> Voila - you've got a shiny leak.  No ->release() is called for
> anyone (and you really don't want to do that on ->flush(), because
> otherwise a library helper doing e.g. system("/bin/date") will tear
> down all the io_uring in your process).  The socket is held by
> the reference you've stashed into io_uring (whichever way you do
> that).  io_uring is held by the reference you've stashed into
> SCM_RIGHTS datagram in queue of the socket.
> 
> No matter what, you need net/unix/garbage.c to be aware of that stuff.
> And getting files_struct lifetime mixed into that would be beyond
> any reason.
> 
> The only reason for doing that as a descriptor table would be
> avoiding the cost of fget() in whatever uses it, right?  Since

Right, the only purpose of this patch is to avoid doing fget/fput for
each IO.

> those are *not* the normal syscalls (and fdget() really should not
> be used anywhere other than the very top of syscall's call chain -
> that's another reason why tossing file_struct around like that
> is insane) and since the benefit is all due to the fact that it's
> *NOT* shared, *NOT* modified in parallel, etc., allowing us to
> treat file references as stable... why the hell use the descriptor
> tables at all?

This one is not a regular system call, since we don't do fget, then IO,
then fput. We hang on to it. But for the non-registered case, it's very
much just like a regular read/write system call, where we fget to do IO
on it, then fput when we are done.

> All you need is an array of struct file *, explicitly populated.
> With net/unix/garbage.c aware of such beasts.  Guess what?  We
> do have such an object already.  The one net/unix/garbage.c is
> working with.  SCM_RIGHTS datagrams, that is.
> 
> IOW, can't we give those io_uring descriptors associated struct
> unix_sock?  No socket descriptors, no struct socket (probably),
> just the AF_UNIX-specific part thereof.  Then teach
> unix_inflight()/unix_notinflight() about getting unix_sock out
> of these guys (incidentally, both would seem to benefit from
> _not_ touching unix_gc_lock in case when there's no unix_sock
> attached to file we are dealing with - I might be missing
> something very subtle about barriers there, but it doesn't
> look likely).

That might be workable, though I'm not sure we currently have helpers to
just explicitly create a unix_sock by itself. Not familiar with the
networking bits at all, I'll take a look.

> And make that (i.e. registering the descriptors) mandatory.

I don't want to make it mandatory, that's very inflexible for managing
tons of files. The registration is useful for specific cases where we
have high frequency of operations on a set of files. Besides, it'd make
the use of the API cumbersome as well for the basic case of just wanting
to do async IO.

> Hell, combine that with creating io_uring fd, if we really
> care about the syscall count.  Benefits:

We don't care about syscall count for setup as much. If you're doing
registration of a file set, you're expected to do a LOT of IO to those
files. Hence having an extra one for setup is not a concern. My concern
is just making it mandatory to do registration, I don't think that's a
workable alternative.

> 	* no file_struct refcount wanking
> 	* no fget()/fput() (conditional, at that) from kernel
> threads
> 	* no CLOEXEC-dependent anything; just the teardown
> on the final fput(), whichever way it comes.
> 	* no fun with duelling garbage collectors.

The fget/fput from a kernel thread can be solved by just hanging on to
the struct file * when we punt the IO. Right now we don't, which is a
little silly, that should be changed.

Getting rid of the files_struct{} is doable.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

  reply	other threads:[~2019-02-05  2:27 UTC|newest]

Thread overview: 158+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
2019-01-29 19:26 ` Jens Axboe
2019-01-29 19:26 ` [PATCH 01/18] fs: add an iopoll method to struct file_operations Jens Axboe
2019-01-29 19:26   ` Jens Axboe
2019-01-29 19:26 ` [PATCH 02/18] block: wire up block device iopoll method Jens Axboe
2019-01-29 19:26   ` Jens Axboe
2019-01-29 19:26 ` [PATCH 03/18] block: add bio_set_polled() helper Jens Axboe
2019-01-29 19:26   ` Jens Axboe
2019-01-29 19:26 ` [PATCH 04/18] iomap: wire up the iopoll method Jens Axboe
2019-01-29 19:26   ` Jens Axboe
2019-01-29 19:26 ` [PATCH 05/18] Add io_uring IO interface Jens Axboe
2019-01-29 19:26   ` Jens Axboe
2019-01-29 19:26 ` [PATCH 06/18] io_uring: add fsync support Jens Axboe
2019-01-29 19:26   ` Jens Axboe
2019-01-29 19:26 ` [PATCH 07/18] io_uring: support for IO polling Jens Axboe
2019-01-29 19:26   ` Jens Axboe
2019-01-29 20:47   ` Jann Horn
2019-01-29 20:47     ` Jann Horn
2019-01-29 20:56     ` Jens Axboe
2019-01-29 20:56       ` Jens Axboe
2019-01-29 21:10       ` Jann Horn
2019-01-29 21:10         ` Jann Horn
2019-01-29 21:33         ` Jens Axboe
2019-01-29 21:33           ` Jens Axboe
2019-01-29 19:26 ` [PATCH 08/18] fs: add fget_many() and fput_many() Jens Axboe
2019-01-29 19:26   ` Jens Axboe
2019-01-29 19:26 ` [PATCH 09/18] io_uring: use fget/fput_many() for file references Jens Axboe
2019-01-29 19:26   ` Jens Axboe
2019-01-29 23:31   ` Jann Horn
2019-01-29 23:31     ` Jann Horn
2019-01-29 23:44     ` Jens Axboe
2019-01-29 23:44       ` Jens Axboe
2019-01-30 15:33       ` Jens Axboe
2019-01-30 15:33         ` Jens Axboe
2019-01-29 19:26 ` [PATCH 10/18] io_uring: batch io_kiocb allocation Jens Axboe
2019-01-29 19:26   ` Jens Axboe
2019-01-29 19:26 ` [PATCH 11/18] block: implement bio helper to add iter bvec pages to bio Jens Axboe
2019-01-29 19:26   ` Jens Axboe
2019-01-29 19:26 ` [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers Jens Axboe
2019-01-29 19:26   ` Jens Axboe
2019-01-29 22:44   ` Jann Horn
2019-01-29 22:44     ` Jann Horn
2019-01-29 22:56     ` Jens Axboe
2019-01-29 22:56       ` Jens Axboe
2019-01-29 23:03       ` Jann Horn
2019-01-29 23:03         ` Jann Horn
2019-01-29 23:06         ` Jens Axboe
2019-01-29 23:06           ` Jens Axboe
2019-01-29 23:08           ` Jann Horn
2019-01-29 23:08             ` Jann Horn
2019-01-29 23:14             ` Jens Axboe
2019-01-29 23:14               ` Jens Axboe
2019-01-29 23:42               ` Jann Horn
2019-01-29 23:42                 ` Jann Horn
2019-01-29 23:51                 ` Jens Axboe
2019-01-29 23:51                   ` Jens Axboe
2019-01-29 19:26 ` [PATCH 13/18] io_uring: add file set registration Jens Axboe
2019-01-29 19:26   ` Jens Axboe
2019-01-30  1:29   ` Jann Horn
2019-01-30  1:29     ` Jann Horn
2019-01-30 15:35     ` Jens Axboe
2019-01-30 15:35       ` Jens Axboe
2019-02-04  2:56     ` Al Viro
2019-02-04  2:56       ` Al Viro
2019-02-05  2:19       ` Jens Axboe [this message]
2019-02-05  2:19         ` Jens Axboe
2019-02-05 17:57         ` Jens Axboe
2019-02-05 17:57           ` Jens Axboe
2019-02-05 19:08           ` Jens Axboe
2019-02-05 19:08             ` Jens Axboe
2019-02-06  0:27             ` Jens Axboe
2019-02-06  0:27               ` Jens Axboe
2019-02-06  1:01               ` Al Viro
2019-02-06  1:01                 ` Al Viro
2019-02-06 17:56                 ` Jens Axboe
2019-02-06 17:56                   ` Jens Axboe
2019-02-07  4:05                   ` Al Viro
2019-02-07  4:05                     ` Al Viro
2019-02-07 16:14                     ` Jens Axboe
2019-02-07 16:30                       ` Al Viro
2019-02-07 16:30                         ` Al Viro
2019-02-07 16:35                         ` Jens Axboe
2019-02-07 16:35                           ` Jens Axboe
2019-02-07 16:51                         ` Al Viro
2019-02-07 16:51                           ` Al Viro
2019-02-06  0:56             ` Al Viro
2019-02-06  0:56               ` Al Viro
2019-02-06 13:41               ` Jens Axboe
2019-02-06 13:41                 ` Jens Axboe
2019-02-07  4:00                 ` Al Viro
2019-02-07  4:00                   ` Al Viro
2019-02-07  9:22                   ` Miklos Szeredi
2019-02-07  9:22                     ` Miklos Szeredi
2019-02-07 13:31                     ` Al Viro
2019-02-07 13:31                       ` Al Viro
2019-02-07 14:20                       ` Miklos Szeredi
2019-02-07 14:20                         ` Miklos Szeredi
2019-02-07 15:20                         ` Al Viro
2019-02-07 15:20                           ` Al Viro
2019-02-07 15:27                           ` Miklos Szeredi
2019-02-07 15:27                             ` Miklos Szeredi
2019-02-07 16:26                             ` Al Viro
2019-02-07 16:26                               ` Al Viro
2019-02-07 19:08                               ` Miklos Szeredi
2019-02-07 19:08                                 ` Miklos Szeredi
2019-02-07 18:45                   ` Jens Axboe
2019-02-07 18:45                     ` Jens Axboe
2019-02-07 18:58                     ` Jens Axboe
2019-02-07 18:58                       ` Jens Axboe
2019-02-11 15:55                     ` Jonathan Corbet
2019-02-11 15:55                       ` Jonathan Corbet
2019-02-11 17:35                       ` Al Viro
2019-02-11 17:35                         ` Al Viro
2019-02-11 20:33                         ` Jonathan Corbet
2019-02-11 20:33                           ` Jonathan Corbet
2019-01-29 19:26 ` [PATCH 14/18] io_uring: add submission polling Jens Axboe
2019-01-29 19:26   ` Jens Axboe
2019-01-29 19:26 ` [PATCH 15/18] io_uring: add io_kiocb ref count Jens Axboe
2019-01-29 19:26   ` Jens Axboe
2019-01-29 19:27 ` [PATCH 16/18] io_uring: add support for IORING_OP_POLL Jens Axboe
2019-01-29 19:27   ` Jens Axboe
2019-01-29 19:27 ` [PATCH 17/18] io_uring: allow workqueue item to handle multiple buffered requests Jens Axboe
2019-01-29 19:27   ` Jens Axboe
2019-01-29 19:27 ` [PATCH 18/18] io_uring: add io_uring_event cache hit information Jens Axboe
2019-01-29 19:27   ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2019-02-07 19:55 [PATCHSET v12] io_uring IO interface Jens Axboe
2019-02-07 19:55 ` [PATCH 13/18] io_uring: add file set registration Jens Axboe
2019-02-07 19:55   ` Jens Axboe
2019-02-08 12:17   ` Alan Jenkins
2019-02-08 12:17     ` Alan Jenkins
2019-02-08 12:57     ` Jens Axboe
2019-02-08 12:57       ` Jens Axboe
2019-02-08 14:02       ` Alan Jenkins
2019-02-08 14:02         ` Alan Jenkins
2019-02-08 15:13         ` Jens Axboe
2019-02-08 15:13           ` Jens Axboe
2019-02-12 12:29           ` Alan Jenkins
2019-02-12 12:29             ` Alan Jenkins
2019-02-12 15:17             ` Jens Axboe
2019-02-12 15:17               ` Jens Axboe
2019-02-12 17:21               ` Alan Jenkins
2019-02-12 17:21                 ` Alan Jenkins
2019-02-12 17:33                 ` Jens Axboe
2019-02-12 17:33                   ` Jens Axboe
2019-02-12 20:23                   ` Alan Jenkins
2019-02-12 20:23                     ` Alan Jenkins
2019-02-12 21:10                     ` Jens Axboe
2019-02-12 21:10                       ` Jens Axboe
2019-02-01 15:23 [PATCHSET v11] io_uring IO interface Jens Axboe
2019-02-01 15:24 ` [PATCH 13/18] io_uring: add file set registration Jens Axboe
2019-02-01 15:24   ` Jens Axboe
2019-01-30 21:55 [PATCHSET v10] io_uring IO interface Jens Axboe
2019-01-30 21:55 ` [PATCH 13/18] io_uring: add file set registration Jens Axboe
2019-01-30 21:55   ` Jens Axboe
2019-01-28 21:35 [PATCHSET v8] io_uring IO interface Jens Axboe
2019-01-28 21:35 ` [PATCH 13/18] io_uring: add file set registration Jens Axboe
2019-01-28 21:35   ` Jens Axboe
2019-01-29 16:36   ` Jann Horn
2019-01-29 16:36     ` Jann Horn
2019-01-29 18:13     ` Jens Axboe
2019-01-29 18:13       ` Jens Axboe
2019-01-23 15:35 [PATCHSET v7] io_uring IO interface Jens Axboe
2019-01-23 15:35 ` [PATCH 13/18] io_uring: add file set registration Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=785c6db4-095e-65b0-ded5-72b41af5174e@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=avi@scylladb.com \
    --cc=hch@lst.de \
    --cc=jannh@google.com \
    --cc=jmoyer@redhat.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.