linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Pavel Begunkov <asml.silence@gmail.com>
To: Josh Triplett <josh@joshtriplett.org>
Cc: Jens Axboe <axboe@kernel.dk>,
	io-uring@vger.kernel.org,
	"David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	Stefan Metzmacher <metze@samba.org>
Subject: Re: [PATCH v2 0/4] open/accept directly into io_uring fixed file table
Date: Sun, 15 Aug 2021 11:48:48 +0100	[thread overview]
Message-ID: <c6c0a1ee-2417-6e9d-4206-77f9498a4401@gmail.com> (raw)
In-Reply-To: <YRiKg7tV+8oMtXtg@localhost>

On 8/15/21 4:31 AM, Josh Triplett wrote:
> On Sat, Aug 14, 2021 at 01:50:24PM +0100, Pavel Begunkov wrote:
>> On 8/13/21 8:00 PM, Josh Triplett wrote:
>>> Rather than using sqe->file_index - 1, which feels like an error-prone
>>> interface, I think it makes sense to use a dedicated flag for this, like
>>> IOSQE_OPEN_FIXED. That flag could work for any open-like operation,
>>> including open, accept, and in the future many other operations such as
>>> memfd_create. (Imagine using a single ring submission to open a memfd,
>>> write a buffer into it, seal it, send it over a UNIX socket, and then
>>> close it.)
>>>
>>> The only downside is that you'll need to reject that flag in all
>>> non-open operations. One way to unify that code might be to add a flag
>>> in io_op_def for open-like operations, and then check in common code for
>>> the case of non-open-like operations passing IOSQE_OPEN_FIXED.
>>
>> io_uring is really thin, and so I absolutely don't want any extra
>> overhead in the generic path, IOW anything affecting
>> reads/writes/sends/recvs.
> 
> There are already several checks for valid flags in io_init_req. For
> instance:

Yes, it's horrible and I don't want to make it any worse.

>         if ((sqe_flags & IOSQE_BUFFER_SELECT) &&
>             !io_op_defs[req->opcode].buffer_select)
>                 return -EOPNOTSUPP;
> It'd be trivial to make io_op_defs have a "valid flags" byte, and one
> bitwise op tells you if any invalid flags were passed. *Zero* additional
> overhead for other operations.

Good point

> Alternatively, since there are so few operations that open a file
> descriptor, you could just add a separate opcode for those few
> operations. That still seems preferable to overloading a 16-bit index
> field for this.

I don't think so

> With this new mechanism, I think we're going to want to support more
> than 65535 fixed-file entries. I can easily imagine wanting to handle
> hundreds of thousands of files or sockets this way.

May be. What I'm curious about is that the feature doesn't really
change anything in this regard, but seems I haven't heard people
asking for larger tables.

>> The other reason is that there are only 2 bits left in sqe->flags,
>> and we may use them for something better, considering that it's
>> only open/accept and not much as this.
> 
> pipe, dup3, socket, socketpair, pidfds (via either pidfd_open or a
> ring-based spawn mechanism), epoll_create, inotify, fanotify, signalfd,
> timerfd, eventfd, memfd_create, userfaultfd, open_tree, fsopen, fsmount,
> memfd_secret.

We could argue for many of those whether they should be in io_uring,
and whether there are many benefits having them async and so. It would
have another story if all the ecosystem was io_uring centric, but
that's speculations.

> Of those, I personally would *love* to have at least pipe, socket,
> pidfd, memfd_create, and fsopen/fsmount/open_tree, plus some manner of
> dup-like operation for moving things between the fixed-file table and
> file descriptors.
> 
> I think this is valuable and versatile enough to merit a flag. It would
> also be entirely reasonable to create separate operations for these. But
> either way, I don't think this should just be determined by whether a
> 16-bit index is non-zero.
> 
>> I agree that it feels error-prone, but at least it can be wrapped
>> nicely enough in liburing, e.g.
>>
>> void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd,
>> 				 const char *path, int flags,
>> 				 mode_t mode, int slot_idx);
> 
> That wrapper wouldn't be able to handle more than a 16-bit slot index
> though.

It would. Note, the index is "int" there, so if doesn't fit
into u16, we can fail it. And do conversion if required.

>>> Also, rather than using a 16-bit index for the fixed file table and
>>> potentially requiring expansion into a different field in the future,
>>> what about overlapping it with the nofile field in the open and accept
>>> requests? If they're not opening a normal file descriptor, they don't
>>> need nofile. And in the original sqe, you can then overlap it with a
>>> 32-bit field like splice_fd_in.
>>
>> There is no nofile in SQEs, though
>>
>> req->open.nofile = rlimit(RLIMIT_NOFILE);
> 
> nofile isn't needed for opening into the fixed-file table, so it could
> be omitted in that case, and another field unioned with it.

There is no problem to place it internally. Moreover, it's at the
moment uniformly placed inside io_kiocb, but with nofile we'd need
to find the place on per-op basis.

Not like any matters, it's just bike shedding.

> allow passing a 32-bit fixed-file index into open and accept without
> growing the size of their structures. I think, with this new capability,
> we're going to want a large number of fixed files available.
> 
> In the SQE, you could overlap it with the splice_fd_in field, which
> isn't needed by any calls other than splice.

But it doesn't mean it won't be used, as happened with pretty every
other field in SQE. So, it rather depends on what packing is wanted.
And reusing almost never used ->buf_index (and potentially ->ioprio),
sounds reasonable.

-- 
Pavel Begunkov

  reply	other threads:[~2021-08-15 10:49 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-13 16:43 [PATCH v2 0/4] open/accept directly into io_uring fixed file table Pavel Begunkov
2021-08-13 16:43 ` [PATCH v2 1/4] net: add accept helper not installing fd Pavel Begunkov
2021-08-13 16:43 ` [PATCH v2 2/4] io_uring: openat directly into fixed fd table Pavel Begunkov
2021-08-13 16:43 ` [PATCH v2 3/4] io_uring: hand code io_accept() fd installing Pavel Begunkov
2021-08-13 16:43 ` [PATCH v2 4/4] io_uring: accept directly into fixed file table Pavel Begunkov
2021-08-13 19:00 ` [PATCH v2 0/4] open/accept directly into io_uring " Josh Triplett
2021-08-14 12:50   ` Pavel Begunkov
2021-08-14 23:03     ` Jens Axboe
2021-08-15  3:42       ` Josh Triplett
2021-08-15 15:05         ` Jens Axboe
2021-08-15 15:12           ` Jens Axboe
2021-08-15 13:00       ` Pavel Begunkov
2021-08-15  3:31     ` Josh Triplett
2021-08-15 10:48       ` Pavel Begunkov [this message]
2021-08-15 14:23         ` Pavel Begunkov
2021-08-16 15:45 ` Stefan Metzmacher
2021-08-17  9:33   ` Pavel Begunkov
2021-08-17 14:57     ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c6c0a1ee-2417-6e9d-4206-77f9498a4401@gmail.com \
    --to=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=davem@davemloft.net \
    --cc=io-uring@vger.kernel.org \
    --cc=josh@joshtriplett.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=metze@samba.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).