Re: [PATCH 2/5] io_uring: add support for IORING_OP_URING_CMD

From: Jens Axboe <axboe@kernel.dk>
To: Stefan Metzmacher <metze@samba.org>,
	"Darrick J. Wong" <djwong@kernel.org>
Cc: io-uring@vger.kernel.org
Subject: Re: [PATCH 2/5] io_uring: add support for IORING_OP_URING_CMD
Date: Mon, 22 Feb 2021 13:14:35 -0700	[thread overview]
Message-ID: <6f3bcf67-15e3-f113-486e-b34c6c0df5e3@kernel.dk> (raw)
In-Reply-To: <bcce9dd6-8222-6dc5-ad4f-5a68ac3ca902@samba.org>

On 2/22/21 1:04 PM, Stefan Metzmacher wrote:
> Hi Jens,
> 
>>> I've been thinking along the same lines, because having a sparse sqe layout
>>> for the uring cmd is a pain. I do think 'personality' is a bit too specific
>>> to be part of the shared space, that should probably belong in the pdu
>>> instead if the user needs it. One thing they all have in common is that they'd
>>> need a sub-command, so why not make that u16 that?
>>>
>>> There's also the option of simply saying that the uring_cmd sqe is just
>>> a different type, ala:
>>>
>>> struct io_uring_cmd_sqe {
>>> 	__u8	opcode;		/* IO_OP_URING_CMD */
>>> 	__u8	flags;
>>> 	__u16	target_op;
>>> 	__s32	fd;
>>> 	__u64	user_data;
>>> 	strut io_uring_cmd_pdu cmd_pdu;
>>> };
>>>
>>> which is essentially the same as your suggestion in terms of layout
>>> (because that is the one that makes the most sense), we just don't try
>>> and shoe-horn it into the existing sqe. As long as we overlap
>>> opcode/flags, then init is fine. And past init, sqe is already consumed.
>>>
>>> Haven't tried and wire that up yet, and it may just be that the simple
>>> layout change you did is just easier to deal with. The important part
>>> here is the layout, and I certainly think we should do that. There's
>>> effectively 54 bytes of data there, if you include the target op and fd
>>> as part of that space. 48 fully usable for whatever.
>>
>> OK, folded in some of your stuff, and pushed out a new branch. Find it
>> here:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops.v3
>>
>> I did notice while doing so that you put the issue flags in the cmd,
>> I've made them external again. Just seems cleaner to me, otherwise
>> you'd have to modify the command for reissue rather than just
>> pass in the flags directly.
> 
> I think the first two commits need more verbose comments, which clearly
> document the uring_cmd() API.

Oh for sure, I just haven't gotten around to it yet :-)

> Event before uring_cmd(), it's really not clear to me why we have
> 'enum io_uring_cmd_flags', as 'enum'.
> As it seems to be use it as 'flags' (IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER).

They could be unsigned in too, but not really a big deal imho.

> With uring_cmd() it's not clear what the backend is supposed to do
> with these flags.

IO_URING_F_NONBLOCK tells the lower layers that the operation should be
non-blocking, and if that isn't possible, then it must return -EAGAIN.
If that happens, then the operation will be retried from a context where
IO_URING_F_NONBLOCK isn't set.

IO_URING_F_COMPLETE_DEFER is just part of the flags that should be
passed to the completion side, the handler need not do anything else.
It's only used internally, but allows fast processing if the completion
occurs off the IO_URING_F_NONBLOCK path.

It'll get documented... But the above is also why it should get passed
in, rather than stuffed in the command itself.

> I'd assume that uring_cmd() would per definition never block and go
> async itself, by returning -EIOCBQUEUED. And a single &req->uring_cmd
> is only ever passed once to uring_cmd() without any retry.

No, -EIOCBQUEUED would mean "operation is queued, I'll call the
completion callback for it later". For example, you start the IO
operation and you'll get a notification (eg IRQ) later on which allows
you to complete it.

> It's also not clear if IOSQE_ASYNC should have any impact.

Handler doesn't need to care about that, it'll just mean that the
initial queue attempt will not have IO_URING_F_NONBLOCK set.

> I think we also need a way to pass IORING_OP_ASYNC_CANCEL down.

Cancelation indeed needs some thought. There are a few options:

1) Request completes sync, obviously no cancelation needed here as the
   request is never stuck in a state that requires cancelation.

2) Request needs blocking context, and hence an async handler is doing
   it. The regular cancelation already works for that, nothing is needed
   here. Would probably be better handled with a cancel handler.

3) uring_cmd handler returns -EIOCBQUEUED. This is the only case that
   needs active cancelation support. Only case where that would
   currently happen are things like block IO, where we don't support
   cancelation to begin with (insert long rant on inadequate hw
   support).

So tldr here is that 1+2 is already there, and 3 not being fixed leaves
us no different than the existing support for cancelation. IOW, I don't
think this is an initial requirement, support can always be expanded
later.

>> Since we just need that one branch in req init, I do think that your
>> suggestion of just modifying io_uring_sqe is the way to go. So that's
>> what the above branch does.
> 
> Thanks! I think it's much easier to handle the personality logic in
> the core only.
> 
> For fixed files or fixed buffers I think helper functions like this:
> 
> struct file *io_uring_cmd_get_file(struct io_uring_cmd *cmd, int fd, bool fixed);
> 
> And similar functions for io_buffer_select or io_import_fixed.

I did end up retaining that, at least in its current state it's like you
proposed. Only change is some packing on that very union, which should
not be necessary, but due to fun arm reasons it is.

>> I tested the block side, and it works for getting the bs of the
>> device. That's all the testing that has been done so far :-)
> 
> I've added EXPORT_SYMBOL(io_uring_cmd_done); and split your net patch,
> similar to the two block patches. So we can better isolate the core
> from the first consumers.
> 
> See https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/io_uring-fops.v3

Great thanks, I'll take a look and fold back. I'll also expand those
commit messages :-)

-- 
Jens Axboe