On Tue, Jul 07, 2020 at 04:52:37PM +0100, Matthew Wilcox wrote: >On Tue, Jul 07, 2020 at 08:41:05PM +0530, Kanchan Joshi wrote: >> On Mon, Jul 06, 2020 at 03:32:08PM +0100, Matthew Wilcox wrote: >> > On Mon, Jul 06, 2020 at 08:27:17AM -0600, Jens Axboe wrote: >> > > On 7/6/20 8:10 AM, Matthew Wilcox wrote: >> > > > On Sun, Jul 05, 2020 at 03:12:50PM -0600, Jens Axboe wrote: >> > > >> On 7/5/20 3:09 PM, Matthew Wilcox wrote: >> > > >>> On Sun, Jul 05, 2020 at 03:00:47PM -0600, Jens Axboe wrote: >> > > >>>> On 7/5/20 12:47 PM, Kanchan Joshi wrote: >> > > >>>>> From: Selvakumar S >> > > >>>>> >> > > >>>>> For zone-append, block-layer will return zone-relative offset via ret2 >> > > >>>>> of ki_complete interface. Make changes to collect it, and send to >> > > >>>>> user-space using cqe->flags. >> > > > >> > > >>> I'm surprised you aren't more upset by the abuse of cqe->flags for the >> > > >>> address. >> >> Documentation (https://protect2.fireeye.com/url?k=297dbcbf-74aee030-297c37f0-0cc47a31ce52-632d3561909b91fc&q=1&u=https%3A%2F%2Fkernel.dk%2Fio_uring.pdf) mentioned cqe->flags can carry >> the metadata for the operation. I wonder if this should be called abuse. >> >> > > >> Yeah, it's not great either, but we have less leeway there in terms of >> > > >> how much space is available to pass back extra data. >> > > >> >> > > >>> What do you think to my idea of interpreting the user_data as being a >> > > >>> pointer to somewhere to store the address? Obviously other things >> > > >>> can be stored after the address in the user_data. >> > > >> >> > > >> I don't like that at all, as all other commands just pass user_data >> > > >> through. This means the application would have to treat this very >> > > >> differently, and potentially not have a way to store any data for >> > > >> locating the original command on the user side. >> > > > >> > > > I think you misunderstood me. You seem to have thought I meant >> > > > "use the user_data field to return the address" when I actually meant >> > > > "interpret the user_data field as a pointer to where userspace >> > > > wants the address stored". >> > > >> > > It's still somewhat weird to have user_data have special meaning, you're >> > > now having the kernel interpret it while every other command it's just >> > > an opaque that is passed through. >> > > >> > > But it could of course work, and the app could embed the necessary >> > > u32/u64 in some other structure that's persistent across IO. If it >> > > doesn't have that, then it'd need to now have one allocated and freed >> > > across the lifetime of the IO. >> > > >> > > If we're going that route, it'd be better to define the write such that >> > > you're passing in the necessary information upfront. In syscall terms, >> > > then that'd be something ala: >> > > >> > > ssize_t my_append_write(int fd, const struct iovec *iov, int iovcnt, >> > > off_t *offset, int flags); >> > > >> > > where *offset is copied out when the write completes. That removes the >> > > need to abuse user_data, with just providing the storage pointer for the >> > > offset upfront. >> > >> > That works for me! In io_uring terms, would you like to see that done >> > as adding: >> > >> > union { >> > __u64 off; /* offset into file */ >> > + __u64 *offp; /* appending writes */ >> > __u64 addr2; >> > }; >> But there are peformance implications of this approach? >> If I got it right, the workflow is: - Application allocates 64bit of space, >> writes "off" into it and pass it >> in the sqe->addr2 >> - Kernel first reads sqe->addr2, reads the value to know the intended >> write-location, and stores the address somewhere (?) to be used during >> completion. Storing this address seems tricky as this may add one more >> cacheline (in io_kiocb->rw)? > >io_kiocb is: > /* size: 232, cachelines: 4, members: 19 */ > /* forced alignments: 1 */ > /* last cacheline: 40 bytes */ >so we have another 24 bytes before io_kiocb takes up another cacheline. >If that's a serious problem, I have an idea about how to shrink struct >kiocb by 8 bytes so struct io_rw would have space to store another >pointer. Yes, io_kiocb has room. Cache-locality wise whether that is fine or it must be placed within io_rw - I'll come to know once I get to implement this. Please share the idea you have, it can come handy. >> - During completion cqe res/flags are written as before, but extra step >> to copy the append-completion-result into that user-space address. >> Extra steps are due to the pointer indirection. > >... we've just done an I/O. Concern about an extra pointer access >seems misplaced? I was thinking about both read-from (submission) and write-to (completion) from user-space pointer, and all those checks APIs (get_user, copy_from_user) perform.....but when seen against I/O (that too direct), it does look small. Down the line it may matter for cached-IO but I get your point. >> And it seems application needs to be careful about managing this 64bit of >> space for a cluster of writes, especially if it wants to reuse the sqe >> before the completion. >> New one can handle 64bit result cleanly, but seems slower than current >> one. > >But userspace has to _do_ something with that information anyway. So >it must already have somewhere to put that information. Right. But it is part of SQE/CQE currently, and in new scheme it gets decoupled and needs to be managed separately. >I do think that interpretation of that field should be a separate flag >from WRITE_APPEND so apps which genuinely don't care about where the I/O >ended up don't have to allocate some temporary storage. eg a logging >application which just needs to know that it managed to append to the >end of the log and doesn't need to do anything if it's successful. Would you want that new flag to do what RWF_APPEND does as well? In v2, we had a separate flag RWF_ZONE_APPEND and did not use file-append infra at all. Thought was: apps using the new flag will always look at where write landed. In v3, I've been looking at this as: zone-append = file-append + something-extra-to-know-where-write-landed. We see to it that something-extra does not get executed for regular files/block-dev append (ref: FMODE_ZONE_APPEND patch1)....and existing behavior (the one you described for logging application) is retained. But on a zoned-device/file, file-append becomes zone-append, all the time.