On Mon, Jun 17, 2019 at 03:26:50PM +0300, Maxim Levitsky wrote: > On Mon, 2019-06-10 at 19:18 +0530, Aarushi Mehta wrote: > > + if (!cqes) { > > + break; > > + } > > + LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes); > > + ret = cqes->res; > > + > > + if (ret == luringcb->qiov->size) { > > + ret = 0; > > + } else if (ret >= 0) { > > > You should very carefully check the allowed return values here. > > It looks like you can get '-EINTR' here, which would ask you to rerun the read operation, and otherwise > you will get the number of bytes read, which might be less that what was asked for, which implies that you > need to retry the read operation with the remainder of the buffer rather that zero the end of the buffer IMHO > > (0 is returned on EOF according to 'read' semantics, which I think are used here, thus a short read might not be an EOF) > > > Looking at linux-aio.c though I do see that it just passes through the returned value with no special treatments. > including lack of check for -EINTR. > > I assume that since aio is linux specific, and it only supports direct IO, it happens > to have assumption of no short reads/-EINTR (but since libaio has very sparse documentation I can't verify this) > > On the other hand the aio=threads implementation actually does everything as specified on the 'write' manpage, > retrying the reads on -EINTR, and doing additional reads if less that required number of bytes were read. > > Looking at io_uring implementation in the kernel I see that it does support synchronous (non O_DIRECT mode), > and in this case, it goes through the same ->read_iter which is pretty much the same path that > regular read() takes and so it might return short reads and or -EINTR. Interesting point. Investigating EINTR should at least be a TODO comment and needs to be resolved before io_uring lands in a QEMU release. > > +static int ioq_submit(LuringState *s) > > +{ > > + int ret = 0; > > + LuringAIOCB *luringcb, *luringcb_next; > > + > > + while (s->io_q.in_queue > 0) { > > + QSIMPLEQ_FOREACH_SAFE(luringcb, &s->io_q.sq_overflow, next, > > + luringcb_next) { > > I am torn about the 'sq_overflow' name. it seems to me that its not immediately clear that these > are the requests that are waiting because the io uring got full, but I can't now think of a better name. > > Maybe add a comment here to explain what is going on here? Hmm...I suggested this name because I thought it was clear. But the fact that it puzzled you proves it wasn't clear :-). Can anyone think of a better name? It's the queue we keep in QEMU to hold requests while the io_uring sq ring is full. > Also maybe we could somehow utilize the plug/unplug facility to avoid reaching that state in first place? > Maybe the block layer has some kind of 'max outstanding requests' limit that could be used? > > In my nvme-mdev I opted to not process the input queues when such a condition is detected, but here you can't as the block layer > pretty much calls you to process the requests. Block layer callers are allowed to submit as many I/O requests as they like and there is no feedback mechanism. It's up to linux-aio.c and io_uring.c to handle the case where host kernel I/O submission resources are exhausted. Plug/unplug is a batching performance optimization to reduce the number of io_uring_enter() calls but it does not stop the callers from submitting more I/O requests. So plug/unplug isn't directly applicable here. > > +static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s, > > + uint64_t offset, int type) > > +{ > > + struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring); > > + if (!sqes) { > > + sqes = &luringcb->sqeq; > > + QSIMPLEQ_INSERT_TAIL(&s->io_q.sq_overflow, luringcb, next); > > + } > > + > > + switch (type) { > > + case QEMU_AIO_WRITE: > > + io_uring_prep_writev(sqes, fd, luringcb->qiov->iov, > > + luringcb->qiov->niov, offset); > > + break; > > + case QEMU_AIO_READ: > > + io_uring_prep_readv(sqes, fd, luringcb->qiov->iov, > > + luringcb->qiov->niov, offset); > > + break; > > + case QEMU_AIO_FLUSH: > > + io_uring_prep_fsync(sqes, fd, 0); > > + break; > > + default: > > + fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n", > > + __func__, type); > > Nitpick: Don't we use some king of error printing functions like 'error_setg' rather that fprintf? Here we're not in a context where an Error object can be returned (e.g. printed by the QMP monitor). We only have an errno return value that the emulated storage controller may squash down further to a single EIO-type error code. 'type' is a QEMU-internal value so the default case is basically assert(false); /* we should never get here */ For these reasons the fprintf() seems okay here. > I plan on this or next week to do some benchmarks of the code and I will share the results as soon > as I do them. Excellent, Aarushi has been benchmarking too. Perhaps you can share the QEMU command-line and fio configuration so that the results can be compared. Stefan