Let multishot support multishot mode, currently only add accept as its first comsumer. theoretical analysis: 1) when connections come in fast - singleshot: add accept sqe(userpsace) --> accept inline ^ | |-----------------| - multishot: add accept sqe(userspace) --> accept inline ^ | |--*--| we do accept repeatedly in * place until get EAGAIN 2) when connections come in at a low pressure similar thing like 1), we reduce a lot of userspace-kernel context switch and useless vfs_poll() tests: Did some tests, which goes in this way: server client(multiple) accept connect read write write read close close Basically, raise up a number of clients(on same machine with server) to connect to the server, and then write some data to it, the server will write those data back to the client after it receives them, and then close the connection after write return. Then the client will read the data and then close the connection. Here I test 10000 clients connect one server, data size 128 bytes. And each client has a go routine for it, so they come to the server in short time. test 20 times before/after this patchset, time spent:(unit cycle, which is the return value of clock()) before: 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075 +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984 +1934226+1914385)/20.0 = 1927633.75 after: 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112 +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506 +1871324+1940803)/20.0 = 1894750.45 (1927633.75 - 1894750.45) / 1927633.75 = 1.65% Higher pressure like 10^5 connections causes problems since ports are in use. I'll test the cancellation path and try to lift pressure to much high level to see if numbers get better. Sent this early for comments and suggestions. Hao Xu (6): io_uring: enhance flush completion logic io_uring: add IORING_ACCEPT_MULTISHOT for accept io_uring: add REQ_F_APOLL_MULTISHOT for requests io_uring: let fast poll support multishot io_uring: implement multishot mode for accept io_uring: enable multishot mode for accept fs/io_uring.c | 72 +++++++++++++++++++++++++++++------ include/uapi/linux/io_uring.h | 4 ++ 2 files changed, 64 insertions(+), 12 deletions(-) -- 2.24.4
Though currently refcount of a req is always one when we flush inline completions, but still a chance there will be exception in the future. Enhance the flush logic to make sure we maintain compl_nr correctly. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> --- we need to either removing the if check to claim clearly that the req's refcount is 1 or adding this patch's logic. fs/io_uring.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 2bde732a1183..c48d43207f57 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2291,7 +2291,7 @@ static void io_submit_flush_completions(struct io_ring_ctx *ctx) __must_hold(&ctx->uring_lock) { struct io_submit_state *state = &ctx->submit_state; - int i, nr = state->compl_nr; + int i, nr = state->compl_nr, remain = 0; struct req_batch rb; spin_lock(&ctx->completion_lock); @@ -2311,10 +2311,12 @@ static void io_submit_flush_completions(struct io_ring_ctx *ctx) if (req_ref_put_and_test(req)) io_req_free_batch(&rb, req, &ctx->submit_state); + else + state->compl_reqs[remain++] = state->compl_reqs[i]; } io_req_free_batch_finish(ctx, &rb); - state->compl_nr = 0; + state->compl_nr = remain; } /* -- 2.24.4
add an accept_flag IORING_ACCEPT_MULTISHOT for accept, which is to support multishot. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> --- include/uapi/linux/io_uring.h | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 3caec9199658..faa3f2e70e46 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -177,6 +177,10 @@ enum { #define IORING_POLL_UPDATE_EVENTS (1U << 1) #define IORING_POLL_UPDATE_USER_DATA (1U << 2) +/* + * accept flags stored in accept_flags + */ +#define IORING_ACCEPT_MULTISHOT (1U << 0) /* * IO completion data structure (Completion Queue Entry) */ -- 2.24.4
add a flag to indicate multishot mode for fast poll. Currently only accept use it, but there may be more operations leveraging it in the future. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> --- fs/io_uring.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/io_uring.c b/fs/io_uring.c index c48d43207f57..d6df60c4cdb9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -720,6 +720,7 @@ enum { REQ_F_NOWAIT_READ_BIT, REQ_F_NOWAIT_WRITE_BIT, REQ_F_ISREG_BIT, + REQ_F_APOLL_MULTISHOT_BIT, /* not a real bit, just to check we're not overflowing the space */ __REQ_F_LAST_BIT, @@ -773,6 +774,8 @@ enum { REQ_F_REFCOUNT = BIT(REQ_F_REFCOUNT_BIT), /* there is a linked timeout that has to be armed */ REQ_F_ARM_LTIMEOUT = BIT(REQ_F_ARM_LTIMEOUT_BIT), + /* fast poll multishot mode */ + REQ_F_APOLL_MULTISHOT = BIT(REQ_F_APOLL_MULTISHOT_BIT), }; struct async_poll { -- 2.24.4
For operations like accept, multishot is a useful feature, since we can reduce a number of accept sqe. Let's integrate it to fast poll, it may be good for other operations in the future. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> --- fs/io_uring.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index d6df60c4cdb9..dae7044e0c24 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5277,8 +5277,15 @@ static void io_async_task_func(struct io_kiocb *req, bool *locked) return; } - hash_del(&req->hash_node); - io_poll_remove_double(req); + if (READ_ONCE(apoll->poll.canceled)) + apoll->poll.events |= EPOLLONESHOT; + if (apoll->poll.events & EPOLLONESHOT) { + hash_del(&req->hash_node); + io_poll_remove_double(req); + } else { + add_wait_queue(apoll->poll.head, &apoll->poll.wait); + } + spin_unlock(&ctx->completion_lock); if (!READ_ONCE(apoll->poll.canceled)) @@ -5366,7 +5373,7 @@ static int io_arm_poll_handler(struct io_kiocb *req) struct io_ring_ctx *ctx = req->ctx; struct async_poll *apoll; struct io_poll_table ipt; - __poll_t ret, mask = EPOLLONESHOT | POLLERR | POLLPRI; + __poll_t ret, mask = POLLERR | POLLPRI; int rw; if (!req->file || !file_can_poll(req->file)) @@ -5388,6 +5395,8 @@ static int io_arm_poll_handler(struct io_kiocb *req) rw = WRITE; mask |= POLLOUT | POLLWRNORM; } + if (!(req->flags & REQ_F_APOLL_MULTISHOT)) + mask |= EPOLLONESHOT; /* if we can't nonblock try, then no point in arming a poll handler */ if (!io_file_supports_nowait(req, rw)) -- 2.24.4
Refactor io_accept() to support multishot mode. theoretical analysis: 1) when connections come in fast - singleshot: add accept sqe(userpsace) --> accept inline ^ | |-----------------| - multishot: add accept sqe(userspace) --> accept inline ^ | |--*--| we do accept repeatedly in * place until get EAGAIN 2) when connections come in at a low pressure similar thing like 1), we reduce a lot of userspace-kernel context switch and useless vfs_poll() tests: Did some tests, which goes in this way: server client(multiple) accept connect read write write read close close Basically, raise up a number of clients(on same machine with server) to connect to the server, and then write some data to it, the server will write those data back to the client after it receives them, and then close the connection after write return. Then the client will read the data and then close the connection. Here I test 10000 clients connect one server, data size 128 bytes. And each client has a go routine for it, so they come to the server in short time. test 20 times before/after this patchset, time spent:(unit cycle, which is the return value of clock()) before: 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075 +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984 +1934226+1914385)/20.0 = 1927633.75 after: 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112 +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506 +1871324+1940803)/20.0 = 1894750.45 (1927633.75 - 1894750.45) / 1927633.75 = 1.65% Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> --- not sure if we should cancel it when io_cqring_fill_event() reurn false fs/io_uring.c | 34 +++++++++++++++++++++++++++++----- 1 file changed, 29 insertions(+), 5 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index dae7044e0c24..eb81d37dce78 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4885,16 +4885,18 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) static int io_accept(struct io_kiocb *req, unsigned int issue_flags) { + struct io_ring_ctx *ctx = req->ctx; struct io_accept *accept = &req->accept; bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; unsigned int file_flags = force_nonblock ? O_NONBLOCK : 0; bool fixed = !!accept->file_slot; struct file *file; - int ret, fd; + int ret, ret2 = 0, fd; if (req->file->f_flags & O_NONBLOCK) req->flags |= REQ_F_NOWAIT; +retry: if (!fixed) { fd = __get_unused_fd_flags(accept->flags, accept->nofile); if (unlikely(fd < 0)) @@ -4906,20 +4908,42 @@ static int io_accept(struct io_kiocb *req, unsigned int issue_flags) if (!fixed) put_unused_fd(fd); ret = PTR_ERR(file); - if (ret == -EAGAIN && force_nonblock) - return -EAGAIN; + if (ret == -EAGAIN && force_nonblock) { + if ((req->flags & (REQ_F_APOLL_MULTISHOT | REQ_F_POLLED)) == + (REQ_F_APOLL_MULTISHOT | REQ_F_POLLED)) + ret = 0; + return ret; + } if (ret == -ERESTARTSYS) ret = -EINTR; req_set_fail(req); } else if (!fixed) { fd_install(fd, file); ret = fd; + /* + * if it's in multishot mode, let's return -EAGAIN to make it go + * into fast poll path + */ + if ((req->flags & REQ_F_APOLL_MULTISHOT) && force_nonblock && + !(req->flags & REQ_F_POLLED)) + ret2 = -EAGAIN; } else { ret = io_install_fixed_file(req, file, issue_flags, accept->file_slot - 1); } - __io_req_complete(req, issue_flags, ret, 0); - return 0; + + if (req->flags & REQ_F_APOLL_MULTISHOT) { + spin_lock(&ctx->completion_lock); + if (io_cqring_fill_event(ctx, req->user_data, ret, 0)) { + io_commit_cqring(ctx); + ctx->cq_extra++; + } + spin_unlock(&ctx->completion_lock); + goto retry; + } else { + __io_req_complete(req, issue_flags, ret, 0); + } + return ret2; } static int io_connect_prep_async(struct io_kiocb *req) -- 2.24.4
Update io_accept_prep() to enable multishot mode for accept operation. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> --- fs/io_uring.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index eb81d37dce78..34612646ae3c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags) static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_accept *accept = &req->accept; + bool is_multishot; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) accept->flags = READ_ONCE(sqe->accept_flags); accept->nofile = rlimit(RLIMIT_NOFILE); + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT; + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC)) + return -EINVAL; + accept->file_slot = READ_ONCE(sqe->file_index); if (accept->file_slot && ((req->open.how.flags & O_CLOEXEC) || - (accept->flags & SOCK_CLOEXEC))) + (accept->flags & SOCK_CLOEXEC) || is_multishot)) return -EINVAL; - if (accept->flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK)) + if (accept->flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK | IORING_ACCEPT_MULTISHOT)) return -EINVAL; if (SOCK_NONBLOCK != O_NONBLOCK && (accept->flags & SOCK_NONBLOCK)) accept->flags = (accept->flags & ~SOCK_NONBLOCK) | O_NONBLOCK; + if (is_multishot) { + req->flags |= REQ_F_APOLL_MULTISHOT; + accept->flags &= ~IORING_ACCEPT_MULTISHOT; + } + return 0; } -- 2.24.4
在 2021/9/3 下午7:00, Hao Xu 写道: > Let multishot support multishot mode, currently only add accept as its ^ fast poll > first comsumer. > theoretical analysis: > 1) when connections come in fast > - singleshot: > add accept sqe(userpsace) --> accept inline > ^ | > |-----------------| > - multishot: > add accept sqe(userspace) --> accept inline > ^ | > |--*--| > > we do accept repeatedly in * place until get EAGAIN > > 2) when connections come in at a low pressure > similar thing like 1), we reduce a lot of userspace-kernel context > switch and useless vfs_poll() > > > tests: > Did some tests, which goes in this way: > > server client(multiple) > accept connect > read write > write read > close close > > Basically, raise up a number of clients(on same machine with server) to > connect to the server, and then write some data to it, the server will > write those data back to the client after it receives them, and then > close the connection after write return. Then the client will read the > data and then close the connection. Here I test 10000 clients connect > one server, data size 128 bytes. And each client has a go routine for > it, so they come to the server in short time. > test 20 times before/after this patchset, time spent:(unit cycle, which > is the return value of clock()) > before: > 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075 > +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984 > +1934226+1914385)/20.0 = 1927633.75 > after: > 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112 > +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506 > +1871324+1940803)/20.0 = 1894750.45 > > (1927633.75 - 1894750.45) / 1927633.75 = 1.65% > > Higher pressure like 10^5 connections causes problems since ports are > in use. > I'll test the cancellation path and try to lift pressure to much high > level to see if numbers get better. Sent this early for comments and > suggestions. > > Hao Xu (6): > io_uring: enhance flush completion logic > io_uring: add IORING_ACCEPT_MULTISHOT for accept > io_uring: add REQ_F_APOLL_MULTISHOT for requests > io_uring: let fast poll support multishot > io_uring: implement multishot mode for accept > io_uring: enable multishot mode for accept > > fs/io_uring.c | 72 +++++++++++++++++++++++++++++------ > include/uapi/linux/io_uring.h | 4 ++ > 2 files changed, 64 insertions(+), 12 deletions(-) >
On 9/3/21 12:00 PM, Hao Xu wrote: > Though currently refcount of a req is always one when we flush inline It can be refcounted and != 1. E.g. poll requests, or consider that tw also flushes, and you may have a read that goes to apoll and then get tw resubmitted from io_async_task_func(). And other cases. > completions, but still a chance there will be exception in the future. > Enhance the flush logic to make sure we maintain compl_nr correctly. See below > > Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> > --- > > we need to either removing the if check to claim clearly that the req's > refcount is 1 or adding this patch's logic. > > fs/io_uring.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 2bde732a1183..c48d43207f57 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -2291,7 +2291,7 @@ static void io_submit_flush_completions(struct io_ring_ctx *ctx) > __must_hold(&ctx->uring_lock) > { > struct io_submit_state *state = &ctx->submit_state; > - int i, nr = state->compl_nr; > + int i, nr = state->compl_nr, remain = 0; > struct req_batch rb; > > spin_lock(&ctx->completion_lock); > @@ -2311,10 +2311,12 @@ static void io_submit_flush_completions(struct io_ring_ctx *ctx) > > if (req_ref_put_and_test(req)) > io_req_free_batch(&rb, req, &ctx->submit_state); > + else > + state->compl_reqs[remain++] = state->compl_reqs[i]; Our ref is dropped, and something else holds another reference. That "something else" is responsible to free the request once it put the last reference. This chunk would make the following io_submit_flush_completions() to underflow refcount and double free. > } > > io_req_free_batch_finish(ctx, &rb); > - state->compl_nr = 0; > + state->compl_nr = remain; > } > > /* > -- Pavel Begunkov
在 2021/9/3 下午7:42, Pavel Begunkov 写道: > On 9/3/21 12:00 PM, Hao Xu wrote: >> Though currently refcount of a req is always one when we flush inline > > It can be refcounted and != 1. E.g. poll requests, or consider It seems poll requests don't leverage comp cache, do I miss anything? > that tw also flushes, and you may have a read that goes to apoll > and then get tw resubmitted from io_async_task_func(). And other when it goes to apoll, (say no double wait entry) ref is 1, then read completes inline and then the only ref is droped by flush. > cases. > >> completions, but still a chance there will be exception in the future. >> Enhance the flush logic to make sure we maintain compl_nr correctly. > > See below > >> >> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >> --- >> >> we need to either removing the if check to claim clearly that the req's >> refcount is 1 or adding this patch's logic. >> >> fs/io_uring.c | 6 ++++-- >> 1 file changed, 4 insertions(+), 2 deletions(-) >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index 2bde732a1183..c48d43207f57 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -2291,7 +2291,7 @@ static void io_submit_flush_completions(struct io_ring_ctx *ctx) >> __must_hold(&ctx->uring_lock) >> { >> struct io_submit_state *state = &ctx->submit_state; >> - int i, nr = state->compl_nr; >> + int i, nr = state->compl_nr, remain = 0; >> struct req_batch rb; >> >> spin_lock(&ctx->completion_lock); >> @@ -2311,10 +2311,12 @@ static void io_submit_flush_completions(struct io_ring_ctx *ctx) >> >> if (req_ref_put_and_test(req)) >> io_req_free_batch(&rb, req, &ctx->submit_state); >> + else >> + state->compl_reqs[remain++] = state->compl_reqs[i]; > > Our ref is dropped, and something else holds another reference. That > "something else" is responsible to free the request once it put the last > reference. This chunk would make the following io_submit_flush_completions() > to underflow refcount and double free. True, I see. Thanks Pavel. > >> } >> >> io_req_free_batch_finish(ctx, &rb); >> - state->compl_nr = 0; >> + state->compl_nr = remain; >> } >> >> /* >> >
On 9/3/21 1:08 PM, Hao Xu wrote: > 在 2021/9/3 下午7:42, Pavel Begunkov 写道: >> On 9/3/21 12:00 PM, Hao Xu wrote: >>> Though currently refcount of a req is always one when we flush inline >> >> It can be refcounted and != 1. E.g. poll requests, or consider > It seems poll requests don't leverage comp cache, do I miss anything? Hmm, looks so. Not great that it doesn't, but probably it's because of trying to submit next reqs right in io_poll_task_func(). I'll be pushing for some changes around tw, with it should be easy to hook poll completion batching with no drawbacks. Would be great if you will be willing to take a shot on it. >> that tw also flushes, and you may have a read that goes to apoll >> and then get tw resubmitted from io_async_task_func(). And other > when it goes to apoll, (say no double wait entry) ref is 1, then read > completes inline and then the only ref is droped by flush. Yep, but some might have double entry. It also will have elevated refs if there was a linked timeout. Another case is io-wq, which takes a reference, and if iowq doesn't put it for a while (e.g. got rescheduled) but requests goes to tw (resubmit, poll, whatever) you have the same picture. >> cases. >> >>> completions, but still a chance there will be exception in the future. >>> Enhance the flush logic to make sure we maintain compl_nr correctly. >> >> See below >> >>> >>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>> --- >>> >>> we need to either removing the if check to claim clearly that the req's >>> refcount is 1 or adding this patch's logic. >>> >>> fs/io_uring.c | 6 ++++-- >>> 1 file changed, 4 insertions(+), 2 deletions(-) >>> >>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>> index 2bde732a1183..c48d43207f57 100644 >>> --- a/fs/io_uring.c >>> +++ b/fs/io_uring.c >>> @@ -2291,7 +2291,7 @@ static void io_submit_flush_completions(struct io_ring_ctx *ctx) >>> __must_hold(&ctx->uring_lock) >>> { >>> struct io_submit_state *state = &ctx->submit_state; >>> - int i, nr = state->compl_nr; >>> + int i, nr = state->compl_nr, remain = 0; >>> struct req_batch rb; >>> spin_lock(&ctx->completion_lock); >>> @@ -2311,10 +2311,12 @@ static void io_submit_flush_completions(struct io_ring_ctx *ctx) >>> if (req_ref_put_and_test(req)) >>> io_req_free_batch(&rb, req, &ctx->submit_state); >>> + else >>> + state->compl_reqs[remain++] = state->compl_reqs[i]; >> >> Our ref is dropped, and something else holds another reference. That >> "something else" is responsible to free the request once it put the last >> reference. This chunk would make the following io_submit_flush_completions() >> to underflow refcount and double free. > True, I see. Thanks Pavel. >> >>> } >>> io_req_free_batch_finish(ctx, &rb); >>> - state->compl_nr = 0; >>> + state->compl_nr = remain; >>> } >>> /* >>> >> > -- Pavel Begunkov
在 2021/9/3 下午8:27, Pavel Begunkov 写道: > On 9/3/21 1:08 PM, Hao Xu wrote: >> 在 2021/9/3 下午7:42, Pavel Begunkov 写道: >>> On 9/3/21 12:00 PM, Hao Xu wrote: >>>> Though currently refcount of a req is always one when we flush inline >>> >>> It can be refcounted and != 1. E.g. poll requests, or consider >> It seems poll requests don't leverage comp cache, do I miss anything? > > Hmm, looks so. Not great that it doesn't, but probably it's because > of trying to submit next reqs right in io_poll_task_func(). > > I'll be pushing for some changes around tw, with it should be easy > to hook poll completion batching with no drawbacks. Would be great > if you will be willing to take a shot on it. Sure, I'll take a look. > >>> that tw also flushes, and you may have a read that goes to apoll >>> and then get tw resubmitted from io_async_task_func(). And other >> when it goes to apoll, (say no double wait entry) ref is 1, then read >> completes inline and then the only ref is droped by flush. > > Yep, but some might have double entry. It also will have elevated double entry should be fine. > refs if there was a linked timeout. Another case is io-wq, which Haven't dig into linked timeout yet. > takes a reference, and if iowq doesn't put it for a while (e.g. got > rescheduled) but requests goes to tw (resubmit, poll, whatever) > you have the same picture. Gotcha, thanks. > >>> cases. >>> >>>> completions, but still a chance there will be exception in the future. >>>> Enhance the flush logic to make sure we maintain compl_nr correctly. >>> >>> See below >>> >>>> >>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>>> --- >>>> >>>> we need to either removing the if check to claim clearly that the req's >>>> refcount is 1 or adding this patch's logic. >>>> >>>> fs/io_uring.c | 6 ++++-- >>>> 1 file changed, 4 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>> index 2bde732a1183..c48d43207f57 100644 >>>> --- a/fs/io_uring.c >>>> +++ b/fs/io_uring.c >>>> @@ -2291,7 +2291,7 @@ static void io_submit_flush_completions(struct io_ring_ctx *ctx) >>>> __must_hold(&ctx->uring_lock) >>>> { >>>> struct io_submit_state *state = &ctx->submit_state; >>>> - int i, nr = state->compl_nr; >>>> + int i, nr = state->compl_nr, remain = 0; >>>> struct req_batch rb; >>>> spin_lock(&ctx->completion_lock); >>>> @@ -2311,10 +2311,12 @@ static void io_submit_flush_completions(struct io_ring_ctx *ctx) >>>> if (req_ref_put_and_test(req)) >>>> io_req_free_batch(&rb, req, &ctx->submit_state); >>>> + else >>>> + state->compl_reqs[remain++] = state->compl_reqs[i]; >>> >>> Our ref is dropped, and something else holds another reference. That >>> "something else" is responsible to free the request once it put the last >>> reference. This chunk would make the following io_submit_flush_completions() >>> to underflow refcount and double free. >> True, I see. Thanks Pavel. >>> >>>> } >>>> io_req_free_batch_finish(ctx, &rb); >>>> - state->compl_nr = 0; >>>> + state->compl_nr = remain; >>>> } >>>> /* >>>> >>> >> >
On 9/3/21 5:00 AM, Hao Xu wrote:
> Update io_accept_prep() to enable multishot mode for accept operation.
>
> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
> ---
> fs/io_uring.c | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index eb81d37dce78..34612646ae3c 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags)
> static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> {
> struct io_accept *accept = &req->accept;
> + bool is_multishot;
>
> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
> return -EINVAL;
> @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> accept->flags = READ_ONCE(sqe->accept_flags);
> accept->nofile = rlimit(RLIMIT_NOFILE);
>
> + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT;
> + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC))
> + return -EINVAL;
I like the idea itself as I think it makes a lot of sense to just have
an accept sitting there and generating multiple CQEs, but I'm a bit
puzzled by how you pass it in. accept->flags is the accept4(2) flags,
which can currently be:
SOCK_NONBLOCK
SOCK_CLOEXEC
While there's not any overlap here, that is mostly by chance I think. A
cleaner separation is needed here, what happens if some other accept4(2)
flag is enabled and it just happens to be the same as
IORING_ACCEPT_MULTISHOT?
--
Jens Axboe
在 2021/9/4 上午12:29, Jens Axboe 写道: > On 9/3/21 5:00 AM, Hao Xu wrote: >> Update io_accept_prep() to enable multishot mode for accept operation. >> >> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >> --- >> fs/io_uring.c | 14 ++++++++++++-- >> 1 file changed, 12 insertions(+), 2 deletions(-) >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index eb81d37dce78..34612646ae3c 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags) >> static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >> { >> struct io_accept *accept = &req->accept; >> + bool is_multishot; >> >> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) >> return -EINVAL; >> @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >> accept->flags = READ_ONCE(sqe->accept_flags); >> accept->nofile = rlimit(RLIMIT_NOFILE); >> >> + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT; >> + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC)) >> + return -EINVAL; > > I like the idea itself as I think it makes a lot of sense to just have > an accept sitting there and generating multiple CQEs, but I'm a bit > puzzled by how you pass it in. accept->flags is the accept4(2) flags, > which can currently be: > > SOCK_NONBLOCK > SOCK_CLOEXEC > > While there's not any overlap here, that is mostly by chance I think. A > cleaner separation is needed here, what happens if some other accept4(2) > flag is enabled and it just happens to be the same as > IORING_ACCEPT_MULTISHOT? Make sense, how about a new IOSQE flag, I saw not many entries left there. >
On 9/4/21 9:34 AM, Hao Xu wrote:
> 在 2021/9/4 上午12:29, Jens Axboe 写道:
>> On 9/3/21 5:00 AM, Hao Xu wrote:
>>> Update io_accept_prep() to enable multishot mode for accept operation.
>>>
>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>>> ---
>>> fs/io_uring.c | 14 ++++++++++++--
>>> 1 file changed, 12 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index eb81d37dce78..34612646ae3c 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags)
>>> static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>> {
>>> struct io_accept *accept = &req->accept;
>>> + bool is_multishot;
>>>
>>> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
>>> return -EINVAL;
>>> @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>> accept->flags = READ_ONCE(sqe->accept_flags);
>>> accept->nofile = rlimit(RLIMIT_NOFILE);
>>>
>>> + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT;
>>> + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC))
>>> + return -EINVAL;
>>
>> I like the idea itself as I think it makes a lot of sense to just have
>> an accept sitting there and generating multiple CQEs, but I'm a bit
>> puzzled by how you pass it in. accept->flags is the accept4(2) flags,
>> which can currently be:
>>
>> SOCK_NONBLOCK
>> SOCK_CLOEXEC
>>
>> While there's not any overlap here, that is mostly by chance I think. A
>> cleaner separation is needed here, what happens if some other accept4(2)
>> flag is enabled and it just happens to be the same as
>> IORING_ACCEPT_MULTISHOT?
> Make sense, how about a new IOSQE flag, I saw not many
> entries left there.
Not quite sure what the best approach would be... The mshot flag only
makes sense for a few request types, so a bit of a shame to have to
waste an IOSQE flag on it. Especially when the flags otherwise passed in
are so sparse, there's plenty of bits there.
Hence while it may not be the prettiest, perhaps using accept->flags is
ok and we just need some careful code to ensure that we never have any
overlap.
Probably best to solve that issue in include/linux/net.h, ala:
/* Flags for socket, socketpair, accept4 */
#define SOCK_CLOEXEC O_CLOEXEC
#ifndef SOCK_NONBLOCK
#define SOCK_NONBLOCK O_NONBLOCK
#endif
/*
* Only used for io_uring accept4, and deliberately chosen to overlap
* with the O_* file bits for read/write mode so we won't risk overlap
* other flags added for socket/socketpair/accept4 use in the future.
*/
#define SOCK_URING_MULTISHOT 00000001
which should be OK, as these overlap with the O_* filespace and the
read/write bits are at the start of that space.
Should be done as a prep patch and sent out to netdev as well, so we can
get their sign-off on this "hack". If we can get that done, then we have
our flag and we can just stuff it in accept->flags as long as we clear
it before calling into accept from io_uring.
--
Jens Axboe
On 9/3/21 12:00 PM, Hao Xu wrote: > Refactor io_accept() to support multishot mode. Multishot with the fixed/direct mode sounds weird (considering that the slot index is specified by userspace), let's forbid them. io_accept_prep() { if (accept->file_slot && (flags & MULTISHOT)) return -EINVAL; ... } > theoretical analysis: > 1) when connections come in fast > - singleshot: > add accept sqe(userpsace) --> accept inline > ^ | > |-----------------| > - multishot: > add accept sqe(userspace) --> accept inline > ^ | > |--*--| > > we do accept repeatedly in * place until get EAGAIN > > 2) when connections come in at a low pressure > similar thing like 1), we reduce a lot of userspace-kernel context > switch and useless vfs_poll() > > tests: > Did some tests, which goes in this way: > > server client(multiple) > accept connect > read write > write read > close close > > Basically, raise up a number of clients(on same machine with server) to > connect to the server, and then write some data to it, the server will > write those data back to the client after it receives them, and then > close the connection after write return. Then the client will read the > data and then close the connection. Here I test 10000 clients connect > one server, data size 128 bytes. And each client has a go routine for > it, so they come to the server in short time. > test 20 times before/after this patchset, time spent:(unit cycle, which > is the return value of clock()) > before: > 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075 > +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984 > +1934226+1914385)/20.0 = 1927633.75 > after: > 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112 > +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506 > +1871324+1940803)/20.0 = 1894750.45 > > (1927633.75 - 1894750.45) / 1927633.75 = 1.65% > > Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> > --- > > not sure if we should cancel it when io_cqring_fill_event() reurn false > > fs/io_uring.c | 34 +++++++++++++++++++++++++++++----- > 1 file changed, 29 insertions(+), 5 deletions(-) > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index dae7044e0c24..eb81d37dce78 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -4885,16 +4885,18 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > static int io_accept(struct io_kiocb *req, unsigned int issue_flags) > { > + struct io_ring_ctx *ctx = req->ctx; > struct io_accept *accept = &req->accept; > bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; > unsigned int file_flags = force_nonblock ? O_NONBLOCK : 0; > bool fixed = !!accept->file_slot; > struct file *file; > - int ret, fd; > + int ret, ret2 = 0, fd; > > if (req->file->f_flags & O_NONBLOCK) > req->flags |= REQ_F_NOWAIT; > > +retry: > if (!fixed) { > fd = __get_unused_fd_flags(accept->flags, accept->nofile); > if (unlikely(fd < 0)) > @@ -4906,20 +4908,42 @@ static int io_accept(struct io_kiocb *req, unsigned int issue_flags) > if (!fixed) > put_unused_fd(fd); > ret = PTR_ERR(file); > - if (ret == -EAGAIN && force_nonblock) > - return -EAGAIN; > + if (ret == -EAGAIN && force_nonblock) { > + if ((req->flags & (REQ_F_APOLL_MULTISHOT | REQ_F_POLLED)) == > + (REQ_F_APOLL_MULTISHOT | REQ_F_POLLED)) > + ret = 0; > + return ret; > + } > if (ret == -ERESTARTSYS) > ret = -EINTR; > req_set_fail(req); > } else if (!fixed) { > fd_install(fd, file); > ret = fd; > + /* > + * if it's in multishot mode, let's return -EAGAIN to make it go > + * into fast poll path > + */ > + if ((req->flags & REQ_F_APOLL_MULTISHOT) && force_nonblock && > + !(req->flags & REQ_F_POLLED)) > + ret2 = -EAGAIN; > } else { > ret = io_install_fixed_file(req, file, issue_flags, > accept->file_slot - 1); > } > - __io_req_complete(req, issue_flags, ret, 0); > - return 0; > + > + if (req->flags & REQ_F_APOLL_MULTISHOT) { > + spin_lock(&ctx->completion_lock); > + if (io_cqring_fill_event(ctx, req->user_data, ret, 0)) { > + io_commit_cqring(ctx); > + ctx->cq_extra++; > + } > + spin_unlock(&ctx->completion_lock); > + goto retry; > + } else { > + __io_req_complete(req, issue_flags, ret, 0); > + } > + return ret2; > } > > static int io_connect_prep_async(struct io_kiocb *req) > -- Pavel Begunkov
On 9/4/21 11:39 PM, Pavel Begunkov wrote: > On 9/3/21 12:00 PM, Hao Xu wrote: >> Refactor io_accept() to support multishot mode. > > Multishot with the fixed/direct mode sounds weird (considering that > the slot index is specified by userspace), let's forbid them. > > io_accept_prep() { > if (accept->file_slot && (flags & MULTISHOT)) > return -EINVAL; > ... > } Ah, never mind, it's already there in 6/6 > >> theoretical analysis: >> 1) when connections come in fast >> - singleshot: >> add accept sqe(userpsace) --> accept inline >> ^ | >> |-----------------| >> - multishot: >> add accept sqe(userspace) --> accept inline >> ^ | >> |--*--| >> >> we do accept repeatedly in * place until get EAGAIN >> >> 2) when connections come in at a low pressure >> similar thing like 1), we reduce a lot of userspace-kernel context >> switch and useless vfs_poll() >> >> tests: >> Did some tests, which goes in this way: >> >> server client(multiple) >> accept connect >> read write >> write read >> close close >> >> Basically, raise up a number of clients(on same machine with server) to >> connect to the server, and then write some data to it, the server will >> write those data back to the client after it receives them, and then >> close the connection after write return. Then the client will read the >> data and then close the connection. Here I test 10000 clients connect >> one server, data size 128 bytes. And each client has a go routine for >> it, so they come to the server in short time. >> test 20 times before/after this patchset, time spent:(unit cycle, which >> is the return value of clock()) >> before: >> 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075 >> +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984 >> +1934226+1914385)/20.0 = 1927633.75 >> after: >> 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112 >> +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506 >> +1871324+1940803)/20.0 = 1894750.45 >> >> (1927633.75 - 1894750.45) / 1927633.75 = 1.65% >> >> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >> --- >> >> not sure if we should cancel it when io_cqring_fill_event() reurn false >> >> fs/io_uring.c | 34 +++++++++++++++++++++++++++++----- >> 1 file changed, 29 insertions(+), 5 deletions(-) >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index dae7044e0c24..eb81d37dce78 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -4885,16 +4885,18 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >> >> static int io_accept(struct io_kiocb *req, unsigned int issue_flags) >> { >> + struct io_ring_ctx *ctx = req->ctx; >> struct io_accept *accept = &req->accept; >> bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; >> unsigned int file_flags = force_nonblock ? O_NONBLOCK : 0; >> bool fixed = !!accept->file_slot; >> struct file *file; >> - int ret, fd; >> + int ret, ret2 = 0, fd; >> >> if (req->file->f_flags & O_NONBLOCK) >> req->flags |= REQ_F_NOWAIT; >> >> +retry: >> if (!fixed) { >> fd = __get_unused_fd_flags(accept->flags, accept->nofile); >> if (unlikely(fd < 0)) >> @@ -4906,20 +4908,42 @@ static int io_accept(struct io_kiocb *req, unsigned int issue_flags) >> if (!fixed) >> put_unused_fd(fd); >> ret = PTR_ERR(file); >> - if (ret == -EAGAIN && force_nonblock) >> - return -EAGAIN; >> + if (ret == -EAGAIN && force_nonblock) { >> + if ((req->flags & (REQ_F_APOLL_MULTISHOT | REQ_F_POLLED)) == >> + (REQ_F_APOLL_MULTISHOT | REQ_F_POLLED)) >> + ret = 0; >> + return ret; >> + } >> if (ret == -ERESTARTSYS) >> ret = -EINTR; >> req_set_fail(req); >> } else if (!fixed) { >> fd_install(fd, file); >> ret = fd; >> + /* >> + * if it's in multishot mode, let's return -EAGAIN to make it go >> + * into fast poll path >> + */ >> + if ((req->flags & REQ_F_APOLL_MULTISHOT) && force_nonblock && >> + !(req->flags & REQ_F_POLLED)) >> + ret2 = -EAGAIN; >> } else { >> ret = io_install_fixed_file(req, file, issue_flags, >> accept->file_slot - 1); >> } >> - __io_req_complete(req, issue_flags, ret, 0); >> - return 0; >> + >> + if (req->flags & REQ_F_APOLL_MULTISHOT) { >> + spin_lock(&ctx->completion_lock); >> + if (io_cqring_fill_event(ctx, req->user_data, ret, 0)) { >> + io_commit_cqring(ctx); >> + ctx->cq_extra++; >> + } >> + spin_unlock(&ctx->completion_lock); >> + goto retry; >> + } else { >> + __io_req_complete(req, issue_flags, ret, 0); >> + } >> + return ret2; >> } >> >> static int io_connect_prep_async(struct io_kiocb *req) >> > -- Pavel Begunkov
On 9/3/21 12:00 PM, Hao Xu wrote: > Update io_accept_prep() to enable multishot mode for accept operation. > > Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> > --- > fs/io_uring.c | 14 ++++++++++++-- > 1 file changed, 12 insertions(+), 2 deletions(-) > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index eb81d37dce78..34612646ae3c 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags) > static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > { > struct io_accept *accept = &req->accept; > + bool is_multishot; > > if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) > return -EINVAL; > @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > accept->flags = READ_ONCE(sqe->accept_flags); > accept->nofile = rlimit(RLIMIT_NOFILE); > > + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT; > + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC)) > + return -EINVAL; Why REQ_F_FORCE_ASYNC is not allowed? It doesn't sound like there should be any problem, would just eventually go looping poll_wait + tw > + > accept->file_slot = READ_ONCE(sqe->file_index); > if (accept->file_slot && ((req->open.how.flags & O_CLOEXEC) || > - (accept->flags & SOCK_CLOEXEC))) > + (accept->flags & SOCK_CLOEXEC) || is_multishot)) > return -EINVAL; > - if (accept->flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK)) > + if (accept->flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK | IORING_ACCEPT_MULTISHOT)) > return -EINVAL; > if (SOCK_NONBLOCK != O_NONBLOCK && (accept->flags & SOCK_NONBLOCK)) > accept->flags = (accept->flags & ~SOCK_NONBLOCK) | O_NONBLOCK; > + if (is_multishot) { > + req->flags |= REQ_F_APOLL_MULTISHOT; > + accept->flags &= ~IORING_ACCEPT_MULTISHOT; > + } > + > return 0; > } > > -- Pavel Begunkov
On 9/4/21 7:40 PM, Jens Axboe wrote: > On 9/4/21 9:34 AM, Hao Xu wrote: >> 在 2021/9/4 上午12:29, Jens Axboe 写道: >>> On 9/3/21 5:00 AM, Hao Xu wrote: >>>> Update io_accept_prep() to enable multishot mode for accept operation. >>>> >>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>>> --- >>>> fs/io_uring.c | 14 ++++++++++++-- >>>> 1 file changed, 12 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>> index eb81d37dce78..34612646ae3c 100644 >>>> --- a/fs/io_uring.c >>>> +++ b/fs/io_uring.c >>>> @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags) >>>> static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>> { >>>> struct io_accept *accept = &req->accept; >>>> + bool is_multishot; >>>> >>>> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) >>>> return -EINVAL; >>>> @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>> accept->flags = READ_ONCE(sqe->accept_flags); >>>> accept->nofile = rlimit(RLIMIT_NOFILE); >>>> >>>> + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT; >>>> + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC)) >>>> + return -EINVAL; >>> >>> I like the idea itself as I think it makes a lot of sense to just have >>> an accept sitting there and generating multiple CQEs, but I'm a bit >>> puzzled by how you pass it in. accept->flags is the accept4(2) flags, >>> which can currently be: >>> >>> SOCK_NONBLOCK >>> SOCK_CLOEXEC >>> >>> While there's not any overlap here, that is mostly by chance I think. A >>> cleaner separation is needed here, what happens if some other accept4(2) >>> flag is enabled and it just happens to be the same as >>> IORING_ACCEPT_MULTISHOT? >> Make sense, how about a new IOSQE flag, I saw not many >> entries left there. > > Not quite sure what the best approach would be... The mshot flag only > makes sense for a few request types, so a bit of a shame to have to > waste an IOSQE flag on it. Especially when the flags otherwise passed in > are so sparse, there's plenty of bits there. > > Hence while it may not be the prettiest, perhaps using accept->flags is > ok and we just need some careful code to ensure that we never have any > overlap. Or we can alias with some of the almost-never-used fields like ->ioprio or ->buf_index. > Probably best to solve that issue in include/linux/net.h, ala: > > /* Flags for socket, socketpair, accept4 */ > #define SOCK_CLOEXEC O_CLOEXEC > #ifndef SOCK_NONBLOCK > #define SOCK_NONBLOCK O_NONBLOCK > #endif > > /* > * Only used for io_uring accept4, and deliberately chosen to overlap > * with the O_* file bits for read/write mode so we won't risk overlap > * other flags added for socket/socketpair/accept4 use in the future. > */ > #define SOCK_URING_MULTISHOT 00000001 > > which should be OK, as these overlap with the O_* filespace and the > read/write bits are at the start of that space. > > Should be done as a prep patch and sent out to netdev as well, so we can > get their sign-off on this "hack". If we can get that done, then we have > our flag and we can just stuff it in accept->flags as long as we clear > it before calling into accept from io_uring. > -- Pavel Begunkov
在 2021/9/5 上午6:43, Pavel Begunkov 写道: > On 9/3/21 12:00 PM, Hao Xu wrote: >> Update io_accept_prep() to enable multishot mode for accept operation. >> >> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >> --- >> fs/io_uring.c | 14 ++++++++++++-- >> 1 file changed, 12 insertions(+), 2 deletions(-) >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index eb81d37dce78..34612646ae3c 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags) >> static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >> { >> struct io_accept *accept = &req->accept; >> + bool is_multishot; >> >> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) >> return -EINVAL; >> @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >> accept->flags = READ_ONCE(sqe->accept_flags); >> accept->nofile = rlimit(RLIMIT_NOFILE); >> >> + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT; >> + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC)) >> + return -EINVAL; > > Why REQ_F_FORCE_ASYNC is not allowed? It doesn't sound like there > should be any problem, would just eventually go looping > poll_wait + tw Hmm..The arm_poll facility is only on the io_submit_sqes() path. If FORCE_ASYNC is set, the req goes to io_wq_submit_work-->io_issue_sqe. Moreover, I guess theoretically poll based retry and async iowq are two ways for users to handle their sqes, it may not be sane to go to poll based retry path if a user already forcely choose iowq as their prefer way. > >> + >> accept->file_slot = READ_ONCE(sqe->file_index); >> if (accept->file_slot && ((req->open.how.flags & O_CLOEXEC) || >> - (accept->flags & SOCK_CLOEXEC))) >> + (accept->flags & SOCK_CLOEXEC) || is_multishot)) >> return -EINVAL; >> - if (accept->flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK)) >> + if (accept->flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK | IORING_ACCEPT_MULTISHOT)) >> return -EINVAL; >> if (SOCK_NONBLOCK != O_NONBLOCK && (accept->flags & SOCK_NONBLOCK)) >> accept->flags = (accept->flags & ~SOCK_NONBLOCK) | O_NONBLOCK; >> + if (is_multishot) { >> + req->flags |= REQ_F_APOLL_MULTISHOT; >> + accept->flags &= ~IORING_ACCEPT_MULTISHOT; >> + } >> + >> return 0; >> } >> >> >
在 2021/9/5 上午6:46, Pavel Begunkov 写道: > On 9/4/21 7:40 PM, Jens Axboe wrote: >> On 9/4/21 9:34 AM, Hao Xu wrote: >>> 在 2021/9/4 上午12:29, Jens Axboe 写道: >>>> On 9/3/21 5:00 AM, Hao Xu wrote: >>>>> Update io_accept_prep() to enable multishot mode for accept operation. >>>>> >>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>>>> --- >>>>> fs/io_uring.c | 14 ++++++++++++-- >>>>> 1 file changed, 12 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>> index eb81d37dce78..34612646ae3c 100644 >>>>> --- a/fs/io_uring.c >>>>> +++ b/fs/io_uring.c >>>>> @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags) >>>>> static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>>> { >>>>> struct io_accept *accept = &req->accept; >>>>> + bool is_multishot; >>>>> >>>>> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) >>>>> return -EINVAL; >>>>> @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>>> accept->flags = READ_ONCE(sqe->accept_flags); >>>>> accept->nofile = rlimit(RLIMIT_NOFILE); >>>>> >>>>> + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT; >>>>> + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC)) >>>>> + return -EINVAL; >>>> >>>> I like the idea itself as I think it makes a lot of sense to just have >>>> an accept sitting there and generating multiple CQEs, but I'm a bit >>>> puzzled by how you pass it in. accept->flags is the accept4(2) flags, >>>> which can currently be: >>>> >>>> SOCK_NONBLOCK >>>> SOCK_CLOEXEC >>>> >>>> While there's not any overlap here, that is mostly by chance I think. A >>>> cleaner separation is needed here, what happens if some other accept4(2) >>>> flag is enabled and it just happens to be the same as >>>> IORING_ACCEPT_MULTISHOT? >>> Make sense, how about a new IOSQE flag, I saw not many >>> entries left there. >> >> Not quite sure what the best approach would be... The mshot flag only >> makes sense for a few request types, so a bit of a shame to have to >> waste an IOSQE flag on it. Especially when the flags otherwise passed in >> are so sparse, there's plenty of bits there. >> >> Hence while it may not be the prettiest, perhaps using accept->flags is >> ok and we just need some careful code to ensure that we never have any >> overlap. > > Or we can alias with some of the almost-never-used fields like > ->ioprio or ->buf_index. I think leverage the highest bit of ioprio may be a good idea? > >> Probably best to solve that issue in include/linux/net.h, ala: >> >> /* Flags for socket, socketpair, accept4 */ >> #define SOCK_CLOEXEC O_CLOEXEC >> #ifndef SOCK_NONBLOCK >> #define SOCK_NONBLOCK O_NONBLOCK >> #endif >> >> /* >> * Only used for io_uring accept4, and deliberately chosen to overlap >> * with the O_* file bits for read/write mode so we won't risk overlap >> * other flags added for socket/socketpair/accept4 use in the future. >> */ >> #define SOCK_URING_MULTISHOT 00000001 >> >> which should be OK, as these overlap with the O_* filespace and the >> read/write bits are at the start of that space. >> >> Should be done as a prep patch and sent out to netdev as well, so we can >> get their sign-off on this "hack". If we can get that done, then we have >> our flag and we can just stuff it in accept->flags as long as we clear >> it before calling into accept from io_uring. >> >
On 9/5/21 7:25 AM, Hao Xu wrote: > 在 2021/9/5 上午6:43, Pavel Begunkov 写道: >> On 9/3/21 12:00 PM, Hao Xu wrote: >>> Update io_accept_prep() to enable multishot mode for accept operation. >>> >>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>> --- >>> fs/io_uring.c | 14 ++++++++++++-- >>> 1 file changed, 12 insertions(+), 2 deletions(-) >>> >>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>> index eb81d37dce78..34612646ae3c 100644 >>> --- a/fs/io_uring.c >>> +++ b/fs/io_uring.c >>> @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags) >>> static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>> { >>> struct io_accept *accept = &req->accept; >>> + bool is_multishot; >>> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) >>> return -EINVAL; >>> @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>> accept->flags = READ_ONCE(sqe->accept_flags); >>> accept->nofile = rlimit(RLIMIT_NOFILE); >>> + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT; >>> + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC)) >>> + return -EINVAL; >> >> Why REQ_F_FORCE_ASYNC is not allowed? It doesn't sound like there >> should be any problem, would just eventually go looping >> poll_wait + tw > Hmm..The arm_poll facility is only on the io_submit_sqes() path. If > FORCE_ASYNC is set, the req goes to io_wq_submit_work-->io_issue_sqe. > Moreover, I guess theoretically poll based retry and async iowq are > two ways for users to handle their sqes, it may not be sane to go to > poll based retry path if a user already forcely choose iowq as their > prefer way. The flag is just a hint, there are already plenty of ways to trick requests into tw. Forbidding it would be inconsistent. >>> + >>> accept->file_slot = READ_ONCE(sqe->file_index); >>> if (accept->file_slot && ((req->open.how.flags & O_CLOEXEC) || >>> - (accept->flags & SOCK_CLOEXEC))) >>> + (accept->flags & SOCK_CLOEXEC) || is_multishot)) >>> return -EINVAL; >>> - if (accept->flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK)) >>> + if (accept->flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK | IORING_ACCEPT_MULTISHOT)) >>> return -EINVAL; >>> if (SOCK_NONBLOCK != O_NONBLOCK && (accept->flags & SOCK_NONBLOCK)) >>> accept->flags = (accept->flags & ~SOCK_NONBLOCK) | O_NONBLOCK; >>> + if (is_multishot) { >>> + req->flags |= REQ_F_APOLL_MULTISHOT; >>> + accept->flags &= ~IORING_ACCEPT_MULTISHOT; >>> + } >>> + >>> return 0; >>> } >>> >> > -- Pavel Begunkov
On 9/4/21 4:46 PM, Pavel Begunkov wrote:
> On 9/4/21 7:40 PM, Jens Axboe wrote:
>> On 9/4/21 9:34 AM, Hao Xu wrote:
>>> 在 2021/9/4 上午12:29, Jens Axboe 写道:
>>>> On 9/3/21 5:00 AM, Hao Xu wrote:
>>>>> Update io_accept_prep() to enable multishot mode for accept operation.
>>>>>
>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>>>>> ---
>>>>> fs/io_uring.c | 14 ++++++++++++--
>>>>> 1 file changed, 12 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>> index eb81d37dce78..34612646ae3c 100644
>>>>> --- a/fs/io_uring.c
>>>>> +++ b/fs/io_uring.c
>>>>> @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags)
>>>>> static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>> {
>>>>> struct io_accept *accept = &req->accept;
>>>>> + bool is_multishot;
>>>>>
>>>>> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
>>>>> return -EINVAL;
>>>>> @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>> accept->flags = READ_ONCE(sqe->accept_flags);
>>>>> accept->nofile = rlimit(RLIMIT_NOFILE);
>>>>>
>>>>> + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT;
>>>>> + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC))
>>>>> + return -EINVAL;
>>>>
>>>> I like the idea itself as I think it makes a lot of sense to just have
>>>> an accept sitting there and generating multiple CQEs, but I'm a bit
>>>> puzzled by how you pass it in. accept->flags is the accept4(2) flags,
>>>> which can currently be:
>>>>
>>>> SOCK_NONBLOCK
>>>> SOCK_CLOEXEC
>>>>
>>>> While there's not any overlap here, that is mostly by chance I think. A
>>>> cleaner separation is needed here, what happens if some other accept4(2)
>>>> flag is enabled and it just happens to be the same as
>>>> IORING_ACCEPT_MULTISHOT?
>>> Make sense, how about a new IOSQE flag, I saw not many
>>> entries left there.
>>
>> Not quite sure what the best approach would be... The mshot flag only
>> makes sense for a few request types, so a bit of a shame to have to
>> waste an IOSQE flag on it. Especially when the flags otherwise passed in
>> are so sparse, there's plenty of bits there.
>>
>> Hence while it may not be the prettiest, perhaps using accept->flags is
>> ok and we just need some careful code to ensure that we never have any
>> overlap.
>
> Or we can alias with some of the almost-never-used fields like
> ->ioprio or ->buf_index.
It's not a bad idea, as long as we can safely use flags from eg ioprio
for cases where ioprio would never be used. In that sense it's probably
safer than using buf_index.
The alternative is, as has been brougt up before, adding a flags2 and
reserving the last flag in ->flags to say "there are flags in flags2".
Not exactly super pretty either, but we'll need to extend them at some
point.
--
Jens Axboe
在 2021/9/6 上午3:44, Jens Axboe 写道: > On 9/4/21 4:46 PM, Pavel Begunkov wrote: >> On 9/4/21 7:40 PM, Jens Axboe wrote: >>> On 9/4/21 9:34 AM, Hao Xu wrote: >>>> 在 2021/9/4 上午12:29, Jens Axboe 写道: >>>>> On 9/3/21 5:00 AM, Hao Xu wrote: >>>>>> Update io_accept_prep() to enable multishot mode for accept operation. >>>>>> >>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>>>>> --- >>>>>> fs/io_uring.c | 14 ++++++++++++-- >>>>>> 1 file changed, 12 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>>> index eb81d37dce78..34612646ae3c 100644 >>>>>> --- a/fs/io_uring.c >>>>>> +++ b/fs/io_uring.c >>>>>> @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags) >>>>>> static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>>>> { >>>>>> struct io_accept *accept = &req->accept; >>>>>> + bool is_multishot; >>>>>> >>>>>> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) >>>>>> return -EINVAL; >>>>>> @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>>>> accept->flags = READ_ONCE(sqe->accept_flags); >>>>>> accept->nofile = rlimit(RLIMIT_NOFILE); >>>>>> >>>>>> + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT; >>>>>> + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC)) >>>>>> + return -EINVAL; >>>>> >>>>> I like the idea itself as I think it makes a lot of sense to just have >>>>> an accept sitting there and generating multiple CQEs, but I'm a bit >>>>> puzzled by how you pass it in. accept->flags is the accept4(2) flags, >>>>> which can currently be: >>>>> >>>>> SOCK_NONBLOCK >>>>> SOCK_CLOEXEC >>>>> >>>>> While there's not any overlap here, that is mostly by chance I think. A >>>>> cleaner separation is needed here, what happens if some other accept4(2) >>>>> flag is enabled and it just happens to be the same as >>>>> IORING_ACCEPT_MULTISHOT? >>>> Make sense, how about a new IOSQE flag, I saw not many >>>> entries left there. >>> >>> Not quite sure what the best approach would be... The mshot flag only >>> makes sense for a few request types, so a bit of a shame to have to >>> waste an IOSQE flag on it. Especially when the flags otherwise passed in >>> are so sparse, there's plenty of bits there. >>> >>> Hence while it may not be the prettiest, perhaps using accept->flags is >>> ok and we just need some careful code to ensure that we never have any >>> overlap. >> >> Or we can alias with some of the almost-never-used fields like >> ->ioprio or ->buf_index. > > It's not a bad idea, as long as we can safely use flags from eg ioprio > for cases where ioprio would never be used. In that sense it's probably > safer than using buf_index. > > The alternative is, as has been brougt up before, adding a flags2 and > reserving the last flag in ->flags to say "there are flags in flags2". May I ask when do we have to use a bit in ->flags to do this? My understanding is just leverage a __u8 in pad2[2] in io_uring_sqe to deliver ext_flags. > Not exactly super pretty either, but we'll need to extend them at some > point. >
在 2021/9/6 下午4:26, Hao Xu 写道: > 在 2021/9/6 上午3:44, Jens Axboe 写道: >> On 9/4/21 4:46 PM, Pavel Begunkov wrote: >>> On 9/4/21 7:40 PM, Jens Axboe wrote: >>>> On 9/4/21 9:34 AM, Hao Xu wrote: >>>>> 在 2021/9/4 上午12:29, Jens Axboe 写道: >>>>>> On 9/3/21 5:00 AM, Hao Xu wrote: >>>>>>> Update io_accept_prep() to enable multishot mode for accept >>>>>>> operation. >>>>>>> >>>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>>>>>> --- >>>>>>> fs/io_uring.c | 14 ++++++++++++-- >>>>>>> 1 file changed, 12 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>>>> index eb81d37dce78..34612646ae3c 100644 >>>>>>> --- a/fs/io_uring.c >>>>>>> +++ b/fs/io_uring.c >>>>>>> @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, >>>>>>> unsigned int issue_flags) >>>>>>> static int io_accept_prep(struct io_kiocb *req, const struct >>>>>>> io_uring_sqe *sqe) >>>>>>> { >>>>>>> struct io_accept *accept = &req->accept; >>>>>>> + bool is_multishot; >>>>>>> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) >>>>>>> return -EINVAL; >>>>>>> @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb >>>>>>> *req, const struct io_uring_sqe *sqe) >>>>>>> accept->flags = READ_ONCE(sqe->accept_flags); >>>>>>> accept->nofile = rlimit(RLIMIT_NOFILE); >>>>>>> + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT; >>>>>>> + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC)) >>>>>>> + return -EINVAL; >>>>>> >>>>>> I like the idea itself as I think it makes a lot of sense to just >>>>>> have >>>>>> an accept sitting there and generating multiple CQEs, but I'm a bit >>>>>> puzzled by how you pass it in. accept->flags is the accept4(2) flags, >>>>>> which can currently be: >>>>>> >>>>>> SOCK_NONBLOCK >>>>>> SOCK_CLOEXEC >>>>>> >>>>>> While there's not any overlap here, that is mostly by chance I >>>>>> think. A >>>>>> cleaner separation is needed here, what happens if some other >>>>>> accept4(2) >>>>>> flag is enabled and it just happens to be the same as >>>>>> IORING_ACCEPT_MULTISHOT? >>>>> Make sense, how about a new IOSQE flag, I saw not many >>>>> entries left there. >>>> >>>> Not quite sure what the best approach would be... The mshot flag only >>>> makes sense for a few request types, so a bit of a shame to have to >>>> waste an IOSQE flag on it. Especially when the flags otherwise >>>> passed in >>>> are so sparse, there's plenty of bits there. >>>> >>>> Hence while it may not be the prettiest, perhaps using accept->flags is >>>> ok and we just need some careful code to ensure that we never have any >>>> overlap. >>> >>> Or we can alias with some of the almost-never-used fields like >>> ->ioprio or ->buf_index. >> >> It's not a bad idea, as long as we can safely use flags from eg ioprio >> for cases where ioprio would never be used. In that sense it's probably >> safer than using buf_index. >> >> The alternative is, as has been brougt up before, adding a flags2 and >> reserving the last flag in ->flags to say "there are flags in flags2". > May I ask when do we have to use a bit in ->flags to do this? My ^why > understanding is just leverage a __u8 in pad2[2] in io_uring_sqe to > deliver ext_flags. >> Not exactly super pretty either, but we'll need to extend them at some >> point. >>
在 2021/9/6 上午3:44, Jens Axboe 写道: > On 9/4/21 4:46 PM, Pavel Begunkov wrote: >> On 9/4/21 7:40 PM, Jens Axboe wrote: >>> On 9/4/21 9:34 AM, Hao Xu wrote: >>>> 在 2021/9/4 上午12:29, Jens Axboe 写道: >>>>> On 9/3/21 5:00 AM, Hao Xu wrote: >>>>>> Update io_accept_prep() to enable multishot mode for accept operation. >>>>>> >>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>>>>> --- >>>>>> fs/io_uring.c | 14 ++++++++++++-- >>>>>> 1 file changed, 12 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>>> index eb81d37dce78..34612646ae3c 100644 >>>>>> --- a/fs/io_uring.c >>>>>> +++ b/fs/io_uring.c >>>>>> @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags) >>>>>> static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>>>> { >>>>>> struct io_accept *accept = &req->accept; >>>>>> + bool is_multishot; >>>>>> >>>>>> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) >>>>>> return -EINVAL; >>>>>> @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>>>> accept->flags = READ_ONCE(sqe->accept_flags); >>>>>> accept->nofile = rlimit(RLIMIT_NOFILE); >>>>>> >>>>>> + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT; >>>>>> + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC)) >>>>>> + return -EINVAL; >>>>> >>>>> I like the idea itself as I think it makes a lot of sense to just have >>>>> an accept sitting there and generating multiple CQEs, but I'm a bit >>>>> puzzled by how you pass it in. accept->flags is the accept4(2) flags, >>>>> which can currently be: >>>>> >>>>> SOCK_NONBLOCK >>>>> SOCK_CLOEXEC >>>>> >>>>> While there's not any overlap here, that is mostly by chance I think. A >>>>> cleaner separation is needed here, what happens if some other accept4(2) >>>>> flag is enabled and it just happens to be the same as >>>>> IORING_ACCEPT_MULTISHOT? >>>> Make sense, how about a new IOSQE flag, I saw not many >>>> entries left there. >>> >>> Not quite sure what the best approach would be... The mshot flag only >>> makes sense for a few request types, so a bit of a shame to have to >>> waste an IOSQE flag on it. Especially when the flags otherwise passed in >>> are so sparse, there's plenty of bits there. >>> >>> Hence while it may not be the prettiest, perhaps using accept->flags is >>> ok and we just need some careful code to ensure that we never have any >>> overlap. >> >> Or we can alias with some of the almost-never-used fields like >> ->ioprio or ->buf_index. > > It's not a bad idea, as long as we can safely use flags from eg ioprio > for cases where ioprio would never be used. In that sense it's probably > safer than using buf_index. > > The alternative is, as has been brougt up before, adding a flags2 and > reserving the last flag in ->flags to say "there are flags in flags2". > Not exactly super pretty either, but we'll need to extend them at some > point. I'm going to do it in this way, there is another thing we have to do: extend req->flags too, since flags we already used > 32 if we add sqe->ext_flags >
On 9/6/21 2:26 AM, Hao Xu wrote:
> 在 2021/9/6 上午3:44, Jens Axboe 写道:
>> On 9/4/21 4:46 PM, Pavel Begunkov wrote:
>>> On 9/4/21 7:40 PM, Jens Axboe wrote:
>>>> On 9/4/21 9:34 AM, Hao Xu wrote:
>>>>> 在 2021/9/4 上午12:29, Jens Axboe 写道:
>>>>>> On 9/3/21 5:00 AM, Hao Xu wrote:
>>>>>>> Update io_accept_prep() to enable multishot mode for accept operation.
>>>>>>>
>>>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>>>>>>> ---
>>>>>>> fs/io_uring.c | 14 ++++++++++++--
>>>>>>> 1 file changed, 12 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>>>> index eb81d37dce78..34612646ae3c 100644
>>>>>>> --- a/fs/io_uring.c
>>>>>>> +++ b/fs/io_uring.c
>>>>>>> @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags)
>>>>>>> static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>>> {
>>>>>>> struct io_accept *accept = &req->accept;
>>>>>>> + bool is_multishot;
>>>>>>>
>>>>>>> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
>>>>>>> return -EINVAL;
>>>>>>> @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>>> accept->flags = READ_ONCE(sqe->accept_flags);
>>>>>>> accept->nofile = rlimit(RLIMIT_NOFILE);
>>>>>>>
>>>>>>> + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT;
>>>>>>> + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC))
>>>>>>> + return -EINVAL;
>>>>>>
>>>>>> I like the idea itself as I think it makes a lot of sense to just have
>>>>>> an accept sitting there and generating multiple CQEs, but I'm a bit
>>>>>> puzzled by how you pass it in. accept->flags is the accept4(2) flags,
>>>>>> which can currently be:
>>>>>>
>>>>>> SOCK_NONBLOCK
>>>>>> SOCK_CLOEXEC
>>>>>>
>>>>>> While there's not any overlap here, that is mostly by chance I think. A
>>>>>> cleaner separation is needed here, what happens if some other accept4(2)
>>>>>> flag is enabled and it just happens to be the same as
>>>>>> IORING_ACCEPT_MULTISHOT?
>>>>> Make sense, how about a new IOSQE flag, I saw not many
>>>>> entries left there.
>>>>
>>>> Not quite sure what the best approach would be... The mshot flag only
>>>> makes sense for a few request types, so a bit of a shame to have to
>>>> waste an IOSQE flag on it. Especially when the flags otherwise passed in
>>>> are so sparse, there's plenty of bits there.
>>>>
>>>> Hence while it may not be the prettiest, perhaps using accept->flags is
>>>> ok and we just need some careful code to ensure that we never have any
>>>> overlap.
>>>
>>> Or we can alias with some of the almost-never-used fields like
>>> ->ioprio or ->buf_index.
>>
>> It's not a bad idea, as long as we can safely use flags from eg ioprio
>> for cases where ioprio would never be used. In that sense it's probably
>> safer than using buf_index.
>>
>> The alternative is, as has been brougt up before, adding a flags2 and
>> reserving the last flag in ->flags to say "there are flags in flags2".
> May I ask when do we have to use a bit in ->flags to do this? My
> understanding is just leverage a __u8 in pad2[2] in io_uring_sqe to
> deliver ext_flags.
We may not need to, it depends solely on whether or not we currently
check that pad for zero or not, and -EINVAL if it's not zero. Otherwise
you can end up with the unfortunate side effect that applications will
set some extended flag on an older kernel and think that it works, but
the old kernel just ignores is as it isn't checked.
--
Jens Axboe
On 9/6/21 6:35 AM, Hao Xu wrote:
> 在 2021/9/6 上午3:44, Jens Axboe 写道:
>> On 9/4/21 4:46 PM, Pavel Begunkov wrote:
>>> On 9/4/21 7:40 PM, Jens Axboe wrote:
>>>> On 9/4/21 9:34 AM, Hao Xu wrote:
>>>>> 在 2021/9/4 上午12:29, Jens Axboe 写道:
>>>>>> On 9/3/21 5:00 AM, Hao Xu wrote:
>>>>>>> Update io_accept_prep() to enable multishot mode for accept operation.
>>>>>>>
>>>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>>>>>>> ---
>>>>>>> fs/io_uring.c | 14 ++++++++++++--
>>>>>>> 1 file changed, 12 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>>>> index eb81d37dce78..34612646ae3c 100644
>>>>>>> --- a/fs/io_uring.c
>>>>>>> +++ b/fs/io_uring.c
>>>>>>> @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags)
>>>>>>> static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>>> {
>>>>>>> struct io_accept *accept = &req->accept;
>>>>>>> + bool is_multishot;
>>>>>>>
>>>>>>> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
>>>>>>> return -EINVAL;
>>>>>>> @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>>> accept->flags = READ_ONCE(sqe->accept_flags);
>>>>>>> accept->nofile = rlimit(RLIMIT_NOFILE);
>>>>>>>
>>>>>>> + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT;
>>>>>>> + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC))
>>>>>>> + return -EINVAL;
>>>>>>
>>>>>> I like the idea itself as I think it makes a lot of sense to just have
>>>>>> an accept sitting there and generating multiple CQEs, but I'm a bit
>>>>>> puzzled by how you pass it in. accept->flags is the accept4(2) flags,
>>>>>> which can currently be:
>>>>>>
>>>>>> SOCK_NONBLOCK
>>>>>> SOCK_CLOEXEC
>>>>>>
>>>>>> While there's not any overlap here, that is mostly by chance I think. A
>>>>>> cleaner separation is needed here, what happens if some other accept4(2)
>>>>>> flag is enabled and it just happens to be the same as
>>>>>> IORING_ACCEPT_MULTISHOT?
>>>>> Make sense, how about a new IOSQE flag, I saw not many
>>>>> entries left there.
>>>>
>>>> Not quite sure what the best approach would be... The mshot flag only
>>>> makes sense for a few request types, so a bit of a shame to have to
>>>> waste an IOSQE flag on it. Especially when the flags otherwise passed in
>>>> are so sparse, there's plenty of bits there.
>>>>
>>>> Hence while it may not be the prettiest, perhaps using accept->flags is
>>>> ok and we just need some careful code to ensure that we never have any
>>>> overlap.
>>>
>>> Or we can alias with some of the almost-never-used fields like
>>> ->ioprio or ->buf_index.
>>
>> It's not a bad idea, as long as we can safely use flags from eg ioprio
>> for cases where ioprio would never be used. In that sense it's probably
>> safer than using buf_index.
>>
>> The alternative is, as has been brougt up before, adding a flags2 and
>> reserving the last flag in ->flags to say "there are flags in flags2".
>> Not exactly super pretty either, but we'll need to extend them at some
>> point.
> I'm going to do it in this way, there is another thing we have to do:
> extend req->flags too, since flags we already used > 32 if we add
> sqe->ext_flags
As far as I can tell from a quick look, there's still plenty of flags
left for REQ_F additions, about 8 of them. Don't expand req->flags if we
can avoid it, just add some safeguards to ensure we don't mess up.
Since we haven't really been tight on req->flags before, there's also
some low hanging fruit there that will allow us to reclaim some of them
if we need to.
--
Jens Axboe
在 2021/9/6 下午9:31, Jens Axboe 写道: > On 9/6/21 6:35 AM, Hao Xu wrote: >> 在 2021/9/6 上午3:44, Jens Axboe 写道: >>> On 9/4/21 4:46 PM, Pavel Begunkov wrote: >>>> On 9/4/21 7:40 PM, Jens Axboe wrote: >>>>> On 9/4/21 9:34 AM, Hao Xu wrote: >>>>>> 在 2021/9/4 上午12:29, Jens Axboe 写道: >>>>>>> On 9/3/21 5:00 AM, Hao Xu wrote: >>>>>>>> Update io_accept_prep() to enable multishot mode for accept operation. >>>>>>>> >>>>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>>>>>>> --- >>>>>>>> fs/io_uring.c | 14 ++++++++++++-- >>>>>>>> 1 file changed, 12 insertions(+), 2 deletions(-) >>>>>>>> >>>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>>>>> index eb81d37dce78..34612646ae3c 100644 >>>>>>>> --- a/fs/io_uring.c >>>>>>>> +++ b/fs/io_uring.c >>>>>>>> @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags) >>>>>>>> static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>>>>>> { >>>>>>>> struct io_accept *accept = &req->accept; >>>>>>>> + bool is_multishot; >>>>>>>> >>>>>>>> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) >>>>>>>> return -EINVAL; >>>>>>>> @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>>>>>> accept->flags = READ_ONCE(sqe->accept_flags); >>>>>>>> accept->nofile = rlimit(RLIMIT_NOFILE); >>>>>>>> >>>>>>>> + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT; >>>>>>>> + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC)) >>>>>>>> + return -EINVAL; >>>>>>> >>>>>>> I like the idea itself as I think it makes a lot of sense to just have >>>>>>> an accept sitting there and generating multiple CQEs, but I'm a bit >>>>>>> puzzled by how you pass it in. accept->flags is the accept4(2) flags, >>>>>>> which can currently be: >>>>>>> >>>>>>> SOCK_NONBLOCK >>>>>>> SOCK_CLOEXEC >>>>>>> >>>>>>> While there's not any overlap here, that is mostly by chance I think. A >>>>>>> cleaner separation is needed here, what happens if some other accept4(2) >>>>>>> flag is enabled and it just happens to be the same as >>>>>>> IORING_ACCEPT_MULTISHOT? >>>>>> Make sense, how about a new IOSQE flag, I saw not many >>>>>> entries left there. >>>>> >>>>> Not quite sure what the best approach would be... The mshot flag only >>>>> makes sense for a few request types, so a bit of a shame to have to >>>>> waste an IOSQE flag on it. Especially when the flags otherwise passed in >>>>> are so sparse, there's plenty of bits there. >>>>> >>>>> Hence while it may not be the prettiest, perhaps using accept->flags is >>>>> ok and we just need some careful code to ensure that we never have any >>>>> overlap. >>>> >>>> Or we can alias with some of the almost-never-used fields like >>>> ->ioprio or ->buf_index. >>> >>> It's not a bad idea, as long as we can safely use flags from eg ioprio >>> for cases where ioprio would never be used. In that sense it's probably >>> safer than using buf_index. >>> >>> The alternative is, as has been brougt up before, adding a flags2 and >>> reserving the last flag in ->flags to say "there are flags in flags2". >>> Not exactly super pretty either, but we'll need to extend them at some >>> point. >> I'm going to do it in this way, there is another thing we have to do: >> extend req->flags too, since flags we already used > 32 if we add >> sqe->ext_flags > > As far as I can tell from a quick look, there's still plenty of flags > left for REQ_F additions, about 8 of them. Don't Ah, sorry, I realised that I reserved the 8 ext_flags bits right after the first 8 bits for sqe->flags for convenience. And that's why I though not enough space in REQ_F. Thanks. expand req->flags if we > can avoid it, just add some safeguards to ensure we don't mess up. > > Since we haven't really been tight on req->flags before, there's also > some low hanging fruit there that will allow us to reclaim some of them > if we need to. >
On 9/6/21 1:35 PM, Hao Xu wrote:
> 在 2021/9/6 上午3:44, Jens Axboe 写道:
>> On 9/4/21 4:46 PM, Pavel Begunkov wrote:
>>> On 9/4/21 7:40 PM, Jens Axboe wrote:
>>>> On 9/4/21 9:34 AM, Hao Xu wrote:
>>>>> 在 2021/9/4 上午12:29, Jens Axboe 写道:
>>>>>> On 9/3/21 5:00 AM, Hao Xu wrote:
>>>>>>> Update io_accept_prep() to enable multishot mode for accept operation.
>>>>>>>
>>>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>>>>>>> ---
>>>>>>> fs/io_uring.c | 14 ++++++++++++--
>>>>>>> 1 file changed, 12 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>>>> index eb81d37dce78..34612646ae3c 100644
>>>>>>> --- a/fs/io_uring.c
>>>>>>> +++ b/fs/io_uring.c
>>>>>>> @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags)
>>>>>>> static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>>> {
>>>>>>> struct io_accept *accept = &req->accept;
>>>>>>> + bool is_multishot;
>>>>>>> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
>>>>>>> return -EINVAL;
>>>>>>> @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>>> accept->flags = READ_ONCE(sqe->accept_flags);
>>>>>>> accept->nofile = rlimit(RLIMIT_NOFILE);
>>>>>>> + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT;
>>>>>>> + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC))
>>>>>>> + return -EINVAL;
>>>>>>
>>>>>> I like the idea itself as I think it makes a lot of sense to just have
>>>>>> an accept sitting there and generating multiple CQEs, but I'm a bit
>>>>>> puzzled by how you pass it in. accept->flags is the accept4(2) flags,
>>>>>> which can currently be:
>>>>>>
>>>>>> SOCK_NONBLOCK
>>>>>> SOCK_CLOEXEC
>>>>>>
>>>>>> While there's not any overlap here, that is mostly by chance I think. A
>>>>>> cleaner separation is needed here, what happens if some other accept4(2)
>>>>>> flag is enabled and it just happens to be the same as
>>>>>> IORING_ACCEPT_MULTISHOT?
>>>>> Make sense, how about a new IOSQE flag, I saw not many
>>>>> entries left there.
>>>>
>>>> Not quite sure what the best approach would be... The mshot flag only
>>>> makes sense for a few request types, so a bit of a shame to have to
>>>> waste an IOSQE flag on it. Especially when the flags otherwise passed in
>>>> are so sparse, there's plenty of bits there.
>>>>
>>>> Hence while it may not be the prettiest, perhaps using accept->flags is
>>>> ok and we just need some careful code to ensure that we never have any
>>>> overlap.
>>>
>>> Or we can alias with some of the almost-never-used fields like
>>> ->ioprio or ->buf_index.
>>
>> It's not a bad idea, as long as we can safely use flags from eg ioprio
>> for cases where ioprio would never be used. In that sense it's probably
>> safer than using buf_index.
>>
>> The alternative is, as has been brougt up before, adding a flags2 and
>> reserving the last flag in ->flags to say "there are flags in flags2".
>> Not exactly super pretty either, but we'll need to extend them at some
>> point.
> I'm going to do it in this way, there is another thing we have to do:
> extend req->flags too, since flags we already used > 32 if we add
> sqe->ext_flags
We still have 2 bits left, and IIRC you wanted to take only 1 of them.
We don't need extending it at the moment, it sounded to me like a plan
for the future. No extra trouble for now
Anyway, I can't think of many requests working in this mode, and I think
sqe_flags should be taken only for features applicable to all (~most) of
requests. Maybe we'd better to fit it individually into accept in the
end? Sounds more plausible tbh
p.s. yes, there is IOSQE_BUFFER_SELECT, but I don't think that was the
best solution, but in any case it's history.
--
Pavel Begunkov
On 9/4/21 11:40 PM, Pavel Begunkov wrote: > On 9/4/21 11:39 PM, Pavel Begunkov wrote: >> On 9/3/21 12:00 PM, Hao Xu wrote: >>> Refactor io_accept() to support multishot mode. >> >> Multishot with the fixed/direct mode sounds weird (considering that >> the slot index is specified by userspace), let's forbid them. >> >> io_accept_prep() { >> if (accept->file_slot && (flags & MULTISHOT)) >> return -EINVAL; >> ... >> } > > Ah, never mind, it's already there in 6/6 btw, I think it would be less confusing if 5/6 and 6/6 are combined into a single patch. But it's a matter of taste, I guess >> >>> theoretical analysis: >>> 1) when connections come in fast >>> - singleshot: >>> add accept sqe(userpsace) --> accept inline >>> ^ | >>> |-----------------| >>> - multishot: >>> add accept sqe(userspace) --> accept inline >>> ^ | >>> |--*--| >>> >>> we do accept repeatedly in * place until get EAGAIN >>> >>> 2) when connections come in at a low pressure >>> similar thing like 1), we reduce a lot of userspace-kernel context >>> switch and useless vfs_poll() >>> >>> tests: >>> Did some tests, which goes in this way: >>> >>> server client(multiple) >>> accept connect >>> read write >>> write read >>> close close >>> >>> Basically, raise up a number of clients(on same machine with server) to >>> connect to the server, and then write some data to it, the server will >>> write those data back to the client after it receives them, and then >>> close the connection after write return. Then the client will read the >>> data and then close the connection. Here I test 10000 clients connect >>> one server, data size 128 bytes. And each client has a go routine for >>> it, so they come to the server in short time. >>> test 20 times before/after this patchset, time spent:(unit cycle, which >>> is the return value of clock()) >>> before: >>> 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075 >>> +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984 >>> +1934226+1914385)/20.0 = 1927633.75 >>> after: >>> 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112 >>> +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506 >>> +1871324+1940803)/20.0 = 1894750.45 >>> >>> (1927633.75 - 1894750.45) / 1927633.75 = 1.65% >>> >>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>> --- >>> >>> not sure if we should cancel it when io_cqring_fill_event() reurn false >>> >>> fs/io_uring.c | 34 +++++++++++++++++++++++++++++----- >>> 1 file changed, 29 insertions(+), 5 deletions(-) >>> >>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>> index dae7044e0c24..eb81d37dce78 100644 >>> --- a/fs/io_uring.c >>> +++ b/fs/io_uring.c >>> @@ -4885,16 +4885,18 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>> >>> static int io_accept(struct io_kiocb *req, unsigned int issue_flags) >>> { >>> + struct io_ring_ctx *ctx = req->ctx; >>> struct io_accept *accept = &req->accept; >>> bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; >>> unsigned int file_flags = force_nonblock ? O_NONBLOCK : 0; >>> bool fixed = !!accept->file_slot; >>> struct file *file; >>> - int ret, fd; >>> + int ret, ret2 = 0, fd; >>> >>> if (req->file->f_flags & O_NONBLOCK) >>> req->flags |= REQ_F_NOWAIT; >>> >>> +retry: >>> if (!fixed) { >>> fd = __get_unused_fd_flags(accept->flags, accept->nofile); >>> if (unlikely(fd < 0)) >>> @@ -4906,20 +4908,42 @@ static int io_accept(struct io_kiocb *req, unsigned int issue_flags) >>> if (!fixed) >>> put_unused_fd(fd); >>> ret = PTR_ERR(file); >>> - if (ret == -EAGAIN && force_nonblock) >>> - return -EAGAIN; >>> + if (ret == -EAGAIN && force_nonblock) { >>> + if ((req->flags & (REQ_F_APOLL_MULTISHOT | REQ_F_POLLED)) == >>> + (REQ_F_APOLL_MULTISHOT | REQ_F_POLLED)) >>> + ret = 0; >>> + return ret; >>> + } >>> if (ret == -ERESTARTSYS) >>> ret = -EINTR; >>> req_set_fail(req); >>> } else if (!fixed) { >>> fd_install(fd, file); >>> ret = fd; >>> + /* >>> + * if it's in multishot mode, let's return -EAGAIN to make it go >>> + * into fast poll path >>> + */ >>> + if ((req->flags & REQ_F_APOLL_MULTISHOT) && force_nonblock && >>> + !(req->flags & REQ_F_POLLED)) >>> + ret2 = -EAGAIN; >>> } else { >>> ret = io_install_fixed_file(req, file, issue_flags, >>> accept->file_slot - 1); >>> } >>> - __io_req_complete(req, issue_flags, ret, 0); >>> - return 0; >>> + >>> + if (req->flags & REQ_F_APOLL_MULTISHOT) { >>> + spin_lock(&ctx->completion_lock); >>> + if (io_cqring_fill_event(ctx, req->user_data, ret, 0)) { >>> + io_commit_cqring(ctx); >>> + ctx->cq_extra++; >>> + } >>> + spin_unlock(&ctx->completion_lock); >>> + goto retry; >>> + } else { >>> + __io_req_complete(req, issue_flags, ret, 0); >>> + } >>> + return ret2; >>> } >>> >>> static int io_connect_prep_async(struct io_kiocb *req) >>> >> > -- Pavel Begunkov
On 9/3/21 12:00 PM, Hao Xu wrote: > For operations like accept, multishot is a useful feature, since we can > reduce a number of accept sqe. Let's integrate it to fast poll, it may > be good for other operations in the future. __io_arm_poll_handler() | -> vfs_poll() | | io_async_task_func() // post CQE | ... | do_apoll_rewait(); -> continues after vfs_poll(),| removing poll->head of | the second poll attempt. | One of the reasons for forbidding multiple apoll's is that it might be racy. I haven't looked into this implementation, but we should check if there will be problems from that. FWIW, putting aside this patchset, the poll/apoll is not in the best shape and can use some refactoring. > Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> > --- > fs/io_uring.c | 15 ++++++++++++--- > 1 file changed, 12 insertions(+), 3 deletions(-) > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index d6df60c4cdb9..dae7044e0c24 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -5277,8 +5277,15 @@ static void io_async_task_func(struct io_kiocb *req, bool *locked) > return; > } > > - hash_del(&req->hash_node); > - io_poll_remove_double(req); > + if (READ_ONCE(apoll->poll.canceled)) > + apoll->poll.events |= EPOLLONESHOT; > + if (apoll->poll.events & EPOLLONESHOT) { > + hash_del(&req->hash_node); > + io_poll_remove_double(req); > + } else { > + add_wait_queue(apoll->poll.head, &apoll->poll.wait); > + } > + > spin_unlock(&ctx->completion_lock); > > if (!READ_ONCE(apoll->poll.canceled)) > @@ -5366,7 +5373,7 @@ static int io_arm_poll_handler(struct io_kiocb *req) > struct io_ring_ctx *ctx = req->ctx; > struct async_poll *apoll; > struct io_poll_table ipt; > - __poll_t ret, mask = EPOLLONESHOT | POLLERR | POLLPRI; > + __poll_t ret, mask = POLLERR | POLLPRI; > int rw; > > if (!req->file || !file_can_poll(req->file)) > @@ -5388,6 +5395,8 @@ static int io_arm_poll_handler(struct io_kiocb *req) > rw = WRITE; > mask |= POLLOUT | POLLWRNORM; > } > + if (!(req->flags & REQ_F_APOLL_MULTISHOT)) > + mask |= EPOLLONESHOT; > > /* if we can't nonblock try, then no point in arming a poll handler */ > if (!io_file_supports_nowait(req, rw)) > -- Pavel Begunkov
On 9/6/21 9:32 AM, Pavel Begunkov wrote: > On 9/6/21 1:35 PM, Hao Xu wrote: >> 在 2021/9/6 上午3:44, Jens Axboe 写道: >>> On 9/4/21 4:46 PM, Pavel Begunkov wrote: >>>> On 9/4/21 7:40 PM, Jens Axboe wrote: >>>>> On 9/4/21 9:34 AM, Hao Xu wrote: >>>>>> 在 2021/9/4 上午12:29, Jens Axboe 写道: >>>>>>> On 9/3/21 5:00 AM, Hao Xu wrote: >>>>>>>> Update io_accept_prep() to enable multishot mode for accept operation. >>>>>>>> >>>>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>>>>>>> --- >>>>>>>> fs/io_uring.c | 14 ++++++++++++-- >>>>>>>> 1 file changed, 12 insertions(+), 2 deletions(-) >>>>>>>> >>>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>>>>> index eb81d37dce78..34612646ae3c 100644 >>>>>>>> --- a/fs/io_uring.c >>>>>>>> +++ b/fs/io_uring.c >>>>>>>> @@ -4861,6 +4861,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags) >>>>>>>> static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>>>>>> { >>>>>>>> struct io_accept *accept = &req->accept; >>>>>>>> + bool is_multishot; >>>>>>>> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) >>>>>>>> return -EINVAL; >>>>>>>> @@ -4872,14 +4873,23 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>>>>>> accept->flags = READ_ONCE(sqe->accept_flags); >>>>>>>> accept->nofile = rlimit(RLIMIT_NOFILE); >>>>>>>> + is_multishot = accept->flags & IORING_ACCEPT_MULTISHOT; >>>>>>>> + if (is_multishot && (req->flags & REQ_F_FORCE_ASYNC)) >>>>>>>> + return -EINVAL; >>>>>>> >>>>>>> I like the idea itself as I think it makes a lot of sense to just have >>>>>>> an accept sitting there and generating multiple CQEs, but I'm a bit >>>>>>> puzzled by how you pass it in. accept->flags is the accept4(2) flags, >>>>>>> which can currently be: >>>>>>> >>>>>>> SOCK_NONBLOCK >>>>>>> SOCK_CLOEXEC >>>>>>> >>>>>>> While there's not any overlap here, that is mostly by chance I think. A >>>>>>> cleaner separation is needed here, what happens if some other accept4(2) >>>>>>> flag is enabled and it just happens to be the same as >>>>>>> IORING_ACCEPT_MULTISHOT? >>>>>> Make sense, how about a new IOSQE flag, I saw not many >>>>>> entries left there. >>>>> >>>>> Not quite sure what the best approach would be... The mshot flag only >>>>> makes sense for a few request types, so a bit of a shame to have to >>>>> waste an IOSQE flag on it. Especially when the flags otherwise passed in >>>>> are so sparse, there's plenty of bits there. >>>>> >>>>> Hence while it may not be the prettiest, perhaps using accept->flags is >>>>> ok and we just need some careful code to ensure that we never have any >>>>> overlap. >>>> >>>> Or we can alias with some of the almost-never-used fields like >>>> ->ioprio or ->buf_index. >>> >>> It's not a bad idea, as long as we can safely use flags from eg ioprio >>> for cases where ioprio would never be used. In that sense it's probably >>> safer than using buf_index. >>> >>> The alternative is, as has been brougt up before, adding a flags2 and >>> reserving the last flag in ->flags to say "there are flags in flags2". >>> Not exactly super pretty either, but we'll need to extend them at some >>> point. >> I'm going to do it in this way, there is another thing we have to do: >> extend req->flags too, since flags we already used > 32 if we add >> sqe->ext_flags > > We still have 2 bits left, and IIRC you wanted to take only 1 of them. > We don't need extending it at the moment, it sounded to me like a plan > for the future. No extra trouble for now Right, and it should be a separate thing anyway. But as you say, there's still 2 bits left, this is more about longer term planning than this particular patchset. > Anyway, I can't think of many requests working in this mode, and I think > sqe_flags should be taken only for features applicable to all (~most) of > requests. Maybe we'd better to fit it individually into accept in the > end? Sounds more plausible tbh That's why I suggested making it op private instead, I don't particularly like having io_uring wide flags that are only applicable to a (very) small subset of requests. And there's also precedence here already in terms of poll supporting mshot with a private flag. -- Jens Axboe
在 2021/9/6 下午11:56, Pavel Begunkov 写道: > On 9/3/21 12:00 PM, Hao Xu wrote: >> For operations like accept, multishot is a useful feature, since we can >> reduce a number of accept sqe. Let's integrate it to fast poll, it may >> be good for other operations in the future. > > __io_arm_poll_handler() | > -> vfs_poll() | > | io_async_task_func() // post CQE > | ... > | do_apoll_rewait(); > -> continues after vfs_poll(),| > removing poll->head of | > the second poll attempt. | > > Sorry.. a little bit confused by this case, would you mind explain a bit more..is the right part a system-workqueue context? and is do_apoll_rewait() io_poll_rewait() function? > One of the reasons for forbidding multiple apoll's is that it > might be racy. I haven't looked into this implementation, but > we should check if there will be problems from that. > > FWIW, putting aside this patchset, the poll/apoll is not in > the best shape and can use some refactoring. > > >> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >> --- >> fs/io_uring.c | 15 ++++++++++++--- >> 1 file changed, 12 insertions(+), 3 deletions(-) >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index d6df60c4cdb9..dae7044e0c24 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -5277,8 +5277,15 @@ static void io_async_task_func(struct io_kiocb *req, bool *locked) >> return; >> } >> >> - hash_del(&req->hash_node); >> - io_poll_remove_double(req); >> + if (READ_ONCE(apoll->poll.canceled)) >> + apoll->poll.events |= EPOLLONESHOT; >> + if (apoll->poll.events & EPOLLONESHOT) { >> + hash_del(&req->hash_node); >> + io_poll_remove_double(req); >> + } else { >> + add_wait_queue(apoll->poll.head, &apoll->poll.wait); >> + } >> + >> spin_unlock(&ctx->completion_lock); >> >> if (!READ_ONCE(apoll->poll.canceled)) >> @@ -5366,7 +5373,7 @@ static int io_arm_poll_handler(struct io_kiocb *req) >> struct io_ring_ctx *ctx = req->ctx; >> struct async_poll *apoll; >> struct io_poll_table ipt; >> - __poll_t ret, mask = EPOLLONESHOT | POLLERR | POLLPRI; >> + __poll_t ret, mask = POLLERR | POLLPRI; >> int rw; >> >> if (!req->file || !file_can_poll(req->file)) >> @@ -5388,6 +5395,8 @@ static int io_arm_poll_handler(struct io_kiocb *req) >> rw = WRITE; >> mask |= POLLOUT | POLLWRNORM; >> } >> + if (!(req->flags & REQ_F_APOLL_MULTISHOT)) >> + mask |= EPOLLONESHOT; >> >> /* if we can't nonblock try, then no point in arming a poll handler */ >> if (!io_file_supports_nowait(req, rw)) >> >
On 9/3/21 12:00 PM, Hao Xu wrote: > For operations like accept, multishot is a useful feature, since we can > reduce a number of accept sqe. Let's integrate it to fast poll, it may > be good for other operations in the future. > > Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> > --- > fs/io_uring.c | 15 ++++++++++++--- > 1 file changed, 12 insertions(+), 3 deletions(-) > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index d6df60c4cdb9..dae7044e0c24 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -5277,8 +5277,15 @@ static void io_async_task_func(struct io_kiocb *req, bool *locked) > return; > } > > - hash_del(&req->hash_node); > - io_poll_remove_double(req); > + if (READ_ONCE(apoll->poll.canceled)) > + apoll->poll.events |= EPOLLONESHOT; > + if (apoll->poll.events & EPOLLONESHOT) { > + hash_del(&req->hash_node); > + io_poll_remove_double(req); > + } else { > + add_wait_queue(apoll->poll.head, &apoll->poll.wait); It looks like it does both io_req_task_submit() and adding back to the wq, so io_issue_sqe() may be called in parallel with io_async_task_func(). If so, there will be tons of all kind of races. > + } > + > spin_unlock(&ctx->completion_lock); > > if (!READ_ONCE(apoll->poll.canceled)) > @@ -5366,7 +5373,7 @@ static int io_arm_poll_handler(struct io_kiocb *req) > struct io_ring_ctx *ctx = req->ctx; > struct async_poll *apoll; > struct io_poll_table ipt; > - __poll_t ret, mask = EPOLLONESHOT | POLLERR | POLLPRI; > + __poll_t ret, mask = POLLERR | POLLPRI; > int rw; > > if (!req->file || !file_can_poll(req->file)) > @@ -5388,6 +5395,8 @@ static int io_arm_poll_handler(struct io_kiocb *req) > rw = WRITE; > mask |= POLLOUT | POLLWRNORM; > } > + if (!(req->flags & REQ_F_APOLL_MULTISHOT)) > + mask |= EPOLLONESHOT; > > /* if we can't nonblock try, then no point in arming a poll handler */ > if (!io_file_supports_nowait(req, rw)) > -- Pavel Begunkov
On 9/6/21 6:40 PM, Hao Xu wrote: > 在 2021/9/6 下午11:56, Pavel Begunkov 写道: >> On 9/3/21 12:00 PM, Hao Xu wrote: >>> For operations like accept, multishot is a useful feature, since we can >>> reduce a number of accept sqe. Let's integrate it to fast poll, it may >>> be good for other operations in the future. >> >> __io_arm_poll_handler() | >> -> vfs_poll() | >> | io_async_task_func() // post CQE >> | ... >> | do_apoll_rewait(); >> -> continues after vfs_poll(),| >> removing poll->head of | >> the second poll attempt. | >> >> > Sorry.. a little bit confused by this case, would you mind explain a bit > more..is the right part a system-workqueue context? and is > do_apoll_rewait() io_poll_rewait() function? I meant in a broad sense. If your patches make lifetime of an accept request to be like: accept() -> arm_apoll() -> apoll_func() -> accept() -> ... -> ... (repeat many times) then do_apoll_rewait() is the second accept in the scheme. If not, and it's accept() -> arm_poll() -> apoll_func() -> apoll_func() -> ... -> ? Then that "do_apoll_rewait()" should have been second and other apoll_func()s. So, it's rather a thing to look after, but not a particular bug. >> One of the reasons for forbidding multiple apoll's is that it >> might be racy. I haven't looked into this implementation, but >> we should check if there will be problems from that. >> >> FWIW, putting aside this patchset, the poll/apoll is not in >> the best shape and can use some refactoring. >> >> [...] -- Pavel Begunkov
在 2021/9/7 上午3:09, Pavel Begunkov 写道: > On 9/6/21 6:40 PM, Hao Xu wrote: >> 在 2021/9/6 下午11:56, Pavel Begunkov 写道: >>> On 9/3/21 12:00 PM, Hao Xu wrote: >>>> For operations like accept, multishot is a useful feature, since we can >>>> reduce a number of accept sqe. Let's integrate it to fast poll, it may >>>> be good for other operations in the future. >>> >>> __io_arm_poll_handler() | >>> -> vfs_poll() | >>> | io_async_task_func() // post CQE >>> | ... >>> | do_apoll_rewait(); >>> -> continues after vfs_poll(),| >>> removing poll->head of | >>> the second poll attempt. this(removal) only happen when there is error or it is EPOLLONESHOT | >>> >>> >> Sorry.. a little bit confused by this case, would you mind explain a bit >> more..is the right part a system-workqueue context? and is >> do_apoll_rewait() io_poll_rewait() function? > > I meant in a broad sense. If your patches make lifetime of an accept > request to be like: > > accept() -> arm_apoll() -> apoll_func() -> accept() -> ... > -> ... (repeat many times) > > then do_apoll_rewait() is the second accept in the scheme. > > If not, and it's > > accept() -> arm_poll() -> apoll_func() -> apoll_func() -> > ... -> ? > > Then that "do_apoll_rewait()" should have been second and > other apoll_func()s. > > So, it's rather a thing to look after, but not a particular > bug. > > >>> One of the reasons for forbidding multiple apoll's is that it >>> might be racy. I haven't looked into this implementation, but >>> we should check if there will be problems from that. >>> >>> FWIW, putting aside this patchset, the poll/apoll is not in >>> the best shape and can use some refactoring. >>> >>> > [...] >
在 2021/9/7 上午3:04, Pavel Begunkov 写道: > On 9/3/21 12:00 PM, Hao Xu wrote: >> For operations like accept, multishot is a useful feature, since we can >> reduce a number of accept sqe. Let's integrate it to fast poll, it may >> be good for other operations in the future. >> >> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >> --- >> fs/io_uring.c | 15 ++++++++++++--- >> 1 file changed, 12 insertions(+), 3 deletions(-) >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index d6df60c4cdb9..dae7044e0c24 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -5277,8 +5277,15 @@ static void io_async_task_func(struct io_kiocb *req, bool *locked) >> return; >> } >> >> - hash_del(&req->hash_node); >> - io_poll_remove_double(req); >> + if (READ_ONCE(apoll->poll.canceled)) >> + apoll->poll.events |= EPOLLONESHOT; >> + if (apoll->poll.events & EPOLLONESHOT) { >> + hash_del(&req->hash_node); >> + io_poll_remove_double(req); >> + } else { >> + add_wait_queue(apoll->poll.head, &apoll->poll.wait); > > It looks like it does both io_req_task_submit() and adding back > to the wq, so io_issue_sqe() may be called in parallel with > io_async_task_func(). If so, there will be tons of all kind of > races. IMHO, io_async_task_func() is called in original context one by one(except PF_EXITING is set, it is also called in system-wq), so shouldn't be parallel case there. > >> + } >> + >> spin_unlock(&ctx->completion_lock); >> >> if (!READ_ONCE(apoll->poll.canceled)) >> @@ -5366,7 +5373,7 @@ static int io_arm_poll_handler(struct io_kiocb *req) >> struct io_ring_ctx *ctx = req->ctx; >> struct async_poll *apoll; >> struct io_poll_table ipt; >> - __poll_t ret, mask = EPOLLONESHOT | POLLERR | POLLPRI; >> + __poll_t ret, mask = POLLERR | POLLPRI; >> int rw; >> >> if (!req->file || !file_can_poll(req->file)) >> @@ -5388,6 +5395,8 @@ static int io_arm_poll_handler(struct io_kiocb *req) >> rw = WRITE; >> mask |= POLLOUT | POLLWRNORM; >> } >> + if (!(req->flags & REQ_F_APOLL_MULTISHOT)) >> + mask |= EPOLLONESHOT; >> >> /* if we can't nonblock try, then no point in arming a poll handler */ >> if (!io_file_supports_nowait(req, rw)) >> >
在 2021/9/7 下午2:48, Hao Xu 写道: > 在 2021/9/7 上午3:04, Pavel Begunkov 写道: >> On 9/3/21 12:00 PM, Hao Xu wrote: >>> For operations like accept, multishot is a useful feature, since we can >>> reduce a number of accept sqe. Let's integrate it to fast poll, it may >>> be good for other operations in the future. >>> >>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>> --- >>> fs/io_uring.c | 15 ++++++++++++--- >>> 1 file changed, 12 insertions(+), 3 deletions(-) >>> >>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>> index d6df60c4cdb9..dae7044e0c24 100644 >>> --- a/fs/io_uring.c >>> +++ b/fs/io_uring.c >>> @@ -5277,8 +5277,15 @@ static void io_async_task_func(struct io_kiocb >>> *req, bool *locked) >>> return; >>> } >>> - hash_del(&req->hash_node); >>> - io_poll_remove_double(req); >>> + if (READ_ONCE(apoll->poll.canceled)) >>> + apoll->poll.events |= EPOLLONESHOT; >>> + if (apoll->poll.events & EPOLLONESHOT) { >>> + hash_del(&req->hash_node); >>> + io_poll_remove_double(req); >>> + } else { >>> + add_wait_queue(apoll->poll.head, &apoll->poll.wait); >> >> It looks like it does both io_req_task_submit() and adding back >> to the wq, so io_issue_sqe() may be called in parallel with >> io_async_task_func(). If so, there will be tons of all kind of >> races. > IMHO, io_async_task_func() is called in original context one by > one(except PF_EXITING is set, it is also called in system-wq), so > shouldn't be parallel case there. ping... >> >>> + } >>> + >>> spin_unlock(&ctx->completion_lock); >>> if (!READ_ONCE(apoll->poll.canceled)) >>> @@ -5366,7 +5373,7 @@ static int io_arm_poll_handler(struct io_kiocb >>> *req) >>> struct io_ring_ctx *ctx = req->ctx; >>> struct async_poll *apoll; >>> struct io_poll_table ipt; >>> - __poll_t ret, mask = EPOLLONESHOT | POLLERR | POLLPRI; >>> + __poll_t ret, mask = POLLERR | POLLPRI; >>> int rw; >>> if (!req->file || !file_can_poll(req->file)) >>> @@ -5388,6 +5395,8 @@ static int io_arm_poll_handler(struct io_kiocb >>> *req) >>> rw = WRITE; >>> mask |= POLLOUT | POLLWRNORM; >>> } >>> + if (!(req->flags & REQ_F_APOLL_MULTISHOT)) >>> + mask |= EPOLLONESHOT; >>> /* if we can't nonblock try, then no point in arming a poll >>> handler */ >>> if (!io_file_supports_nowait(req, rw)) >>> >>
On 9/8/21 12:21 PM, Hao Xu wrote: > 在 2021/9/7 下午2:48, Hao Xu 写道: >> 在 2021/9/7 上午3:04, Pavel Begunkov 写道: >>> On 9/3/21 12:00 PM, Hao Xu wrote: >>>> For operations like accept, multishot is a useful feature, since we can >>>> reduce a number of accept sqe. Let's integrate it to fast poll, it may >>>> be good for other operations in the future. >>>> >>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>>> --- >>>> fs/io_uring.c | 15 ++++++++++++--- >>>> 1 file changed, 12 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>> index d6df60c4cdb9..dae7044e0c24 100644 >>>> --- a/fs/io_uring.c >>>> +++ b/fs/io_uring.c >>>> @@ -5277,8 +5277,15 @@ static void io_async_task_func(struct io_kiocb *req, bool *locked) >>>> return; >>>> } >>>> - hash_del(&req->hash_node); >>>> - io_poll_remove_double(req); >>>> + if (READ_ONCE(apoll->poll.canceled)) >>>> + apoll->poll.events |= EPOLLONESHOT; >>>> + if (apoll->poll.events & EPOLLONESHOT) { >>>> + hash_del(&req->hash_node); >>>> + io_poll_remove_double(req); >>>> + } else { >>>> + add_wait_queue(apoll->poll.head, &apoll->poll.wait); >>> >>> It looks like it does both io_req_task_submit() and adding back >>> to the wq, so io_issue_sqe() may be called in parallel with >>> io_async_task_func(). If so, there will be tons of all kind of >>> races. >> IMHO, io_async_task_func() is called in original context one by >> one(except PF_EXITING is set, it is also called in system-wq), so >> shouldn't be parallel case there. > ping... fwiw, the case we're talking about: CPU0 | CPU1 io_async_task_func() | -> add_wait_queue(); | -> io_req_task_submit(); | /* no tw run happened in between */ | io_async_task_func() | --> io_req_task_submit() We called io_req_task_submit() twice without running tw in-between, both of the calls use the same req->io_task_work.node field in the request for accounting, and so the second call will screw tctx->task_list and not only by not considering that req->io_task_work.node is already taken/enqueued. io_req_task_work_add() { wq_list_add_tail(&req->io_task_work.node, &tctx->task_list); } >>> >>>> + } >>>> + >>>> spin_unlock(&ctx->completion_lock); >>>> if (!READ_ONCE(apoll->poll.canceled)) >>>> @@ -5366,7 +5373,7 @@ static int io_arm_poll_handler(struct io_kiocb *req) >>>> struct io_ring_ctx *ctx = req->ctx; >>>> struct async_poll *apoll; >>>> struct io_poll_table ipt; >>>> - __poll_t ret, mask = EPOLLONESHOT | POLLERR | POLLPRI; >>>> + __poll_t ret, mask = POLLERR | POLLPRI; >>>> int rw; >>>> if (!req->file || !file_can_poll(req->file)) >>>> @@ -5388,6 +5395,8 @@ static int io_arm_poll_handler(struct io_kiocb *req) >>>> rw = WRITE; >>>> mask |= POLLOUT | POLLWRNORM; >>>> } >>>> + if (!(req->flags & REQ_F_APOLL_MULTISHOT)) >>>> + mask |= EPOLLONESHOT; >>>> /* if we can't nonblock try, then no point in arming a poll handler */ >>>> if (!io_file_supports_nowait(req, rw)) >>>> >>> > -- Pavel Begunkov
On 9/8/21 1:03 PM, Pavel Begunkov wrote: > On 9/8/21 12:21 PM, Hao Xu wrote: >> 在 2021/9/7 下午2:48, Hao Xu 写道: >>> 在 2021/9/7 上午3:04, Pavel Begunkov 写道: >>>> On 9/3/21 12:00 PM, Hao Xu wrote: >>>>> For operations like accept, multishot is a useful feature, since we can >>>>> reduce a number of accept sqe. Let's integrate it to fast poll, it may >>>>> be good for other operations in the future. >>>>> >>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>>>> --- >>>>> fs/io_uring.c | 15 ++++++++++++--- >>>>> 1 file changed, 12 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>> index d6df60c4cdb9..dae7044e0c24 100644 >>>>> --- a/fs/io_uring.c >>>>> +++ b/fs/io_uring.c >>>>> @@ -5277,8 +5277,15 @@ static void io_async_task_func(struct io_kiocb *req, bool *locked) >>>>> return; >>>>> } >>>>> - hash_del(&req->hash_node); >>>>> - io_poll_remove_double(req); >>>>> + if (READ_ONCE(apoll->poll.canceled)) >>>>> + apoll->poll.events |= EPOLLONESHOT; >>>>> + if (apoll->poll.events & EPOLLONESHOT) { >>>>> + hash_del(&req->hash_node); >>>>> + io_poll_remove_double(req); >>>>> + } else { >>>>> + add_wait_queue(apoll->poll.head, &apoll->poll.wait); >>>> >>>> It looks like it does both io_req_task_submit() and adding back >>>> to the wq, so io_issue_sqe() may be called in parallel with >>>> io_async_task_func(). If so, there will be tons of all kind of >>>> races. >>> IMHO, io_async_task_func() is called in original context one by >>> one(except PF_EXITING is set, it is also called in system-wq), so >>> shouldn't be parallel case there. >> ping... > > fwiw, the case we're talking about: > > CPU0 | CPU1 > io_async_task_func() | > -> add_wait_queue(); | > -> io_req_task_submit(); | > /* no tw run happened in between */ > | io_async_task_func() > | --> io_req_task_submit() > > We called io_req_task_submit() twice without running tw in-between, > both of the calls use the same req->io_task_work.node field in the > request for accounting, and so the second call will screw > tctx->task_list and not only by not considering that > req->io_task_work.node is already taken/enqueued. > > io_req_task_work_add() { > wq_list_add_tail(&req->io_task_work.node, &tctx->task_list); > } fwiw, can be just 1 CPU, just io_req_task_work_add(); io_req_task_work_add(); task_work_run(); // first one is buggy in current constraints. > >>>> >>>>> + } >>>>> + >>>>> spin_unlock(&ctx->completion_lock); >>>>> if (!READ_ONCE(apoll->poll.canceled)) >>>>> @@ -5366,7 +5373,7 @@ static int io_arm_poll_handler(struct io_kiocb *req) >>>>> struct io_ring_ctx *ctx = req->ctx; >>>>> struct async_poll *apoll; >>>>> struct io_poll_table ipt; >>>>> - __poll_t ret, mask = EPOLLONESHOT | POLLERR | POLLPRI; >>>>> + __poll_t ret, mask = POLLERR | POLLPRI; >>>>> int rw; >>>>> if (!req->file || !file_can_poll(req->file)) >>>>> @@ -5388,6 +5395,8 @@ static int io_arm_poll_handler(struct io_kiocb *req) >>>>> rw = WRITE; >>>>> mask |= POLLOUT | POLLWRNORM; >>>>> } >>>>> + if (!(req->flags & REQ_F_APOLL_MULTISHOT)) >>>>> + mask |= EPOLLONESHOT; >>>>> /* if we can't nonblock try, then no point in arming a poll handler */ >>>>> if (!io_file_supports_nowait(req, rw)) >>>>> >>>> >> > -- Pavel Begunkov
在 2021/9/8 下午8:03, Pavel Begunkov 写道: > On 9/8/21 12:21 PM, Hao Xu wrote: >> 在 2021/9/7 下午2:48, Hao Xu 写道: >>> 在 2021/9/7 上午3:04, Pavel Begunkov 写道: >>>> On 9/3/21 12:00 PM, Hao Xu wrote: >>>>> For operations like accept, multishot is a useful feature, since we can >>>>> reduce a number of accept sqe. Let's integrate it to fast poll, it may >>>>> be good for other operations in the future. >>>>> >>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>>>> --- >>>>> fs/io_uring.c | 15 ++++++++++++--- >>>>> 1 file changed, 12 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>> index d6df60c4cdb9..dae7044e0c24 100644 >>>>> --- a/fs/io_uring.c >>>>> +++ b/fs/io_uring.c >>>>> @@ -5277,8 +5277,15 @@ static void io_async_task_func(struct io_kiocb *req, bool *locked) >>>>> return; >>>>> } >>>>> - hash_del(&req->hash_node); >>>>> - io_poll_remove_double(req); >>>>> + if (READ_ONCE(apoll->poll.canceled)) >>>>> + apoll->poll.events |= EPOLLONESHOT; >>>>> + if (apoll->poll.events & EPOLLONESHOT) { >>>>> + hash_del(&req->hash_node); >>>>> + io_poll_remove_double(req); >>>>> + } else { >>>>> + add_wait_queue(apoll->poll.head, &apoll->poll.wait); >>>> >>>> It looks like it does both io_req_task_submit() and adding back >>>> to the wq, so io_issue_sqe() may be called in parallel with >>>> io_async_task_func(). If so, there will be tons of all kind of >>>> races. >>> IMHO, io_async_task_func() is called in original context one by >>> one(except PF_EXITING is set, it is also called in system-wq), so >>> shouldn't be parallel case there. >> ping... > > fwiw, the case we're talking about: > > CPU0 | CPU1 > io_async_task_func() | > -> add_wait_queue(); | > -> io_req_task_submit(); | > /* no tw run happened in between */ > | io_async_task_func() > | --> io_req_task_submit() > > We called io_req_task_submit() twice without running tw in-between, > both of the calls use the same req->io_task_work.node field in the > request for accounting, and so the second call will screw > tctx->task_list and not only by not considering that > req->io_task_work.node is already taken/enqueued. > > io_req_task_work_add() { > wq_list_add_tail(&req->io_task_work.node, &tctx->task_list); > } > I guess you mean io_req_task_work_add() called by async_wake() two times: io_async_task_func() -> add_wait_queue() async_wake() ->io_req_task_work_add() this one mess up the running task_work list since req->io_task_work.node is in use. It seems the current poll_add + multishot logic has this issue too, I'll give it a shot(simply clean req->io_task_work.node before running req->io_task_work.func should work) >>>> >>>>> + } >>>>> + >>>>> spin_unlock(&ctx->completion_lock); >>>>> if (!READ_ONCE(apoll->poll.canceled)) >>>>> @@ -5366,7 +5373,7 @@ static int io_arm_poll_handler(struct io_kiocb *req) >>>>> struct io_ring_ctx *ctx = req->ctx; >>>>> struct async_poll *apoll; >>>>> struct io_poll_table ipt; >>>>> - __poll_t ret, mask = EPOLLONESHOT | POLLERR | POLLPRI; >>>>> + __poll_t ret, mask = POLLERR | POLLPRI; >>>>> int rw; >>>>> if (!req->file || !file_can_poll(req->file)) >>>>> @@ -5388,6 +5395,8 @@ static int io_arm_poll_handler(struct io_kiocb *req) >>>>> rw = WRITE; >>>>> mask |= POLLOUT | POLLWRNORM; >>>>> } >>>>> + if (!(req->flags & REQ_F_APOLL_MULTISHOT)) >>>>> + mask |= EPOLLONESHOT; >>>>> /* if we can't nonblock try, then no point in arming a poll handler */ >>>>> if (!io_file_supports_nowait(req, rw)) >>>>> >>>> >> >
在 2021/9/9 下午3:01, Hao Xu 写道: > 在 2021/9/8 下午8:03, Pavel Begunkov 写道: >> On 9/8/21 12:21 PM, Hao Xu wrote: >>> 在 2021/9/7 下午2:48, Hao Xu 写道: >>>> 在 2021/9/7 上午3:04, Pavel Begunkov 写道: >>>>> On 9/3/21 12:00 PM, Hao Xu wrote: >>>>>> For operations like accept, multishot is a useful feature, since >>>>>> we can >>>>>> reduce a number of accept sqe. Let's integrate it to fast poll, it >>>>>> may >>>>>> be good for other operations in the future. >>>>>> >>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>>>>> --- >>>>>> fs/io_uring.c | 15 ++++++++++++--- >>>>>> 1 file changed, 12 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>>> index d6df60c4cdb9..dae7044e0c24 100644 >>>>>> --- a/fs/io_uring.c >>>>>> +++ b/fs/io_uring.c >>>>>> @@ -5277,8 +5277,15 @@ static void io_async_task_func(struct >>>>>> io_kiocb *req, bool *locked) >>>>>> return; >>>>>> } >>>>>> - hash_del(&req->hash_node); >>>>>> - io_poll_remove_double(req); >>>>>> + if (READ_ONCE(apoll->poll.canceled)) >>>>>> + apoll->poll.events |= EPOLLONESHOT; >>>>>> + if (apoll->poll.events & EPOLLONESHOT) { >>>>>> + hash_del(&req->hash_node); >>>>>> + io_poll_remove_double(req); >>>>>> + } else { >>>>>> + add_wait_queue(apoll->poll.head, &apoll->poll.wait); >>>>> >>>>> It looks like it does both io_req_task_submit() and adding back >>>>> to the wq, so io_issue_sqe() may be called in parallel with >>>>> io_async_task_func(). If so, there will be tons of all kind of >>>>> races. >>>> IMHO, io_async_task_func() is called in original context one by >>>> one(except PF_EXITING is set, it is also called in system-wq), so >>>> shouldn't be parallel case there. >>> ping... >> >> fwiw, the case we're talking about: >> >> CPU0 | CPU1 >> io_async_task_func() | >> -> add_wait_queue(); | >> -> io_req_task_submit(); | >> /* no tw run happened in between */ >> | io_async_task_func() >> | --> io_req_task_submit() >> >> We called io_req_task_submit() twice without running tw in-between, >> both of the calls use the same req->io_task_work.node field in the >> request for accounting, and so the second call will screw >> tctx->task_list and not only by not considering that >> req->io_task_work.node is already taken/enqueued. >> >> io_req_task_work_add() { >> wq_list_add_tail(&req->io_task_work.node, &tctx->task_list); >> } >> > I guess you mean io_req_task_work_add() called by async_wake() two times: > io_async_task_func() > -> add_wait_queue() > async_wake() > ->io_req_task_work_add() > this one mess up the running task_work list > since req->io_task_work.node is in use. > > It seems the current poll_add + multishot logic has this issue too, I'll > give it a shot(simply clean req->io_task_work.node before running > req->io_task_work.func should work) Similar issue for double wait entry since we didn't remove double entry in interrupt handler: async_wake() --> io_req_task_work_add() io_poll_double_wake()-->async_wake()-->io_req_task_work_add() >>>>> >>>>>> + } >>>>>> + >>>>>> spin_unlock(&ctx->completion_lock); >>>>>> if (!READ_ONCE(apoll->poll.canceled)) >>>>>> @@ -5366,7 +5373,7 @@ static int io_arm_poll_handler(struct >>>>>> io_kiocb *req) >>>>>> struct io_ring_ctx *ctx = req->ctx; >>>>>> struct async_poll *apoll; >>>>>> struct io_poll_table ipt; >>>>>> - __poll_t ret, mask = EPOLLONESHOT | POLLERR | POLLPRI; >>>>>> + __poll_t ret, mask = POLLERR | POLLPRI; >>>>>> int rw; >>>>>> if (!req->file || !file_can_poll(req->file)) >>>>>> @@ -5388,6 +5395,8 @@ static int io_arm_poll_handler(struct >>>>>> io_kiocb *req) >>>>>> rw = WRITE; >>>>>> mask |= POLLOUT | POLLWRNORM; >>>>>> } >>>>>> + if (!(req->flags & REQ_F_APOLL_MULTISHOT)) >>>>>> + mask |= EPOLLONESHOT; >>>>>> /* if we can't nonblock try, then no point in arming a poll >>>>>> handler */ >>>>>> if (!io_file_supports_nowait(req, rw)) >>>>>> >>>>> >>> >>
On 9/9/21 9:29 AM, Hao Xu wrote: > 在 2021/9/9 下午3:01, Hao Xu 写道: >> 在 2021/9/8 下午8:03, Pavel Begunkov 写道: >>> On 9/8/21 12:21 PM, Hao Xu wrote: >>>> 在 2021/9/7 下午2:48, Hao Xu 写道: >>>>> 在 2021/9/7 上午3:04, Pavel Begunkov 写道: >>>>>> On 9/3/21 12:00 PM, Hao Xu wrote: >>>>>>> For operations like accept, multishot is a useful feature, since we can >>>>>>> reduce a number of accept sqe. Let's integrate it to fast poll, it may >>>>>>> be good for other operations in the future. >>>>>>> >>>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>>>>>> --- >>>>>>> fs/io_uring.c | 15 ++++++++++++--- >>>>>>> 1 file changed, 12 insertions(+), 3 deletions(-) >>>>>>> >>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>>>> index d6df60c4cdb9..dae7044e0c24 100644 >>>>>>> --- a/fs/io_uring.c >>>>>>> +++ b/fs/io_uring.c >>>>>>> @@ -5277,8 +5277,15 @@ static void io_async_task_func(struct io_kiocb *req, bool *locked) >>>>>>> return; >>>>>>> } >>>>>>> - hash_del(&req->hash_node); >>>>>>> - io_poll_remove_double(req); >>>>>>> + if (READ_ONCE(apoll->poll.canceled)) >>>>>>> + apoll->poll.events |= EPOLLONESHOT; >>>>>>> + if (apoll->poll.events & EPOLLONESHOT) { >>>>>>> + hash_del(&req->hash_node); >>>>>>> + io_poll_remove_double(req); >>>>>>> + } else { >>>>>>> + add_wait_queue(apoll->poll.head, &apoll->poll.wait); >>>>>> >>>>>> It looks like it does both io_req_task_submit() and adding back >>>>>> to the wq, so io_issue_sqe() may be called in parallel with >>>>>> io_async_task_func(). If so, there will be tons of all kind of >>>>>> races. >>>>> IMHO, io_async_task_func() is called in original context one by >>>>> one(except PF_EXITING is set, it is also called in system-wq), so >>>>> shouldn't be parallel case there. >>>> ping... >>> >>> fwiw, the case we're talking about: >>> >>> CPU0 | CPU1 >>> io_async_task_func() | >>> -> add_wait_queue(); | >>> -> io_req_task_submit(); | >>> /* no tw run happened in between */ >>> | io_async_task_func() >>> | --> io_req_task_submit() >>> >>> We called io_req_task_submit() twice without running tw in-between, >>> both of the calls use the same req->io_task_work.node field in the >>> request for accounting, and so the second call will screw >>> tctx->task_list and not only by not considering that >>> req->io_task_work.node is already taken/enqueued. >>> >>> io_req_task_work_add() { >>> wq_list_add_tail(&req->io_task_work.node, &tctx->task_list); >>> } >>> >> I guess you mean io_req_task_work_add() called by async_wake() two times: >> io_async_task_func() >> -> add_wait_queue() >> async_wake() >> ->io_req_task_work_add() >> this one mess up the running task_work list >> since req->io_task_work.node is in use. >> >> It seems the current poll_add + multishot logic has this issue too, I'll >> give it a shot(simply clean req->io_task_work.node before running >> req->io_task_work.func should work) > Similar issue for double wait entry since we didn't remove double entry > in interrupt handler: Yep, sounds like that. Polling needs reworking, and not only because of this one. > async_wake() --> io_req_task_work_add() > io_poll_double_wake()-->async_wake()-->io_req_task_work_add() > >>>>>> >>>>>>> + } >>>>>>> + >>>>>>> spin_unlock(&ctx->completion_lock); >>>>>>> if (!READ_ONCE(apoll->poll.canceled)) >>>>>>> @@ -5366,7 +5373,7 @@ static int io_arm_poll_handler(struct io_kiocb *req) >>>>>>> struct io_ring_ctx *ctx = req->ctx; >>>>>>> struct async_poll *apoll; >>>>>>> struct io_poll_table ipt; >>>>>>> - __poll_t ret, mask = EPOLLONESHOT | POLLERR | POLLPRI; >>>>>>> + __poll_t ret, mask = POLLERR | POLLPRI; >>>>>>> int rw; >>>>>>> if (!req->file || !file_can_poll(req->file)) >>>>>>> @@ -5388,6 +5395,8 @@ static int io_arm_poll_handler(struct io_kiocb *req) >>>>>>> rw = WRITE; >>>>>>> mask |= POLLOUT | POLLWRNORM; >>>>>>> } >>>>>>> + if (!(req->flags & REQ_F_APOLL_MULTISHOT)) >>>>>>> + mask |= EPOLLONESHOT; >>>>>>> /* if we can't nonblock try, then no point in arming a poll handler */ >>>>>>> if (!io_file_supports_nowait(req, rw)) >>>>>>> >>>>>> >>>> >>> > -- Pavel Begunkov
在 2021/9/11 下午6:49, Pavel Begunkov 写道: > On 9/9/21 9:29 AM, Hao Xu wrote: >> 在 2021/9/9 下午3:01, Hao Xu 写道: >>> 在 2021/9/8 下午8:03, Pavel Begunkov 写道: >>>> On 9/8/21 12:21 PM, Hao Xu wrote: >>>>> 在 2021/9/7 下午2:48, Hao Xu 写道: >>>>>> 在 2021/9/7 上午3:04, Pavel Begunkov 写道: >>>>>>> On 9/3/21 12:00 PM, Hao Xu wrote: >>>>>>>> For operations like accept, multishot is a useful feature, since we can >>>>>>>> reduce a number of accept sqe. Let's integrate it to fast poll, it may >>>>>>>> be good for other operations in the future. >>>>>>>> >>>>>>>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> >>>>>>>> --- >>>>>>>> fs/io_uring.c | 15 ++++++++++++--- >>>>>>>> 1 file changed, 12 insertions(+), 3 deletions(-) >>>>>>>> >>>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>>>>> index d6df60c4cdb9..dae7044e0c24 100644 >>>>>>>> --- a/fs/io_uring.c >>>>>>>> +++ b/fs/io_uring.c >>>>>>>> @@ -5277,8 +5277,15 @@ static void io_async_task_func(struct io_kiocb *req, bool *locked) >>>>>>>> return; >>>>>>>> } >>>>>>>> - hash_del(&req->hash_node); >>>>>>>> - io_poll_remove_double(req); >>>>>>>> + if (READ_ONCE(apoll->poll.canceled)) >>>>>>>> + apoll->poll.events |= EPOLLONESHOT; >>>>>>>> + if (apoll->poll.events & EPOLLONESHOT) { >>>>>>>> + hash_del(&req->hash_node); >>>>>>>> + io_poll_remove_double(req); >>>>>>>> + } else { >>>>>>>> + add_wait_queue(apoll->poll.head, &apoll->poll.wait); >>>>>>> >>>>>>> It looks like it does both io_req_task_submit() and adding back >>>>>>> to the wq, so io_issue_sqe() may be called in parallel with >>>>>>> io_async_task_func(). If so, there will be tons of all kind of >>>>>>> races. >>>>>> IMHO, io_async_task_func() is called in original context one by >>>>>> one(except PF_EXITING is set, it is also called in system-wq), so >>>>>> shouldn't be parallel case there. >>>>> ping... >>>> >>>> fwiw, the case we're talking about: >>>> >>>> CPU0 | CPU1 >>>> io_async_task_func() | >>>> -> add_wait_queue(); | >>>> -> io_req_task_submit(); | >>>> /* no tw run happened in between */ >>>> | io_async_task_func() >>>> | --> io_req_task_submit() >>>> >>>> We called io_req_task_submit() twice without running tw in-between, >>>> both of the calls use the same req->io_task_work.node field in the >>>> request for accounting, and so the second call will screw >>>> tctx->task_list and not only by not considering that >>>> req->io_task_work.node is already taken/enqueued. >>>> >>>> io_req_task_work_add() { >>>> wq_list_add_tail(&req->io_task_work.node, &tctx->task_list); >>>> } >>>> >>> I guess you mean io_req_task_work_add() called by async_wake() two times: >>> io_async_task_func() >>> -> add_wait_queue() >>> async_wake() >>> ->io_req_task_work_add() >>> this one mess up the running task_work list >>> since req->io_task_work.node is in use. >>> >>> It seems the current poll_add + multishot logic has this issue too, I'll >>> give it a shot(simply clean req->io_task_work.node before running >>> req->io_task_work.func should work) >> Similar issue for double wait entry since we didn't remove double entry >> in interrupt handler: > > Yep, sounds like that. Polling needs reworking, and not only > because of this one. Currently I'm working on these type of poll issues, will send something soon. > > >> async_wake() --> io_req_task_work_add() >> io_poll_double_wake()-->async_wake()-->io_req_task_work_add() >> >>>>>>> >>>>>>>> + } >>>>>>>> + >>>>>>>> spin_unlock(&ctx->completion_lock); >>>>>>>> if (!READ_ONCE(apoll->poll.canceled)) >>>>>>>> @@ -5366,7 +5373,7 @@ static int io_arm_poll_handler(struct io_kiocb *req) >>>>>>>> struct io_ring_ctx *ctx = req->ctx; >>>>>>>> struct async_poll *apoll; >>>>>>>> struct io_poll_table ipt; >>>>>>>> - __poll_t ret, mask = EPOLLONESHOT | POLLERR | POLLPRI; >>>>>>>> + __poll_t ret, mask = POLLERR | POLLPRI; >>>>>>>> int rw; >>>>>>>> if (!req->file || !file_can_poll(req->file)) >>>>>>>> @@ -5388,6 +5395,8 @@ static int io_arm_poll_handler(struct io_kiocb *req) >>>>>>>> rw = WRITE; >>>>>>>> mask |= POLLOUT | POLLWRNORM; >>>>>>>> } >>>>>>>> + if (!(req->flags & REQ_F_APOLL_MULTISHOT)) >>>>>>>> + mask |= EPOLLONESHOT; >>>>>>>> /* if we can't nonblock try, then no point in arming a poll handler */ >>>>>>>> if (!io_file_supports_nowait(req, rw)) >>>>>>>> >>>>>>> >>>>> >>>> >> >
在 2021/9/3 下午9:38, Hao Xu 写道:
> 在 2021/9/3 下午8:27, Pavel Begunkov 写道:
>> On 9/3/21 1:08 PM, Hao Xu wrote:
>>> 在 2021/9/3 下午7:42, Pavel Begunkov 写道:
>>>> On 9/3/21 12:00 PM, Hao Xu wrote:
>>>>> Though currently refcount of a req is always one when we flush inline
>>>>
>>>> It can be refcounted and != 1. E.g. poll requests, or consider
>>> It seems poll requests don't leverage comp cache, do I miss anything?
>>
>> Hmm, looks so. Not great that it doesn't, but probably it's because
>> of trying to submit next reqs right in io_poll_task_func().
>>
>> I'll be pushing for some changes around tw, with it should be easy
>> to hook poll completion batching with no drawbacks. Would be great
>> if you will be willing to take a shot on it.
> Sure, I'll take a look.
>>
Tried to integrate poll logic to completion cache, I did test it with an
echo-server program, both multishot and singleshot mode, not difference
on req/s.