* aio poll and a new in-kernel poll API V20 (aka 2.0) @ 2018-07-26 8:28 Christoph Hellwig 2018-07-26 8:29 ` [PATCH 1/4] timerfd: add support for keyed wakeups Christoph Hellwig ` (3 more replies) 0 siblings, 4 replies; 14+ messages in thread From: Christoph Hellwig @ 2018-07-26 8:28 UTC (permalink / raw) To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, linux-kernel Hi all, this series adds support for the IOCB_CMD_POLL operation to poll for the readyness of file descriptors using the aio subsystem. The API is based on patches that existed in RHAS2.1 and RHEL3, which means it already is supported by libaio. As our dear leader didn't like the ->poll_mask method this tries to implement the behavior using plain old ->poll which is rather painful. For one we only support ->poll instances with a single wait queue behind them and reject the request otherwise, which isn't really different from the previous ->poll_mask requirement, just implemented in a rathet awkward way. Second we had to implement a refcount on struct aio_iocb (although it is kept as a no-op for non-poll commands) so that we can safely handle the case of ->poll returning a mask after it got a wakeup. This also means there is a lot of open coded magic for the waitqueue removals and dealing with ki_list to deal with these cases. Last but not least to avoid a guaranteed context switch on every wakeup we trust keyed wakeups, which from an audit of the users seems to be good. The only thing it loses is batching of multiple wakeups in a short time period into a single result. The changes were sponsored by Scylladb. git://git.infradead.org/users/hch/vfs.git aio-poll.20 Gitweb: http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.20 Libaio changes: https://pagure.io/libaio.git io-poll Seastar changes: https://github.com/avikivity/seastar/commits/aio Changes since v13: - rewritten to use ->poll Changes since v12: - remove iocb from ki_list only after ki_cancel has completed - fix __poll_t annotations - turn __poll_t sparse checkin on by default - call fput after aio_complete - only add the iocb to active_reqs if we wait for it Changes since v11: - simplify cancellation by completion poll requests from a workqueue if we can't take the ctx_lock Changes since v10: - fixed a mismerge that let a sock_rps_record_flow sneak into tcp_poll_mask - remove the now unused struct proto_ops get_poll_head method Changes since v9: - add to the delayed_cancel_reqs earlier to avoid a race - get rid of POLL_TO_PTR magic Changes since v8: - make delayed cancellation conditional again - add a cancel_kiocb file operation to split delayed vs normal cancel Changes since v7: - make delayed cancellation safe and unconditional Changes since v6: - reworked cancellation Changes since v5: - small changelog updates - rebased on top of the aio-fsync changes Changes since v4: - rebased ontop of Linux 4.16-rc4 Changes since v3: - remove the pre-sleep ->poll_mask call in vfs_poll, allow ->get_poll_head to return POLL* values. Changes since v2: - removed a double initialization - new vfs_get_poll_head helper - document that ->get_poll_head can return NULL - call ->poll_mask before sleeping - various ACKs - add conversion of random to ->poll_mask - add conversion of af_alg to ->poll_mask - lacking ->poll_mask support now returns -EINVAL for IOCB_CMD_POLL - reshuffled the series so that prep patches and everything not requiring the new in-kernel poll API is in the beginning Changes since v1: - handle the NULL ->poll case in vfs_poll - dropped the file argument to the ->poll_mask socket operation - replace the ->pre_poll socket operation with ->get_poll_head as in the file operations ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/4] timerfd: add support for keyed wakeups 2018-07-26 8:28 aio poll and a new in-kernel poll API V20 (aka 2.0) Christoph Hellwig @ 2018-07-26 8:29 ` Christoph Hellwig 2018-07-26 8:29 ` [PATCH 2/4] aio: add a iocb refcount Christoph Hellwig ` (2 subsequent siblings) 3 siblings, 0 replies; 14+ messages in thread From: Christoph Hellwig @ 2018-07-26 8:29 UTC (permalink / raw) To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, linux-kernel This prepares timerfd for use with aio poll. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/timerfd.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/timerfd.c b/fs/timerfd.c index cdad49da3ff7..f6c54fd56645 100644 --- a/fs/timerfd.c +++ b/fs/timerfd.c @@ -66,7 +66,7 @@ static void timerfd_triggered(struct timerfd_ctx *ctx) spin_lock_irqsave(&ctx->wqh.lock, flags); ctx->expired = 1; ctx->ticks++; - wake_up_locked(&ctx->wqh); + wake_up_locked_poll(&ctx->wqh, EPOLLIN); spin_unlock_irqrestore(&ctx->wqh.lock, flags); } @@ -107,7 +107,7 @@ void timerfd_clock_was_set(void) if (ctx->moffs != moffs) { ctx->moffs = KTIME_MAX; ctx->ticks++; - wake_up_locked(&ctx->wqh); + wake_up_locked_poll(&ctx->wqh, EPOLLIN); } spin_unlock_irqrestore(&ctx->wqh.lock, flags); } @@ -345,7 +345,7 @@ static long timerfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg spin_lock_irq(&ctx->wqh.lock); if (!timerfd_canceled(ctx)) { ctx->ticks = ticks; - wake_up_locked(&ctx->wqh); + wake_up_locked_poll(&ctx->wqh, EPOLLIN); } else ret = -ECANCELED; spin_unlock_irq(&ctx->wqh.lock); -- 2.18.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/4] aio: add a iocb refcount 2018-07-26 8:28 aio poll and a new in-kernel poll API V20 (aka 2.0) Christoph Hellwig 2018-07-26 8:29 ` [PATCH 1/4] timerfd: add support for keyed wakeups Christoph Hellwig @ 2018-07-26 8:29 ` Christoph Hellwig 2018-07-26 11:22 ` Matthew Wilcox 2018-07-26 8:29 ` [PATCH 3/4] aio: implement IOCB_CMD_POLL Christoph Hellwig 2018-07-26 8:29 ` [PATCH 4/4] aio: allow direct aio poll comletions for keyed wakeups Christoph Hellwig 3 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2018-07-26 8:29 UTC (permalink / raw) To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, linux-kernel This is needed to prevent races caused by the way the ->poll API works. To avoid introducing overhead for other users of the iocbs we initialize it to zero and only do refcount operations if it is non-zero in the completion path. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/aio.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 27454594e37a..7f3c159b3e2e 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -178,6 +178,7 @@ struct aio_kiocb { struct list_head ki_list; /* the aio core uses this * for cancellation */ + atomic_t ki_refcnt; /* * If the aio_resfd field of the userspace iocb is not zero, @@ -1015,6 +1016,7 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx) percpu_ref_get(&ctx->reqs); INIT_LIST_HEAD(&req->ki_list); + atomic_set(&req->ki_refcnt, 0); req->ki_ctx = ctx; return req; out_put: @@ -1049,6 +1051,15 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id) return ret; } +static inline void iocb_put(struct aio_kiocb *iocb) +{ + if (atomic_read(&iocb->ki_refcnt) == 0 || + atomic_dec_and_test(&iocb->ki_refcnt)) { + percpu_ref_put(&iocb->ki_ctx->reqs); + kmem_cache_free(kiocb_cachep, iocb); + } +} + /* aio_complete * Called when the io request on the given iocb is complete. */ @@ -1118,8 +1129,6 @@ static void aio_complete(struct aio_kiocb *iocb, long res, long res2) eventfd_ctx_put(iocb->ki_eventfd); } - kmem_cache_free(kiocb_cachep, iocb); - /* * We have to order our ring_info tail store above and test * of the wait list below outside the wait lock. This is @@ -1130,8 +1139,7 @@ static void aio_complete(struct aio_kiocb *iocb, long res, long res2) if (waitqueue_active(&ctx->wait)) wake_up(&ctx->wait); - - percpu_ref_put(&ctx->reqs); + iocb_put(iocb); } /* aio_read_events_ring -- 2.18.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 2/4] aio: add a iocb refcount 2018-07-26 8:29 ` [PATCH 2/4] aio: add a iocb refcount Christoph Hellwig @ 2018-07-26 11:22 ` Matthew Wilcox 2018-07-26 11:57 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Matthew Wilcox @ 2018-07-26 11:22 UTC (permalink / raw) To: Christoph Hellwig Cc: viro, Avi Kivity, linux-aio, linux-fsdevel, linux-kernel On Thu, Jul 26, 2018 at 10:29:01AM +0200, Christoph Hellwig wrote: > + atomic_t ki_refcnt; Should this be a refcount_t instead? At first glance your usage seems compatible with refcount_t. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/4] aio: add a iocb refcount 2018-07-26 11:22 ` Matthew Wilcox @ 2018-07-26 11:57 ` Christoph Hellwig 2018-07-27 8:31 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2018-07-26 11:57 UTC (permalink / raw) To: Matthew Wilcox Cc: Christoph Hellwig, viro, Avi Kivity, linux-aio, linux-fsdevel, linux-kernel On Thu, Jul 26, 2018 at 04:22:27AM -0700, Matthew Wilcox wrote: > On Thu, Jul 26, 2018 at 10:29:01AM +0200, Christoph Hellwig wrote: > > + atomic_t ki_refcnt; > > Should this be a refcount_t instead? At first glance your usage seems > compatible with refcount_t. I though the magic 0 meaning would be incompatible with a refcnt_t. I'll investigate and respin if it ends up being ok. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/4] aio: add a iocb refcount 2018-07-26 11:57 ` Christoph Hellwig @ 2018-07-27 8:31 ` Christoph Hellwig 0 siblings, 0 replies; 14+ messages in thread From: Christoph Hellwig @ 2018-07-27 8:31 UTC (permalink / raw) To: Matthew Wilcox Cc: Christoph Hellwig, viro, Avi Kivity, linux-aio, linux-fsdevel, linux-kernel On Thu, Jul 26, 2018 at 01:57:05PM +0200, Christoph Hellwig wrote: > On Thu, Jul 26, 2018 at 04:22:27AM -0700, Matthew Wilcox wrote: > > On Thu, Jul 26, 2018 at 10:29:01AM +0200, Christoph Hellwig wrote: > > > + atomic_t ki_refcnt; > > > > Should this be a refcount_t instead? At first glance your usage seems > > compatible with refcount_t. > > I though the magic 0 meaning would be incompatible with a refcnt_t. > I'll investigate and respin if it ends up being ok. Seems like a recount_t works fine, even with CONFIG_REFCOUNT_FULL, so I'll switch it over for the next version. ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 3/4] aio: implement IOCB_CMD_POLL 2018-07-26 8:28 aio poll and a new in-kernel poll API V20 (aka 2.0) Christoph Hellwig 2018-07-26 8:29 ` [PATCH 1/4] timerfd: add support for keyed wakeups Christoph Hellwig 2018-07-26 8:29 ` [PATCH 2/4] aio: add a iocb refcount Christoph Hellwig @ 2018-07-26 8:29 ` Christoph Hellwig 2018-07-26 8:29 ` [PATCH 4/4] aio: allow direct aio poll comletions for keyed wakeups Christoph Hellwig 3 siblings, 0 replies; 14+ messages in thread From: Christoph Hellwig @ 2018-07-26 8:29 UTC (permalink / raw) To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, linux-kernel Simple one-shot poll through the io_submit() interface. To poll for a file descriptor the application should submit an iocb of type IOCB_CMD_POLL. It will poll the fd for the events specified in the the first 32 bits of the aio_buf field of the iocb. Unlike poll or epoll without EPOLLONESHOT this interface always works in one shot mode, that is once the iocb is completed, it will have to be resubmitted. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/aio.c | 178 +++++++++++++++++++++++++++++++++++ include/uapi/linux/aio_abi.h | 6 +- 2 files changed, 180 insertions(+), 4 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 7f3c159b3e2e..cf364d75abe9 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -5,6 +5,7 @@ * Implements an efficient asynchronous io interface. * * Copyright 2000, 2001, 2002 Red Hat, Inc. All Rights Reserved. + * Copyright 2018 Christoph Hellwig. * * See ../COPYING for licensing terms. */ @@ -164,10 +165,21 @@ struct fsync_iocb { bool datasync; }; +struct poll_iocb { + struct file *file; + struct wait_queue_head *head; + __poll_t events; + bool cancelled; + bool done; + struct wait_queue_entry wait; + struct work_struct work; +}; + struct aio_kiocb { union { struct kiocb rw; struct fsync_iocb fsync; + struct poll_iocb poll; }; struct kioctx *ki_ctx; @@ -1600,6 +1612,169 @@ static int aio_fsync(struct fsync_iocb *req, struct iocb *iocb, bool datasync) return 0; } +static inline void aio_poll_complete(struct aio_kiocb *iocb, __poll_t mask) +{ + struct file *file = iocb->poll.file; + + aio_complete(iocb, mangle_poll(mask), 0); + fput(file); +} + +static void aio_poll_complete_work(struct work_struct *work) +{ + struct poll_iocb *req = container_of(work, struct poll_iocb, work); + struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll); + struct poll_table_struct pt = { ._key = req->events }; + struct kioctx *ctx = iocb->ki_ctx; + __poll_t mask; + + if (READ_ONCE(req->cancelled)) { + /* synchronize with ki_list removal in the callers: */ + spin_lock_irq(&ctx->ctx_lock); + WARN_ON_ONCE(!list_empty(&iocb->ki_list)); + spin_unlock_irq(&ctx->ctx_lock); + + aio_poll_complete(iocb, 0); + return; + } + + mask = vfs_poll(req->file, &pt) & req->events; + if (!mask) { + add_wait_queue(req->head, &req->wait); + return; + } + + spin_lock_irq(&ctx->ctx_lock); + req->done = true; + list_del(&iocb->ki_list); + spin_unlock_irq(&ctx->ctx_lock); + + aio_poll_complete(iocb, mask); +} + +/* assumes we are called with irqs disabled */ +static int aio_poll_cancel(struct kiocb *iocb) +{ + struct aio_kiocb *aiocb = container_of(iocb, struct aio_kiocb, rw); + struct poll_iocb *req = &aiocb->poll; + + spin_lock(&req->head->lock); + if (!list_empty(&req->wait.entry)) { + WRITE_ONCE(req->cancelled, true); + list_del_init(&req->wait.entry); + schedule_work(&aiocb->poll.work); + } + spin_unlock(&req->head->lock); + + return 0; +} + +static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, + void *key) +{ + struct poll_iocb *req = container_of(wait, struct poll_iocb, wait); + __poll_t mask = key_to_poll(key); + + /* for instances that support it check for an event match first: */ + if (mask && !(mask & req->events)) + return 0; + + list_del_init(&req->wait.entry); + schedule_work(&req->work); + return 1; +} + +struct aio_poll_table { + struct poll_table_struct pt; + struct aio_kiocb *iocb; + int error; +}; + +static void +aio_poll_queue_proc(struct file *file, struct wait_queue_head *head, + struct poll_table_struct *p) +{ + struct aio_poll_table *pt = container_of(p, struct aio_poll_table, pt); + + /* multiple wait queues per file are not supported */ + if (unlikely(pt->iocb->poll.head)) { + pt->error = -EINVAL; + return; + } + + pt->error = 0; + pt->iocb->poll.head = head; + add_wait_queue(head, &pt->iocb->poll.wait); +} + +static ssize_t aio_poll(struct aio_kiocb *aiocb, struct iocb *iocb) +{ + struct kioctx *ctx = aiocb->ki_ctx; + struct poll_iocb *req = &aiocb->poll; + struct aio_poll_table apt; + __poll_t mask; + + /* reject any unknown events outside the normal event mask. */ + if ((u16)iocb->aio_buf != iocb->aio_buf) + return -EINVAL; + /* reject fields that are not defined for poll */ + if (iocb->aio_offset || iocb->aio_nbytes || iocb->aio_rw_flags) + return -EINVAL; + + INIT_WORK(&req->work, aio_poll_complete_work); + req->events = demangle_poll(iocb->aio_buf) | EPOLLERR | EPOLLHUP; + req->file = fget(iocb->aio_fildes); + if (unlikely(!req->file)) + return -EBADF; + + apt.pt._qproc = aio_poll_queue_proc; + apt.pt._key = req->events; + apt.iocb = aiocb; + apt.error = -EINVAL; /* same as no support for IOCB_CMD_POLL */ + + /* initialized the list so that we can do list_empty checks */ + INIT_LIST_HEAD(&req->wait.entry); + init_waitqueue_func_entry(&req->wait, aio_poll_wake); + + /* one for removal from waitqueue, one for this function */ + atomic_set(&aiocb->ki_refcnt, 2); + + mask = vfs_poll(req->file, &apt.pt) & req->events; + if (mask || apt.error) { + bool removed = false; + + /* we did not manage to set up a waitqueue, done */ + if (unlikely(!req->head)) + goto out_fput; + + spin_lock_irq(&req->head->lock); + if (!list_empty(&req->wait.entry)) { + list_del_init(&req->wait.entry); + removed = true; + } + spin_unlock_irq(&req->head->lock); + + if (removed) { + if (apt.error) + goto out_fput; + aio_poll_complete(aiocb, mask); + } + } else { + spin_lock_irq(&ctx->ctx_lock); + if (!req->done) { + list_add_tail(&aiocb->ki_list, &ctx->active_reqs); + aiocb->ki_cancel = aio_poll_cancel; + } + spin_unlock_irq(&ctx->ctx_lock); + } + + iocb_put(aiocb); + return 0; +out_fput: + fput(req->file); + return apt.error; +} + static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, bool compat) { @@ -1673,6 +1848,9 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, case IOCB_CMD_FDSYNC: ret = aio_fsync(&req->fsync, &iocb, true); break; + case IOCB_CMD_POLL: + ret = aio_poll(req, &iocb); + break; default: pr_debug("invalid aio operation %d\n", iocb.aio_lio_opcode); ret = -EINVAL; diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h index d4593a6062ef..ce43d340f010 100644 --- a/include/uapi/linux/aio_abi.h +++ b/include/uapi/linux/aio_abi.h @@ -38,10 +38,8 @@ enum { IOCB_CMD_PWRITE = 1, IOCB_CMD_FSYNC = 2, IOCB_CMD_FDSYNC = 3, - /* These two are experimental. - * IOCB_CMD_PREADX = 4, - * IOCB_CMD_POLL = 5, - */ + /* 4 was the experimental IOCB_CMD_PREADX */ + IOCB_CMD_POLL = 5, IOCB_CMD_NOOP = 6, IOCB_CMD_PREADV = 7, IOCB_CMD_PWRITEV = 8, -- 2.18.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 4/4] aio: allow direct aio poll comletions for keyed wakeups 2018-07-26 8:28 aio poll and a new in-kernel poll API V20 (aka 2.0) Christoph Hellwig ` (2 preceding siblings ...) 2018-07-26 8:29 ` [PATCH 3/4] aio: implement IOCB_CMD_POLL Christoph Hellwig @ 2018-07-26 8:29 ` Christoph Hellwig 3 siblings, 0 replies; 14+ messages in thread From: Christoph Hellwig @ 2018-07-26 8:29 UTC (permalink / raw) To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, linux-kernel If we get a keyed wakeup for a aio poll waitqueue and wake can acquire the ctx_lock without spinning we can just complete the iocb straight from the wakeup callback to avoid a context switch. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/aio.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index cf364d75abe9..ea744092387d 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1673,11 +1673,25 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, void *key) { struct poll_iocb *req = container_of(wait, struct poll_iocb, wait); + struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll); __poll_t mask = key_to_poll(key); /* for instances that support it check for an event match first: */ - if (mask && !(mask & req->events)) - return 0; + if (mask) { + if (!(mask & req->events)) + return 0; + + /* try to complete the iocb inline if we can: */ + if (spin_trylock(&iocb->ki_ctx->ctx_lock)) { + req->done = true; + list_del(&iocb->ki_list); + spin_unlock(&iocb->ki_ctx->ctx_lock); + + list_del_init(&req->wait.entry); + aio_poll_complete(iocb, mask); + return 1; + } + } list_del_init(&req->wait.entry); schedule_work(&req->work); -- 2.18.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* aio poll and a new in-kernel poll API V21 (aka 2.0) @ 2018-07-30 7:15 Christoph Hellwig 2018-07-30 7:15 ` [PATCH 4/4] aio: allow direct aio poll comletions for keyed wakeups Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2018-07-30 7:15 UTC (permalink / raw) To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, linux-kernel Hi all, this series adds support for the IOCB_CMD_POLL operation to poll for the readyness of file descriptors using the aio subsystem. The API is based on patches that existed in RHAS2.1 and RHEL3, which means it already is supported by libaio. As our dear leader didn't like the ->poll_mask method this tries to implement the behavior using plain old ->poll which is rather painful. For one we only support ->poll instances with a single wait queue behind them and reject the request otherwise, which isn't really different from the previous ->poll_mask requirement, just implemented in a rathet awkward way. Second we had to implement a refcount on struct aio_iocb (although it is kept as a no-op for non-poll commands) so that we can safely handle the case of ->poll returning a mask after it got a wakeup. This also means there is a lot of open coded magic for the waitqueue removals and dealing with ki_list to deal with these cases. Last but not least to avoid a guaranteed context switch on every wakeup we trust keyed wakeups, which from an audit of the users seems to be good. The only thing it loses is batching of multiple wakeups in a short time period into a single result. The changes were sponsored by Scylladb. git://git.infradead.org/users/hch/vfs.git aio-poll.21 Gitweb: http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.21 Libaio changes: https://pagure.io/libaio.git io-poll Seastar changes: https://github.com/avikivity/seastar/commits/aio Changes since v20: - use a refcount_t instead of an atomic_t for ki_refcnt Changes since v13: - rewritten to use ->poll Changes since v12: - remove iocb from ki_list only after ki_cancel has completed - fix __poll_t annotations - turn __poll_t sparse checkin on by default - call fput after aio_complete - only add the iocb to active_reqs if we wait for it Changes since v11: - simplify cancellation by completion poll requests from a workqueue if we can't take the ctx_lock Changes since v10: - fixed a mismerge that let a sock_rps_record_flow sneak into tcp_poll_mask - remove the now unused struct proto_ops get_poll_head method Changes since v9: - add to the delayed_cancel_reqs earlier to avoid a race - get rid of POLL_TO_PTR magic Changes since v8: - make delayed cancellation conditional again - add a cancel_kiocb file operation to split delayed vs normal cancel Changes since v7: - make delayed cancellation safe and unconditional Changes since v6: - reworked cancellation Changes since v5: - small changelog updates - rebased on top of the aio-fsync changes Changes since v4: - rebased ontop of Linux 4.16-rc4 Changes since v3: - remove the pre-sleep ->poll_mask call in vfs_poll, allow ->get_poll_head to return POLL* values. Changes since v2: - removed a double initialization - new vfs_get_poll_head helper - document that ->get_poll_head can return NULL - call ->poll_mask before sleeping - various ACKs - add conversion of random to ->poll_mask - add conversion of af_alg to ->poll_mask - lacking ->poll_mask support now returns -EINVAL for IOCB_CMD_POLL - reshuffled the series so that prep patches and everything not requiring the new in-kernel poll API is in the beginning Changes since v1: - handle the NULL ->poll case in vfs_poll - dropped the file argument to the ->poll_mask socket operation - replace the ->pre_poll socket operation with ->get_poll_head as in the file operations ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 4/4] aio: allow direct aio poll comletions for keyed wakeups 2018-07-30 7:15 aio poll and a new in-kernel poll API V21 (aka 2.0) Christoph Hellwig @ 2018-07-30 7:15 ` Christoph Hellwig 0 siblings, 0 replies; 14+ messages in thread From: Christoph Hellwig @ 2018-07-30 7:15 UTC (permalink / raw) To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, linux-kernel If we get a keyed wakeup for a aio poll waitqueue and wake can acquire the ctx_lock without spinning we can just complete the iocb straight from the wakeup callback to avoid a context switch. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/aio.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 6993684d0665..158c5e41b17c 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1674,11 +1674,25 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, void *key) { struct poll_iocb *req = container_of(wait, struct poll_iocb, wait); + struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll); __poll_t mask = key_to_poll(key); /* for instances that support it check for an event match first: */ - if (mask && !(mask & req->events)) - return 0; + if (mask) { + if (!(mask & req->events)) + return 0; + + /* try to complete the iocb inline if we can: */ + if (spin_trylock(&iocb->ki_ctx->ctx_lock)) { + req->done = true; + list_del(&iocb->ki_list); + spin_unlock(&iocb->ki_ctx->ctx_lock); + + list_del_init(&req->wait.entry); + aio_poll_complete(iocb, mask); + return 1; + } + } list_del_init(&req->wait.entry); schedule_work(&req->work); -- 2.18.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* aio poll V22 (aka 2.0) @ 2018-08-06 8:30 Christoph Hellwig 2018-08-06 8:30 ` [PATCH 4/4] aio: allow direct aio poll comletions for keyed wakeups Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2018-08-06 8:30 UTC (permalink / raw) To: viro; +Cc: Avi Kivity, Linus Torvalds, linux-aio, linux-fsdevel, linux-kernel Hi all, this series adds support for the IOCB_CMD_POLL operation to poll for the readyness of file descriptors using the aio subsystem. The API is based on patches that existed in RHAS2.1 and RHEL3, which means it already is supported by libaio. As our dear leader didn't like the ->poll_mask method this tries to implement the behavior using plain old ->poll which is rather painful. For one we only support ->poll instances with a single wait queue behind them and reject the request otherwise, which isn't really different from the previous ->poll_mask requirement, just implemented in a rathet awkward way. Second we had to implement a refcount on struct aio_iocb (although it is kept as a no-op for non-poll commands) so that we can safely handle the case of ->poll returning a mask after it got a wakeup. This also means there is a lot of open coded magic for the waitqueue removals and dealing with ki_list to deal with these cases. Last but not least to avoid a guaranteed context switch on every wakeup we trust keyed wakeups, which from an audit of the users seems to be good. The only thing it loses is batching of multiple wakeups in a short time period into a single result. The changes were sponsored by Scylladb. git://git.infradead.org/users/hch/vfs.git aio-poll.22 Gitweb: http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.22 Libaio changes: https://pagure.io/libaio.git io-poll Seastar changes: https://github.com/avikivity/seastar/commits/aio Changes since v21: - rework the cancellation and early complete logic based on feedback from Al Changes since v20: - use a refcount_t instead of an atomic_t for ki_refcnt Changes since v13: - rewritten to use ->poll Changes since v12: - remove iocb from ki_list only after ki_cancel has completed - fix __poll_t annotations - turn __poll_t sparse checkin on by default - call fput after aio_complete - only add the iocb to active_reqs if we wait for it Changes since v11: - simplify cancellation by completion poll requests from a workqueue if we can't take the ctx_lock Changes since v10: - fixed a mismerge that let a sock_rps_record_flow sneak into tcp_poll_mask - remove the now unused struct proto_ops get_poll_head method Changes since v9: - add to the delayed_cancel_reqs earlier to avoid a race - get rid of POLL_TO_PTR magic Changes since v8: - make delayed cancellation conditional again - add a cancel_kiocb file operation to split delayed vs normal cancel Changes since v7: - make delayed cancellation safe and unconditional Changes since v6: - reworked cancellation Changes since v5: - small changelog updates - rebased on top of the aio-fsync changes Changes since v4: - rebased ontop of Linux 4.16-rc4 Changes since v3: - remove the pre-sleep ->poll_mask call in vfs_poll, allow ->get_poll_head to return POLL* values. Changes since v2: - removed a double initialization - new vfs_get_poll_head helper - document that ->get_poll_head can return NULL - call ->poll_mask before sleeping - various ACKs - add conversion of random to ->poll_mask - add conversion of af_alg to ->poll_mask - lacking ->poll_mask support now returns -EINVAL for IOCB_CMD_POLL - reshuffled the series so that prep patches and everything not requiring the new in-kernel poll API is in the beginning Changes since v1: - handle the NULL ->poll case in vfs_poll - dropped the file argument to the ->poll_mask socket operation - replace the ->pre_poll socket operation with ->get_poll_head as in the file operations ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 4/4] aio: allow direct aio poll comletions for keyed wakeups 2018-08-06 8:30 aio poll V22 (aka 2.0) Christoph Hellwig @ 2018-08-06 8:30 ` Christoph Hellwig 2018-08-06 22:27 ` Andrew Morton 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2018-08-06 8:30 UTC (permalink / raw) To: viro; +Cc: Avi Kivity, Linus Torvalds, linux-aio, linux-fsdevel, linux-kernel If we get a keyed wakeup for a aio poll waitqueue and wake can acquire the ctx_lock without spinning we can just complete the iocb straight from the wakeup callback to avoid a context switch. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Avi Kivity <avi@scylladb.com> --- fs/aio.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 2fd19521d8a8..29f2b5b57d32 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1672,13 +1672,26 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, void *key) { struct poll_iocb *req = container_of(wait, struct poll_iocb, wait); + struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll); __poll_t mask = key_to_poll(key); req->woken = true; /* for instances that support it check for an event match first: */ - if (mask && !(mask & req->events)) - return 0; + if (mask) { + if (!(mask & req->events)) + return 0; + + /* try to complete the iocb inline if we can: */ + if (spin_trylock(&iocb->ki_ctx->ctx_lock)) { + list_del(&iocb->ki_list); + spin_unlock(&iocb->ki_ctx->ctx_lock); + + list_del_init(&req->wait.entry); + aio_poll_complete(iocb, mask); + return 1; + } + } list_del_init(&req->wait.entry); schedule_work(&req->work); -- 2.18.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 4/4] aio: allow direct aio poll comletions for keyed wakeups 2018-08-06 8:30 ` [PATCH 4/4] aio: allow direct aio poll comletions for keyed wakeups Christoph Hellwig @ 2018-08-06 22:27 ` Andrew Morton 2018-08-07 7:25 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Andrew Morton @ 2018-08-06 22:27 UTC (permalink / raw) To: Christoph Hellwig Cc: viro, Avi Kivity, Linus Torvalds, linux-aio, linux-fsdevel, linux-kernel On Mon, 6 Aug 2018 10:30:58 +0200 Christoph Hellwig <hch@lst.de> wrote: > If we get a keyed wakeup for a aio poll waitqueue and wake can acquire the > ctx_lock without spinning we can just complete the iocb straight from the > wakeup callback to avoid a context switch. Why do we try to avoid spinning on the lock? > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1672,13 +1672,26 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, > void *key) > { > struct poll_iocb *req = container_of(wait, struct poll_iocb, wait); > + struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll); > __poll_t mask = key_to_poll(key); > > req->woken = true; > > /* for instances that support it check for an event match first: */ > - if (mask && !(mask & req->events)) > - return 0; > + if (mask) { > + if (!(mask & req->events)) > + return 0; > + > + /* try to complete the iocb inline if we can: */ ie, this comment explains 'what" but not "why". (There's a typo in Subject:, btw) > + if (spin_trylock(&iocb->ki_ctx->ctx_lock)) { > + list_del(&iocb->ki_list); > + spin_unlock(&iocb->ki_ctx->ctx_lock); > + > + list_del_init(&req->wait.entry); > + aio_poll_complete(iocb, mask); > + return 1; > + } > + } > > list_del_init(&req->wait.entry); > schedule_work(&req->work); ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 4/4] aio: allow direct aio poll comletions for keyed wakeups 2018-08-06 22:27 ` Andrew Morton @ 2018-08-07 7:25 ` Christoph Hellwig 2018-08-07 16:04 ` Andrew Morton 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2018-08-07 7:25 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Hellwig, viro, Avi Kivity, Linus Torvalds, linux-aio, linux-fsdevel, linux-kernel On Mon, Aug 06, 2018 at 03:27:05PM -0700, Andrew Morton wrote: > On Mon, 6 Aug 2018 10:30:58 +0200 Christoph Hellwig <hch@lst.de> wrote: > > > If we get a keyed wakeup for a aio poll waitqueue and wake can acquire the > > ctx_lock without spinning we can just complete the iocb straight from the > > wakeup callback to avoid a context switch. > > Why do we try to avoid spinning on the lock? Because we are called with the lock on the waitqueue called, which nests inside it. > > + /* try to complete the iocb inline if we can: */ > > ie, this comment explains 'what" but not "why". > > (There's a typo in Subject:, btw) Because it is faster obviously. I can update the comment. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 4/4] aio: allow direct aio poll comletions for keyed wakeups 2018-08-07 7:25 ` Christoph Hellwig @ 2018-08-07 16:04 ` Andrew Morton 2018-08-08 9:57 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Andrew Morton @ 2018-08-07 16:04 UTC (permalink / raw) To: Christoph Hellwig Cc: viro, Avi Kivity, Linus Torvalds, linux-aio, linux-fsdevel, linux-kernel On Tue, 7 Aug 2018 09:25:55 +0200 Christoph Hellwig <hch@lst.de> wrote: > On Mon, Aug 06, 2018 at 03:27:05PM -0700, Andrew Morton wrote: > > On Mon, 6 Aug 2018 10:30:58 +0200 Christoph Hellwig <hch@lst.de> wrote: > > > > > If we get a keyed wakeup for a aio poll waitqueue and wake can acquire the > > > ctx_lock without spinning we can just complete the iocb straight from the > > > wakeup callback to avoid a context switch. > > > > Why do we try to avoid spinning on the lock? > > Because we are called with the lock on the waitqueue called, which > nests inside it. Ah. > > > + /* try to complete the iocb inline if we can: */ > > > > ie, this comment explains 'what" but not "why". > > > > (There's a typo in Subject:, btw) > > Because it is faster obviously. I can update the comment. I meant the comment could explain why it's a trylock instead of a spin_lock(). ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 4/4] aio: allow direct aio poll comletions for keyed wakeups 2018-08-07 16:04 ` Andrew Morton @ 2018-08-08 9:57 ` Christoph Hellwig 0 siblings, 0 replies; 14+ messages in thread From: Christoph Hellwig @ 2018-08-08 9:57 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Hellwig, viro, Avi Kivity, Linus Torvalds, linux-aio, linux-fsdevel, linux-kernel On Tue, Aug 07, 2018 at 09:04:41AM -0700, Andrew Morton wrote: > > Because it is faster obviously. I can update the comment. > > I meant the comment could explain why it's a trylock instead of a > spin_lock(). We could something like this the patch below. Al, do you want me to resend or can you just fold it in? diff --git a/fs/aio.c b/fs/aio.c index 5943098a87c6..84df2c2bf80b 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1684,7 +1684,8 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, /* * Try to complete the iocb inline if we can to avoid a costly - * context switch. + * context switch. As the waitqueue lock nests inside the ctx + * lock we can only do that if we can get it without waiting. */ if (spin_trylock(&iocb->ki_ctx->ctx_lock)) { list_del(&iocb->ki_list); ^ permalink raw reply related [flat|nested] 14+ messages in thread
end of thread, other threads:[~2018-08-08 9:51 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-07-26 8:28 aio poll and a new in-kernel poll API V20 (aka 2.0) Christoph Hellwig 2018-07-26 8:29 ` [PATCH 1/4] timerfd: add support for keyed wakeups Christoph Hellwig 2018-07-26 8:29 ` [PATCH 2/4] aio: add a iocb refcount Christoph Hellwig 2018-07-26 11:22 ` Matthew Wilcox 2018-07-26 11:57 ` Christoph Hellwig 2018-07-27 8:31 ` Christoph Hellwig 2018-07-26 8:29 ` [PATCH 3/4] aio: implement IOCB_CMD_POLL Christoph Hellwig 2018-07-26 8:29 ` [PATCH 4/4] aio: allow direct aio poll comletions for keyed wakeups Christoph Hellwig 2018-07-30 7:15 aio poll and a new in-kernel poll API V21 (aka 2.0) Christoph Hellwig 2018-07-30 7:15 ` [PATCH 4/4] aio: allow direct aio poll comletions for keyed wakeups Christoph Hellwig 2018-08-06 8:30 aio poll V22 (aka 2.0) Christoph Hellwig 2018-08-06 8:30 ` [PATCH 4/4] aio: allow direct aio poll comletions for keyed wakeups Christoph Hellwig 2018-08-06 22:27 ` Andrew Morton 2018-08-07 7:25 ` Christoph Hellwig 2018-08-07 16:04 ` Andrew Morton 2018-08-08 9:57 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).