Formerly known as non-blocking AIO. This series adds nonblocking feature to asynchronous I/O writes. io_submit() can be delayed because of a number of reason: - Block allocation for files - Data writebacks for direct I/O - Sleeping because of waiting to acquire i_rwsem - Congested block device The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if any of these conditions are met. This way userspace can push most of the write()s to the kernel to the best of its ability to complete and if it returns -EAGAIN, can defer it to another thread. In order to enable this, IOCB_RW_FLAG_NOWAIT is introduced in uapi/linux/aio_abi.h. If set for aio_rw_flags, it translates to IOCB_NOWAIT for struct iocb, REQ_NOWAIT for bio.bi_opf and IOMAP_NOWAIT for iomap. aio_rw_flags is a new flag replacing aio_reserved1. We could not use aio_flags because it is not currently checked for invalidity in the kernel. This feature is provided for direct I/O of asynchronous I/O only. I have tested it against xfs, ext4, and btrfs while I intend to add more filesystems. The nowait feature is for request based devices. In the future, I intend to add support to stacked devices such as md. Applications will have to check supportability by sending a async direct write and any other error besides -EAGAIN would mean it is not supported. First two patches are prep patches into nowait I/O. Changes since v1: + changed name from _NONBLOCKING to *_NOWAIT + filemap_range_has_page call moved to closer to (just before) calling filemap_write_and_wait_range(). + BIO_NOWAIT limited to get_request() + XFS fixes - included reflink - use of xfs_ilock_nowait() instead of a XFS_IOLOCK_NONBLOCKING flag - Translate the flag through IOMAP_NOWAIT (iomap) to check for block allocation for the file. + ext4 coding style Changes since v2: + Using aio_reserved1 as aio_rw_flags instead of aio_flags + blk-mq support + xfs uptodate with kernel and reflink changes Changes since v3: + Added FS_NOWAIT, which is set if the filesystem supports NOWAIT feature. + Checks in generic_make_request() to make sure BIO_NOWAIT comes in for async direct writes only. + Added QUEUE_FLAG_NOWAIT, which is set if the device supports BIO_NOWAIT. This is added (rather not set) to block devices such as dm/md currently. Changes since v4: + Ported AIO code to use RWF_* flags. Check for RWF_* flags in generic_file_write_iter(). + Changed IOCB_RW_FLAGS_NOWAIT to RWF_NOWAIT. Changes since v5: + BIO_NOWAIT to REQ_NOWAIT + Common helper for RWF flags. Changes since v6: + REQ_NOWAIT will be ignored for request based devices since they cannot block. So, removed QUEUE_FLAG_NOWAIT since it is not required in the current implementation. It will be resurrected when we program for stacked devices. + changed kiocb_rw_flags() to kiocb_set_rw_flags() in order to accomodate for errors. Moved checks in the function. Changes since v7: + split patches into prep so the main patches are smaller and easier to understand + All patches are reviewed or acked! -- Goldwyn
From: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/read_write.c | 12 +++--------- include/linux/fs.h | 14 ++++++++++++++ 2 files changed, 17 insertions(+), 9 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index c4f88afbc67f..362f91cd8d66 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -678,16 +678,10 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter, struct kiocb kiocb; ssize_t ret; - if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC)) - return -EOPNOTSUPP; - init_sync_kiocb(&kiocb, filp); - if (flags & RWF_HIPRI) - kiocb.ki_flags |= IOCB_HIPRI; - if (flags & RWF_DSYNC) - kiocb.ki_flags |= IOCB_DSYNC; - if (flags & RWF_SYNC) - kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC); + ret = kiocb_set_rw_flags(&kiocb, flags); + if (ret) + return ret; kiocb.ki_pos = *ppos; if (type == READ) diff --git a/include/linux/fs.h b/include/linux/fs.h index 7251f7bb45e8..869c9a6fe58d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3049,6 +3049,20 @@ static inline int iocb_flags(struct file *file) return res; } +static inline int kiocb_set_rw_flags(struct kiocb *ki, int flags) +{ + if (unlikely(flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))) + return -EOPNOTSUPP; + + if (flags & RWF_HIPRI) + ki->ki_flags |= IOCB_HIPRI; + if (flags & RWF_DSYNC) + ki->ki_flags |= IOCB_DSYNC; + if (flags & RWF_SYNC) + ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC); + return 0; +} + static inline ino_t parent_ino(struct dentry *dentry) { ino_t res; -- 2.12.0
From: Goldwyn Rodrigues <rgoldwyn@suse.com> filemap_range_has_page() return true if the file's mapping has a page within the range mentioned. This function will be used to check if a write() call will cause a writeback of previous writes. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> --- include/linux/fs.h | 2 ++ mm/filemap.c | 33 +++++++++++++++++++++++++++++++++ 2 files changed, 35 insertions(+) diff --git a/include/linux/fs.h b/include/linux/fs.h index 869c9a6fe58d..2e6fc6a23f91 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2513,6 +2513,8 @@ extern int filemap_fdatawait(struct address_space *); extern void filemap_fdatawait_keep_errors(struct address_space *); extern int filemap_fdatawait_range(struct address_space *, loff_t lstart, loff_t lend); +extern int filemap_range_has_page(struct address_space *, loff_t lstart, + loff_t lend); extern int filemap_write_and_wait(struct address_space *mapping); extern int filemap_write_and_wait_range(struct address_space *mapping, loff_t lstart, loff_t lend); diff --git a/mm/filemap.c b/mm/filemap.c index 1694623a6289..fae5a361befb 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -376,6 +376,39 @@ int filemap_flush(struct address_space *mapping) } EXPORT_SYMBOL(filemap_flush); +/** + * filemap_range_has_page - check if a page exists in range. + * @mapping: address space structure to wait for + * @start_byte: offset in bytes where the range starts + * @end_byte: offset in bytes where the range ends (inclusive) + * + * Find at least one page in the range supplied, usually used to check if + * direct writing in this range will trigger a writeback. + */ +int filemap_range_has_page(struct address_space *mapping, + loff_t start_byte, loff_t end_byte) +{ + pgoff_t index = start_byte >> PAGE_SHIFT; + pgoff_t end = end_byte >> PAGE_SHIFT; + struct pagevec pvec; + int ret; + + if (end_byte < start_byte) + return 0; + + if (mapping->nrpages == 0) + return 0; + + pagevec_init(&pvec, 0); + ret = pagevec_lookup(&pvec, mapping, index, 1); + if (!ret) + return 0; + ret = (pvec.pages[0]->index <= end); + pagevec_release(&pvec); + return ret; +} +EXPORT_SYMBOL(filemap_range_has_page); + static int __filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, loff_t end_byte) { -- 2.12.0
From: Goldwyn Rodrigues <rgoldwyn-IBi9RG/b67k@public.gmane.org> aio_rw_flags is introduced in struct iocb (using aio_reserved1) which will carry the RWF_* flags. We cannot use aio_flags because they are not checked for validity which may break existing applications. Note, the only place RWF_HIPRI comes in effect is dio_await_one(). All the rest of the locations, aio code return -EIOCBQUEUED before the checks for RWF_HIPRI. Signed-off-by: Goldwyn Rodrigues <rgoldwyn-IBi9RG/b67k@public.gmane.org> Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> --- fs/aio.c | 8 +++++++- include/uapi/linux/aio_abi.h | 2 +- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index f52d925ee259..020fa0045e3c 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1541,7 +1541,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, ssize_t ret; /* enforce forwards compatibility on users */ - if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) { + if (unlikely(iocb->aio_reserved2)) { pr_debug("EINVAL: reserve field set\n"); return -EINVAL; } @@ -1586,6 +1586,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, req->common.ki_flags |= IOCB_EVENTFD; } + ret = kiocb_set_rw_flags(&req->common, iocb->aio_rw_flags); + if (unlikely(ret)) { + pr_debug("EINVAL: aio_rw_flags\n"); + goto out_put_req; + } + ret = put_user(KIOCB_KEY, &user_iocb->aio_key); if (unlikely(ret)) { pr_debug("EFAULT: aio_key\n"); diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h index bb2554f7fbd1..a2d4a8ac94ca 100644 --- a/include/uapi/linux/aio_abi.h +++ b/include/uapi/linux/aio_abi.h @@ -79,7 +79,7 @@ struct io_event { struct iocb { /* these are internal to the kernel/libc. */ __u64 aio_data; /* data to be returned in event's data */ - __u32 PADDED(aio_key, aio_reserved1); + __u32 PADDED(aio_key, aio_rw_flags); /* the kernel sets aio_key to the req # */ /* common fields */ -- 2.12.0
From: Goldwyn Rodrigues <rgoldwyn@suse.com> RWF_NOWAIT informs kernel to bail out if an AIO request will block for reasons such as file allocations, or a writeback triggered, or would block while allocating requests while performing direct I/O. RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags. The check for -EOPNOTSUPP is placed in generic_file_write_iter(). This is called by most filesystems, either through fsops.write_iter() or through the function defined by write_iter(). If not, we perform the check defined by .write_iter() which is called for direct IO specifically. Filesystems xfs, btrfs and ext4 would be supported in the following patches. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/9p/vfs_file.c | 3 +++ fs/aio.c | 6 ++++++ fs/ceph/file.c | 3 +++ fs/cifs/file.c | 3 +++ fs/fuse/file.c | 3 +++ fs/nfs/direct.c | 3 +++ fs/ocfs2/file.c | 3 +++ include/linux/fs.h | 5 ++++- include/uapi/linux/fs.h | 1 + mm/filemap.c | 3 +++ 10 files changed, 32 insertions(+), 1 deletion(-) diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c index 3de3b4a89d89..403681db7723 100644 --- a/fs/9p/vfs_file.c +++ b/fs/9p/vfs_file.c @@ -411,6 +411,9 @@ v9fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from) loff_t origin; int err = 0; + if (iocb->ki_flags & IOCB_NOWAIT) + return -EOPNOTSUPP; + retval = generic_write_checks(iocb, from); if (retval <= 0) return retval; diff --git a/fs/aio.c b/fs/aio.c index 020fa0045e3c..34027b67e2f4 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1592,6 +1592,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, goto out_put_req; } + if ((req->common.ki_flags & IOCB_NOWAIT) && + !(req->common.ki_flags & IOCB_DIRECT)) { + ret = -EOPNOTSUPP; + goto out_put_req; + } + ret = put_user(KIOCB_KEY, &user_iocb->aio_key); if (unlikely(ret)) { pr_debug("EFAULT: aio_key\n"); diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 26cc95421cca..af28419b1731 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -1267,6 +1267,9 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from) int err, want, got; loff_t pos; + if (iocb->ki_flags & IOCB_NOWAIT) + return -EOPNOTSUPP; + if (ceph_snap(inode) != CEPH_NOSNAP) return -EROFS; diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 21d404535739..f8858a06e119 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -2638,6 +2638,9 @@ ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from) * write request. */ + if (iocb->ki_flags & IOCB_NOWAIT) + return -EOPNOTSUPP; + rc = generic_write_checks(iocb, from); if (rc <= 0) return rc; diff --git a/fs/fuse/file.c b/fs/fuse/file.c index ec238fb5a584..72786e798319 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1425,6 +1425,9 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from) struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(file); ssize_t res; + if (iocb->ki_flags & IOCB_NOWAIT) + return -EOPNOTSUPP; + if (is_bad_inode(inode)) return -EIO; diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c index c1b5fed7c863..dcea0caa5cb5 100644 --- a/fs/nfs/direct.c +++ b/fs/nfs/direct.c @@ -996,6 +996,9 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter) dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n", file, iov_iter_count(iter), (long long) iocb->ki_pos); + if (iocb->ki_flags & IOCB_NOWAIT) + return -EOPNOTSUPP; + result = generic_write_checks(iocb, iter); if (result <= 0) return result; diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index bfeb647459d9..e7f8ba890305 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2235,6 +2235,9 @@ static ssize_t ocfs2_file_write_iter(struct kiocb *iocb, if (count == 0) return 0; + if (iocb->ki_flags & IOCB_NOWAIT) + return -EOPNOTSUPP; + direct_io = iocb->ki_flags & IOCB_DIRECT ? 1 : 0; inode_lock(inode); diff --git a/include/linux/fs.h b/include/linux/fs.h index 2e6fc6a23f91..7e39b510b7a4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -270,6 +270,7 @@ struct writeback_control; #define IOCB_DSYNC (1 << 4) #define IOCB_SYNC (1 << 5) #define IOCB_WRITE (1 << 6) +#define IOCB_NOWAIT (1 << 7) struct kiocb { struct file *ki_filp; @@ -3053,7 +3054,7 @@ static inline int iocb_flags(struct file *file) static inline int kiocb_set_rw_flags(struct kiocb *ki, int flags) { - if (unlikely(flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))) + if (unlikely(flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT))) return -EOPNOTSUPP; if (flags & RWF_HIPRI) @@ -3062,6 +3063,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, int flags) ki->ki_flags |= IOCB_DSYNC; if (flags & RWF_SYNC) ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC); + if (flags & RWF_NOWAIT) + ki->ki_flags |= IOCB_NOWAIT; return 0; } diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 048a85e9f017..7bcaef101876 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -347,5 +347,6 @@ struct fscrypt_policy { #define RWF_HIPRI 0x00000001 /* high priority request, poll if possible */ #define RWF_DSYNC 0x00000002 /* per-IO O_DSYNC */ #define RWF_SYNC 0x00000004 /* per-IO O_SYNC */ +#define RWF_NOWAIT 0x00000008 /* per-IO, return -EAGAIN if operation would block */ #endif /* _UAPI_LINUX_FS_H */ diff --git a/mm/filemap.c b/mm/filemap.c index fae5a361befb..ca3031f505f2 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3015,6 +3015,9 @@ ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from) struct inode *inode = file->f_mapping->host; ssize_t ret; + if (iocb->ki_flags & IOCB_NOWAIT) + return -EOPNOTSUPP; + inode_lock(inode); ret = generic_write_checks(iocb, from); if (ret > 0) -- 2.12.0
From: Goldwyn Rodrigues <rgoldwyn@suse.com> Find out if the write will trigger a wait due to writeback. If yes, return -EAGAIN. Return -EINVAL for buffered AIO: there are multiple causes of delay such as page locks, dirty throttling logic, page loading from disk etc. which cannot be taken care of. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> --- mm/filemap.c | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index ca3031f505f2..fd7d175b3dee 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2673,6 +2673,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) pos = iocb->ki_pos; + if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT)) + return -EINVAL; + if (limit != RLIM_INFINITY) { if (iocb->ki_pos >= limit) { send_sig(SIGXFSZ, current, 0); @@ -2742,9 +2745,17 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from) write_len = iov_iter_count(from); end = (pos + write_len - 1) >> PAGE_SHIFT; - written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1); - if (written) - goto out; + if (iocb->ki_flags & IOCB_NOWAIT) { + /* If there are pages to writeback, return */ + if (filemap_range_has_page(inode->i_mapping, pos, + pos + iov_iter_count(from))) + return -EAGAIN; + } else { + written = filemap_write_and_wait_range(mapping, pos, + pos + write_len - 1); + if (written) + goto out; + } /* * After a write we want buffered reads to be sure to go to disk to get -- 2.12.0
From: Goldwyn Rodrigues <rgoldwyn@suse.com> IOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps. This is used by XFS in the XFS patch. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/iomap.c | 2 ++ include/linux/iomap.h | 1 + 2 files changed, 3 insertions(+) diff --git a/fs/iomap.c b/fs/iomap.c index 141c3cd55a8b..d1c81753d411 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -885,6 +885,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, } else { dio->flags |= IOMAP_DIO_WRITE; flags |= IOMAP_WRITE; + if (iocb->ki_flags & IOCB_NOWAIT) + flags |= IOMAP_NOWAIT; } if (mapping->nrpages) { diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 7291810067eb..53f6af89c625 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -51,6 +51,7 @@ struct iomap { #define IOMAP_REPORT (1 << 2) /* report extent status, e.g. FIEMAP */ #define IOMAP_FAULT (1 << 3) /* mapping for page fault */ #define IOMAP_DIRECT (1 << 4) /* direct I/O */ +#define IOMAP_NOWAIT (1 << 5) /* Don't wait for writeback */ struct iomap_ops { /* -- 2.12.0
From: Goldwyn Rodrigues <rgoldwyn@suse.com> A new bio operation flag REQ_NOWAIT is introduced to identify bio's orignating from iocb with IOCB_NOWAIT. This flag indicates to return immediately if a request cannot be made instead of retrying. Stacked devices such as md (the ones with make_request_fn hooks) currently are not supported because it may block for housekeeping. For example, an md can have a part of the device suspended. For this reason, only request based devices are supported. In the future, this feature will be expanded to stacked devices by teaching them how to handle the REQ_NOWAIT flags. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> --- block/blk-core.c | 24 ++++++++++++++++++++++-- block/blk-mq-sched.c | 3 +++ block/blk-mq.c | 4 ++++ fs/direct-io.c | 10 ++++++++-- include/linux/bio.h | 6 ++++++ include/linux/blk_types.h | 2 ++ 6 files changed, 45 insertions(+), 4 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index d772c221cc17..effe934b806b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1232,6 +1232,11 @@ static struct request *get_request(struct request_queue *q, unsigned int op, if (!IS_ERR(rq)) return rq; + if (op & REQ_NOWAIT) { + blk_put_rl(rl); + return ERR_PTR(-EAGAIN); + } + if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) { blk_put_rl(rl); return rq; @@ -1870,6 +1875,17 @@ generic_make_request_checks(struct bio *bio) goto end_io; } + /* + * For a REQ_NOWAIT based request, return -EOPNOTSUPP + * if queue does not have QUEUE_FLAG_NOWAIT_SUPPORT set + * and if it is not a request based queue. + */ + + if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) { + err = -EOPNOTSUPP; + goto end_io; + } + part = bio->bi_bdev->bd_part; if (should_fail_request(part, bio->bi_iter.bi_size) || should_fail_request(&part_to_disk(part)->part0, @@ -2021,7 +2037,7 @@ blk_qc_t generic_make_request(struct bio *bio) do { struct request_queue *q = bdev_get_queue(bio->bi_bdev); - if (likely(blk_queue_enter(q, false) == 0)) { + if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) { struct bio_list lower, same; /* Create a fresh bio_list for all subordinate requests */ @@ -2046,7 +2062,11 @@ blk_qc_t generic_make_request(struct bio *bio) bio_list_merge(&bio_list_on_stack[0], &same); bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); } else { - bio_io_error(bio); + if (unlikely(!blk_queue_dying(q) && + (bio->bi_opf & REQ_NOWAIT))) + bio_wouldblock_error(bio); + else + bio_io_error(bio); } bio = bio_list_pop(&bio_list_on_stack[0]); } while (bio); diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index c974a1bbf4cb..019d881d62b7 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct request_queue *q, if (likely(!data->hctx)) data->hctx = blk_mq_map_queue(q, data->ctx->cpu); + if (op & REQ_NOWAIT) + data->flags |= BLK_MQ_REQ_NOWAIT; + if (e) { data->flags |= BLK_MQ_REQ_INTERNAL; diff --git a/block/blk-mq.c b/block/blk-mq.c index c7836a1ded97..d7613ae6a269 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1538,6 +1538,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data); if (unlikely(!rq)) { __wbt_done(q->rq_wb, wb_acct); + if (bio->bi_opf & REQ_NOWAIT) + bio_wouldblock_error(bio); return BLK_QC_T_NONE; } @@ -1662,6 +1664,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data); if (unlikely(!rq)) { __wbt_done(q->rq_wb, wb_acct); + if (bio->bi_opf & REQ_NOWAIT) + bio_wouldblock_error(bio); return BLK_QC_T_NONE; } diff --git a/fs/direct-io.c b/fs/direct-io.c index a04ebea77de8..139ebd5ae1c7 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -480,8 +480,12 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio) unsigned i; int err; - if (bio->bi_error) - dio->io_error = -EIO; + if (bio->bi_error) { + if (bio->bi_error == -EAGAIN && (bio->bi_opf & REQ_NOWAIT)) + dio->io_error = -EAGAIN; + else + dio->io_error = -EIO; + } if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) { err = bio->bi_error; @@ -1197,6 +1201,8 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, if (iov_iter_rw(iter) == WRITE) { dio->op = REQ_OP_WRITE; dio->op_flags = REQ_SYNC | REQ_IDLE; + if (iocb->ki_flags & IOCB_NOWAIT) + dio->op_flags |= REQ_NOWAIT; } else { dio->op = REQ_OP_READ; } diff --git a/include/linux/bio.h b/include/linux/bio.h index 8e521194f6fc..1a9270744b1e 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -425,6 +425,12 @@ static inline void bio_io_error(struct bio *bio) bio_endio(bio); } +static inline void bio_wouldblock_error(struct bio *bio) +{ + bio->bi_error = -EAGAIN; + bio_endio(bio); +} + struct request_queue; extern int bio_phys_segments(struct request_queue *, struct bio *); diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index d703acb55d0f..5ce4da30ba43 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -187,6 +187,7 @@ enum req_flag_bits { __REQ_PREFLUSH, /* request for cache flush */ __REQ_RAHEAD, /* read ahead, can fail anytime */ __REQ_BACKGROUND, /* background IO */ + __REQ_NOWAIT, /* Don't wait if request will block */ __REQ_NR_BITS, /* stops here */ }; @@ -203,6 +204,7 @@ enum req_flag_bits { #define REQ_PREFLUSH (1ULL << __REQ_PREFLUSH) #define REQ_RAHEAD (1ULL << __REQ_RAHEAD) #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND) +#define REQ_NOWAIT (1ULL << __REQ_NOWAIT) #define REQ_FAILFAST_MASK \ (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER) -- 2.12.0
From: Goldwyn Rodrigues <rgoldwyn@suse.com> Return EAGAIN if any of the following checks fail for direct I/O: + i_rwsem is lockable + Writing beyond end of file (will trigger allocation) + Blocks are not allocated at the write location Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> --- fs/ext4/file.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index cefa9835f275..2efdc6d4d3e8 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -216,7 +216,13 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) return ext4_dax_write_iter(iocb, from); #endif - inode_lock(inode); + if (iocb->ki_flags & IOCB_NOWAIT) { + if (!inode_trylock(inode)) + return -EAGAIN; + } else { + inode_lock(inode); + } + ret = ext4_write_checks(iocb, from); if (ret <= 0) goto out; @@ -235,9 +241,15 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) iocb->private = &overwrite; /* Check whether we do a DIO overwrite or not */ - if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio && - ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) - overwrite = 1; + if (o_direct && !unaligned_aio) { + if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) { + if (ext4_should_dioread_nolock(inode)) + overwrite = 1; + } else if (iocb->ki_flags & IOCB_NOWAIT) { + ret = -EAGAIN; + goto out; + } + } ret = __generic_file_write_iter(iocb, from); inode_unlock(inode); -- 2.12.0
From: Goldwyn Rodrigues <rgoldwyn@suse.com> If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable immediately. IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin if it needs allocation either due to file extension, writing to a hole, or COW or waiting for other DIOs to finish. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_file.c | 19 ++++++++++++++----- fs/xfs/xfs_iomap.c | 17 +++++++++++++++++ 2 files changed, 31 insertions(+), 5 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 35703a801372..b307940e7d56 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -541,8 +541,11 @@ xfs_file_dio_aio_write( iolock = XFS_IOLOCK_SHARED; } - xfs_ilock(ip, iolock); - + if (!xfs_ilock_nowait(ip, iolock)) { + if (iocb->ki_flags & IOCB_NOWAIT) + return -EAGAIN; + xfs_ilock(ip, iolock); + } ret = xfs_file_aio_write_checks(iocb, from, &iolock); if (ret) goto out; @@ -553,9 +556,15 @@ xfs_file_dio_aio_write( * otherwise demote the lock if we had to take the exclusive lock * for other reasons in xfs_file_aio_write_checks. */ - if (unaligned_io) - inode_dio_wait(inode); - else if (iolock == XFS_IOLOCK_EXCL) { + if (unaligned_io) { + /* If we are going to wait for other DIO to finish, bail */ + if (iocb->ki_flags & IOCB_NOWAIT) { + if (atomic_read(&inode->i_dio_count)) + return -EAGAIN; + } else { + inode_dio_wait(inode); + } + } else if (iolock == XFS_IOLOCK_EXCL) { xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); iolock = XFS_IOLOCK_SHARED; } diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 288ee5b840d7..9baa65eeae9e 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -1015,6 +1015,15 @@ xfs_file_iomap_begin( if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) { if (flags & IOMAP_DIRECT) { + /* + * A reflinked inode will result in CoW alloc. + * FIXME: It could still overwrite on unshared extents + * and not need allocation. + */ + if (flags & IOMAP_NOWAIT) { + error = -EAGAIN; + goto out_unlock; + } /* may drop and re-acquire the ilock */ error = xfs_reflink_allocate_cow(ip, &imap, &shared, &lockmode); @@ -1032,6 +1041,14 @@ xfs_file_iomap_begin( if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, &imap, nimaps)) { /* + * If nowait is set bail since we are going to make + * allocations. + */ + if (flags & IOMAP_NOWAIT) { + error = -EAGAIN; + goto out_unlock; + } + /* * We cap the maximum length we map here to MAX_WRITEBACK_PAGES * pages to keep the chunks of work done where somewhat symmetric * with the work writeback does. This is a completely arbitrary -- 2.12.0
From: Goldwyn Rodrigues <rgoldwyn@suse.com> Return EAGAIN if any of the following checks fail + i_rwsem is not lockable + NODATACOW or PREALLOC is not set + Cannot nocow at the desired location + Writing beyond end of file which is not allocated Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Acked-by: David Sterba <dsterba@suse.com> --- fs/btrfs/file.c | 25 ++++++++++++++++++++----- fs/btrfs/inode.c | 3 +++ 2 files changed, 23 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 520cb7230b2d..a870e5dd2b4d 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1823,12 +1823,29 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, ssize_t num_written = 0; bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host); ssize_t err; - loff_t pos; - size_t count; + loff_t pos = iocb->ki_pos; + size_t count = iov_iter_count(from); loff_t oldsize; int clean_page = 0; - inode_lock(inode); + if ((iocb->ki_flags & IOCB_NOWAIT) && + (iocb->ki_flags & IOCB_DIRECT)) { + /* Don't sleep on inode rwsem */ + if (!inode_trylock(inode)) + return -EAGAIN; + /* + * We will allocate space in case nodatacow is not set, + * so bail + */ + if (!(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | + BTRFS_INODE_PREALLOC)) || + check_can_nocow(BTRFS_I(inode), pos, &count) <= 0) { + inode_unlock(inode); + return -EAGAIN; + } + } else + inode_lock(inode); + err = generic_write_checks(iocb, from); if (err <= 0) { inode_unlock(inode); @@ -1862,8 +1879,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, */ update_time_for_write(inode); - pos = iocb->ki_pos; - count = iov_iter_count(from); start_pos = round_down(pos, fs_info->sectorsize); oldsize = i_size_read(inode); if (start_pos > oldsize) { diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 5e71f1ea3391..47d3fcd86979 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8625,6 +8625,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter) dio_data.overwrite = 1; inode_unlock(inode); relock = true; + } else if (iocb->ki_flags & IOCB_NOWAIT) { + ret = -EAGAIN; + goto out; } ret = btrfs_delalloc_reserve_space(inode, offset, count); if (ret) -- 2.12.0
On Thu 11-05-17 14:17:04, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn-IBi9RG/b67k@public.gmane.org> > > RWF_NOWAIT informs kernel to bail out if an AIO request will block > for reasons such as file allocations, or a writeback triggered, > or would block while allocating requests while performing > direct I/O. > > RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags. > > The check for -EOPNOTSUPP is placed in generic_file_write_iter(). This > is called by most filesystems, either through fsops.write_iter() or through > the function defined by write_iter(). If not, we perform the check defined > by .write_iter() which is called for direct IO specifically. > > Filesystems xfs, btrfs and ext4 would be supported in the following patches. ... > diff --git a/fs/aio.c b/fs/aio.c > index 020fa0045e3c..34027b67e2f4 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1592,6 +1592,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, > goto out_put_req; > } > > + if ((req->common.ki_flags & IOCB_NOWAIT) && > + !(req->common.ki_flags & IOCB_DIRECT)) { > + ret = -EOPNOTSUPP; > + goto out_put_req; > + } > + > ret = put_user(KIOCB_KEY, &user_iocb->aio_key); > if (unlikely(ret)) { > pr_debug("EFAULT: aio_key\n"); I think you need to also check here that the IO is write. So that NOWAIT reads don't silently pass. Honza -- Jan Kara <jack-IBi9RG/b67k@public.gmane.org> SUSE Labs, CR
From: Goldwyn Rodrigues <rgoldwyn@suse.com> If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable immediately. IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin if it needs allocation either due to file extension, writing to a hole, or COW or waiting for other DIOs to finish. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_file.c | 19 ++++++++++++++----- fs/xfs/xfs_iomap.c | 17 +++++++++++++++++ 2 files changed, 31 insertions(+), 5 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 35703a801372..b307940e7d56 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -541,8 +541,11 @@ xfs_file_dio_aio_write( iolock = XFS_IOLOCK_SHARED; } - xfs_ilock(ip, iolock); - + if (!xfs_ilock_nowait(ip, iolock)) { + if (iocb->ki_flags & IOCB_NOWAIT) + return -EAGAIN; + xfs_ilock(ip, iolock); + } ret = xfs_file_aio_write_checks(iocb, from, &iolock); if (ret) goto out; @@ -553,9 +556,15 @@ xfs_file_dio_aio_write( * otherwise demote the lock if we had to take the exclusive lock * for other reasons in xfs_file_aio_write_checks. */ - if (unaligned_io) - inode_dio_wait(inode); - else if (iolock == XFS_IOLOCK_EXCL) { + if (unaligned_io) { + /* If we are going to wait for other DIO to finish, bail */ + if (iocb->ki_flags & IOCB_NOWAIT) { + if (atomic_read(&inode->i_dio_count)) + return -EAGAIN; + } else { + inode_dio_wait(inode); + } + } else if (iolock == XFS_IOLOCK_EXCL) { xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); iolock = XFS_IOLOCK_SHARED; } diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 94e5bdf7304c..8b0e3c1e086d 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -1016,6 +1016,15 @@ xfs_file_iomap_begin( if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) { if (flags & IOMAP_DIRECT) { + /* + * A reflinked inode will result in CoW alloc. + * FIXME: It could still overwrite on unshared extents + * and not need allocation. + */ + if (flags & IOMAP_NOWAIT) { + error = -EAGAIN; + goto out_unlock; + } /* may drop and re-acquire the ilock */ error = xfs_reflink_allocate_cow(ip, &imap, &shared, &lockmode); @@ -1033,6 +1042,14 @@ xfs_file_iomap_begin( if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, &imap, nimaps)) { /* + * If nowait is set bail since we are going to make + * allocations. + */ + if (flags & IOMAP_NOWAIT) { + error = -EAGAIN; + goto out_unlock; + } + /* * We cap the maximum length we map here to MAX_WRITEBACK_PAGES * pages to keep the chunks of work done where somewhat symmetric * with the work writeback does. This is a completely arbitrary -- 2.12.0
On Wed, May 24, 2017 at 11:41:49AM -0500, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable > immediately. > > IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin > if it needs allocation either due to file extension, writing to a hole, > or COW or waiting for other DIOs to finish. > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> > Reviewed-by: Christoph Hellwig <hch@lst.de> Looks good, Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --D > --- > fs/xfs/xfs_file.c | 19 ++++++++++++++----- > fs/xfs/xfs_iomap.c | 17 +++++++++++++++++ > 2 files changed, 31 insertions(+), 5 deletions(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 35703a801372..b307940e7d56 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -541,8 +541,11 @@ xfs_file_dio_aio_write( > iolock = XFS_IOLOCK_SHARED; > } > > - xfs_ilock(ip, iolock); > - > + if (!xfs_ilock_nowait(ip, iolock)) { > + if (iocb->ki_flags & IOCB_NOWAIT) > + return -EAGAIN; > + xfs_ilock(ip, iolock); > + } > ret = xfs_file_aio_write_checks(iocb, from, &iolock); > if (ret) > goto out; > @@ -553,9 +556,15 @@ xfs_file_dio_aio_write( > * otherwise demote the lock if we had to take the exclusive lock > * for other reasons in xfs_file_aio_write_checks. > */ > - if (unaligned_io) > - inode_dio_wait(inode); > - else if (iolock == XFS_IOLOCK_EXCL) { > + if (unaligned_io) { > + /* If we are going to wait for other DIO to finish, bail */ > + if (iocb->ki_flags & IOCB_NOWAIT) { > + if (atomic_read(&inode->i_dio_count)) > + return -EAGAIN; > + } else { > + inode_dio_wait(inode); > + } > + } else if (iolock == XFS_IOLOCK_EXCL) { > xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); > iolock = XFS_IOLOCK_SHARED; > } > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 94e5bdf7304c..8b0e3c1e086d 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -1016,6 +1016,15 @@ xfs_file_iomap_begin( > > if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) { > if (flags & IOMAP_DIRECT) { > + /* > + * A reflinked inode will result in CoW alloc. > + * FIXME: It could still overwrite on unshared extents > + * and not need allocation. > + */ > + if (flags & IOMAP_NOWAIT) { > + error = -EAGAIN; > + goto out_unlock; > + } > /* may drop and re-acquire the ilock */ > error = xfs_reflink_allocate_cow(ip, &imap, &shared, > &lockmode); > @@ -1033,6 +1042,14 @@ xfs_file_iomap_begin( > > if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, &imap, nimaps)) { > /* > + * If nowait is set bail since we are going to make > + * allocations. > + */ > + if (flags & IOMAP_NOWAIT) { > + error = -EAGAIN; > + goto out_unlock; > + } > + /* > * We cap the maximum length we map here to MAX_WRITEBACK_PAGES > * pages to keep the chunks of work done where somewhat symmetric > * with the work writeback does. This is a completely arbitrary > -- > 2.12.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Despite my previous reviewed-by tag this will need another fix: xfs_file_iomap_begin needs to return EAGAIN if we don't have the extent list in memoery already. E.g. something like this: if ((flags & IOMAP_NOWAIT) && !(ip->i_d.if_flags & XFS_IFEXTENTS)) { error = -EAGAIN; goto out_unlock; } right after locking the ilock.
On 05/28/2017 04:31 AM, Christoph Hellwig wrote:
> Despite my previous reviewed-by tag this will need another fix:
>
> xfs_file_iomap_begin needs to return EAGAIN if we don't have the extent
> list in memoery already. E.g. something like this:
>
> if ((flags & IOMAP_NOWAIT) && !(ip->i_d.if_flags & XFS_IFEXTENTS)) {
> error = -EAGAIN;
> goto out_unlock;
> }
>
> right after locking the ilock.
>
I am not sure if it is right to penalize the application to write to
file which has been freshly opened (and is the first one to open). It
basically means extent maps needs to be read from disk. Do you see a
reason it would have a non-deterministic wait if it is the only user? I
understand the block layer can block if it has too many requests though.
--
Goldwyn
On Sun 28-05-17 21:38:26, Goldwyn Rodrigues wrote:
> On 05/28/2017 04:31 AM, Christoph Hellwig wrote:
> > Despite my previous reviewed-by tag this will need another fix:
> >
> > xfs_file_iomap_begin needs to return EAGAIN if we don't have the extent
> > list in memoery already. E.g. something like this:
> >
> > if ((flags & IOMAP_NOWAIT) && !(ip->i_d.if_flags & XFS_IFEXTENTS)) {
> > error = -EAGAIN;
> > goto out_unlock;
> > }
> >
> > right after locking the ilock.
>
> I am not sure if it is right to penalize the application to write to
> file which has been freshly opened (and is the first one to open). It
> basically means extent maps needs to be read from disk. Do you see a
> reason it would have a non-deterministic wait if it is the only user? I
> understand the block layer can block if it has too many requests though.
Well, submitting such write will have to wait for read of metadata from
disk. That is certainly considered blocking so Christoph is right that we
must return EAGAIN in such case.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Sun, May 28, 2017 at 09:38:26PM -0500, Goldwyn Rodrigues wrote:
>
>
> On 05/28/2017 04:31 AM, Christoph Hellwig wrote:
> > Despite my previous reviewed-by tag this will need another fix:
> >
> > xfs_file_iomap_begin needs to return EAGAIN if we don't have the extent
> > list in memoery already. E.g. something like this:
> >
> > if ((flags & IOMAP_NOWAIT) && !(ip->i_d.if_flags & XFS_IFEXTENTS)) {
> > error = -EAGAIN;
> > goto out_unlock;
> > }
> >
> > right after locking the ilock.
> >
>
> I am not sure if it is right to penalize the application to write to
> file which has been freshly opened (and is the first one to open). It
> basically means extent maps needs to be read from disk. Do you see a
> reason it would have a non-deterministic wait if it is the only user? I
> understand the block layer can block if it has too many requests though.
For either a read or a write we might have to read in the extent list
(note that for few enough extents they are stored in the inode and
we won't have to), in which case the call will block and by the
semantics you define we'll need to return -EAGAIN.
Btw, can you write a small blurb up for the man page to document these
ѕemantics in man-page like language?
On 05/29/2017 03:33 AM, Christoph Hellwig wrote: > On Sun, May 28, 2017 at 09:38:26PM -0500, Goldwyn Rodrigues wrote: >> >> >> On 05/28/2017 04:31 AM, Christoph Hellwig wrote: >>> Despite my previous reviewed-by tag this will need another fix: >>> >>> xfs_file_iomap_begin needs to return EAGAIN if we don't have the extent >>> list in memoery already. E.g. something like this: >>> >>> if ((flags & IOMAP_NOWAIT) && !(ip->i_d.if_flags & XFS_IFEXTENTS)) { >>> error = -EAGAIN; >>> goto out_unlock; >>> } >>> >>> right after locking the ilock. >>> >> >> I am not sure if it is right to penalize the application to write to >> file which has been freshly opened (and is the first one to open). It >> basically means extent maps needs to be read from disk. Do you see a >> reason it would have a non-deterministic wait if it is the only user? I >> understand the block layer can block if it has too many requests though. > > For either a read or a write we might have to read in the extent list > (note that for few enough extents they are stored in the inode and > we won't have to), in which case the call will block and by the > semantics you define we'll need to return -EAGAIN. Yes, that is right. I will include it in. > > Btw, can you write a small blurb up for the man page to document these > ѕemantics in man-page like language? > Yes, but which man page would it belong to? Should it be a subsection of errors in io_getevents/io_submit. We don't want to add ERRORS to io_getevents() because it would be the return value of the io_getevents call, and not the ones in the iocb structure. Should it be a new man page, say for iocb(7/8)? -- Goldwyn
On Tue 30-05-17 11:13:29, Goldwyn Rodrigues wrote:
> > Btw, can you write a small blurb up for the man page to document these
> > ѕemantics in man-page like language?
> >
>
> Yes, but which man page would it belong to?
> Should it be a subsection of errors in io_getevents/io_submit. We don't
> want to add ERRORS to io_getevents() because it would be the return
> value of the io_getevents call, and not the ones in the iocb structure.
> Should it be a new man page, say for iocb(7/8)?
I think you should extend the manpage for io_submit(8). There you can add
definition of struct iocb in 'DESCRIPTION' section explaining at least the
most common fields. You can also explain there which flags can be passed
and what are they intended to do.
You can also expand EAGAIN error description to specifically mention that
in case of NOWAIT aio EAGAIN can be returned if io submission would block.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
From: Goldwyn Rodrigues <rgoldwyn@suse.com> If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable immediately. IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin if it needs allocation either due to file extension, writing to a hole, or COW or waiting for other DIOs to finish. Return -EAGAIN if we don't have extent list in memory. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/xfs/xfs_file.c | 19 ++++++++++++++----- fs/xfs/xfs_iomap.c | 22 ++++++++++++++++++++++ 2 files changed, 36 insertions(+), 5 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 5fb5a0958a14..f87a8a66e6f7 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -541,8 +541,11 @@ xfs_file_dio_aio_write( iolock = XFS_IOLOCK_SHARED; } - xfs_ilock(ip, iolock); - + if (!xfs_ilock_nowait(ip, iolock)) { + if (iocb->ki_flags & IOCB_NOWAIT) + return -EAGAIN; + xfs_ilock(ip, iolock); + } ret = xfs_file_aio_write_checks(iocb, from, &iolock); if (ret) goto out; @@ -553,9 +556,15 @@ xfs_file_dio_aio_write( * otherwise demote the lock if we had to take the exclusive lock * for other reasons in xfs_file_aio_write_checks. */ - if (unaligned_io) - inode_dio_wait(inode); - else if (iolock == XFS_IOLOCK_EXCL) { + if (unaligned_io) { + /* If we are going to wait for other DIO to finish, bail */ + if (iocb->ki_flags & IOCB_NOWAIT) { + if (atomic_read(&inode->i_dio_count)) + return -EAGAIN; + } else { + inode_dio_wait(inode); + } + } else if (iolock == XFS_IOLOCK_EXCL) { xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); iolock = XFS_IOLOCK_SHARED; } diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 94e5bdf7304c..05dc87e8c1f5 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -995,6 +995,11 @@ xfs_file_iomap_begin( lockmode = xfs_ilock_data_map_shared(ip); } + if ((flags & IOMAP_NOWAIT) && !(ip->i_df.if_flags & XFS_IFEXTENTS)) { + error = -EAGAIN; + goto out_unlock; + } + ASSERT(offset <= mp->m_super->s_maxbytes); if ((xfs_fsize_t)offset + length > mp->m_super->s_maxbytes) length = mp->m_super->s_maxbytes - offset; @@ -1016,6 +1021,15 @@ xfs_file_iomap_begin( if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) { if (flags & IOMAP_DIRECT) { + /* + * A reflinked inode will result in CoW alloc. + * FIXME: It could still overwrite on unshared extents + * and not need allocation. + */ + if (flags & IOMAP_NOWAIT) { + error = -EAGAIN; + goto out_unlock; + } /* may drop and re-acquire the ilock */ error = xfs_reflink_allocate_cow(ip, &imap, &shared, &lockmode); @@ -1033,6 +1047,14 @@ xfs_file_iomap_begin( if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, &imap, nimaps)) { /* + * If nowait is set bail since we are going to make + * allocations. + */ + if (flags & IOMAP_NOWAIT) { + error = -EAGAIN; + goto out_unlock; + } + /* * We cap the maximum length we map here to MAX_WRITEBACK_PAGES * pages to keep the chunks of work done where somewhat symmetric * with the work writeback does. This is a completely arbitrary -- 2.12.0
From: Goldwyn Rodrigues <rgoldwyn@suse.com> If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable immediately. IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin if it needs allocation either due to file extension, writing to a hole, or COW or waiting for other DIOs to finish. Return -EAGAIN if we don't have extent list in memory. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/xfs/xfs_file.c | 19 ++++++++++++++----- fs/xfs/xfs_iomap.c | 22 ++++++++++++++++++++++ 2 files changed, 36 insertions(+), 5 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 5fb5a0958a14..f87a8a66e6f7 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -541,8 +541,11 @@ xfs_file_dio_aio_write( iolock = XFS_IOLOCK_SHARED; } - xfs_ilock(ip, iolock); - + if (!xfs_ilock_nowait(ip, iolock)) { + if (iocb->ki_flags & IOCB_NOWAIT) + return -EAGAIN; + xfs_ilock(ip, iolock); + } ret = xfs_file_aio_write_checks(iocb, from, &iolock); if (ret) goto out; @@ -553,9 +556,15 @@ xfs_file_dio_aio_write( * otherwise demote the lock if we had to take the exclusive lock * for other reasons in xfs_file_aio_write_checks. */ - if (unaligned_io) - inode_dio_wait(inode); - else if (iolock == XFS_IOLOCK_EXCL) { + if (unaligned_io) { + /* If we are going to wait for other DIO to finish, bail */ + if (iocb->ki_flags & IOCB_NOWAIT) { + if (atomic_read(&inode->i_dio_count)) + return -EAGAIN; + } else { + inode_dio_wait(inode); + } + } else if (iolock == XFS_IOLOCK_EXCL) { xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); iolock = XFS_IOLOCK_SHARED; } diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 94e5bdf7304c..05dc87e8c1f5 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -995,6 +995,11 @@ xfs_file_iomap_begin( lockmode = xfs_ilock_data_map_shared(ip); } + if ((flags & IOMAP_NOWAIT) && !(ip->i_df.if_flags & XFS_IFEXTENTS)) { + error = -EAGAIN; + goto out_unlock; + } + ASSERT(offset <= mp->m_super->s_maxbytes); if ((xfs_fsize_t)offset + length > mp->m_super->s_maxbytes) length = mp->m_super->s_maxbytes - offset; @@ -1016,6 +1021,15 @@ xfs_file_iomap_begin( if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) { if (flags & IOMAP_DIRECT) { + /* + * A reflinked inode will result in CoW alloc. + * FIXME: It could still overwrite on unshared extents + * and not need allocation. + */ + if (flags & IOMAP_NOWAIT) { + error = -EAGAIN; + goto out_unlock; + } /* may drop and re-acquire the ilock */ error = xfs_reflink_allocate_cow(ip, &imap, &shared, &lockmode); @@ -1033,6 +1047,14 @@ xfs_file_iomap_begin( if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, &imap, nimaps)) { /* + * If nowait is set bail since we are going to make + * allocations. + */ + if (flags & IOMAP_NOWAIT) { + error = -EAGAIN; + goto out_unlock; + } + /* * We cap the maximum length we map here to MAX_WRITEBACK_PAGES * pages to keep the chunks of work done where somewhat symmetric * with the work writeback does. This is a completely arbitrary -- 2.12.0
From: Goldwyn Rodrigues <rgoldwyn@suse.com> If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable immediately. IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin if it needs allocation either due to file extension, writing to a hole, or COW or waiting for other DIOs to finish. Return -EAGAIN if we don't have extent list in memory. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/xfs/xfs_file.c | 20 +++++++++++++++----- fs/xfs/xfs_iomap.c | 22 ++++++++++++++++++++++ 2 files changed, 37 insertions(+), 5 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 5fb5a0958a14..e159eb381d9f 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -541,8 +541,11 @@ xfs_file_dio_aio_write( iolock = XFS_IOLOCK_SHARED; } - xfs_ilock(ip, iolock); - + if (!xfs_ilock_nowait(ip, iolock)) { + if (iocb->ki_flags & IOCB_NOWAIT) + return -EAGAIN; + xfs_ilock(ip, iolock); + } ret = xfs_file_aio_write_checks(iocb, from, &iolock); if (ret) goto out; @@ -553,9 +556,15 @@ xfs_file_dio_aio_write( * otherwise demote the lock if we had to take the exclusive lock * for other reasons in xfs_file_aio_write_checks. */ - if (unaligned_io) - inode_dio_wait(inode); - else if (iolock == XFS_IOLOCK_EXCL) { + if (unaligned_io) { + /* If we are going to wait for other DIO to finish, bail */ + if (iocb->ki_flags & IOCB_NOWAIT) { + if (atomic_read(&inode->i_dio_count)) + return -EAGAIN; + } else { + inode_dio_wait(inode); + } + } else if (iolock == XFS_IOLOCK_EXCL) { xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); iolock = XFS_IOLOCK_SHARED; } @@ -892,6 +901,7 @@ xfs_file_open( return -EFBIG; if (XFS_FORCED_SHUTDOWN(XFS_M(inode->i_sb))) return -EIO; + file->f_mode |= FMODE_AIO_NOWAIT; return 0; } diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 94e5bdf7304c..05dc87e8c1f5 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -995,6 +995,11 @@ xfs_file_iomap_begin( lockmode = xfs_ilock_data_map_shared(ip); } + if ((flags & IOMAP_NOWAIT) && !(ip->i_df.if_flags & XFS_IFEXTENTS)) { + error = -EAGAIN; + goto out_unlock; + } + ASSERT(offset <= mp->m_super->s_maxbytes); if ((xfs_fsize_t)offset + length > mp->m_super->s_maxbytes) length = mp->m_super->s_maxbytes - offset; @@ -1016,6 +1021,15 @@ xfs_file_iomap_begin( if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) { if (flags & IOMAP_DIRECT) { + /* + * A reflinked inode will result in CoW alloc. + * FIXME: It could still overwrite on unshared extents + * and not need allocation. + */ + if (flags & IOMAP_NOWAIT) { + error = -EAGAIN; + goto out_unlock; + } /* may drop and re-acquire the ilock */ error = xfs_reflink_allocate_cow(ip, &imap, &shared, &lockmode); @@ -1033,6 +1047,14 @@ xfs_file_iomap_begin( if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, &imap, nimaps)) { /* + * If nowait is set bail since we are going to make + * allocations. + */ + if (flags & IOMAP_NOWAIT) { + error = -EAGAIN; + goto out_unlock; + } + /* * We cap the maximum length we map here to MAX_WRITEBACK_PAGES * pages to keep the chunks of work done where somewhat symmetric * with the work writeback does. This is a completely arbitrary -- 2.12.0