All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8 v2] Non-blocking AIO
@ 2017-02-28 23:36 Goldwyn Rodrigues
  2017-02-28 23:36 ` [PATCH 1/8] nowait aio: Introduce IOCB_FLAG_NOWAIT Goldwyn Rodrigues
                   ` (8 more replies)
  0 siblings, 9 replies; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-02-28 23:36 UTC (permalink / raw)
  To: jack; +Cc: hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4, linux-xfs

This series adds nonblocking feature to asynchronous I/O writes.
io_submit() can be delayed because of a number of reason:
 - Block allocation for files
 - Data writebacks for direct I/O
 - Sleeping because of waiting to acquire i_rwsem
 - Congested block device

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.

In order to enable this, IOCB_FLAG_NOWAIT is introduced in
uapi/linux/aio_abi.h which translates to IOCB_NOWAIT for struct iocb,
BIO_NOWAIT for bio and IOMAP_NOWAIT for iomap.

This feature is provided for direct I/O of asynchronous I/O only. I have
tested it against xfs, ext4, and btrfs.

Changes since v1:
 + Forwardported from 4.9.10
 + changed name from _NONBLOCKING to *_NOWAIT
 + filemap_range_has_page call moved to closer to (just before) calling filemap_write_and_wait_range().
 + BIO_NOWAIT limited to get_request()
 + XFS fixes 
	- included reflink 
	- use of xfs_ilock_nowait() instead of a XFS_IOLOCK_NONBLOCKING flag
	- Translate the flag through IOMAP_NOWAIT (iomap) to check for
	  block allocation for the file.
 + ext4 coding style
-- 
Goldwyn

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 1/8] nowait aio: Introduce IOCB_FLAG_NOWAIT
  2017-02-28 23:36 [PATCH 0/8 v2] Non-blocking AIO Goldwyn Rodrigues
@ 2017-02-28 23:36 ` Goldwyn Rodrigues
  2017-03-01 15:36   ` Christoph Hellwig
  2017-02-28 23:36 ` [PATCH 2/8] nowait aio: Return if cannot get hold of i_rwsem Goldwyn Rodrigues
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-02-28 23:36 UTC (permalink / raw)
  To: jack
  Cc: hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

This flag informs kernel to bail out if an AIO request will block
for reasons such as file allocations, or a writeback triggered,
or would block while allocating requests while performing
direct I/O.

IOCB_FLAG_NOWAIT is translated to IOCB_NOWAIT for
iocb->ki_flags.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/aio.c                     | 3 +++
 include/linux/fs.h           | 1 +
 include/uapi/linux/aio_abi.h | 3 +++
 3 files changed, 7 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index 873b4ca..5ae19ba 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1586,6 +1586,9 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 		req->common.ki_flags |= IOCB_EVENTFD;
 	}
 
+	if (iocb->aio_flags & IOCB_FLAG_NOWAIT)
+		req->common.ki_flags |= IOCB_NOWAIT;
+
 	ret = put_user(KIOCB_KEY, &user_iocb->aio_key);
 	if (unlikely(ret)) {
 		pr_debug("EFAULT: aio_key\n");
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2ba0743..ab2f556 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -270,6 +270,7 @@ struct writeback_control;
 #define IOCB_DSYNC		(1 << 4)
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
+#define IOCB_NOWAIT		(1 << 7)
 
 struct kiocb {
 	struct file		*ki_filp;
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f..82d1d94 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -51,8 +51,11 @@ enum {
  *
  * IOCB_FLAG_RESFD - Set if the "aio_resfd" member of the "struct iocb"
  *                   is valid.
+ * IOCB_FLAG_NOWAIT - Set if the user wants the iocb to fail if it would block
+ *			for operations such as disk allocation.
  */
 #define IOCB_FLAG_RESFD		(1 << 0)
+#define IOCB_FLAG_NOWAIT	(1 << 1)
 
 /* read() from /dev/aio returns these structures. */
 struct io_event {
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 2/8] nowait aio: Return if cannot get hold of i_rwsem
  2017-02-28 23:36 [PATCH 0/8 v2] Non-blocking AIO Goldwyn Rodrigues
  2017-02-28 23:36 ` [PATCH 1/8] nowait aio: Introduce IOCB_FLAG_NOWAIT Goldwyn Rodrigues
@ 2017-02-28 23:36 ` Goldwyn Rodrigues
  2017-03-01 15:37   ` Christoph Hellwig
  2017-02-28 23:36 ` [PATCH 3/8] nowait aio: return if direct write will trigger writeback Goldwyn Rodrigues
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-02-28 23:36 UTC (permalink / raw)
  To: jack
  Cc: hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

A failure to lock i_rwsem would mean there is I/O being performed
by another thread. So, let's bail.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 mm/filemap.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 3f9afde..78dd50e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2973,7 +2973,12 @@ ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct inode *inode = file->f_mapping->host;
 	ssize_t ret;
 
-	inode_lock(inode);
+	if (!inode_trylock(inode)) {
+		/* Don't sleep on inode rwsem */
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EAGAIN;
+		inode_lock(inode);
+	}
 	ret = generic_write_checks(iocb, from);
 	if (ret > 0)
 		ret = __generic_file_write_iter(iocb, from);
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 3/8] nowait aio: return if direct write will trigger writeback
  2017-02-28 23:36 [PATCH 0/8 v2] Non-blocking AIO Goldwyn Rodrigues
  2017-02-28 23:36 ` [PATCH 1/8] nowait aio: Introduce IOCB_FLAG_NOWAIT Goldwyn Rodrigues
  2017-02-28 23:36 ` [PATCH 2/8] nowait aio: Return if cannot get hold of i_rwsem Goldwyn Rodrigues
@ 2017-02-28 23:36 ` Goldwyn Rodrigues
  2017-03-01  3:46   ` Matthew Wilcox
  2017-02-28 23:36 ` [PATCH 4/8] nowait aio: Introduce IOMAP_NOWAIT Goldwyn Rodrigues
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-02-28 23:36 UTC (permalink / raw)
  To: jack
  Cc: hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Find out if the write will trigger a wait due to writeback. If yes,
return -EAGAIN.

This introduces a new function filemap_range_has_page() which
returns true if the file's mapping has a page within the range
mentioned.

Return -EINVAL for buffered AIO: there are multiple causes of
delay such as page locks, dirty throttling logic, page loading
from disk etc. which cannot be taken care of.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 include/linux/fs.h |  2 ++
 mm/filemap.c       | 50 +++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index ab2f556..527ef53 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2494,6 +2494,8 @@ extern int filemap_fdatawait(struct address_space *);
 extern void filemap_fdatawait_keep_errors(struct address_space *);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
 				   loff_t lend);
+extern int filemap_range_has_page(struct address_space *, loff_t lstart,
+				   loff_t lend);
 extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
 				        loff_t lstart, loff_t lend);
diff --git a/mm/filemap.c b/mm/filemap.c
index 78dd50e..82335f4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -375,6 +375,39 @@ int filemap_flush(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_flush);
 
+/**
+ * filemap_range_has_page - check if a page exists in range.
+ * @mapping:		address space structure to wait for
+ * @start_byte:		offset in bytes where the range starts
+ * @end_byte:		offset in bytes where the range ends (inclusive)
+ *
+ * Find at least one page in the range supplied, usually used to check if
+ * direct writing in this range will trigger a writeback.
+ */
+int filemap_range_has_page(struct address_space *mapping,
+				loff_t start_byte, loff_t end_byte)
+{
+	pgoff_t index = start_byte >> PAGE_SHIFT;
+	pgoff_t end = end_byte >> PAGE_SHIFT;
+	struct pagevec pvec;
+	int ret;
+
+	if (end_byte < start_byte)
+		return 0;
+
+	if (mapping->nrpages == 0)
+		return 0;
+
+	pagevec_init(&pvec, 0);
+	ret = pagevec_lookup(&pvec, mapping, index, 1);
+	if (!ret)
+		return 0;
+	ret = (pvec.pages[0]->index <= end);
+	pagevec_release(&pvec);
+	return ret;
+}
+EXPORT_SYMBOL(filemap_range_has_page);
+
 static int __filemap_fdatawait_range(struct address_space *mapping,
 				     loff_t start_byte, loff_t end_byte)
 {
@@ -2631,6 +2664,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 
 	pos = iocb->ki_pos;
 
+	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+		return -EINVAL;
+
 	if (limit != RLIM_INFINITY) {
 		if (iocb->ki_pos >= limit) {
 			send_sig(SIGXFSZ, current, 0);
@@ -2700,9 +2736,17 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	write_len = iov_iter_count(from);
 	end = (pos + write_len - 1) >> PAGE_SHIFT;
 
-	written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
-	if (written)
-		goto out;
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		/* If there are pages to writeback, return */
+		if (filemap_range_has_page(inode->i_mapping, pos,
+					   pos + iov_iter_count(from)))
+			return -EAGAIN;
+	} else {
+		written = filemap_write_and_wait_range(mapping, pos,
+							pos + write_len - 1);
+		if (written)
+			goto out;
+	}
 
 	/*
 	 * After a write we want buffered reads to be sure to go to disk to get
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 4/8] nowait aio: Introduce IOMAP_NOWAIT
  2017-02-28 23:36 [PATCH 0/8 v2] Non-blocking AIO Goldwyn Rodrigues
                   ` (2 preceding siblings ...)
  2017-02-28 23:36 ` [PATCH 3/8] nowait aio: return if direct write will trigger writeback Goldwyn Rodrigues
@ 2017-02-28 23:36 ` Goldwyn Rodrigues
  2017-02-28 23:36 ` [PATCH 5/8] nowait aio: return on congested block device Goldwyn Rodrigues
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-02-28 23:36 UTC (permalink / raw)
  To: jack
  Cc: hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

IOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps.
This is used by XFS in the XFS patch.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/iomap.c            | 2 ++
 include/linux/iomap.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index a51cb4c..3fb68d2 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -883,6 +883,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, struct iomap_ops *ops,
 	} else {
 		dio->flags |= IOMAP_DIO_WRITE;
 		flags |= IOMAP_WRITE;
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			flags |= IOMAP_NOWAIT;
 	}
 
 	if (mapping->nrpages) {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index a4c94b8..d1c33ef 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -51,6 +51,7 @@ struct iomap {
 #define IOMAP_REPORT		(1 << 2) /* report extent status, e.g. FIEMAP */
 #define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT		(1 << 4) /* direct I/O */
+#define IOMAP_NOWAIT		(1 << 5) /* Don't wait for writeback */
 
 struct iomap_ops {
 	/*
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 5/8] nowait aio: return on congested block device
  2017-02-28 23:36 [PATCH 0/8 v2] Non-blocking AIO Goldwyn Rodrigues
                   ` (3 preceding siblings ...)
  2017-02-28 23:36 ` [PATCH 4/8] nowait aio: Introduce IOMAP_NOWAIT Goldwyn Rodrigues
@ 2017-02-28 23:36 ` Goldwyn Rodrigues
  2017-03-08  7:03   ` Sagi Grimberg
  2017-02-28 23:36 ` [PATCH 6/8] nowait aio: ext4 Goldwyn Rodrigues
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-02-28 23:36 UTC (permalink / raw)
  To: jack
  Cc: hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

A new flag BIO_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 block/blk-core.c          | 13 +++++++++++--
 fs/direct-io.c            | 11 +++++++++--
 include/linux/blk_types.h |  1 +
 3 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c..e5cfc50 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1258,6 +1258,11 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
 	if (!IS_ERR(rq))
 		return rq;
 
+	if (bio_flagged(bio, BIO_NOWAIT)) {
+		blk_put_rl(rl);
+		return ERR_PTR(-EAGAIN);
+	}
+
 	if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
 		blk_put_rl(rl);
 		return rq;
@@ -2018,7 +2023,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-		if (likely(blk_queue_enter(q, false) == 0)) {
+		if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) {
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
@@ -2027,7 +2032,11 @@ blk_qc_t generic_make_request(struct bio *bio)
 		} else {
 			struct bio *bio_next = bio_list_pop(current->bio_list);
 
-			bio_io_error(bio);
+			if (unlikely(bio_flagged(bio, BIO_NOWAIT))) {
+				bio->bi_error = -EAGAIN;
+				bio_endio(bio);
+			} else
+				bio_io_error(bio);
 			bio = bio_next;
 		}
 	} while (bio);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index c87bae4..2973df0 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,6 +386,9 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 	else
 		bio->bi_end_io = dio_bio_end_io;
 
+	if (dio->iocb->ki_flags & IOCB_NOWAIT)
+		bio_set_flag(bio, BIO_NOWAIT);
+
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
@@ -480,8 +483,12 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
 	unsigned i;
 	int err;
 
-	if (bio->bi_error)
-		dio->io_error = -EIO;
+	if (bio->bi_error) {
+		if (bio_flagged(bio, BIO_NOWAIT))
+			dio->io_error = bio->bi_error;
+		else
+			dio->io_error = -EIO;
+	}
 
 	if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) {
 		err = bio->bi_error;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 519ea2c..1d77e9b 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -102,6 +102,7 @@ struct bio {
 #define BIO_REFFED	8	/* bio has elevated ->bi_cnt */
 #define BIO_THROTTLED	9	/* This bio has already been subjected to
 				 * throttling rules. Don't do it again. */
+#define BIO_NOWAIT	10	/* don't block over blk device congestion */
 
 /*
  * Flags starting here get preserved by bio_reset() - this includes
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 6/8] nowait aio: ext4
  2017-02-28 23:36 [PATCH 0/8 v2] Non-blocking AIO Goldwyn Rodrigues
                   ` (4 preceding siblings ...)
  2017-02-28 23:36 ` [PATCH 5/8] nowait aio: return on congested block device Goldwyn Rodrigues
@ 2017-02-28 23:36 ` Goldwyn Rodrigues
  2017-02-28 23:36 ` [PATCH 7/8] nowait aio: xfs Goldwyn Rodrigues
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-02-28 23:36 UTC (permalink / raw)
  To: jack
  Cc: hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Return EAGAIN if any of the following checks fail for direct I/O:
 + i_rwsem is lockable
 + Writing beyond end of file (will trigger allocation)
 + Blocks are not allocated at the write location

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/ext4/file.c | 53 +++++++++++++++++++++++++++++++++++------------------
 1 file changed, 35 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d663d3d..391e03b 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -124,27 +124,22 @@ ext4_unaligned_aio(struct inode *inode, struct iov_iter *from, loff_t pos)
 	return 0;
 }
 
-/* Is IO overwriting allocated and initialized blocks? */
-static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
+/* Are IO blocks allocated */
+static bool ext4_blocks_mapped(struct inode *inode, loff_t pos, loff_t len,
+				struct ext4_map_blocks *map)
 {
-	struct ext4_map_blocks map;
 	unsigned int blkbits = inode->i_blkbits;
 	int err, blklen;
 
 	if (pos + len > i_size_read(inode))
 		return false;
 
-	map.m_lblk = pos >> blkbits;
-	map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
-	blklen = map.m_len;
+	map->m_lblk = pos >> blkbits;
+	map->m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
+	blklen = map->m_len;
 
-	err = ext4_map_blocks(NULL, inode, &map, 0);
-	/*
-	 * 'err==len' means that all of the blocks have been preallocated,
-	 * regardless of whether they have been initialized or not. To exclude
-	 * unwritten extents, we need to check m_flags.
-	 */
-	return err == blklen && (map.m_flags & EXT4_MAP_MAPPED);
+	err = ext4_map_blocks(NULL, inode, map, 0);
+	return err == blklen;
 }
 
 static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
@@ -176,6 +171,7 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
 	bool overwrite = false;
+	struct ext4_map_blocks map;
 
 	inode_lock(inode);
 	ret = ext4_write_checks(iocb, from);
@@ -188,7 +184,9 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (ret)
 		goto out;
 
-	if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) {
+	if (ext4_blocks_mapped(inode, iocb->ki_pos,
+				iov_iter_count(from), &map) &&
+	    (map.m_flags & EXT4_MAP_MAPPED)) {
 		overwrite = true;
 		downgrade_write(&inode->i_rwsem);
 	}
@@ -209,6 +207,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	int o_direct = iocb->ki_flags & IOCB_DIRECT;
+	int nowait = iocb->ki_flags & IOCB_NOWAIT;
 	int unaligned_aio = 0;
 	int overwrite = 0;
 	ssize_t ret;
@@ -218,7 +217,13 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		return ext4_dax_write_iter(iocb, from);
 #endif
 
-	inode_lock(inode);
+	if (o_direct && nowait) {
+		if (!inode_trylock(inode))
+			return -EAGAIN;
+	} else {
+		inode_lock(inode);
+	}
+
 	ret = ext4_write_checks(iocb, from);
 	if (ret <= 0)
 		goto out;
@@ -237,9 +242,21 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 	iocb->private = &overwrite;
 	/* Check whether we do a DIO overwrite or not */
-	if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio &&
-	    ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from)))
-		overwrite = 1;
+	if (o_direct && !unaligned_aio) {
+		struct ext4_map_blocks map;
+		if (ext4_blocks_mapped(inode, iocb->ki_pos,
+				      iov_iter_count(from), &map)) {
+	 		/* To exclude unwritten extents, we need to check
+			 * m_flags.
+			 */
+			if (ext4_should_dioread_nolock(inode) &&
+			    (map.m_flags & EXT4_MAP_MAPPED))
+				overwrite = 1;
+		} else if (iocb->ki_flags & IOCB_NOWAIT) {
+			ret = -EAGAIN;
+			goto out;
+		}
+	}
 
 	ret = __generic_file_write_iter(iocb, from);
 	inode_unlock(inode);
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 7/8] nowait aio: xfs
  2017-02-28 23:36 [PATCH 0/8 v2] Non-blocking AIO Goldwyn Rodrigues
                   ` (5 preceding siblings ...)
  2017-02-28 23:36 ` [PATCH 6/8] nowait aio: ext4 Goldwyn Rodrigues
@ 2017-02-28 23:36 ` Goldwyn Rodrigues
  2017-03-01 15:40   ` Christoph Hellwig
  2017-02-28 23:36 ` [PATCH 8/8] nowait aio: btrfs Goldwyn Rodrigues
  2017-03-05 14:56 ` [PATCH 0/8 v2] Non-blocking AIO Avi Kivity
  8 siblings, 1 reply; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-02-28 23:36 UTC (permalink / raw)
  To: jack
  Cc: hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable
immediately.

IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin
if it needs allocation either due to file extending, writing to a hole,
or COW.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/xfs/xfs_file.c  | 9 +++++++--
 fs/xfs/xfs_iomap.c | 9 +++++++++
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index bbb9eb6..7e16a83 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -528,12 +528,17 @@ xfs_file_dio_aio_write(
 	    ((iocb->ki_pos + count) & mp->m_blockmask)) {
 		unaligned_io = 1;
 		iolock = XFS_IOLOCK_EXCL;
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EAGAIN;
 	} else {
 		iolock = XFS_IOLOCK_SHARED;
 	}
 
-	xfs_ilock(ip, iolock);
-
+	if (!xfs_ilock_nowait(ip, iolock)) {
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EAGAIN;
+		xfs_ilock(ip, iolock);
+	}
 	ret = xfs_file_aio_write_checks(iocb, from, &iolock);
 	if (ret)
 		goto out;
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 1aa3abd..84f981a 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1020,6 +1020,11 @@ xfs_file_iomap_begin(
 	if ((flags & IOMAP_REPORT) ||
 	    (xfs_is_reflink_inode(ip) &&
 	     (flags & IOMAP_WRITE) && (flags & IOMAP_DIRECT))) {
+		/* Allocations due to reflinks */
+		if ((flags & IOMAP_NOWAIT) && !(flags & IOMAP_REPORT)) {
+			error = -EAGAIN;
+			goto out_unlock;
+		}
 		/* Trim the mapping to the nearest shared extent boundary. */
 		error = xfs_reflink_trim_around_shared(ip, &imap, &shared,
 				&trimmed);
@@ -1049,6 +1054,10 @@ xfs_file_iomap_begin(
 	}
 
 	if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, &imap, nimaps)) {
+		if (flags & IOMAP_NOWAIT) {
+			error = -EAGAIN;
+			goto out_unlock;
+		}
 		/*
 		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 		 * pages to keep the chunks of work done where somewhat symmetric
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 8/8] nowait aio: btrfs
  2017-02-28 23:36 [PATCH 0/8 v2] Non-blocking AIO Goldwyn Rodrigues
                   ` (6 preceding siblings ...)
  2017-02-28 23:36 ` [PATCH 7/8] nowait aio: xfs Goldwyn Rodrigues
@ 2017-02-28 23:36 ` Goldwyn Rodrigues
  2017-03-05 14:56 ` [PATCH 0/8 v2] Non-blocking AIO Avi Kivity
  8 siblings, 0 replies; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-02-28 23:36 UTC (permalink / raw)
  To: jack
  Cc: hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Return EAGAIN if any of the following checks fail
 + i_rwsem is not lockable
 + NODATACOW or PREALLOC is not set
 + Cannot nocow at the desired location
 + Writing beyond end of file which is not allocated

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/file.c  | 25 ++++++++++++++++++++-----
 fs/btrfs/inode.c |  3 +++
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b5c5da2..8640280 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1819,12 +1819,29 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	ssize_t num_written = 0;
 	bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host);
 	ssize_t err;
-	loff_t pos;
-	size_t count;
+	loff_t pos = iocb->ki_pos;
+	size_t count = iov_iter_count(from);
 	loff_t oldsize;
 	int clean_page = 0;
 
-	inode_lock(inode);
+	if ((iocb->ki_flags & IOCB_NOWAIT) &&
+			(iocb->ki_flags & IOCB_DIRECT)) {
+		/* Don't sleep on inode rwsem */
+		if (!inode_trylock(inode))
+			return -EAGAIN;
+		/*
+		 * We will allocate space in case nodatacow is not set,
+		 * so bail
+		 */
+		if (!(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
+					      BTRFS_INODE_PREALLOC)) ||
+		    check_can_nocow(inode, pos, &count) <= 0) {
+			inode_unlock(inode);
+			return -EAGAIN;
+		}
+	} else
+		inode_lock(inode);
+
 	err = generic_write_checks(iocb, from);
 	if (err <= 0) {
 		inode_unlock(inode);
@@ -1858,8 +1875,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	 */
 	update_time_for_write(inode);
 
-	pos = iocb->ki_pos;
-	count = iov_iter_count(from);
 	start_pos = round_down(pos, fs_info->sectorsize);
 	oldsize = i_size_read(inode);
 	if (start_pos > oldsize) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1e861a0..c5041ea 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8681,6 +8681,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 		if (offset + count <= inode->i_size) {
 			inode_unlock(inode);
 			relock = true;
+		} else if (iocb->ki_flags & IOCB_NOWAIT) {
+			ret = -EAGAIN;
+			goto out;
 		}
 		ret = btrfs_delalloc_reserve_space(inode, offset, count);
 		if (ret)
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 3/8] nowait aio: return if direct write will trigger writeback
  2017-02-28 23:36 ` [PATCH 3/8] nowait aio: return if direct write will trigger writeback Goldwyn Rodrigues
@ 2017-03-01  3:46   ` Matthew Wilcox
  2017-03-01 15:38     ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Matthew Wilcox @ 2017-03-01  3:46 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: jack, hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues

On Tue, Feb 28, 2017 at 05:36:05PM -0600, Goldwyn Rodrigues wrote:
> Find out if the write will trigger a wait due to writeback. If yes,
> return -EAGAIN.
> 
> This introduces a new function filemap_range_has_page() which
> returns true if the file's mapping has a page within the range
> mentioned.

Ugh, this is pretty inefficient.  If that's all you want to know, then
using the radix tree directly will be far more efficient than spinning
up all the pagevec machinery only to discard the pages found.

But what's going to kick these pages out of cache?  Shouldn't we rather
find the pages, kick them out if clean, start writeback if not, and *then*
return -EAGAIN?

So maybe we want to spin up the pagevec machinery after all so we can
do that extra work?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/8] nowait aio: Introduce IOCB_FLAG_NOWAIT
  2017-02-28 23:36 ` [PATCH 1/8] nowait aio: Introduce IOCB_FLAG_NOWAIT Goldwyn Rodrigues
@ 2017-03-01 15:36   ` Christoph Hellwig
  2017-03-01 15:56     ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2017-03-01 15:36 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: jack, hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues

On Tue, Feb 28, 2017 at 05:36:03PM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> This flag informs kernel to bail out if an AIO request will block
> for reasons such as file allocations, or a writeback triggered,
> or would block while allocating requests while performing
> direct I/O.
> 
> IOCB_FLAG_NOWAIT is translated to IOCB_NOWAIT for
> iocb->ki_flags.

Given that we aren't validating aio_flags in older kernels we can't
just add this flag as it will be a no-op in older kernels.  I think
we will have to add IOCB_CMD_PREADV2/IOCB_CMD_WRITEV2 opcodes that
properly validate all reserved fields or flags first.

Once we do that I'd really prefer to use the same flags values
as preadv2/pwritev2 so that we'll only need one set of flags over
sync/async read/write ops.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/8] nowait aio: Return if cannot get hold of i_rwsem
  2017-02-28 23:36 ` [PATCH 2/8] nowait aio: Return if cannot get hold of i_rwsem Goldwyn Rodrigues
@ 2017-03-01 15:37   ` Christoph Hellwig
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2017-03-01 15:37 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: jack, hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 3/8] nowait aio: return if direct write will trigger writeback
  2017-03-01  3:46   ` Matthew Wilcox
@ 2017-03-01 15:38     ` Christoph Hellwig
  2017-03-02 10:38       ` Jan Kara
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2017-03-01 15:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
	linux-btrfs, linux-ext4, linux-xfs, Goldwyn Rodrigues

On Tue, Feb 28, 2017 at 07:46:06PM -0800, Matthew Wilcox wrote:
> Ugh, this is pretty inefficient.  If that's all you want to know, then
> using the radix tree directly will be far more efficient than spinning
> up all the pagevec machinery only to discard the pages found.
> 
> But what's going to kick these pages out of cache?  Shouldn't we rather
> find the pages, kick them out if clean, start writeback if not, and *then*
> return -EAGAIN?
> 
> So maybe we want to spin up the pagevec machinery after all so we can
> do that extra work?

As pointed out in the last round of these patches I think we really
need to pass a flags argument to filemap_write_and_wait_range to
communicate the non-blocking nature and only return -EAGAIN if we'd
block.  As a bonus that can indeed start to kick the pages out.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 7/8] nowait aio: xfs
  2017-02-28 23:36 ` [PATCH 7/8] nowait aio: xfs Goldwyn Rodrigues
@ 2017-03-01 15:40   ` Christoph Hellwig
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2017-03-01 15:40 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: jack, hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues

> @@ -528,12 +528,17 @@ xfs_file_dio_aio_write(
>  	    ((iocb->ki_pos + count) & mp->m_blockmask)) {
>  		unaligned_io = 1;
>  		iolock = XFS_IOLOCK_EXCL;
> +		if (iocb->ki_flags & IOCB_NOWAIT)
> +			return -EAGAIN;

So all unaligned I/O will return -EAGAIN?  Why?  Also please explain
that reason in a comment right here.

> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 1aa3abd..84f981a 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1020,6 +1020,11 @@ xfs_file_iomap_begin(
>  	if ((flags & IOMAP_REPORT) ||
>  	    (xfs_is_reflink_inode(ip) &&
>  	     (flags & IOMAP_WRITE) && (flags & IOMAP_DIRECT))) {
> +		/* Allocations due to reflinks */
> +		if ((flags & IOMAP_NOWAIT) && !(flags & IOMAP_REPORT)) {
> +			error = -EAGAIN;
> +			goto out_unlock;
> +		}

FYI, this code looks very different in current Linus' tree - I think
you're on some old kernel base.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/8] nowait aio: Introduce IOCB_FLAG_NOWAIT
  2017-03-01 15:36   ` Christoph Hellwig
@ 2017-03-01 15:56     ` Christoph Hellwig
  2017-03-01 16:57       ` Goldwyn Rodrigues
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2017-03-01 15:56 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: jack, hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues

On Wed, Mar 01, 2017 at 07:36:48AM -0800, Christoph Hellwig wrote:
> Given that we aren't validating aio_flags in older kernels we can't
> just add this flag as it will be a no-op in older kernels.  I think
> we will have to add IOCB_CMD_PREADV2/IOCB_CMD_WRITEV2 opcodes that
> properly validate all reserved fields or flags first.
> 
> Once we do that I'd really prefer to use the same flags values
> as preadv2/pwritev2 so that we'll only need one set of flags over
> sync/async read/write ops.

I just took another look and we do verify that
aio_reserved1/aio_reserved2 must be zero.  So I think we can just
stick RWF_* into aio_reserved1 and fix that problem that way.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/8] nowait aio: Introduce IOCB_FLAG_NOWAIT
  2017-03-01 15:56     ` Christoph Hellwig
@ 2017-03-01 16:57       ` Goldwyn Rodrigues
  2017-03-01 22:44         ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-03-01 16:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: jack, linux-fsdevel, linux-block, linux-btrfs, linux-ext4, linux-xfs



On 03/01/2017 09:56 AM, Christoph Hellwig wrote:
> On Wed, Mar 01, 2017 at 07:36:48AM -0800, Christoph Hellwig wrote:
>> Given that we aren't validating aio_flags in older kernels we can't
>> just add this flag as it will be a no-op in older kernels.  I think
>> we will have to add IOCB_CMD_PREADV2/IOCB_CMD_WRITEV2 opcodes that
>> properly validate all reserved fields or flags first.
>>
>> Once we do that I'd really prefer to use the same flags values
>> as preadv2/pwritev2 so that we'll only need one set of flags over
>> sync/async read/write ops.
> 
> I just took another look and we do verify that
> aio_reserved1/aio_reserved2 must be zero.  So I think we can just
> stick RWF_* into aio_reserved1 and fix that problem that way.
> 

RWF_* ? Isn't that kernel space flags? Or did you intend to say
IOCB_FLAG_*? If yes, we maintain two flag fields? aio_reserved1 (perhaps
renamed to aio_flags2) and aio_flags?

aio_reserved1 is also used to return key for the purpose of io_cancel,
but we should be able to fetch the flags before putting the key value
there. Still I am not comfortable using the same field for it because it
will be overwritten when io_submit returns.

Which brings me to the next question: What is the purpose of aio_key?
Why is aio_key set to KIOCB_KEY (which is zero) every time? You are not
differentiating the request by setting all the iocb's key to zero.


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/8] nowait aio: Introduce IOCB_FLAG_NOWAIT
  2017-03-01 16:57       ` Goldwyn Rodrigues
@ 2017-03-01 22:44         ` Christoph Hellwig
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2017-03-01 22:44 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: Christoph Hellwig, jack, linux-fsdevel, linux-block, linux-btrfs,
	linux-ext4, linux-xfs

On Wed, Mar 01, 2017 at 10:57:17AM -0600, Goldwyn Rodrigues wrote:
> RWF_* ? Isn't that kernel space flags? Or did you intend to say
> IOCB_FLAG_*?

No, they are the flags for preadv2/pwritev2.

> If yes, we maintain two flag fields? aio_reserved1 (perhaps
> renamed to aio_flags2) and aio_flags?

Yes - I'd call it aio_rw_flags or similar.

> aio_reserved1 is also used to return key for the purpose of io_cancel,
> but we should be able to fetch the flags before putting the key value
> there. Still I am not comfortable using the same field for it because it
> will be overwritten when io_submit returns.

It's not - the key is a separate field.  It's just that the two are
defined using a very strange macro switching around their positions
based on the endiannes.

> Which brings me to the next question: What is the purpose of aio_key?
> Why is aio_key set to KIOCB_KEY (which is zero) every time? You are not
> differentiating the request by setting all the iocb's key to zero.

I don't know the history of this rather odd field.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 3/8] nowait aio: return if direct write will trigger writeback
  2017-03-01 15:38     ` Christoph Hellwig
@ 2017-03-02 10:38       ` Jan Kara
  2017-03-02 14:12         ` Matthew Wilcox
  0 siblings, 1 reply; 63+ messages in thread
From: Jan Kara @ 2017-03-02 10:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Goldwyn Rodrigues, jack, linux-fsdevel,
	linux-block, linux-btrfs, linux-ext4, linux-xfs,
	Goldwyn Rodrigues

On Wed 01-03-17 07:38:57, Christoph Hellwig wrote:
> On Tue, Feb 28, 2017 at 07:46:06PM -0800, Matthew Wilcox wrote:
> > Ugh, this is pretty inefficient.  If that's all you want to know, then
> > using the radix tree directly will be far more efficient than spinning
> > up all the pagevec machinery only to discard the pages found.
> > 
> > But what's going to kick these pages out of cache?  Shouldn't we rather
> > find the pages, kick them out if clean, start writeback if not, and *then*
> > return -EAGAIN?
> > 
> > So maybe we want to spin up the pagevec machinery after all so we can
> > do that extra work?
> 
> As pointed out in the last round of these patches I think we really
> need to pass a flags argument to filemap_write_and_wait_range to
> communicate the non-blocking nature and only return -EAGAIN if we'd
> block.  As a bonus that can indeed start to kick the pages out.

Aren't flags to filemap_write_and_wait_range() unnecessary complication?
Realistically, most users wanting performance from AIO DIO so badly that
they bother with this API won't have any pages to write / evict. If they do
by some bad accident, they can fall back to standard "blocking" AIO DIO.
So I don't see much value in teaching filemap_write_and_wait_range() about
a non-blocking mode...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 3/8] nowait aio: return if direct write will trigger writeback
  2017-03-02 10:38       ` Jan Kara
@ 2017-03-02 14:12         ` Matthew Wilcox
  2017-03-02 15:22           ` Jan Kara
  0 siblings, 1 reply; 63+ messages in thread
From: Matthew Wilcox @ 2017-03-02 14:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Goldwyn Rodrigues, jack, linux-fsdevel,
	linux-block, linux-btrfs, linux-ext4, linux-xfs,
	Goldwyn Rodrigues

On Thu, Mar 02, 2017 at 11:38:45AM +0100, Jan Kara wrote:
> On Wed 01-03-17 07:38:57, Christoph Hellwig wrote:
> > On Tue, Feb 28, 2017 at 07:46:06PM -0800, Matthew Wilcox wrote:
> > > But what's going to kick these pages out of cache?  Shouldn't we rather
> > > find the pages, kick them out if clean, start writeback if not, and *then*
> > > return -EAGAIN?
> > 
> > As pointed out in the last round of these patches I think we really
> > need to pass a flags argument to filemap_write_and_wait_range to
> > communicate the non-blocking nature and only return -EAGAIN if we'd
> > block.  As a bonus that can indeed start to kick the pages out.
> 
> Aren't flags to filemap_write_and_wait_range() unnecessary complication?
> Realistically, most users wanting performance from AIO DIO so badly that
> they bother with this API won't have any pages to write / evict. If they do
> by some bad accident, they can fall back to standard "blocking" AIO DIO.
> So I don't see much value in teaching filemap_write_and_wait_range() about
> a non-blocking mode...

That lets me execute a DoS against a user using this API.  All I have
to do is open the file they're using read-only and read a byte from it.
Page goes into page-cache, and they'll only get -EAGAIN from calling
this syscall until the page ages out.

Also, I don't understand why this is a flag.  Isn't the point of AIO to
be non-blocking?  Why isn't this just a change to how we do AIO?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 3/8] nowait aio: return if direct write will trigger writeback
  2017-03-02 14:12         ` Matthew Wilcox
@ 2017-03-02 15:22           ` Jan Kara
  0 siblings, 0 replies; 63+ messages in thread
From: Jan Kara @ 2017-03-02 15:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Christoph Hellwig, Goldwyn Rodrigues, jack,
	linux-fsdevel, linux-block, linux-btrfs, linux-ext4, linux-xfs,
	Goldwyn Rodrigues

On Thu 02-03-17 06:12:45, Matthew Wilcox wrote:
> On Thu, Mar 02, 2017 at 11:38:45AM +0100, Jan Kara wrote:
> > On Wed 01-03-17 07:38:57, Christoph Hellwig wrote:
> > > On Tue, Feb 28, 2017 at 07:46:06PM -0800, Matthew Wilcox wrote:
> > > > But what's going to kick these pages out of cache?  Shouldn't we rather
> > > > find the pages, kick them out if clean, start writeback if not, and *then*
> > > > return -EAGAIN?
> > > 
> > > As pointed out in the last round of these patches I think we really
> > > need to pass a flags argument to filemap_write_and_wait_range to
> > > communicate the non-blocking nature and only return -EAGAIN if we'd
> > > block.  As a bonus that can indeed start to kick the pages out.
> > 
> > Aren't flags to filemap_write_and_wait_range() unnecessary complication?
> > Realistically, most users wanting performance from AIO DIO so badly that
> > they bother with this API won't have any pages to write / evict. If they do
> > by some bad accident, they can fall back to standard "blocking" AIO DIO.
> > So I don't see much value in teaching filemap_write_and_wait_range() about
> > a non-blocking mode...
> 
> That lets me execute a DoS against a user using this API.  All I have
> to do is open the file they're using read-only and read a byte from it.
> Page goes into page-cache, and they'll only get -EAGAIN from calling
> this syscall until the page ages out.

It will not be a DoS. This non-blocking AIO can always return EAGAIN when
it feels like it and the caller is required to fall back to a blocking
version in that case if he wants to guarantee forward progress. It is just
a performance optimization which allows user (database) to submit IO from a
computation thread instead of having to offload it to an IO thread...

> Also, I don't understand why this is a flag.  Isn't the point of AIO to
> be non-blocking?  Why isn't this just a change to how we do AIO?

Because this is an API change and the caller has to implement some handling
to guarantee a forward progress of non-blocking IO...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/8 v2] Non-blocking AIO
  2017-02-28 23:36 [PATCH 0/8 v2] Non-blocking AIO Goldwyn Rodrigues
                   ` (7 preceding siblings ...)
  2017-02-28 23:36 ` [PATCH 8/8] nowait aio: btrfs Goldwyn Rodrigues
@ 2017-03-05 14:56 ` Avi Kivity
  2017-03-06  8:25   ` Jan Kara
  8 siblings, 1 reply; 63+ messages in thread
From: Avi Kivity @ 2017-03-05 14:56 UTC (permalink / raw)
  To: Goldwyn Rodrigues, jack
  Cc: hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4, linux-xfs



On 03/01/2017 01:36 AM, Goldwyn Rodrigues wrote:
> This series adds nonblocking feature to asynchronous I/O writes.
> io_submit() can be delayed because of a number of reason:
>   - Block allocation for files
>   - Data writebacks for direct I/O
>   - Sleeping because of waiting to acquire i_rwsem
>   - Congested block device

We've been hit by a few of these so this change is very welcome.

>
> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
> any of these conditions are met. This way userspace can push most
> of the write()s to the kernel to the best of its ability to complete
> and if it returns -EAGAIN, can defer it to another thread.
>

Is it not possible to push the iocb to a workqueue?  This will allow 
existing userspace to work with the new functionality, unchanged. Any 
userspace implementation would have to do the same thing, so it's not 
like we're saving anything by pushing it there.

> In order to enable this, IOCB_FLAG_NOWAIT is introduced in
> uapi/linux/aio_abi.h which translates to IOCB_NOWAIT for struct iocb,
> BIO_NOWAIT for bio and IOMAP_NOWAIT for iomap.
>
> This feature is provided for direct I/O of asynchronous I/O only. I have
> tested it against xfs, ext4, and btrfs.
>
> Changes since v1:
>   + Forwardported from 4.9.10
>   + changed name from _NONBLOCKING to *_NOWAIT
>   + filemap_range_has_page call moved to closer to (just before) calling filemap_write_and_wait_range().
>   + BIO_NOWAIT limited to get_request()
>   + XFS fixes
> 	- included reflink
> 	- use of xfs_ilock_nowait() instead of a XFS_IOLOCK_NONBLOCKING flag
> 	- Translate the flag through IOMAP_NOWAIT (iomap) to check for
> 	  block allocation for the file.
>   + ext4 coding style

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/8 v2] Non-blocking AIO
  2017-03-05 14:56 ` [PATCH 0/8 v2] Non-blocking AIO Avi Kivity
@ 2017-03-06  8:25   ` Jan Kara
  2017-03-06  8:40     ` Avi Kivity
  2017-03-06 15:19     ` Jens Axboe
  0 siblings, 2 replies; 63+ messages in thread
From: Jan Kara @ 2017-03-06  8:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
	linux-btrfs, linux-ext4, linux-xfs

On Sun 05-03-17 16:56:21, Avi Kivity wrote:
> >The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
> >any of these conditions are met. This way userspace can push most
> >of the write()s to the kernel to the best of its ability to complete
> >and if it returns -EAGAIN, can defer it to another thread.
> >
> 
> Is it not possible to push the iocb to a workqueue?  This will allow
> existing userspace to work with the new functionality, unchanged. Any
> userspace implementation would have to do the same thing, so it's not like
> we're saving anything by pushing it there.

That is not easy because until IO is fully submitted, you need some parts
of the context of the process which submits the IO (e.g. memory mappings,
but possibly also other credentials). So you would need to somehow transfer
this information to the workqueue.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/8 v2] Non-blocking AIO
  2017-03-06  8:25   ` Jan Kara
@ 2017-03-06  8:40     ` Avi Kivity
  2017-03-06 15:19     ` Jens Axboe
  1 sibling, 0 replies; 63+ messages in thread
From: Avi Kivity @ 2017-03-06  8:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
	linux-btrfs, linux-ext4, linux-xfs

On 03/06/2017 10:25 AM, Jan Kara wrote:
> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>> any of these conditions are met. This way userspace can push most
>>> of the write()s to the kernel to the best of its ability to complete
>>> and if it returns -EAGAIN, can defer it to another thread.
>>>
>> Is it not possible to push the iocb to a workqueue?  This will allow
>> existing userspace to work with the new functionality, unchanged. Any
>> userspace implementation would have to do the same thing, so it's not like
>> we're saving anything by pushing it there.
> That is not easy because until IO is fully submitted, you need some parts
> of the context of the process which submits the IO (e.g. memory mappings,
> but possibly also other credentials). So you would need to somehow transfer
> this information to the workqueue.
>


It's at least possible to pass the mm_struct to the workqueue, and I 
imagine other process attributes.  But I appreciate the difficulty.

It would be quite annoying to have to keep a large number of worker 
threads active, just in case aio is not working.  Modern NVMes have 
fairly deep queues, and at the worst case, you'll need one thread for 
each I/O to keep everything busy.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/8 v2] Non-blocking AIO
  2017-03-06  8:25   ` Jan Kara
  2017-03-06  8:40     ` Avi Kivity
@ 2017-03-06 15:19     ` Jens Axboe
  2017-03-06 15:29       ` Avi Kivity
  1 sibling, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2017-03-06 15:19 UTC (permalink / raw)
  To: Jan Kara, Avi Kivity
  Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
	linux-btrfs, linux-ext4, linux-xfs

On 03/06/2017 01:25 AM, Jan Kara wrote:
> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>> any of these conditions are met. This way userspace can push most
>>> of the write()s to the kernel to the best of its ability to complete
>>> and if it returns -EAGAIN, can defer it to another thread.
>>>
>>
>> Is it not possible to push the iocb to a workqueue?  This will allow
>> existing userspace to work with the new functionality, unchanged. Any
>> userspace implementation would have to do the same thing, so it's not like
>> we're saving anything by pushing it there.
> 
> That is not easy because until IO is fully submitted, you need some parts
> of the context of the process which submits the IO (e.g. memory mappings,
> but possibly also other credentials). So you would need to somehow transfer
> this information to the workqueue.

Outside of technical challenges, the API also needs to return EAGAIN or
start blocking at some point. We can't expose a direct connection to
queue work like that, and let any user potentially create millions of
pending work items (and IOs). That's why the current API is safe, even
though it does suck that it block seemingly randomly for users.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/8 v2] Non-blocking AIO
  2017-03-06 15:19     ` Jens Axboe
@ 2017-03-06 15:29       ` Avi Kivity
  2017-03-06 15:38         ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Avi Kivity @ 2017-03-06 15:29 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara
  Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
	linux-btrfs, linux-ext4, linux-xfs



On 03/06/2017 05:19 PM, Jens Axboe wrote:
> On 03/06/2017 01:25 AM, Jan Kara wrote:
>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>> any of these conditions are met. This way userspace can push most
>>>> of the write()s to the kernel to the best of its ability to complete
>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>
>>> Is it not possible to push the iocb to a workqueue?  This will allow
>>> existing userspace to work with the new functionality, unchanged. Any
>>> userspace implementation would have to do the same thing, so it's not like
>>> we're saving anything by pushing it there.
>> That is not easy because until IO is fully submitted, you need some parts
>> of the context of the process which submits the IO (e.g. memory mappings,
>> but possibly also other credentials). So you would need to somehow transfer
>> this information to the workqueue.
> Outside of technical challenges, the API also needs to return EAGAIN or
> start blocking at some point. We can't expose a direct connection to
> queue work like that, and let any user potentially create millions of
> pending work items (and IOs).

You wouldn't expect more concurrent events than the maxevents parameter 
that was supplied to io_setup syscall; it should have reserved any 
resources needed.

>   That's why the current API is safe, even
> though it does suck that it block seemingly randomly for users.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/8 v2] Non-blocking AIO
  2017-03-06 15:29       ` Avi Kivity
@ 2017-03-06 15:38         ` Jens Axboe
  2017-03-06 15:59           ` Avi Kivity
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2017-03-06 15:38 UTC (permalink / raw)
  To: Avi Kivity, Jan Kara
  Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
	linux-btrfs, linux-ext4, linux-xfs

On 03/06/2017 08:29 AM, Avi Kivity wrote:
> 
> 
> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>> any of these conditions are met. This way userspace can push most
>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>
>>>> Is it not possible to push the iocb to a workqueue?  This will allow
>>>> existing userspace to work with the new functionality, unchanged. Any
>>>> userspace implementation would have to do the same thing, so it's not like
>>>> we're saving anything by pushing it there.
>>> That is not easy because until IO is fully submitted, you need some parts
>>> of the context of the process which submits the IO (e.g. memory mappings,
>>> but possibly also other credentials). So you would need to somehow transfer
>>> this information to the workqueue.
>> Outside of technical challenges, the API also needs to return EAGAIN or
>> start blocking at some point. We can't expose a direct connection to
>> queue work like that, and let any user potentially create millions of
>> pending work items (and IOs).
> 
> You wouldn't expect more concurrent events than the maxevents parameter 
> that was supplied to io_setup syscall; it should have reserved any 
> resources needed.

Doesn't matter what limit you apply, my point still stands - at some
point you have to return EAGAIN, or block. Returning EAGAIN without
the caller having flagged support for that change of behavior would
be problematic.

And for this to really work, aio would need some serious help in
how it applies limits. It looks like a hot mess.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/8 v2] Non-blocking AIO
  2017-03-06 15:38         ` Jens Axboe
@ 2017-03-06 15:59           ` Avi Kivity
  2017-03-06 16:08             ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Avi Kivity @ 2017-03-06 15:59 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara
  Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
	linux-btrfs, linux-ext4, linux-xfs

On 03/06/2017 05:38 PM, Jens Axboe wrote:
> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>
>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>>> any of these conditions are met. This way userspace can push most
>>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>>
>>>>> Is it not possible to push the iocb to a workqueue?  This will allow
>>>>> existing userspace to work with the new functionality, unchanged. Any
>>>>> userspace implementation would have to do the same thing, so it's not like
>>>>> we're saving anything by pushing it there.
>>>> That is not easy because until IO is fully submitted, you need some parts
>>>> of the context of the process which submits the IO (e.g. memory mappings,
>>>> but possibly also other credentials). So you would need to somehow transfer
>>>> this information to the workqueue.
>>> Outside of technical challenges, the API also needs to return EAGAIN or
>>> start blocking at some point. We can't expose a direct connection to
>>> queue work like that, and let any user potentially create millions of
>>> pending work items (and IOs).
>> You wouldn't expect more concurrent events than the maxevents parameter
>> that was supplied to io_setup syscall; it should have reserved any
>> resources needed.
> Doesn't matter what limit you apply, my point still stands - at some
> point you have to return EAGAIN, or block. Returning EAGAIN without
> the caller having flagged support for that change of behavior would
> be problematic.

Doesn't it already return EAGAIN (or some other error) if you exceed 
maxevents?

> And for this to really work, aio would need some serious help in
> how it applies limits. It looks like a hot mess.

For sure.  I think it would be a shame to create more user-facing 
complexity.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/8 v2] Non-blocking AIO
  2017-03-06 15:59           ` Avi Kivity
@ 2017-03-06 16:08             ` Jens Axboe
  2017-03-06 16:59               ` Avi Kivity
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2017-03-06 16:08 UTC (permalink / raw)
  To: Avi Kivity, Jan Kara
  Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
	linux-btrfs, linux-ext4, linux-xfs

On 03/06/2017 08:59 AM, Avi Kivity wrote:
> On 03/06/2017 05:38 PM, Jens Axboe wrote:
>> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>>
>>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>>>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>>>> any of these conditions are met. This way userspace can push most
>>>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>>>
>>>>>> Is it not possible to push the iocb to a workqueue?  This will allow
>>>>>> existing userspace to work with the new functionality, unchanged. Any
>>>>>> userspace implementation would have to do the same thing, so it's not like
>>>>>> we're saving anything by pushing it there.
>>>>> That is not easy because until IO is fully submitted, you need some parts
>>>>> of the context of the process which submits the IO (e.g. memory mappings,
>>>>> but possibly also other credentials). So you would need to somehow transfer
>>>>> this information to the workqueue.
>>>> Outside of technical challenges, the API also needs to return EAGAIN or
>>>> start blocking at some point. We can't expose a direct connection to
>>>> queue work like that, and let any user potentially create millions of
>>>> pending work items (and IOs).
>>> You wouldn't expect more concurrent events than the maxevents parameter
>>> that was supplied to io_setup syscall; it should have reserved any
>>> resources needed.
>> Doesn't matter what limit you apply, my point still stands - at some
>> point you have to return EAGAIN, or block. Returning EAGAIN without
>> the caller having flagged support for that change of behavior would
>> be problematic.
> 
> Doesn't it already return EAGAIN (or some other error) if you exceed 
> maxevents?

It's a setup thing. We check these limits when someone creates an IO
context, and carve out the specified entries form our global pool. Then
we free those "resources" when the io context is freed.

Right now I can setup an IO context with 1000 entries on it, yet that
number has NO bearing on when io_submit() would potentially block or
return EAGAIN.

We can have a huge gap on the intent signaled by io context setup, and
the reality imposed by what actually happens on the IO submission side.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/8 v2] Non-blocking AIO
  2017-03-06 16:08             ` Jens Axboe
@ 2017-03-06 16:59               ` Avi Kivity
  2017-03-06 17:06                 ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Avi Kivity @ 2017-03-06 16:59 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara
  Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
	linux-btrfs, linux-ext4, linux-xfs



On 03/06/2017 06:08 PM, Jens Axboe wrote:
> On 03/06/2017 08:59 AM, Avi Kivity wrote:
>> On 03/06/2017 05:38 PM, Jens Axboe wrote:
>>> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>>>>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>>>>> any of these conditions are met. This way userspace can push most
>>>>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>>>>
>>>>>>> Is it not possible to push the iocb to a workqueue?  This will allow
>>>>>>> existing userspace to work with the new functionality, unchanged. Any
>>>>>>> userspace implementation would have to do the same thing, so it's not like
>>>>>>> we're saving anything by pushing it there.
>>>>>> That is not easy because until IO is fully submitted, you need some parts
>>>>>> of the context of the process which submits the IO (e.g. memory mappings,
>>>>>> but possibly also other credentials). So you would need to somehow transfer
>>>>>> this information to the workqueue.
>>>>> Outside of technical challenges, the API also needs to return EAGAIN or
>>>>> start blocking at some point. We can't expose a direct connection to
>>>>> queue work like that, and let any user potentially create millions of
>>>>> pending work items (and IOs).
>>>> You wouldn't expect more concurrent events than the maxevents parameter
>>>> that was supplied to io_setup syscall; it should have reserved any
>>>> resources needed.
>>> Doesn't matter what limit you apply, my point still stands - at some
>>> point you have to return EAGAIN, or block. Returning EAGAIN without
>>> the caller having flagged support for that change of behavior would
>>> be problematic.
>> Doesn't it already return EAGAIN (or some other error) if you exceed
>> maxevents?
> It's a setup thing. We check these limits when someone creates an IO
> context, and carve out the specified entries form our global pool. Then
> we free those "resources" when the io context is freed.
>
> Right now I can setup an IO context with 1000 entries on it, yet that
> number has NO bearing on when io_submit() would potentially block or
> return EAGAIN.
>
> We can have a huge gap on the intent signaled by io context setup, and
> the reality imposed by what actually happens on the IO submission side.

Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return 
EAGAIN?

Just tested it, and maxevents is not respected for this:

io_setup(1, [0x7fc64537f000])           = 0
io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000, 
nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, 
offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, 
{pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, 
fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, 
buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, 
nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, 
offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, 
{pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10

which is unexpected, to me.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/8 v2] Non-blocking AIO
  2017-03-06 16:59               ` Avi Kivity
@ 2017-03-06 17:06                 ` Jens Axboe
  2017-03-06 18:17                   ` Avi Kivity
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2017-03-06 17:06 UTC (permalink / raw)
  To: Avi Kivity, Jan Kara
  Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
	linux-btrfs, linux-ext4, linux-xfs

On 03/06/2017 09:59 AM, Avi Kivity wrote:
> 
> 
> On 03/06/2017 06:08 PM, Jens Axboe wrote:
>> On 03/06/2017 08:59 AM, Avi Kivity wrote:
>>> On 03/06/2017 05:38 PM, Jens Axboe wrote:
>>>> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>>>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>>>>>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>>>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>>>>>> any of these conditions are met. This way userspace can push most
>>>>>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>>>>>
>>>>>>>> Is it not possible to push the iocb to a workqueue?  This will allow
>>>>>>>> existing userspace to work with the new functionality, unchanged. Any
>>>>>>>> userspace implementation would have to do the same thing, so it's not like
>>>>>>>> we're saving anything by pushing it there.
>>>>>>> That is not easy because until IO is fully submitted, you need some parts
>>>>>>> of the context of the process which submits the IO (e.g. memory mappings,
>>>>>>> but possibly also other credentials). So you would need to somehow transfer
>>>>>>> this information to the workqueue.
>>>>>> Outside of technical challenges, the API also needs to return EAGAIN or
>>>>>> start blocking at some point. We can't expose a direct connection to
>>>>>> queue work like that, and let any user potentially create millions of
>>>>>> pending work items (and IOs).
>>>>> You wouldn't expect more concurrent events than the maxevents parameter
>>>>> that was supplied to io_setup syscall; it should have reserved any
>>>>> resources needed.
>>>> Doesn't matter what limit you apply, my point still stands - at some
>>>> point you have to return EAGAIN, or block. Returning EAGAIN without
>>>> the caller having flagged support for that change of behavior would
>>>> be problematic.
>>> Doesn't it already return EAGAIN (or some other error) if you exceed
>>> maxevents?
>> It's a setup thing. We check these limits when someone creates an IO
>> context, and carve out the specified entries form our global pool. Then
>> we free those "resources" when the io context is freed.
>>
>> Right now I can setup an IO context with 1000 entries on it, yet that
>> number has NO bearing on when io_submit() would potentially block or
>> return EAGAIN.
>>
>> We can have a huge gap on the intent signaled by io context setup, and
>> the reality imposed by what actually happens on the IO submission side.
> 
> Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return 
> EAGAIN?
> 
> Just tested it, and maxevents is not respected for this:
> 
> io_setup(1, [0x7fc64537f000])           = 0
> io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000, 
> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, 
> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, 
> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, 
> fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, 
> buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, 
> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, 
> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, 
> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10
> 
> which is unexpected, to me.

ioctx_alloc()
{
        [...]

        /*                                                                      
         * We keep track of the number of available ringbuffer slots, to prevent
         * overflow (reqs_available), and we also use percpu counters for this. 
         *                                                                      
         * So since up to half the slots might be on other cpu's percpu counters
         * and unavailable, double nr_events so userspace sees what they        
         * expected: additionally, we move req_batch slots to/from percpu       
         * counters at a time, so make sure that isn't 0:                       
         */                                                                     
        nr_events = max(nr_events, num_possible_cpus() * 4);                    
        nr_events *= 2;                                    
}


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/8 v2] Non-blocking AIO
  2017-03-06 17:06                 ` Jens Axboe
@ 2017-03-06 18:17                   ` Avi Kivity
  2017-03-06 18:27                     ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Avi Kivity @ 2017-03-06 18:17 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara
  Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
	linux-btrfs, linux-ext4, linux-xfs



On 03/06/2017 07:06 PM, Jens Axboe wrote:
> On 03/06/2017 09:59 AM, Avi Kivity wrote:
>>
>> On 03/06/2017 06:08 PM, Jens Axboe wrote:
>>> On 03/06/2017 08:59 AM, Avi Kivity wrote:
>>>> On 03/06/2017 05:38 PM, Jens Axboe wrote:
>>>>> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>>>>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>>>>>>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>>>>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>>>>>>> any of these conditions are met. This way userspace can push most
>>>>>>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>>>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>>>>>>
>>>>>>>>> Is it not possible to push the iocb to a workqueue?  This will allow
>>>>>>>>> existing userspace to work with the new functionality, unchanged. Any
>>>>>>>>> userspace implementation would have to do the same thing, so it's not like
>>>>>>>>> we're saving anything by pushing it there.
>>>>>>>> That is not easy because until IO is fully submitted, you need some parts
>>>>>>>> of the context of the process which submits the IO (e.g. memory mappings,
>>>>>>>> but possibly also other credentials). So you would need to somehow transfer
>>>>>>>> this information to the workqueue.
>>>>>>> Outside of technical challenges, the API also needs to return EAGAIN or
>>>>>>> start blocking at some point. We can't expose a direct connection to
>>>>>>> queue work like that, and let any user potentially create millions of
>>>>>>> pending work items (and IOs).
>>>>>> You wouldn't expect more concurrent events than the maxevents parameter
>>>>>> that was supplied to io_setup syscall; it should have reserved any
>>>>>> resources needed.
>>>>> Doesn't matter what limit you apply, my point still stands - at some
>>>>> point you have to return EAGAIN, or block. Returning EAGAIN without
>>>>> the caller having flagged support for that change of behavior would
>>>>> be problematic.
>>>> Doesn't it already return EAGAIN (or some other error) if you exceed
>>>> maxevents?
>>> It's a setup thing. We check these limits when someone creates an IO
>>> context, and carve out the specified entries form our global pool. Then
>>> we free those "resources" when the io context is freed.
>>>
>>> Right now I can setup an IO context with 1000 entries on it, yet that
>>> number has NO bearing on when io_submit() would potentially block or
>>> return EAGAIN.
>>>
>>> We can have a huge gap on the intent signaled by io context setup, and
>>> the reality imposed by what actually happens on the IO submission side.
>> Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return
>> EAGAIN?
>>
>> Just tested it, and maxevents is not respected for this:
>>
>> io_setup(1, [0x7fc64537f000])           = 0
>> io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000,
>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread,
>> fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3,
>> buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000,
>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10
>>
>> which is unexpected, to me.
> ioctx_alloc()
> {
>          [...]
>
>          /*
>           * We keep track of the number of available ringbuffer slots, to prevent
>           * overflow (reqs_available), and we also use percpu counters for this.
>           *
>           * So since up to half the slots might be on other cpu's percpu counters
>           * and unavailable, double nr_events so userspace sees what they
>           * expected: additionally, we move req_batch slots to/from percpu
>           * counters at a time, so make sure that isn't 0:
>           */
>          nr_events = max(nr_events, num_possible_cpus() * 4);
>          nr_events *= 2;
> }

On a 4-lcore desktop:

io_setup(1, [0x7fc210041000])           = 0
io_submit(0x7fc210041000, 10000, [big array]) = 126
io_submit(0x7fc210041000, 10000, [big array]) = -1 EAGAIN (Resource 
temporarily unavailable)

so, the user should already expect EAGAIN from io_submit() due to 
resource limits.  I'm sure the check could be tightened so that if we do 
have to use a workqueue, we respect the user's limit rather than some 
inflated number.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/8 v2] Non-blocking AIO
  2017-03-06 18:17                   ` Avi Kivity
@ 2017-03-06 18:27                     ` Jens Axboe
  2017-03-06 18:50                       ` Avi Kivity
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2017-03-06 18:27 UTC (permalink / raw)
  To: Avi Kivity, Jan Kara
  Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
	linux-btrfs, linux-ext4, linux-xfs

On 03/06/2017 11:17 AM, Avi Kivity wrote:
> 
> 
> On 03/06/2017 07:06 PM, Jens Axboe wrote:
>> On 03/06/2017 09:59 AM, Avi Kivity wrote:
>>>
>>> On 03/06/2017 06:08 PM, Jens Axboe wrote:
>>>> On 03/06/2017 08:59 AM, Avi Kivity wrote:
>>>>> On 03/06/2017 05:38 PM, Jens Axboe wrote:
>>>>>> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>>>>>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>>>>>>>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>>>>>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>>>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>>>>>>>> any of these conditions are met. This way userspace can push most
>>>>>>>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>>>>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>>>>>>>
>>>>>>>>>> Is it not possible to push the iocb to a workqueue?  This will allow
>>>>>>>>>> existing userspace to work with the new functionality, unchanged. Any
>>>>>>>>>> userspace implementation would have to do the same thing, so it's not like
>>>>>>>>>> we're saving anything by pushing it there.
>>>>>>>>> That is not easy because until IO is fully submitted, you need some parts
>>>>>>>>> of the context of the process which submits the IO (e.g. memory mappings,
>>>>>>>>> but possibly also other credentials). So you would need to somehow transfer
>>>>>>>>> this information to the workqueue.
>>>>>>>> Outside of technical challenges, the API also needs to return EAGAIN or
>>>>>>>> start blocking at some point. We can't expose a direct connection to
>>>>>>>> queue work like that, and let any user potentially create millions of
>>>>>>>> pending work items (and IOs).
>>>>>>> You wouldn't expect more concurrent events than the maxevents parameter
>>>>>>> that was supplied to io_setup syscall; it should have reserved any
>>>>>>> resources needed.
>>>>>> Doesn't matter what limit you apply, my point still stands - at some
>>>>>> point you have to return EAGAIN, or block. Returning EAGAIN without
>>>>>> the caller having flagged support for that change of behavior would
>>>>>> be problematic.
>>>>> Doesn't it already return EAGAIN (or some other error) if you exceed
>>>>> maxevents?
>>>> It's a setup thing. We check these limits when someone creates an IO
>>>> context, and carve out the specified entries form our global pool. Then
>>>> we free those "resources" when the io context is freed.
>>>>
>>>> Right now I can setup an IO context with 1000 entries on it, yet that
>>>> number has NO bearing on when io_submit() would potentially block or
>>>> return EAGAIN.
>>>>
>>>> We can have a huge gap on the intent signaled by io context setup, and
>>>> the reality imposed by what actually happens on the IO submission side.
>>> Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return
>>> EAGAIN?
>>>
>>> Just tested it, and maxevents is not respected for this:
>>>
>>> io_setup(1, [0x7fc64537f000])           = 0
>>> io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000,
>>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread,
>>> fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3,
>>> buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000,
>>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10
>>>
>>> which is unexpected, to me.
>> ioctx_alloc()
>> {
>>          [...]
>>
>>          /*
>>           * We keep track of the number of available ringbuffer slots, to prevent
>>           * overflow (reqs_available), and we also use percpu counters for this.
>>           *
>>           * So since up to half the slots might be on other cpu's percpu counters
>>           * and unavailable, double nr_events so userspace sees what they
>>           * expected: additionally, we move req_batch slots to/from percpu
>>           * counters at a time, so make sure that isn't 0:
>>           */
>>          nr_events = max(nr_events, num_possible_cpus() * 4);
>>          nr_events *= 2;
>> }
> 
> On a 4-lcore desktop:
> 
> io_setup(1, [0x7fc210041000])           = 0
> io_submit(0x7fc210041000, 10000, [big array]) = 126
> io_submit(0x7fc210041000, 10000, [big array]) = -1 EAGAIN (Resource 
> temporarily unavailable)
> 
> so, the user should already expect EAGAIN from io_submit() due to 
> resource limits.  I'm sure the check could be tightened so that if we do 
> have to use a workqueue, we respect the user's limit rather than some 
> inflated number.

This is why I previously said that the 1000 requests you potentially
asks for when setting up your IO context has NOTHING to do with when you
will run into EAGAIN. Yes, returning EAGAIN if the app exceeds the
limit that it itself has set is existing behavior and it certainly makes
sense. And it's an easily handled condition, since the app can just
backoff and wait/reap completion events.

But if we allow EAGAIN to bubble up from block request submission, then
that's a change in behavior. This can happen without the app having any
pending IO against that IO context, hence we can return EAGAIN to the
app that then has no reasonable way to handle that condition.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/8 v2] Non-blocking AIO
  2017-03-06 18:27                     ` Jens Axboe
@ 2017-03-06 18:50                       ` Avi Kivity
  0 siblings, 0 replies; 63+ messages in thread
From: Avi Kivity @ 2017-03-06 18:50 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara
  Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
	linux-btrfs, linux-ext4, linux-xfs

On 03/06/2017 08:27 PM, Jens Axboe wrote:
> On 03/06/2017 11:17 AM, Avi Kivity wrote:
>>
>> On 03/06/2017 07:06 PM, Jens Axboe wrote:
>>> On 03/06/2017 09:59 AM, Avi Kivity wrote:
>>>> On 03/06/2017 06:08 PM, Jens Axboe wrote:
>>>>> On 03/06/2017 08:59 AM, Avi Kivity wrote:
>>>>>> On 03/06/2017 05:38 PM, Jens Axboe wrote:
>>>>>>> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>>>>>>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>>>>>>>>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>>>>>>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>>>>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>>>>>>>>> any of these conditions are met. This way userspace can push most
>>>>>>>>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>>>>>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>>>>>>>>
>>>>>>>>>>> Is it not possible to push the iocb to a workqueue?  This will allow
>>>>>>>>>>> existing userspace to work with the new functionality, unchanged. Any
>>>>>>>>>>> userspace implementation would have to do the same thing, so it's not like
>>>>>>>>>>> we're saving anything by pushing it there.
>>>>>>>>>> That is not easy because until IO is fully submitted, you need some parts
>>>>>>>>>> of the context of the process which submits the IO (e.g. memory mappings,
>>>>>>>>>> but possibly also other credentials). So you would need to somehow transfer
>>>>>>>>>> this information to the workqueue.
>>>>>>>>> Outside of technical challenges, the API also needs to return EAGAIN or
>>>>>>>>> start blocking at some point. We can't expose a direct connection to
>>>>>>>>> queue work like that, and let any user potentially create millions of
>>>>>>>>> pending work items (and IOs).
>>>>>>>> You wouldn't expect more concurrent events than the maxevents parameter
>>>>>>>> that was supplied to io_setup syscall; it should have reserved any
>>>>>>>> resources needed.
>>>>>>> Doesn't matter what limit you apply, my point still stands - at some
>>>>>>> point you have to return EAGAIN, or block. Returning EAGAIN without
>>>>>>> the caller having flagged support for that change of behavior would
>>>>>>> be problematic.
>>>>>> Doesn't it already return EAGAIN (or some other error) if you exceed
>>>>>> maxevents?
>>>>> It's a setup thing. We check these limits when someone creates an IO
>>>>> context, and carve out the specified entries form our global pool. Then
>>>>> we free those "resources" when the io context is freed.
>>>>>
>>>>> Right now I can setup an IO context with 1000 entries on it, yet that
>>>>> number has NO bearing on when io_submit() would potentially block or
>>>>> return EAGAIN.
>>>>>
>>>>> We can have a huge gap on the intent signaled by io context setup, and
>>>>> the reality imposed by what actually happens on the IO submission side.
>>>> Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return
>>>> EAGAIN?
>>>>
>>>> Just tested it, and maxevents is not respected for this:
>>>>
>>>> io_setup(1, [0x7fc64537f000])           = 0
>>>> io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000,
>>>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>>>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>>>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread,
>>>> fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3,
>>>> buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000,
>>>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>>>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>>>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10
>>>>
>>>> which is unexpected, to me.
>>> ioctx_alloc()
>>> {
>>>           [...]
>>>
>>>           /*
>>>            * We keep track of the number of available ringbuffer slots, to prevent
>>>            * overflow (reqs_available), and we also use percpu counters for this.
>>>            *
>>>            * So since up to half the slots might be on other cpu's percpu counters
>>>            * and unavailable, double nr_events so userspace sees what they
>>>            * expected: additionally, we move req_batch slots to/from percpu
>>>            * counters at a time, so make sure that isn't 0:
>>>            */
>>>           nr_events = max(nr_events, num_possible_cpus() * 4);
>>>           nr_events *= 2;
>>> }
>> On a 4-lcore desktop:
>>
>> io_setup(1, [0x7fc210041000])           = 0
>> io_submit(0x7fc210041000, 10000, [big array]) = 126
>> io_submit(0x7fc210041000, 10000, [big array]) = -1 EAGAIN (Resource
>> temporarily unavailable)
>>
>> so, the user should already expect EAGAIN from io_submit() due to
>> resource limits.  I'm sure the check could be tightened so that if we do
>> have to use a workqueue, we respect the user's limit rather than some
>> inflated number.
> This is why I previously said that the 1000 requests you potentially
> asks for when setting up your IO context has NOTHING to do with when you
> will run into EAGAIN. Yes, returning EAGAIN if the app exceeds the
> limit that it itself has set is existing behavior and it certainly makes
> sense. And it's an easily handled condition, since the app can just
> backoff and wait/reap completion events.

Every time I used aio, I considered maxevents to be the maximum number 
of in-flight requests for that queue, and observed this limit 
religiously.  It's possible others don't.

> But if we allow EAGAIN to bubble up from block request submission, then
> that's a change in behavior. This can happen without the app having any
> pending IO against that IO context, hence we can return EAGAIN to the
> app that then has no reasonable way to handle that condition.
>

For sure (and it's a different EAGAIN -- it's tied to the iocb, not 
request submission).  But we do have an upper bound for the number of 
concurrent requests, even if inflated, so having the kernel convert a 
blocking iocb into a workqueue item does not allow userspace to exploit 
the kernel.

We could limit the number of workqueue submissions to maxevents, and 
queue anything between maxevents and (maxevents * inflation_factor) 
using a regular queue.  So the intent of maxevents is respected, and 
applications that overflow it are not regressed.

   if iocb would overflow inflated maxevents:
       io_submit returns EAGAIN
   elseif iocb can be submitted asynchronously:
       do that
   elseif number of iocbs running in workqueues < maxevents:
       push to a workqueue
   else
       queue somewhere, when a work_item completes it can pick up an 
iocb from the queue

this enables aio for all filesystems, and doesn't require lots of idle 
thread pools if the filesystem works or useless syscalls if it doesn't.  
It's a lot more work for the kernel, but results in a tighter and 
simpler interface.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-02-28 23:36 ` [PATCH 5/8] nowait aio: return on congested block device Goldwyn Rodrigues
@ 2017-03-08  7:03   ` Sagi Grimberg
  2017-03-08 15:00     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 63+ messages in thread
From: Sagi Grimberg @ 2017-03-08  7:03 UTC (permalink / raw)
  To: Goldwyn Rodrigues, jack
  Cc: hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues


> -		if (likely(blk_queue_enter(q, false) == 0)) {
> +		if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) {
>  			ret = q->make_request_fn(q, bio);

I think that for ->make_request to not block we'd need to set
BLK_MQ_REQ_NOWAIT in blk_mq_alloc_data to avoid blocking on a tag
allocation.

Something like the untested addition below:
--
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff18719..40e78b57fb44 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct 
request_queue *q,
         if (likely(!data->hctx))
                 data->hctx = blk_mq_map_queue(q, data->ctx->cpu);

+       if (likely(bio) && bio_flagged(bio, BIO_NOWAIT))
+               data->flags |= BLK_MQ_REQ_NOWAIT;
+
         if (e) {
                 data->flags |= BLK_MQ_REQ_INTERNAL;
--

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-03-08  7:03   ` Sagi Grimberg
@ 2017-03-08 15:00     ` Goldwyn Rodrigues
  2017-03-08 15:28       ` Jan Kara
  2017-03-08 16:17       ` Jens Axboe
  0 siblings, 2 replies; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-03-08 15:00 UTC (permalink / raw)
  To: Sagi Grimberg, jack
  Cc: hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues



On 03/08/2017 01:03 AM, Sagi Grimberg wrote:
> 
>> -        if (likely(blk_queue_enter(q, false) == 0)) {
>> +        if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT))
>> == 0)) {
>>              ret = q->make_request_fn(q, bio);
> 
> I think that for ->make_request to not block we'd need to set
> BLK_MQ_REQ_NOWAIT in blk_mq_alloc_data to avoid blocking on a tag
> allocation.
> 
> Something like the untested addition below:

I did that in the first series, but there are too many reasons to block
in blk-mq [1]. I dropped blk-mq work in v2.

[1] https://patchwork.kernel.org/patch/9571051/

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-03-08 15:00     ` Goldwyn Rodrigues
@ 2017-03-08 15:28       ` Jan Kara
  2017-03-08 15:51         ` Christoph Hellwig
  2017-03-08 16:17       ` Jens Axboe
  1 sibling, 1 reply; 63+ messages in thread
From: Jan Kara @ 2017-03-08 15:28 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: Sagi Grimberg, jack, hch, linux-fsdevel, linux-block,
	linux-btrfs, linux-ext4, linux-xfs, Goldwyn Rodrigues

On Wed 08-03-17 09:00:09, Goldwyn Rodrigues wrote:
> 
> 
> On 03/08/2017 01:03 AM, Sagi Grimberg wrote:
> > 
> >> -        if (likely(blk_queue_enter(q, false) == 0)) {
> >> +        if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT))
> >> == 0)) {
> >>              ret = q->make_request_fn(q, bio);
> > 
> > I think that for ->make_request to not block we'd need to set
> > BLK_MQ_REQ_NOWAIT in blk_mq_alloc_data to avoid blocking on a tag
> > allocation.
> > 
> > Something like the untested addition below:
> 
> I did that in the first series, but there are too many reasons to block
> in blk-mq [1]. I dropped blk-mq work in v2.
> 
> [1] https://patchwork.kernel.org/patch/9571051/

Well, that's not really good. If we cannot support this for both blk-mq and
legacy block layer the feature will not be usable. So please work on blk-mq
support as well.

								Honza 

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-03-08 15:28       ` Jan Kara
@ 2017-03-08 15:51         ` Christoph Hellwig
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2017-03-08 15:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Goldwyn Rodrigues, Sagi Grimberg, jack, hch, linux-fsdevel,
	linux-block, linux-btrfs, linux-ext4, linux-xfs,
	Goldwyn Rodrigues

On Wed, Mar 08, 2017 at 04:28:06PM +0100, Jan Kara wrote:
> Well, that's not really good. If we cannot support this for both blk-mq and
> legacy block layer the feature will not be usable. So please work on blk-mq
> support as well.

Exactly.  In addition to that anything implementing a feature for the
legacy request request code but not blk-mq is grounds for an instant
NAK.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-03-08 15:00     ` Goldwyn Rodrigues
  2017-03-08 15:28       ` Jan Kara
@ 2017-03-08 16:17       ` Jens Axboe
  2017-03-09  2:18         ` Goldwyn Rodrigues
  1 sibling, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2017-03-08 16:17 UTC (permalink / raw)
  To: Goldwyn Rodrigues, Sagi Grimberg, jack
  Cc: hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues

On 03/08/2017 08:00 AM, Goldwyn Rodrigues wrote:
> 
> 
> On 03/08/2017 01:03 AM, Sagi Grimberg wrote:
>>
>>> -        if (likely(blk_queue_enter(q, false) == 0)) {
>>> +        if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT))
>>> == 0)) {
>>>              ret = q->make_request_fn(q, bio);
>>
>> I think that for ->make_request to not block we'd need to set
>> BLK_MQ_REQ_NOWAIT in blk_mq_alloc_data to avoid blocking on a tag
>> allocation.
>>
>> Something like the untested addition below:
> 
> I did that in the first series, but there are too many reasons to block
> in blk-mq [1]. I dropped blk-mq work in v2.

That's complete nonsense, there are no more places in blk-mq that will
block that in the legacy path. Most of the examples from your URL:

> [1] https://patchwork.kernel.org/patch/9571051/

are not blk-mq, but writeback throttling, and drivers that explicitly
hook into ->make_request_fn.

As others have mentioned, it's a total non-starter to focus on the
deprecated IO path and just ignore the new one. Back to the drawing
board.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-03-08 16:17       ` Jens Axboe
@ 2017-03-09  2:18         ` Goldwyn Rodrigues
  0 siblings, 0 replies; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-03-09  2:18 UTC (permalink / raw)
  To: Jens Axboe, Sagi Grimberg, jack
  Cc: hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues



On 03/08/2017 10:17 AM, Jens Axboe wrote:
> On 03/08/2017 08:00 AM, Goldwyn Rodrigues wrote:
>>
>>
>> On 03/08/2017 01:03 AM, Sagi Grimberg wrote:
>>>
>>>> -        if (likely(blk_queue_enter(q, false) == 0)) {
>>>> +        if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT))
>>>> == 0)) {
>>>>              ret = q->make_request_fn(q, bio);
>>>
>>> I think that for ->make_request to not block we'd need to set
>>> BLK_MQ_REQ_NOWAIT in blk_mq_alloc_data to avoid blocking on a tag
>>> allocation.
>>>
>>> Something like the untested addition below:
>>
>> I did that in the first series, but there are too many reasons to block
>> in blk-mq [1]. I dropped blk-mq work in v2.
> 
> That's complete nonsense, there are no more places in blk-mq that will
> block that in the legacy path. Most of the examples from your URL:
> 
>> [1] https://patchwork.kernel.org/patch/9571051/
> 
> are not blk-mq, but writeback throttling, and drivers that explicitly
> hook into ->make_request_fn.
> 
> As others have mentioned, it's a total non-starter to focus on the
> deprecated IO path and just ignore the new one. Back to the drawing
> board.
> 

Thanks, I understood what I was doing wrong. I will include blk-mq in
the patch.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
@ 2017-05-11 18:16       ` Goldwyn Rodrigues
  0 siblings, 0 replies; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-05-11 18:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, jack, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, sagi, avi, axboe, linux-api, willy, tom.leiming,
	Goldwyn Rodrigues



On 05/11/2017 02:44 AM, Christoph Hellwig wrote:
> Looks fine,
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Although lifting the make_request limit is something a lot of users
> would appreciate in the near future..
> 

Yes, I understand. That will be on my todo list next on priority.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
@ 2017-05-11 18:16       ` Goldwyn Rodrigues
  0 siblings, 0 replies; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-05-11 18:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, jack-IBi9RG/b67k,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, sagi-NQWnxTmZq1alnMjI0IkVqw,
	avi-VrcmuVmyx1hWk0Htik3J/w, axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-api-u79uwXL29TY76Z2rM5mHXA, willy-wEGCiKHe2LqWVfeAwA7xHQ,
	tom.leiming-Re5JQEeQqe8AvxtiuMwx3w, Goldwyn Rodrigues



On 05/11/2017 02:44 AM, Christoph Hellwig wrote:
> Looks fine,
> 
> Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> 
> Although lifting the make_request limit is something a lot of users
> would appreciate in the near future..
> 

Yes, I understand. That will be on my todo list next on priority.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-05-09 12:22 ` [PATCH 5/8] nowait aio: return on congested block device Goldwyn Rodrigues
@ 2017-05-11  7:44   ` Christoph Hellwig
  2017-05-11 18:16       ` Goldwyn Rodrigues
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2017-05-11  7:44 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, jack, hch, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, sagi, avi, axboe, linux-api, willy, tom.leiming,
	Goldwyn Rodrigues

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

Although lifting the make_request limit is something a lot of users
would appreciate in the near future..

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 5/8] nowait aio: return on congested block device
  2017-05-09 12:22 [PATCH 0/8 v7] No wait AIO Goldwyn Rodrigues
@ 2017-05-09 12:22 ` Goldwyn Rodrigues
  2017-05-11  7:44   ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-05-09 12:22 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, axboe, linux-api, willy, tom.leiming, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

A new bio operation flag REQ_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

Stacked devices such as md (the ones with make_request_fn hooks)
currently are not supported because it may block for housekeeping.
For example, an md can have a part of the device suspended.
For this reason, only request based devices are supported.
In the future, this feature will be expanded to stacked devices
by teaching them how to handle the REQ_NOWAIT flags.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 block/blk-core.c          | 24 ++++++++++++++++++++++--
 block/blk-mq-sched.c      |  3 +++
 block/blk-mq.c            |  4 ++++
 fs/direct-io.c            | 10 ++++++++--
 include/linux/bio.h       |  6 ++++++
 include/linux/blk_types.h |  2 ++
 6 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d772c221cc17..effe934b806b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1232,6 +1232,11 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
 	if (!IS_ERR(rq))
 		return rq;
 
+	if (op & REQ_NOWAIT) {
+		blk_put_rl(rl);
+		return ERR_PTR(-EAGAIN);
+	}
+
 	if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
 		blk_put_rl(rl);
 		return rq;
@@ -1870,6 +1875,17 @@ generic_make_request_checks(struct bio *bio)
 		goto end_io;
 	}
 
+	/*
+	 * For a REQ_NOWAIT based request, return -EOPNOTSUPP
+	 * if queue does not have QUEUE_FLAG_NOWAIT_SUPPORT set
+	 * and if it is not a request based queue.
+	 */
+
+	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) {
+		err = -EOPNOTSUPP;
+		goto end_io;
+	}
+
 	part = bio->bi_bdev->bd_part;
 	if (should_fail_request(part, bio->bi_iter.bi_size) ||
 	    should_fail_request(&part_to_disk(part)->part0,
@@ -2021,7 +2037,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-		if (likely(blk_queue_enter(q, false) == 0)) {
+		if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
 			struct bio_list lower, same;
 
 			/* Create a fresh bio_list for all subordinate requests */
@@ -2046,7 +2062,11 @@ blk_qc_t generic_make_request(struct bio *bio)
 			bio_list_merge(&bio_list_on_stack[0], &same);
 			bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
 		} else {
-			bio_io_error(bio);
+			if (unlikely(!blk_queue_dying(q) &&
+					(bio->bi_opf & REQ_NOWAIT)))
+				bio_wouldblock_error(bio);
+			else
+				bio_io_error(bio);
 		}
 		bio = bio_list_pop(&bio_list_on_stack[0]);
 	} while (bio);
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index c974a1bbf4cb..019d881d62b7 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct request_queue *q,
 	if (likely(!data->hctx))
 		data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
 
+	if (op & REQ_NOWAIT)
+		data->flags |= BLK_MQ_REQ_NOWAIT;
+
 	if (e) {
 		data->flags |= BLK_MQ_REQ_INTERNAL;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c7836a1ded97..d7613ae6a269 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1538,6 +1538,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
+		if (bio->bi_opf & REQ_NOWAIT)
+			bio_wouldblock_error(bio);
 		return BLK_QC_T_NONE;
 	}
 
@@ -1662,6 +1664,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
+		if (bio->bi_opf & REQ_NOWAIT)
+			bio_wouldblock_error(bio);
 		return BLK_QC_T_NONE;
 	}
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea77de8..139ebd5ae1c7 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -480,8 +480,12 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
 	unsigned i;
 	int err;
 
-	if (bio->bi_error)
-		dio->io_error = -EIO;
+	if (bio->bi_error) {
+		if (bio->bi_error == -EAGAIN && (bio->bi_opf & REQ_NOWAIT))
+			dio->io_error = -EAGAIN;
+		else
+			dio->io_error = -EIO;
+	}
 
 	if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) {
 		err = bio->bi_error;
@@ -1197,6 +1201,8 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	if (iov_iter_rw(iter) == WRITE) {
 		dio->op = REQ_OP_WRITE;
 		dio->op_flags = REQ_SYNC | REQ_IDLE;
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			dio->op_flags |= REQ_NOWAIT;
 	} else {
 		dio->op = REQ_OP_READ;
 	}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e521194f6fc..1a9270744b1e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -425,6 +425,12 @@ static inline void bio_io_error(struct bio *bio)
 	bio_endio(bio);
 }
 
+static inline void bio_wouldblock_error(struct bio *bio)
+{
+	bio->bi_error = -EAGAIN;
+	bio_endio(bio);
+}
+
 struct request_queue;
 extern int bio_phys_segments(struct request_queue *, struct bio *);
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d703acb55d0f..5ce4da30ba43 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -187,6 +187,7 @@ enum req_flag_bits {
 	__REQ_PREFLUSH,		/* request for cache flush */
 	__REQ_RAHEAD,		/* read ahead, can fail anytime */
 	__REQ_BACKGROUND,	/* background IO */
+	__REQ_NOWAIT,		/* Don't wait if request will block */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -203,6 +204,7 @@ enum req_flag_bits {
 #define REQ_PREFLUSH		(1ULL << __REQ_PREFLUSH)
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
+#define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-04-24 21:10     ` Goldwyn Rodrigues
@ 2017-04-25  2:28       ` Jens Axboe
  0 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2017-04-25  2:28 UTC (permalink / raw)
  To: Goldwyn Rodrigues, Christoph Hellwig
  Cc: linux-fsdevel, jack, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, sagi, avi, linux-api, willy, tom.leiming,
	Goldwyn Rodrigues

On 04/24/2017 03:10 PM, Goldwyn Rodrigues wrote:
> 
> 
> On 04/19/2017 01:45 AM, Christoph Hellwig wrote:
>> On Fri, Apr 14, 2017 at 07:02:54AM -0500, Goldwyn Rodrigues wrote:
>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>
>>
>>> +	/* Request queue supports BIO_NOWAIT */
>>> +	queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, q);
>>
>> BIO_NOWAIT is gone.  And the comment would not be needed if the
>> flag had a more descriptive name, e.g. QUEUE_FLAG_NOWAIT_SUPPORT.
>>
>> And I think all request based drivers should set the flag implicitly
>> as ->queuecommand can't sleep, and ->queue_rq only when it's always
>> offloaded to a workqueue when the BLK_MQ_F_BLOCKING flag is set.
>>
> 
> We introduced QUEUE_FLAG_NOWAIT for devices which would not wait for
> request completions. The ones which wait are MD devices because of sync
> or suspend operations.
> 
> The only user of BLK_MQ_F_NONBLOCKING seems to be nbd. As you mentioned,
> it uses the flag to offload it to a workqueue.
> 
> The other way to do it implicitly is to change the flag to
> BLK_MAY_BLOCK_REQS and use it for devices which do wait such as md/dm.
> Is that what you are hinting at? Or do you have something else in mind?

You are misunderstanding. What Christoph (correctly) says is that request
based drivers do not block, so all of those are fine. What you need to
worry about is drivers that are NOT request based. In other words, drivers
that hook into make_request_fn and process bio's.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-04-19  6:45   ` Christoph Hellwig
  2017-04-19 15:21     ` Goldwyn Rodrigues
@ 2017-04-24 21:10     ` Goldwyn Rodrigues
  2017-04-25  2:28       ` Jens Axboe
  1 sibling, 1 reply; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-04-24 21:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, jack, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, sagi, avi, axboe, linux-api, willy, tom.leiming,
	Goldwyn Rodrigues



On 04/19/2017 01:45 AM, Christoph Hellwig wrote:
> On Fri, Apr 14, 2017 at 07:02:54AM -0500, Goldwyn Rodrigues wrote:
>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>
> 
>> +	/* Request queue supports BIO_NOWAIT */
>> +	queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, q);
> 
> BIO_NOWAIT is gone.  And the comment would not be needed if the
> flag had a more descriptive name, e.g. QUEUE_FLAG_NOWAIT_SUPPORT.
> 
> And I think all request based drivers should set the flag implicitly
> as ->queuecommand can't sleep, and ->queue_rq only when it's always
> offloaded to a workqueue when the BLK_MQ_F_BLOCKING flag is set.
> 

We introduced QUEUE_FLAG_NOWAIT for devices which would not wait for
request completions. The ones which wait are MD devices because of sync
or suspend operations.

The only user of BLK_MQ_F_NONBLOCKING seems to be nbd. As you mentioned,
it uses the flag to offload it to a workqueue.

The other way to do it implicitly is to change the flag to
BLK_MAY_BLOCK_REQS and use it for devices which do wait such as md/dm.
Is that what you are hinting at? Or do you have something else in mind?


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-04-19 15:21     ` Goldwyn Rodrigues
@ 2017-04-20 13:43       ` Jan Kara
  0 siblings, 0 replies; 63+ messages in thread
From: Jan Kara @ 2017-04-20 13:43 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: Christoph Hellwig, linux-fsdevel, jack, linux-block, linux-btrfs,
	linux-ext4, linux-xfs, sagi, avi, axboe, linux-api, willy,
	tom.leiming, Goldwyn Rodrigues

On Wed 19-04-17 10:21:39, Goldwyn Rodrigues wrote:
> 
> 
> On 04/19/2017 01:45 AM, Christoph Hellwig wrote:
> > 
> >> +	if (bio->bi_opf & REQ_NOWAIT) {
> >> +		if (!blk_queue_nowait(q)) {
> >> +			err = -EOPNOTSUPP;
> >> +			goto end_io;
> >> +		}
> >> +		if (!(bio->bi_opf & REQ_SYNC)) {
> > 
> > I don't understand this check at all..
> 
> It is to ensure at the block layer that NOWAIT comes only for DIRECT
> calls only. I should probably change it to REQ_SYNC | REQ_IDLE.

Ouch. Checking 'REQ_SYNC' for this is

a) unreliable hack
b) layering violation

You just don't care why someone marked bio with REQ_NOWAIT at this place.
Just obey the request if you can, return error if you cannot, but advisory
REQ_SYNC or REQ_IDLE flags have nothing to do with the ability of the block
layer to submit the bio without blocking...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-04-19  6:45   ` Christoph Hellwig
@ 2017-04-19 15:21     ` Goldwyn Rodrigues
  2017-04-20 13:43       ` Jan Kara
  2017-04-24 21:10     ` Goldwyn Rodrigues
  1 sibling, 1 reply; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-04-19 15:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, jack, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, sagi, avi, axboe, linux-api, willy, tom.leiming,
	Goldwyn Rodrigues



On 04/19/2017 01:45 AM, Christoph Hellwig wrote:
> 
>> +	if (bio->bi_opf & REQ_NOWAIT) {
>> +		if (!blk_queue_nowait(q)) {
>> +			err = -EOPNOTSUPP;
>> +			goto end_io;
>> +		}
>> +		if (!(bio->bi_opf & REQ_SYNC)) {
> 
> I don't understand this check at all..

It is to ensure at the block layer that NOWAIT comes only for DIRECT
calls only. I should probably change it to REQ_SYNC | REQ_IDLE.

> 
>> +			if (unlikely(!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT)))
> 
> Please break lines after 80 characters.
> 
>> @@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct request_queue *q,
>>  	if (likely(!data->hctx))
>>  		data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
>>  
>> +	if (bio && (bio->bi_opf & REQ_NOWAIT))
>> +		data->flags |= BLK_MQ_REQ_NOWAIT;
> 
> Check the op flag again here.
> 
>> +++ b/block/blk-mq.c
>> @@ -1538,6 +1538,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
>>  	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
>>  	if (unlikely(!rq)) {
>>  		__wbt_done(q->rq_wb, wb_acct);
>> +		if (bio && (bio->bi_opf & REQ_NOWAIT))
>> +			bio_wouldblock_error(bio);
> 
> bio iѕ dereferences unconditionally above, so you can do the same.
> 
>> @@ -1662,6 +1664,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
>>  	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
>>  	if (unlikely(!rq)) {
>>  		__wbt_done(q->rq_wb, wb_acct);
>> +		if (bio && (bio->bi_opf & REQ_NOWAIT))
>> +			bio_wouldblock_error(bio);
> 
> Same here.  Although blk_sq_make_request is gone anyway in the current
> block tree..
> 
>> +	/* Request queue supports BIO_NOWAIT */
>> +	queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, q);
> 
> BIO_NOWAIT is gone.  And the comment would not be needed if the
> flag had a more descriptive name, e.g. QUEUE_FLAG_NOWAIT_SUPPORT.
> 
> And I think all request based drivers should set the flag implicitly
> as ->queuecommand can't sleep, and ->queue_rq only when it's always
> offloaded to a workqueue when the BLK_MQ_F_BLOCKING flag is set.
> 

Yes, Do we have a central point (like a probe() function call?) where
this can be done?


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-04-14 12:02 ` [PATCH 5/8] nowait aio: return on congested block device Goldwyn Rodrigues
@ 2017-04-19  6:45   ` Christoph Hellwig
  2017-04-19 15:21     ` Goldwyn Rodrigues
  2017-04-24 21:10     ` Goldwyn Rodrigues
  0 siblings, 2 replies; 63+ messages in thread
From: Christoph Hellwig @ 2017-04-19  6:45 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, jack, hch, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, sagi, avi, axboe, linux-api, willy, tom.leiming,
	Goldwyn Rodrigues

On Fri, Apr 14, 2017 at 07:02:54AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> A new bio operation flag REQ_NOWAIT is introduced to identify bio's

s/bio/block/

> @@ -1232,6 +1232,11 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
>  	if (!IS_ERR(rq))
>  		return rq;
>  
> +	if (bio && (bio->bi_opf & REQ_NOWAIT)) {
> +		blk_put_rl(rl);
> +		return ERR_PTR(-EAGAIN);
> +	}

Please check the op argument instead of touching bio.

> +	if (bio->bi_opf & REQ_NOWAIT) {
> +		if (!blk_queue_nowait(q)) {
> +			err = -EOPNOTSUPP;
> +			goto end_io;
> +		}
> +		if (!(bio->bi_opf & REQ_SYNC)) {

I don't understand this check at all..

> +			if (unlikely(!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT)))

Please break lines after 80 characters.

> @@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct request_queue *q,
>  	if (likely(!data->hctx))
>  		data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
>  
> +	if (bio && (bio->bi_opf & REQ_NOWAIT))
> +		data->flags |= BLK_MQ_REQ_NOWAIT;

Check the op flag again here.

> +++ b/block/blk-mq.c
> @@ -1538,6 +1538,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
>  	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
>  	if (unlikely(!rq)) {
>  		__wbt_done(q->rq_wb, wb_acct);
> +		if (bio && (bio->bi_opf & REQ_NOWAIT))
> +			bio_wouldblock_error(bio);

bio iѕ dereferences unconditionally above, so you can do the same.

> @@ -1662,6 +1664,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
>  	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
>  	if (unlikely(!rq)) {
>  		__wbt_done(q->rq_wb, wb_acct);
> +		if (bio && (bio->bi_opf & REQ_NOWAIT))
> +			bio_wouldblock_error(bio);

Same here.  Although blk_sq_make_request is gone anyway in the current
block tree..

> +	/* Request queue supports BIO_NOWAIT */
> +	queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, q);

BIO_NOWAIT is gone.  And the comment would not be needed if the
flag had a more descriptive name, e.g. QUEUE_FLAG_NOWAIT_SUPPORT.

And I think all request based drivers should set the flag implicitly
as ->queuecommand can't sleep, and ->queue_rq only when it's always
offloaded to a workqueue when the BLK_MQ_F_BLOCKING flag is set.

> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -480,8 +480,12 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
>  	unsigned i;
>  	int err;
>  
> -	if (bio->bi_error)
> -		dio->io_error = -EIO;
> +	if (bio->bi_error) {
> +		if (bio->bi_opf & REQ_NOWAIT)
> +			dio->io_error = -EAGAIN;
> +		else
> +			dio->io_error = -EIO;
> +	}

Huh?  Once REQ_NOWAIT is set all errors are -EAGAIN?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 5/8] nowait aio: return on congested block device
  2017-04-14 12:02 [PATCH 0/8 v6] No wait AIO Goldwyn Rodrigues
@ 2017-04-14 12:02 ` Goldwyn Rodrigues
  2017-04-19  6:45   ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-04-14 12:02 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, axboe, linux-api, willy, tom.leiming, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

A new bio operation flag REQ_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

To facilitate this, QUEUE_FLAG_NOWAIT is set to devices
which support this. While currently this is set to
virtio and sd only. Support to more devices will be added soon
once I am sure they don't block. Currently blocks such as dm/md
block while performing sync.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 block/blk-core.c           | 24 ++++++++++++++++++++++--
 block/blk-mq-sched.c       |  3 +++
 block/blk-mq.c             |  4 ++++
 drivers/block/virtio_blk.c |  3 +++
 drivers/scsi/sd.c          |  3 +++
 fs/direct-io.c             | 10 ++++++++--
 include/linux/bio.h        |  6 ++++++
 include/linux/blk_types.h  |  2 ++
 include/linux/blkdev.h     |  3 +++
 9 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d772c221cc17..54698521756b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1232,6 +1232,11 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
 	if (!IS_ERR(rq))
 		return rq;
 
+	if (bio && (bio->bi_opf & REQ_NOWAIT)) {
+		blk_put_rl(rl);
+		return ERR_PTR(-EAGAIN);
+	}
+
 	if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
 		blk_put_rl(rl);
 		return rq;
@@ -1870,6 +1875,18 @@ generic_make_request_checks(struct bio *bio)
 		goto end_io;
 	}
 
+	if (bio->bi_opf & REQ_NOWAIT) {
+		if (!blk_queue_nowait(q)) {
+			err = -EOPNOTSUPP;
+			goto end_io;
+		}
+		if (!(bio->bi_opf & REQ_SYNC)) {
+			err = -EINVAL;
+			goto end_io;
+		}
+	}
+
+
 	part = bio->bi_bdev->bd_part;
 	if (should_fail_request(part, bio->bi_iter.bi_size) ||
 	    should_fail_request(&part_to_disk(part)->part0,
@@ -2021,7 +2038,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-		if (likely(blk_queue_enter(q, false) == 0)) {
+		if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
 			struct bio_list lower, same;
 
 			/* Create a fresh bio_list for all subordinate requests */
@@ -2046,7 +2063,10 @@ blk_qc_t generic_make_request(struct bio *bio)
 			bio_list_merge(&bio_list_on_stack[0], &same);
 			bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
 		} else {
-			bio_io_error(bio);
+			if (unlikely(!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT)))
+				bio_wouldblock_error(bio);
+			else
+				bio_io_error(bio);
 		}
 		bio = bio_list_pop(&bio_list_on_stack[0]);
 	} while (bio);
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index c974a1bbf4cb..9f88190ff395 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct request_queue *q,
 	if (likely(!data->hctx))
 		data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
 
+	if (bio && (bio->bi_opf & REQ_NOWAIT))
+		data->flags |= BLK_MQ_REQ_NOWAIT;
+
 	if (e) {
 		data->flags |= BLK_MQ_REQ_INTERNAL;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 572966f49596..8b9b1a411ce2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1538,6 +1538,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
+		if (bio && (bio->bi_opf & REQ_NOWAIT))
+			bio_wouldblock_error(bio);
 		return BLK_QC_T_NONE;
 	}
 
@@ -1662,6 +1664,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
+		if (bio && (bio->bi_opf & REQ_NOWAIT))
+			bio_wouldblock_error(bio);
 		return BLK_QC_T_NONE;
 	}
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 1d4c9f8bc1e1..7481124c5025 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -731,6 +731,9 @@ static int virtblk_probe(struct virtio_device *vdev)
 	/* No real sector limit. */
 	blk_queue_max_hw_sectors(q, -1U);
 
+	/* Request queue supports BIO_NOWAIT */
+	queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, q);
+
 	/* Host can optionally specify maximum segment size and number of
 	 * segments. */
 	err = virtio_cread_feature(vdev, VIRTIO_BLK_F_SIZE_MAX,
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index fcfeddc79331..9df85ee165be 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3177,6 +3177,9 @@ static int sd_probe(struct device *dev)
 					     SD_MOD_TIMEOUT);
 	}
 
+	/* Support BIO_NOWAIT */
+	queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, sdp->request_queue);
+
 	device_initialize(&sdkp->dev);
 	sdkp->dev.parent = dev;
 	sdkp->dev.class = &sd_disk_class;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea77de8..a802168284e1 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -480,8 +480,12 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
 	unsigned i;
 	int err;
 
-	if (bio->bi_error)
-		dio->io_error = -EIO;
+	if (bio->bi_error) {
+		if (bio->bi_opf & REQ_NOWAIT)
+			dio->io_error = -EAGAIN;
+		else
+			dio->io_error = -EIO;
+	}
 
 	if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) {
 		err = bio->bi_error;
@@ -1197,6 +1201,8 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	if (iov_iter_rw(iter) == WRITE) {
 		dio->op = REQ_OP_WRITE;
 		dio->op_flags = REQ_SYNC | REQ_IDLE;
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			dio->op_flags |= REQ_NOWAIT;
 	} else {
 		dio->op = REQ_OP_READ;
 	}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e521194f6fc..1a9270744b1e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -425,6 +425,12 @@ static inline void bio_io_error(struct bio *bio)
 	bio_endio(bio);
 }
 
+static inline void bio_wouldblock_error(struct bio *bio)
+{
+	bio->bi_error = -EAGAIN;
+	bio_endio(bio);
+}
+
 struct request_queue;
 extern int bio_phys_segments(struct request_queue *, struct bio *);
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d703acb55d0f..5ce4da30ba43 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -187,6 +187,7 @@ enum req_flag_bits {
 	__REQ_PREFLUSH,		/* request for cache flush */
 	__REQ_RAHEAD,		/* read ahead, can fail anytime */
 	__REQ_BACKGROUND,	/* background IO */
+	__REQ_NOWAIT,		/* Don't wait if request will block */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -203,6 +204,7 @@ enum req_flag_bits {
 #define REQ_PREFLUSH		(1ULL << __REQ_PREFLUSH)
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
+#define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7548f332121a..df0b1245d955 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -610,6 +610,8 @@ struct request_queue {
 #define QUEUE_FLAG_FLUSH_NQ    25	/* flush not queueuable */
 #define QUEUE_FLAG_DAX         26	/* device supports DAX */
 #define QUEUE_FLAG_STATS       27	/* track rq completion times */
+/* can return immediately on congestion  (for REQ_NOWAIT) */
+#define QUEUE_FLAG_NOWAIT      28
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
@@ -700,6 +702,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 #define blk_queue_secure_erase(q) \
 	(test_bit(QUEUE_FLAG_SECERASE, &(q)->queue_flags))
 #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
+#define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-04-03 18:53 ` [PATCH 5/8] nowait aio: return on congested block device Goldwyn Rodrigues
@ 2017-04-04  6:49   ` Christoph Hellwig
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2017-04-04  6:49 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, jack, hch, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, sagi, avi, axboe, linux-api, willy, tom.leiming,
	Goldwyn Rodrigues

Please make this a REQ_* flag so that it can be passed in the bio,
the request and as an argument to the get_request functions instead
of testing for a bio.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 5/8] nowait aio: return on congested block device
  2017-04-03 18:52 [PATCH 0/8 v4] No wait AIO Goldwyn Rodrigues
@ 2017-04-03 18:53 ` Goldwyn Rodrigues
  2017-04-04  6:49   ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-04-03 18:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, axboe, linux-api, willy, tom.leiming, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

A new flag BIO_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

To facilitate this, QUEUE_FLAG_NOWAIT is set to devices
which support this. While currently this is set to
virtio and sd only. Support to more devices will be added soon.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 block/blk-core.c           | 24 ++++++++++++++++++++++--
 block/blk-mq-sched.c       |  3 +++
 block/blk-mq.c             |  4 ++++
 drivers/block/virtio_blk.c |  3 +++
 drivers/scsi/sd.c          |  3 +++
 fs/direct-io.c             | 11 +++++++++--
 include/linux/bio.h        |  6 ++++++
 include/linux/blk_types.h  |  1 +
 include/linux/blkdev.h     |  2 ++
 9 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d772c22..95a9b18 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1232,6 +1232,11 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
 	if (!IS_ERR(rq))
 		return rq;
 
+	if (bio && bio_flagged(bio, BIO_NOWAIT)) {
+		blk_put_rl(rl);
+		return ERR_PTR(-EAGAIN);
+	}
+
 	if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
 		blk_put_rl(rl);
 		return rq;
@@ -1870,6 +1875,18 @@ generic_make_request_checks(struct bio *bio)
 		goto end_io;
 	}
 
+	if (bio_flagged(bio, BIO_NOWAIT)) {
+		if (!blk_queue_nowait(q)) {
+			err = -EOPNOTSUPP;
+			goto end_io;
+		}
+		if (!(bio->bi_opf & (REQ_SYNC | REQ_IDLE))) {
+			err = -EINVAL;
+			goto end_io;
+		}
+	}
+
+
 	part = bio->bi_bdev->bd_part;
 	if (should_fail_request(part, bio->bi_iter.bi_size) ||
 	    should_fail_request(&part_to_disk(part)->part0,
@@ -2021,7 +2038,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-		if (likely(blk_queue_enter(q, false) == 0)) {
+		if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) {
 			struct bio_list lower, same;
 
 			/* Create a fresh bio_list for all subordinate requests */
@@ -2046,7 +2063,10 @@ blk_qc_t generic_make_request(struct bio *bio)
 			bio_list_merge(&bio_list_on_stack[0], &same);
 			bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
 		} else {
-			bio_io_error(bio);
+			if (unlikely(!blk_queue_dying(q) && bio_flagged(bio, BIO_NOWAIT)))
+				bio_wouldblock_error(bio);
+			else
+				bio_io_error(bio);
 		}
 		bio = bio_list_pop(&bio_list_on_stack[0]);
 	} while (bio);
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff..40e78b5 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct request_queue *q,
 	if (likely(!data->hctx))
 		data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
 
+	if (likely(bio) && bio_flagged(bio, BIO_NOWAIT))
+		data->flags |= BLK_MQ_REQ_NOWAIT;
+
 	if (e) {
 		data->flags |= BLK_MQ_REQ_INTERNAL;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6b6e7bc..2d90b12 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1511,6 +1511,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
+		if (bio && bio_flagged(bio, BIO_NOWAIT))
+			bio_wouldblock_error(bio);
 		return BLK_QC_T_NONE;
 	}
 
@@ -1635,6 +1637,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
+		if (bio && bio_flagged(bio, BIO_NOWAIT))
+			bio_wouldblock_error(bio);
 		return BLK_QC_T_NONE;
 	}
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 1d4c9f8..7481124 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -731,6 +731,9 @@ static int virtblk_probe(struct virtio_device *vdev)
 	/* No real sector limit. */
 	blk_queue_max_hw_sectors(q, -1U);
 
+	/* Request queue supports BIO_NOWAIT */
+	queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, q);
+
 	/* Host can optionally specify maximum segment size and number of
 	 * segments. */
 	err = virtio_cread_feature(vdev, VIRTIO_BLK_F_SIZE_MAX,
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index fcfeddc..9df85ee 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3177,6 +3177,9 @@ static int sd_probe(struct device *dev)
 					     SD_MOD_TIMEOUT);
 	}
 
+	/* Support BIO_NOWAIT */
+	queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, sdp->request_queue);
+
 	device_initialize(&sdkp->dev);
 	sdkp->dev.parent = dev;
 	sdkp->dev.class = &sd_disk_class;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea..f6835d3 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,6 +386,9 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 	else
 		bio->bi_end_io = dio_bio_end_io;
 
+	if (dio->iocb->ki_flags & IOCB_NOWAIT)
+		bio_set_flag(bio, BIO_NOWAIT);
+
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
@@ -480,8 +483,12 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
 	unsigned i;
 	int err;
 
-	if (bio->bi_error)
-		dio->io_error = -EIO;
+	if (bio->bi_error) {
+		if (bio_flagged(bio, BIO_NOWAIT))
+			dio->io_error = -EAGAIN;
+		else
+			dio->io_error = -EIO;
+	}
 
 	if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) {
 		err = bio->bi_error;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e52119..1a92707 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -425,6 +425,12 @@ static inline void bio_io_error(struct bio *bio)
 	bio_endio(bio);
 }
 
+static inline void bio_wouldblock_error(struct bio *bio)
+{
+	bio->bi_error = -EAGAIN;
+	bio_endio(bio);
+}
+
 struct request_queue;
 extern int bio_phys_segments(struct request_queue *, struct bio *);
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d703acb..514c08e 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -102,6 +102,7 @@ struct bio {
 #define BIO_REFFED	8	/* bio has elevated ->bi_cnt */
 #define BIO_THROTTLED	9	/* This bio has already been subjected to
 				 * throttling rules. Don't do it again. */
+#define BIO_NOWAIT	10	/* don't block over blk device congestion */
 
 /*
  * Flags starting here get preserved by bio_reset() - this includes
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5a7da60..ae38ab6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -611,6 +611,7 @@ struct request_queue {
 #define QUEUE_FLAG_DAX         26	/* device supports DAX */
 #define QUEUE_FLAG_STATS       27	/* track rq completion times */
 #define QUEUE_FLAG_RESTART     28	/* queue needs restart at completion */
+#define QUEUE_FLAG_NOWAIT      29	/* queue supports BIO_NOWAIT */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
@@ -701,6 +702,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 #define blk_queue_secure_erase(q) \
 	(test_bit(QUEUE_FLAG_SECERASE, &(q)->queue_flags))
 #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
+#define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-03-24 11:32       ` Goldwyn Rodrigues
  (?)
@ 2017-03-24 14:39       ` Jens Axboe
  -1 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2017-03-24 14:39 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, linux-api, willy, Goldwyn Rodrigues

On 03/24/2017 05:32 AM, Goldwyn Rodrigues wrote:
> 
> 
> On 03/16/2017 09:33 AM, Jens Axboe wrote:
>> On 03/15/2017 03:51 PM, Goldwyn Rodrigues wrote:
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 0eeb99e..2e5cba2 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>>>  	do {
>>>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>>>  
>>> -		if (likely(blk_queue_enter(q, false) == 0)) {
>>> +		if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) {
>>>  			struct bio_list hold;
>>>  			struct bio_list lower, same;
>>>  
>>> @@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio)
>>>  			bio_list_merge(&bio_list_on_stack, &same);
>>>  			bio_list_merge(&bio_list_on_stack, &hold);
>>>  		} else {
>>> -			bio_io_error(bio);
>>> +			if (unlikely(bio_flagged(bio, BIO_NOWAIT)))
>>> +				bio_wouldblock_error(bio);
>>> +			else
>>> +				bio_io_error(bio);
>>
>> This doesn't look right. What if the queue is dying, and BIO_NOWAIT just
>> happened to be set?
>>
> 
> Yes, I need to add a condition here to check for blk_queue_dying(). Thanks.
> 
>> And you're missing wbt_wait() as well as a blocking point. Ditto in
>> blk-mq.
> 
> wbt_wait() does not apply to WRITE_ODIRECT

It doesn't _currently_ block for it, but that could change. It should
have an explicit check.

Additionally, the more weird assumptions you make, the more half assed
this will be. The other assumption made here is that only direct IO
would set nonblock. We have to assume that any IO can have NONBLOCK set,
and not make any assumptions about the caller.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
@ 2017-03-24 11:32       ` Goldwyn Rodrigues
  0 siblings, 0 replies; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-03-24 11:32 UTC (permalink / raw)
  To: Jens Axboe, linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, linux-api, willy, Goldwyn Rodrigues



On 03/16/2017 09:33 AM, Jens Axboe wrote:
> On 03/15/2017 03:51 PM, Goldwyn Rodrigues wrote:
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 0eeb99e..2e5cba2 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>>  	do {
>>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>>  
>> -		if (likely(blk_queue_enter(q, false) == 0)) {
>> +		if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) {
>>  			struct bio_list hold;
>>  			struct bio_list lower, same;
>>  
>> @@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio)
>>  			bio_list_merge(&bio_list_on_stack, &same);
>>  			bio_list_merge(&bio_list_on_stack, &hold);
>>  		} else {
>> -			bio_io_error(bio);
>> +			if (unlikely(bio_flagged(bio, BIO_NOWAIT)))
>> +				bio_wouldblock_error(bio);
>> +			else
>> +				bio_io_error(bio);
> 
> This doesn't look right. What if the queue is dying, and BIO_NOWAIT just
> happened to be set?
> 

Yes, I need to add a condition here to check for blk_queue_dying(). Thanks.

> And you're missing wbt_wait() as well as a blocking point. Ditto in
> blk-mq.

wbt_wait() does not apply to WRITE_ODIRECT


> 
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 159187a..942ce8c 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1518,6 +1518,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
>>  	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
>>  	if (unlikely(!rq)) {
>>  		__wbt_done(q->rq_wb, wb_acct);
>> +		if (bio && bio_flagged(bio, BIO_NOWAIT))
>> +			bio_wouldblock_error(bio);
>>  		return BLK_QC_T_NONE;
>>  	}
>>  
> 
> This seems a little fragile now, since not both paths free the bio.
> 

Direct I/O should free the bios in bio_dio_complete(). I am not sure why
it would not free bio here originally, but IIRC, this path is for
bio==NULL only. So, with this patch we would get a rq==NULL here and
hence the bio_wouldblock_error() call.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
@ 2017-03-24 11:32       ` Goldwyn Rodrigues
  0 siblings, 0 replies; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-03-24 11:32 UTC (permalink / raw)
  To: Jens Axboe, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: jack-IBi9RG/b67k, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, sagi-NQWnxTmZq1alnMjI0IkVqw,
	avi-VrcmuVmyx1hWk0Htik3J/w, linux-api-u79uwXL29TY76Z2rM5mHXA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, Goldwyn Rodrigues



On 03/16/2017 09:33 AM, Jens Axboe wrote:
> On 03/15/2017 03:51 PM, Goldwyn Rodrigues wrote:
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 0eeb99e..2e5cba2 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>>  	do {
>>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>>  
>> -		if (likely(blk_queue_enter(q, false) == 0)) {
>> +		if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) {
>>  			struct bio_list hold;
>>  			struct bio_list lower, same;
>>  
>> @@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio)
>>  			bio_list_merge(&bio_list_on_stack, &same);
>>  			bio_list_merge(&bio_list_on_stack, &hold);
>>  		} else {
>> -			bio_io_error(bio);
>> +			if (unlikely(bio_flagged(bio, BIO_NOWAIT)))
>> +				bio_wouldblock_error(bio);
>> +			else
>> +				bio_io_error(bio);
> 
> This doesn't look right. What if the queue is dying, and BIO_NOWAIT just
> happened to be set?
> 

Yes, I need to add a condition here to check for blk_queue_dying(). Thanks.

> And you're missing wbt_wait() as well as a blocking point. Ditto in
> blk-mq.

wbt_wait() does not apply to WRITE_ODIRECT


> 
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 159187a..942ce8c 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1518,6 +1518,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
>>  	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
>>  	if (unlikely(!rq)) {
>>  		__wbt_done(q->rq_wb, wb_acct);
>> +		if (bio && bio_flagged(bio, BIO_NOWAIT))
>> +			bio_wouldblock_error(bio);
>>  		return BLK_QC_T_NONE;
>>  	}
>>  
> 
> This seems a little fragile now, since not both paths free the bio.
> 

Direct I/O should free the bios in bio_dio_complete(). I am not sure why
it would not free bio here originally, but IIRC, this path is for
bio==NULL only. So, with this patch we would get a rq==NULL here and
hence the bio_wouldblock_error() call.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-03-17 12:23     ` Goldwyn Rodrigues
@ 2017-03-20 17:33       ` Jan Kara
  0 siblings, 0 replies; 63+ messages in thread
From: Jan Kara @ 2017-03-20 17:33 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: Dave Chinner, Goldwyn Rodrigues, linux-fsdevel, jack, hch,
	linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi, avi,
	axboe, linux-api, willy

On Fri 17-03-17 07:23:24, Goldwyn Rodrigues wrote:
> On 03/16/2017 04:31 PM, Dave Chinner wrote:
> > On Wed, Mar 15, 2017 at 04:51:04PM -0500, Goldwyn Rodrigues wrote:
> >> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >>
> >> A new flag BIO_NOWAIT is introduced to identify bio's
> >> orignating from iocb with IOCB_NOWAIT. This flag indicates
> >> to return immediately if a request cannot be made instead
> >> of retrying.
> > 
> > So this makes a congested block device run the bio IO completion
> > callback with an -EAGAIN error present? Are all the filesystem
> > direct IO submission and completion routines OK with that? i.e. does
> > such a congestion case cause filesystems to temporarily expose stale
> > data to unprivileged users when the IO is requeued in this way?
> > 
> > e.g. filesystem does allocation without blocking, submits bio,
> > device is congested, runs IO completion with error, so nothing
> > written to allocated blocks, write gets queued, so other read
> > comes in while the write is queued, reads data from uninitialised
> > blocks that were allocated during the write....
> > 
> > Seems kinda problematic to me to have a undocumented design
> > constraint (i.e a landmine) where we submit the AIO only to have it
> > error out and then expect the filesystem to do something special and
> > different /without blocking/ on EAGAIN.
> 
> If the filesystems has to perform block allocation, we would return
> -EAGAIN early enough. However, I agree there is a problem, since not all
> filesystems know this. I worked on only three of them.

So Dave has a good point about the missing documentation explaining the
expectation from the filesystem. Other than that the filesystem is
responsible for determining in its direct IO submission routine whether it
can deal with NOWAIT direct IO without blocking (including backing out of
block layer refuses to do the IO) and if not, just return with EAGAIN right
away.

Probably we should also have a feature flag in filesystem_type telling
whether a particular filesystem knows about NOWAIT direct IO at all so that
we don't have to add stub to each and every filesystem supporting direct IO
that returns EAGAIN when NOWAIT is set (OTOH adding that stub won't be that
hard either since there are not *that* many filesystems supporting direct
IO). But one or the other has to be done.

> > Why isn't the congestion check at a higher layer like we do for page
> > cache readahead? i.e. using the bdi*congested() API at the time we
> > are doing all the other filesystem blocking checks.
> > 
> 
> Yes, that may work better. We will have to call bdi_read_congested() on
> a write path. (will have to comment that part of the code). Would it
> encompass all possible waits in the block layer?

Congestion mechanism is not really reliable. The bdi can show the device is
not congested but block layer will still block you waiting for some
resource. And for some of the stuff like tag allocation you cannot really
do a reliable check in advance so I don't think congestion checks are
really a viable alternative.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-03-16 14:33     ` Jens Axboe
  (?)
  (?)
@ 2017-03-17 12:23     ` Goldwyn Rodrigues
  -1 siblings, 0 replies; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-03-17 12:23 UTC (permalink / raw)
  To: Jens Axboe, Goldwyn Rodrigues, linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, linux-api, willy



On 03/16/2017 09:33 AM, Jens Axboe wrote:
> On 03/15/2017 03:51 PM, Goldwyn Rodrigues wrote:
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 0eeb99e..2e5cba2 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>>  	do {
>>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>>  
>> -		if (likely(blk_queue_enter(q, false) == 0)) {
>> +		if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) {
>>  			struct bio_list hold;
>>  			struct bio_list lower, same;
>>  
>> @@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio)
>>  			bio_list_merge(&bio_list_on_stack, &same);
>>  			bio_list_merge(&bio_list_on_stack, &hold);
>>  		} else {
>> -			bio_io_error(bio);
>> +			if (unlikely(bio_flagged(bio, BIO_NOWAIT)))
>> +				bio_wouldblock_error(bio);
>> +			else
>> +				bio_io_error(bio);
> 
> This doesn't look right. What if the queue is dying, and BIO_NOWAIT just
> happened to be set?

Yes, need a check for that.
> 
> And you're missing wbt_wait() as well as a blocking point. Ditto in
> blk-mq.

Agree.

> 
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 159187a..942ce8c 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1518,6 +1518,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
>>  	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
>>  	if (unlikely(!rq)) {
>>  		__wbt_done(q->rq_wb, wb_acct);
>> +		if (bio && bio_flagged(bio, BIO_NOWAIT))
>> +			bio_wouldblock_error(bio);
>>  		return BLK_QC_T_NONE;
>>  	}
>>  
> 
> This seems a little fragile now, since not both paths free the bio.
> 

bio_put() is handled in dio_bio_complete(). However, I am not sure why
there is no bio_endio() call for !rq. IIRC, it was not meant to be.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-03-16 21:31     ` Dave Chinner
  (?)
@ 2017-03-17 12:23     ` Goldwyn Rodrigues
  2017-03-20 17:33       ` Jan Kara
  -1 siblings, 1 reply; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-03-17 12:23 UTC (permalink / raw)
  To: Dave Chinner, Goldwyn Rodrigues
  Cc: linux-fsdevel, jack, hch, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, sagi, avi, axboe, linux-api, willy



On 03/16/2017 04:31 PM, Dave Chinner wrote:
> On Wed, Mar 15, 2017 at 04:51:04PM -0500, Goldwyn Rodrigues wrote:
>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>
>> A new flag BIO_NOWAIT is introduced to identify bio's
>> orignating from iocb with IOCB_NOWAIT. This flag indicates
>> to return immediately if a request cannot be made instead
>> of retrying.
> 
> So this makes a congested block device run the bio IO completion
> callback with an -EAGAIN error present? Are all the filesystem
> direct IO submission and completion routines OK with that? i.e. does
> such a congestion case cause filesystems to temporarily expose stale
> data to unprivileged users when the IO is requeued in this way?
> 
> e.g. filesystem does allocation without blocking, submits bio,
> device is congested, runs IO completion with error, so nothing
> written to allocated blocks, write gets queued, so other read
> comes in while the write is queued, reads data from uninitialised
> blocks that were allocated during the write....
> 
> Seems kinda problematic to me to have a undocumented design
> constraint (i.e a landmine) where we submit the AIO only to have it
> error out and then expect the filesystem to do something special and
> different /without blocking/ on EAGAIN.


If the filesystems has to perform block allocation, we would return
-EAGAIN early enough. However, I agree there is a problem, since not all
filesystems know this. I worked on only three of them.

> 
> Why isn't the congestion check at a higher layer like we do for page
> cache readahead? i.e. using the bdi*congested() API at the time we
> are doing all the other filesystem blocking checks.
> 

Yes, that may work better. We will have to call bdi_read_congested() on
a write path. (will have to comment that part of the code). Would it
encompass all possible waits in the block layer?


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
  2017-03-16 14:33     ` Jens Axboe
  (?)
@ 2017-03-17  2:03     ` Ming Lei
  -1 siblings, 0 replies; 63+ messages in thread
From: Ming Lei @ 2017-03-17  2:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Goldwyn Rodrigues, Linux FS Devel, Jan Kara, Christoph Hellwig,
	linux-block, Btrfs BTRFS, open list:EXT4 FILE SYSTEM,
	open list:XFS FILESYSTEM, Sagi Grimberg, avi, linux-api,
	Matthew Wilcox, Goldwyn Rodrigues

On Thu, Mar 16, 2017 at 10:33 PM, Jens Axboe <axboe@kernel.dk> wrote:
> On 03/15/2017 03:51 PM, Goldwyn Rodrigues wrote:
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 0eeb99e..2e5cba2 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>>       do {
>>               struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>>
>> -             if (likely(blk_queue_enter(q, false) == 0)) {
>> +             if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) {
>>                       struct bio_list hold;
>>                       struct bio_list lower, same;
>>
>> @@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio)
>>                       bio_list_merge(&bio_list_on_stack, &same);
>>                       bio_list_merge(&bio_list_on_stack, &hold);
>>               } else {
>> -                     bio_io_error(bio);
>> +                     if (unlikely(bio_flagged(bio, BIO_NOWAIT)))
>> +                             bio_wouldblock_error(bio);
>> +                     else
>> +                             bio_io_error(bio);
>
> This doesn't look right. What if the queue is dying, and BIO_NOWAIT just
> happened to be set?
>
> And you're missing wbt_wait() as well as a blocking point. Ditto in
> blk-mq.

There are also tons of block points in dm/md/bcache, which need to be
considered or be documented at least, :-)



Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
@ 2017-03-16 21:31     ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2017-03-16 21:31 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, jack, hch, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, sagi, avi, axboe, linux-api, willy, Goldwyn Rodrigues

On Wed, Mar 15, 2017 at 04:51:04PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> A new flag BIO_NOWAIT is introduced to identify bio's
> orignating from iocb with IOCB_NOWAIT. This flag indicates
> to return immediately if a request cannot be made instead
> of retrying.

So this makes a congested block device run the bio IO completion
callback with an -EAGAIN error present? Are all the filesystem
direct IO submission and completion routines OK with that? i.e. does
such a congestion case cause filesystems to temporarily expose stale
data to unprivileged users when the IO is requeued in this way?

e.g. filesystem does allocation without blocking, submits bio,
device is congested, runs IO completion with error, so nothing
written to allocated blocks, write gets queued, so other read
comes in while the write is queued, reads data from uninitialised
blocks that were allocated during the write....

Seems kinda problematic to me to have a undocumented design
constraint (i.e a landmine) where we submit the AIO only to have it
error out and then expect the filesystem to do something special and
different /without blocking/ on EAGAIN.

Why isn't the congestion check at a higher layer like we do for page
cache readahead? i.e. using the bdi*congested() API at the time we
are doing all the other filesystem blocking checks.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
@ 2017-03-16 21:31     ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2017-03-16 21:31 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, jack-IBi9RG/b67k,
	hch-wEGCiKHe2LqWVfeAwA7xHQ, linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, sagi-NQWnxTmZq1alnMjI0IkVqw,
	avi-VrcmuVmyx1hWk0Htik3J/w, axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-api-u79uwXL29TY76Z2rM5mHXA, willy-wEGCiKHe2LqWVfeAwA7xHQ,
	Goldwyn Rodrigues

On Wed, Mar 15, 2017 at 04:51:04PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn-IBi9RG/b67k@public.gmane.org>
> 
> A new flag BIO_NOWAIT is introduced to identify bio's
> orignating from iocb with IOCB_NOWAIT. This flag indicates
> to return immediately if a request cannot be made instead
> of retrying.

So this makes a congested block device run the bio IO completion
callback with an -EAGAIN error present? Are all the filesystem
direct IO submission and completion routines OK with that? i.e. does
such a congestion case cause filesystems to temporarily expose stale
data to unprivileged users when the IO is requeued in this way?

e.g. filesystem does allocation without blocking, submits bio,
device is congested, runs IO completion with error, so nothing
written to allocated blocks, write gets queued, so other read
comes in while the write is queued, reads data from uninitialised
blocks that were allocated during the write....

Seems kinda problematic to me to have a undocumented design
constraint (i.e a landmine) where we submit the AIO only to have it
error out and then expect the filesystem to do something special and
different /without blocking/ on EAGAIN.

Why isn't the congestion check at a higher layer like we do for page
cache readahead? i.e. using the bdi*congested() API at the time we
are doing all the other filesystem blocking checks.

-Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
@ 2017-03-16 14:33     ` Jens Axboe
  0 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2017-03-16 14:33 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, linux-api, willy, Goldwyn Rodrigues

On 03/15/2017 03:51 PM, Goldwyn Rodrigues wrote:
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 0eeb99e..2e5cba2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>  	do {
>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  
> -		if (likely(blk_queue_enter(q, false) == 0)) {
> +		if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) {
>  			struct bio_list hold;
>  			struct bio_list lower, same;
>  
> @@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio)
>  			bio_list_merge(&bio_list_on_stack, &same);
>  			bio_list_merge(&bio_list_on_stack, &hold);
>  		} else {
> -			bio_io_error(bio);
> +			if (unlikely(bio_flagged(bio, BIO_NOWAIT)))
> +				bio_wouldblock_error(bio);
> +			else
> +				bio_io_error(bio);

This doesn't look right. What if the queue is dying, and BIO_NOWAIT just
happened to be set?

And you're missing wbt_wait() as well as a blocking point. Ditto in
blk-mq.

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 159187a..942ce8c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1518,6 +1518,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
>  	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
>  	if (unlikely(!rq)) {
>  		__wbt_done(q->rq_wb, wb_acct);
> +		if (bio && bio_flagged(bio, BIO_NOWAIT))
> +			bio_wouldblock_error(bio);
>  		return BLK_QC_T_NONE;
>  	}
>  

This seems a little fragile now, since not both paths free the bio.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/8] nowait aio: return on congested block device
@ 2017-03-16 14:33     ` Jens Axboe
  0 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2017-03-16 14:33 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: jack-IBi9RG/b67k, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, sagi-NQWnxTmZq1alnMjI0IkVqw,
	avi-VrcmuVmyx1hWk0Htik3J/w, linux-api-u79uwXL29TY76Z2rM5mHXA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, Goldwyn Rodrigues

On 03/15/2017 03:51 PM, Goldwyn Rodrigues wrote:
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 0eeb99e..2e5cba2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>  	do {
>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  
> -		if (likely(blk_queue_enter(q, false) == 0)) {
> +		if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) {
>  			struct bio_list hold;
>  			struct bio_list lower, same;
>  
> @@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio)
>  			bio_list_merge(&bio_list_on_stack, &same);
>  			bio_list_merge(&bio_list_on_stack, &hold);
>  		} else {
> -			bio_io_error(bio);
> +			if (unlikely(bio_flagged(bio, BIO_NOWAIT)))
> +				bio_wouldblock_error(bio);
> +			else
> +				bio_io_error(bio);

This doesn't look right. What if the queue is dying, and BIO_NOWAIT just
happened to be set?

And you're missing wbt_wait() as well as a blocking point. Ditto in
blk-mq.

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 159187a..942ce8c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1518,6 +1518,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
>  	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
>  	if (unlikely(!rq)) {
>  		__wbt_done(q->rq_wb, wb_acct);
> +		if (bio && bio_flagged(bio, BIO_NOWAIT))
> +			bio_wouldblock_error(bio);
>  		return BLK_QC_T_NONE;
>  	}
>  

This seems a little fragile now, since not both paths free the bio.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 5/8] nowait aio: return on congested block device
  2017-03-15 21:50 [PATCH 0/8 v3] No wait AIO Goldwyn Rodrigues
@ 2017-03-15 21:51 ` Goldwyn Rodrigues
  2017-03-16 14:33     ` Jens Axboe
  2017-03-16 21:31     ` Dave Chinner
  0 siblings, 2 replies; 63+ messages in thread
From: Goldwyn Rodrigues @ 2017-03-15 21:51 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, axboe, linux-api, willy, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

A new flag BIO_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 block/blk-core.c          | 12 ++++++++++--
 block/blk-mq-sched.c      |  3 +++
 block/blk-mq.c            |  4 ++++
 fs/direct-io.c            | 11 +++++++++--
 include/linux/bio.h       |  6 ++++++
 include/linux/blk_types.h |  1 +
 6 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0eeb99e..2e5cba2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1232,6 +1232,11 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
 	if (!IS_ERR(rq))
 		return rq;
 
+	if (bio && bio_flagged(bio, BIO_NOWAIT)) {
+		blk_put_rl(rl);
+		return ERR_PTR(-EAGAIN);
+	}
+
 	if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
 		blk_put_rl(rl);
 		return rq;
@@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-		if (likely(blk_queue_enter(q, false) == 0)) {
+		if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) {
 			struct bio_list hold;
 			struct bio_list lower, same;
 
@@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio)
 			bio_list_merge(&bio_list_on_stack, &same);
 			bio_list_merge(&bio_list_on_stack, &hold);
 		} else {
-			bio_io_error(bio);
+			if (unlikely(bio_flagged(bio, BIO_NOWAIT)))
+				bio_wouldblock_error(bio);
+			else
+				bio_io_error(bio);
 		}
 		bio = bio_list_pop(current->bio_list);
 	} while (bio);
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff..40e78b5 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct request_queue *q,
 	if (likely(!data->hctx))
 		data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
 
+	if (likely(bio) && bio_flagged(bio, BIO_NOWAIT))
+		data->flags |= BLK_MQ_REQ_NOWAIT;
+
 	if (e) {
 		data->flags |= BLK_MQ_REQ_INTERNAL;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a..942ce8c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1518,6 +1518,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
+		if (bio && bio_flagged(bio, BIO_NOWAIT))
+			bio_wouldblock_error(bio);
 		return BLK_QC_T_NONE;
 	}
 
@@ -1642,6 +1644,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
+		if (bio && bio_flagged(bio, BIO_NOWAIT))
+			bio_wouldblock_error(bio);
 		return BLK_QC_T_NONE;
 	}
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea..f6835d3 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,6 +386,9 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 	else
 		bio->bi_end_io = dio_bio_end_io;
 
+	if (dio->iocb->ki_flags & IOCB_NOWAIT)
+		bio_set_flag(bio, BIO_NOWAIT);
+
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
@@ -480,8 +483,12 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
 	unsigned i;
 	int err;
 
-	if (bio->bi_error)
-		dio->io_error = -EIO;
+	if (bio->bi_error) {
+		if (bio_flagged(bio, BIO_NOWAIT))
+			dio->io_error = -EAGAIN;
+		else
+			dio->io_error = -EIO;
+	}
 
 	if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) {
 		err = bio->bi_error;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e52119..1a92707 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -425,6 +425,12 @@ static inline void bio_io_error(struct bio *bio)
 	bio_endio(bio);
 }
 
+static inline void bio_wouldblock_error(struct bio *bio)
+{
+	bio->bi_error = -EAGAIN;
+	bio_endio(bio);
+}
+
 struct request_queue;
 extern int bio_phys_segments(struct request_queue *, struct bio *);
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d703acb..514c08e 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -102,6 +102,7 @@ struct bio {
 #define BIO_REFFED	8	/* bio has elevated ->bi_cnt */
 #define BIO_THROTTLED	9	/* This bio has already been subjected to
 				 * throttling rules. Don't do it again. */
+#define BIO_NOWAIT	10	/* don't block over blk device congestion */
 
 /*
  * Flags starting here get preserved by bio_reset() - this includes
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2017-05-11 18:16 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-28 23:36 [PATCH 0/8 v2] Non-blocking AIO Goldwyn Rodrigues
2017-02-28 23:36 ` [PATCH 1/8] nowait aio: Introduce IOCB_FLAG_NOWAIT Goldwyn Rodrigues
2017-03-01 15:36   ` Christoph Hellwig
2017-03-01 15:56     ` Christoph Hellwig
2017-03-01 16:57       ` Goldwyn Rodrigues
2017-03-01 22:44         ` Christoph Hellwig
2017-02-28 23:36 ` [PATCH 2/8] nowait aio: Return if cannot get hold of i_rwsem Goldwyn Rodrigues
2017-03-01 15:37   ` Christoph Hellwig
2017-02-28 23:36 ` [PATCH 3/8] nowait aio: return if direct write will trigger writeback Goldwyn Rodrigues
2017-03-01  3:46   ` Matthew Wilcox
2017-03-01 15:38     ` Christoph Hellwig
2017-03-02 10:38       ` Jan Kara
2017-03-02 14:12         ` Matthew Wilcox
2017-03-02 15:22           ` Jan Kara
2017-02-28 23:36 ` [PATCH 4/8] nowait aio: Introduce IOMAP_NOWAIT Goldwyn Rodrigues
2017-02-28 23:36 ` [PATCH 5/8] nowait aio: return on congested block device Goldwyn Rodrigues
2017-03-08  7:03   ` Sagi Grimberg
2017-03-08 15:00     ` Goldwyn Rodrigues
2017-03-08 15:28       ` Jan Kara
2017-03-08 15:51         ` Christoph Hellwig
2017-03-08 16:17       ` Jens Axboe
2017-03-09  2:18         ` Goldwyn Rodrigues
2017-02-28 23:36 ` [PATCH 6/8] nowait aio: ext4 Goldwyn Rodrigues
2017-02-28 23:36 ` [PATCH 7/8] nowait aio: xfs Goldwyn Rodrigues
2017-03-01 15:40   ` Christoph Hellwig
2017-02-28 23:36 ` [PATCH 8/8] nowait aio: btrfs Goldwyn Rodrigues
2017-03-05 14:56 ` [PATCH 0/8 v2] Non-blocking AIO Avi Kivity
2017-03-06  8:25   ` Jan Kara
2017-03-06  8:40     ` Avi Kivity
2017-03-06 15:19     ` Jens Axboe
2017-03-06 15:29       ` Avi Kivity
2017-03-06 15:38         ` Jens Axboe
2017-03-06 15:59           ` Avi Kivity
2017-03-06 16:08             ` Jens Axboe
2017-03-06 16:59               ` Avi Kivity
2017-03-06 17:06                 ` Jens Axboe
2017-03-06 18:17                   ` Avi Kivity
2017-03-06 18:27                     ` Jens Axboe
2017-03-06 18:50                       ` Avi Kivity
2017-03-15 21:50 [PATCH 0/8 v3] No wait AIO Goldwyn Rodrigues
2017-03-15 21:51 ` [PATCH 5/8] nowait aio: return on congested block device Goldwyn Rodrigues
2017-03-16 14:33   ` Jens Axboe
2017-03-16 14:33     ` Jens Axboe
2017-03-17  2:03     ` Ming Lei
2017-03-17 12:23     ` Goldwyn Rodrigues
2017-03-24 11:32     ` Goldwyn Rodrigues
2017-03-24 11:32       ` Goldwyn Rodrigues
2017-03-24 14:39       ` Jens Axboe
2017-03-16 21:31   ` Dave Chinner
2017-03-16 21:31     ` Dave Chinner
2017-03-17 12:23     ` Goldwyn Rodrigues
2017-03-20 17:33       ` Jan Kara
2017-04-03 18:52 [PATCH 0/8 v4] No wait AIO Goldwyn Rodrigues
2017-04-03 18:53 ` [PATCH 5/8] nowait aio: return on congested block device Goldwyn Rodrigues
2017-04-04  6:49   ` Christoph Hellwig
2017-04-14 12:02 [PATCH 0/8 v6] No wait AIO Goldwyn Rodrigues
2017-04-14 12:02 ` [PATCH 5/8] nowait aio: return on congested block device Goldwyn Rodrigues
2017-04-19  6:45   ` Christoph Hellwig
2017-04-19 15:21     ` Goldwyn Rodrigues
2017-04-20 13:43       ` Jan Kara
2017-04-24 21:10     ` Goldwyn Rodrigues
2017-04-25  2:28       ` Jens Axboe
2017-05-09 12:22 [PATCH 0/8 v7] No wait AIO Goldwyn Rodrigues
2017-05-09 12:22 ` [PATCH 5/8] nowait aio: return on congested block device Goldwyn Rodrigues
2017-05-11  7:44   ` Christoph Hellwig
2017-05-11 18:16     ` Goldwyn Rodrigues
2017-05-11 18:16       ` Goldwyn Rodrigues

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.