All of lore.kernel.org
 help / color / mirror / Atom feed
* buffered I/O without buffer heads in xfs and iomap v3
@ 2018-05-23 14:43 Christoph Hellwig
  2018-05-23 14:43 ` [PATCH 01/34] block: add a lower-level bio_add_page interface Christoph Hellwig
                   ` (33 more replies)
  0 siblings, 34 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

Hi all,

this series adds support for buffered I/O without buffer heads to
the iomap and XFS code.

For now this series only contains support for block size == PAGE_SIZE,
with the 4k support split into a separate series.


A git tree is available at:

    git://git.infradead.org/users/hch/xfs.git xfs-iomap-read.3

Gitweb:

    http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-iomap-read.3

Changes since v2:
 - minor page_seek_hole_data tweaks
 - don't read data entirely covered by the write operation in write_begin
 - fix zeroing on write_begin I/O failure
 - remove iomap_block_needs_zeroing to make the code more clear
 - update comments on __do_page_cache_readahead

Changes since v1:
 - fix the iomap_readpages error handling
 - use unsigned file offsets in a few places to avoid arithmetic overflows
 - allocate a iomap_page in iomap_page_mkwrite to fix generic/095
 - improve a few comments
 - add more asserts
 - warn about truncated block numbers from ->bmap
 - new patch to change the __do_page_cache_readahead return value to
   unsigned int
 - remove an incorrectly added empty line
 - make inline data an explicit iomap type instead of a flag
 - add a IOMAP_F_BUFFER_HEAD flag to force use of buffers heads for gfs2,
   and keep the basic buffer head infrastructure around for now.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 01/34] block: add a lower-level bio_add_page interface
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  5:28   ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 02/34] fs: factor out a __generic_write_end helper Christoph Hellwig
                   ` (32 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

For the upcoming removal of buffer heads in XFS we need to keep track of
the number of outstanding writeback requests per page.  For this we need
to know if bio_add_page merged a region with the previous bvec or not.
Instead of adding additional arguments this refactors bio_add_page to
be implemented using three lower level helpers which users like XFS can
use directly if they care about the merge decisions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c         | 96 +++++++++++++++++++++++++++++----------------
 include/linux/bio.h |  9 +++++
 2 files changed, 72 insertions(+), 33 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 53e0f0a1ed94..fdf635d42bbd 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -773,7 +773,7 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
 			return 0;
 	}
 
-	if (bio->bi_vcnt >= bio->bi_max_vecs)
+	if (bio_full(bio))
 		return 0;
 
 	/*
@@ -821,52 +821,82 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
 EXPORT_SYMBOL(bio_add_pc_page);
 
 /**
- *	bio_add_page	-	attempt to add page to bio
- *	@bio: destination bio
- *	@page: page to add
- *	@len: vec entry length
- *	@offset: vec entry offset
+ * __bio_try_merge_page - try appending data to an existing bvec.
+ * @bio: destination bio
+ * @page: page to add
+ * @len: length of the data to add
+ * @off: offset of the data in @page
  *
- *	Attempt to add a page to the bio_vec maplist. This will only fail
- *	if either bio->bi_vcnt == bio->bi_max_vecs or it's a cloned bio.
+ * Try to add the data at @page + @off to the last bvec of @bio.  This is a
+ * a useful optimisation for file systems with a block size smaller than the
+ * page size.
+ *
+ * Return %true on success or %false on failure.
  */
-int bio_add_page(struct bio *bio, struct page *page,
-		 unsigned int len, unsigned int offset)
+bool __bio_try_merge_page(struct bio *bio, struct page *page,
+		unsigned int len, unsigned int off)
 {
-	struct bio_vec *bv;
-
-	/*
-	 * cloned bio must not modify vec list
-	 */
 	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
-		return 0;
+		return false;
 
-	/*
-	 * For filesystems with a blocksize smaller than the pagesize
-	 * we will often be called with the same page as last time and
-	 * a consecutive offset.  Optimize this special case.
-	 */
 	if (bio->bi_vcnt > 0) {
-		bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
 
-		if (page == bv->bv_page &&
-		    offset == bv->bv_offset + bv->bv_len) {
+		if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
 			bv->bv_len += len;
-			goto done;
+			bio->bi_iter.bi_size += len;
+			return true;
 		}
 	}
+	return false;
+}
+EXPORT_SYMBOL_GPL(__bio_try_merge_page);
 
-	if (bio->bi_vcnt >= bio->bi_max_vecs)
-		return 0;
+/**
+ * __bio_add_page - add page to a bio in a new segment
+ * @bio: destination bio
+ * @page: page to add
+ * @len: length of the data to add
+ * @off: offset of the data in @page
+ *
+ * Add the data at @page + @off to @bio as a new bvec.  The caller must ensure
+ * that @bio has space for another bvec.
+ */
+void __bio_add_page(struct bio *bio, struct page *page,
+		unsigned int len, unsigned int off)
+{
+	struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt];
 
-	bv		= &bio->bi_io_vec[bio->bi_vcnt];
-	bv->bv_page	= page;
-	bv->bv_len	= len;
-	bv->bv_offset	= offset;
+	WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
+	WARN_ON_ONCE(bio_full(bio));
+
+	bv->bv_page = page;
+	bv->bv_offset = off;
+	bv->bv_len = len;
 
-	bio->bi_vcnt++;
-done:
 	bio->bi_iter.bi_size += len;
+	bio->bi_vcnt++;
+}
+EXPORT_SYMBOL_GPL(__bio_add_page);
+
+/**
+ *	bio_add_page	-	attempt to add page to bio
+ *	@bio: destination bio
+ *	@page: page to add
+ *	@len: vec entry length
+ *	@offset: vec entry offset
+ *
+ *	Attempt to add a page to the bio_vec maplist. This will only fail
+ *	if either bio->bi_vcnt == bio->bi_max_vecs or it's a cloned bio.
+ */
+int bio_add_page(struct bio *bio, struct page *page,
+		 unsigned int len, unsigned int offset)
+{
+	if (!__bio_try_merge_page(bio, page, len, offset)) {
+		if (bio_full(bio))
+			return 0;
+		__bio_add_page(bio, page, len, offset);
+	}
 	return len;
 }
 EXPORT_SYMBOL(bio_add_page);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index ce547a25e8ae..3e73c8bc25ea 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -123,6 +123,11 @@ static inline void *bio_data(struct bio *bio)
 	return NULL;
 }
 
+static inline bool bio_full(struct bio *bio)
+{
+	return bio->bi_vcnt >= bio->bi_max_vecs;
+}
+
 /*
  * will die
  */
@@ -470,6 +475,10 @@ void bio_chain(struct bio *, struct bio *);
 extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
 			   unsigned int, unsigned int);
+bool __bio_try_merge_page(struct bio *bio, struct page *page,
+		unsigned int len, unsigned int off);
+void __bio_add_page(struct bio *bio, struct page *page,
+		unsigned int len, unsigned int off);
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
 struct rq_map_data;
 extern struct bio *bio_map_user_iov(struct request_queue *,
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 02/34] fs: factor out a __generic_write_end helper
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
  2018-05-23 14:43 ` [PATCH 01/34] block: add a lower-level bio_add_page interface Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  5:30   ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 03/34] fs: move page_cache_seek_hole_data to iomap.c Christoph Hellwig
                   ` (31 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

Bits of the buffer.c based write_end implementations that don't know
about buffer_heads and can be reused by other implementations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/buffer.c   | 67 +++++++++++++++++++++++++++------------------------
 fs/internal.h |  2 ++
 2 files changed, 37 insertions(+), 32 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 249b83fafe48..bd964b2ad99a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2076,6 +2076,40 @@ int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
 }
 EXPORT_SYMBOL(block_write_begin);
 
+int __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
+		struct page *page)
+{
+	loff_t old_size = inode->i_size;
+	bool i_size_changed = false;
+
+	/*
+	 * No need to use i_size_read() here, the i_size cannot change under us
+	 * because we hold i_rwsem.
+	 *
+	 * But it's important to update i_size while still holding page lock:
+	 * page writeout could otherwise come in and zero beyond i_size.
+	 */
+	if (pos + copied > inode->i_size) {
+		i_size_write(inode, pos + copied);
+		i_size_changed = true;
+	}
+
+	unlock_page(page);
+	put_page(page);
+
+	if (old_size < pos)
+		pagecache_isize_extended(inode, old_size, pos);
+	/*
+	 * Don't mark the inode dirty under page lock. First, it unnecessarily
+	 * makes the holding time of page lock longer. Second, it forces lock
+	 * ordering of page lock and transaction start for journaling
+	 * filesystems.
+	 */
+	if (i_size_changed)
+		mark_inode_dirty(inode);
+	return copied;
+}
+
 int block_write_end(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned copied,
 			struct page *page, void *fsdata)
@@ -2116,39 +2150,8 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned copied,
 			struct page *page, void *fsdata)
 {
-	struct inode *inode = mapping->host;
-	loff_t old_size = inode->i_size;
-	int i_size_changed = 0;
-
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
-
-	/*
-	 * No need to use i_size_read() here, the i_size
-	 * cannot change under us because we hold i_mutex.
-	 *
-	 * But it's important to update i_size while still holding page lock:
-	 * page writeout could otherwise come in and zero beyond i_size.
-	 */
-	if (pos+copied > inode->i_size) {
-		i_size_write(inode, pos+copied);
-		i_size_changed = 1;
-	}
-
-	unlock_page(page);
-	put_page(page);
-
-	if (old_size < pos)
-		pagecache_isize_extended(inode, old_size, pos);
-	/*
-	 * Don't mark the inode dirty under page lock. First, it unnecessarily
-	 * makes the holding time of page lock longer. Second, it forces lock
-	 * ordering of page lock and transaction start for journaling
-	 * filesystems.
-	 */
-	if (i_size_changed)
-		mark_inode_dirty(inode);
-
-	return copied;
+	return __generic_write_end(mapping->host, pos, copied, page);
 }
 EXPORT_SYMBOL(generic_write_end);
 
diff --git a/fs/internal.h b/fs/internal.h
index e08972db0303..b955232d3d49 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -43,6 +43,8 @@ static inline int __sync_blockdev(struct block_device *bdev, int wait)
 extern void guard_bio_eod(int rw, struct bio *bio);
 extern int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
 		get_block_t *get_block, struct iomap *iomap);
+int __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
+		struct page *page);
 
 /*
  * char_dev.c
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 03/34] fs: move page_cache_seek_hole_data to iomap.c
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
  2018-05-23 14:43 ` [PATCH 01/34] block: add a lower-level bio_add_page interface Christoph Hellwig
  2018-05-23 14:43 ` [PATCH 02/34] fs: factor out a __generic_write_end helper Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  5:31   ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 04/34] fs: remove the buffer_unwritten check in page_seek_hole_data Christoph Hellwig
                   ` (30 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

This function is only used by the iomap code, depends on being called
from it, and will soon stop poking into buffer head internals.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/buffer.c                 | 114 -----------------------------------
 fs/iomap.c                  | 116 ++++++++++++++++++++++++++++++++++++
 include/linux/buffer_head.h |   2 -
 3 files changed, 116 insertions(+), 116 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index bd964b2ad99a..aba2a948b235 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3430,120 +3430,6 @@ int bh_submit_read(struct buffer_head *bh)
 }
 EXPORT_SYMBOL(bh_submit_read);
 
-/*
- * Seek for SEEK_DATA / SEEK_HOLE within @page, starting at @lastoff.
- *
- * Returns the offset within the file on success, and -ENOENT otherwise.
- */
-static loff_t
-page_seek_hole_data(struct page *page, loff_t lastoff, int whence)
-{
-	loff_t offset = page_offset(page);
-	struct buffer_head *bh, *head;
-	bool seek_data = whence == SEEK_DATA;
-
-	if (lastoff < offset)
-		lastoff = offset;
-
-	bh = head = page_buffers(page);
-	do {
-		offset += bh->b_size;
-		if (lastoff >= offset)
-			continue;
-
-		/*
-		 * Unwritten extents that have data in the page cache covering
-		 * them can be identified by the BH_Unwritten state flag.
-		 * Pages with multiple buffers might have a mix of holes, data
-		 * and unwritten extents - any buffer with valid data in it
-		 * should have BH_Uptodate flag set on it.
-		 */
-
-		if ((buffer_unwritten(bh) || buffer_uptodate(bh)) == seek_data)
-			return lastoff;
-
-		lastoff = offset;
-	} while ((bh = bh->b_this_page) != head);
-	return -ENOENT;
-}
-
-/*
- * Seek for SEEK_DATA / SEEK_HOLE in the page cache.
- *
- * Within unwritten extents, the page cache determines which parts are holes
- * and which are data: unwritten and uptodate buffer heads count as data;
- * everything else counts as a hole.
- *
- * Returns the resulting offset on successs, and -ENOENT otherwise.
- */
-loff_t
-page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
-			  int whence)
-{
-	pgoff_t index = offset >> PAGE_SHIFT;
-	pgoff_t end = DIV_ROUND_UP(offset + length, PAGE_SIZE);
-	loff_t lastoff = offset;
-	struct pagevec pvec;
-
-	if (length <= 0)
-		return -ENOENT;
-
-	pagevec_init(&pvec);
-
-	do {
-		unsigned nr_pages, i;
-
-		nr_pages = pagevec_lookup_range(&pvec, inode->i_mapping, &index,
-						end - 1);
-		if (nr_pages == 0)
-			break;
-
-		for (i = 0; i < nr_pages; i++) {
-			struct page *page = pvec.pages[i];
-
-			/*
-			 * At this point, the page may be truncated or
-			 * invalidated (changing page->mapping to NULL), or
-			 * even swizzled back from swapper_space to tmpfs file
-			 * mapping.  However, page->index will not change
-			 * because we have a reference on the page.
-                         *
-			 * If current page offset is beyond where we've ended,
-			 * we've found a hole.
-                         */
-			if (whence == SEEK_HOLE &&
-			    lastoff < page_offset(page))
-				goto check_range;
-
-			lock_page(page);
-			if (likely(page->mapping == inode->i_mapping) &&
-			    page_has_buffers(page)) {
-				lastoff = page_seek_hole_data(page, lastoff, whence);
-				if (lastoff >= 0) {
-					unlock_page(page);
-					goto check_range;
-				}
-			}
-			unlock_page(page);
-			lastoff = page_offset(page) + PAGE_SIZE;
-		}
-		pagevec_release(&pvec);
-	} while (index < end);
-
-	/* When no page at lastoff and we are not done, we found a hole. */
-	if (whence != SEEK_HOLE)
-		goto not_found;
-
-check_range:
-	if (lastoff < offset + length)
-		goto out;
-not_found:
-	lastoff = -ENOENT;
-out:
-	pagevec_release(&pvec);
-	return lastoff;
-}
-
 void __init buffer_init(void)
 {
 	unsigned long nrpages;
diff --git a/fs/iomap.c b/fs/iomap.c
index f2456d0d8ddd..4a01d2f4e8e9 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -20,6 +20,7 @@
 #include <linux/mm.h>
 #include <linux/swap.h>
 #include <linux/pagemap.h>
+#include <linux/pagevec.h>
 #include <linux/file.h>
 #include <linux/uio.h>
 #include <linux/backing-dev.h>
@@ -588,6 +589,121 @@ int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
 }
 EXPORT_SYMBOL_GPL(iomap_fiemap);
 
+/*
+ * Seek for SEEK_DATA / SEEK_HOLE within @page, starting at @lastoff.
+ *
+ * Returns the offset within the file on success, and -ENOENT otherwise.
+ */
+static loff_t
+page_seek_hole_data(struct page *page, loff_t lastoff, int whence)
+{
+	loff_t offset = page_offset(page);
+	struct buffer_head *bh, *head;
+	bool seek_data = whence == SEEK_DATA;
+
+	if (lastoff < offset)
+		lastoff = offset;
+
+	bh = head = page_buffers(page);
+	do {
+		offset += bh->b_size;
+		if (lastoff >= offset)
+			continue;
+
+		/*
+		 * Unwritten extents that have data in the page cache covering
+		 * them can be identified by the BH_Unwritten state flag.
+		 * Pages with multiple buffers might have a mix of holes, data
+		 * and unwritten extents - any buffer with valid data in it
+		 * should have BH_Uptodate flag set on it.
+		 */
+
+		if ((buffer_unwritten(bh) || buffer_uptodate(bh)) == seek_data)
+			return lastoff;
+
+		lastoff = offset;
+	} while ((bh = bh->b_this_page) != head);
+	return -ENOENT;
+}
+
+/*
+ * Seek for SEEK_DATA / SEEK_HOLE in the page cache.
+ *
+ * Within unwritten extents, the page cache determines which parts are holes
+ * and which are data: unwritten and uptodate buffer heads count as data;
+ * everything else counts as a hole.
+ *
+ * Returns the resulting offset on successs, and -ENOENT otherwise.
+ */
+static loff_t
+page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
+		int whence)
+{
+	pgoff_t index = offset >> PAGE_SHIFT;
+	pgoff_t end = DIV_ROUND_UP(offset + length, PAGE_SIZE);
+	loff_t lastoff = offset;
+	struct pagevec pvec;
+
+	if (length <= 0)
+		return -ENOENT;
+
+	pagevec_init(&pvec);
+
+	do {
+		unsigned nr_pages, i;
+
+		nr_pages = pagevec_lookup_range(&pvec, inode->i_mapping, &index,
+						end - 1);
+		if (nr_pages == 0)
+			break;
+
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			/*
+			 * At this point, the page may be truncated or
+			 * invalidated (changing page->mapping to NULL), or
+			 * even swizzled back from swapper_space to tmpfs file
+			 * mapping.  However, page->index will not change
+			 * because we have a reference on the page.
+                         *
+			 * If current page offset is beyond where we've ended,
+			 * we've found a hole.
+                         */
+			if (whence == SEEK_HOLE &&
+			    lastoff < page_offset(page))
+				goto check_range;
+
+			lock_page(page);
+			if (likely(page->mapping == inode->i_mapping) &&
+			    page_has_buffers(page)) {
+				lastoff = page_seek_hole_data(page, lastoff, whence);
+				if (lastoff >= 0) {
+					unlock_page(page);
+					goto check_range;
+				}
+			}
+			unlock_page(page);
+			lastoff = page_offset(page) + PAGE_SIZE;
+		}
+		pagevec_release(&pvec);
+	} while (index < end);
+
+	/* When no page at lastoff and we are not done, we found a hole. */
+	if (whence != SEEK_HOLE)
+		goto not_found;
+
+check_range:
+	if (lastoff < offset + length)
+		goto out;
+not_found:
+	lastoff = -ENOENT;
+out:
+	pagevec_release(&pvec);
+	return lastoff;
+}
+
+
 static loff_t
 iomap_seek_hole_actor(struct inode *inode, loff_t offset, loff_t length,
 		      void *data, struct iomap *iomap)
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 894e5d125de6..96225a77c112 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -205,8 +205,6 @@ void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize);
 int bh_uptodate_or_lock(struct buffer_head *bh);
 int bh_submit_read(struct buffer_head *bh);
-loff_t page_cache_seek_hole_data(struct inode *inode, loff_t offset,
-				 loff_t length, int whence);
 
 extern int buffer_heads_over_limit;
 
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 04/34] fs: remove the buffer_unwritten check in page_seek_hole_data
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 03/34] fs: move page_cache_seek_hole_data to iomap.c Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  5:36   ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 05/34] fs: use ->is_partially_uptodate in page_cache_seek_hole_data Christoph Hellwig
                   ` (29 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

We only call into this function through the iomap iterators, so we already
know the buffer is unwritten.  In addition to that we always require the
uptodate flag that is ORed with the result anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 4a01d2f4e8e9..bef5e91d40bf 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -611,14 +611,9 @@ page_seek_hole_data(struct page *page, loff_t lastoff, int whence)
 			continue;
 
 		/*
-		 * Unwritten extents that have data in the page cache covering
-		 * them can be identified by the BH_Unwritten state flag.
-		 * Pages with multiple buffers might have a mix of holes, data
-		 * and unwritten extents - any buffer with valid data in it
-		 * should have BH_Uptodate flag set on it.
+		 * Any buffer with valid data in it should have BH_Uptodate set.
 		 */
-
-		if ((buffer_unwritten(bh) || buffer_uptodate(bh)) == seek_data)
+		if (buffer_uptodate(bh) == seek_data)
 			return lastoff;
 
 		lastoff = offset;
@@ -630,8 +625,8 @@ page_seek_hole_data(struct page *page, loff_t lastoff, int whence)
  * Seek for SEEK_DATA / SEEK_HOLE in the page cache.
  *
  * Within unwritten extents, the page cache determines which parts are holes
- * and which are data: unwritten and uptodate buffer heads count as data;
- * everything else counts as a hole.
+ * and which are data: uptodate buffer heads count as data; everything else
+ * counts as a hole.
  *
  * Returns the resulting offset on successs, and -ENOENT otherwise.
  */
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 05/34] fs: use ->is_partially_uptodate in page_cache_seek_hole_data
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 04/34] fs: remove the buffer_unwritten check in page_seek_hole_data Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  5:41   ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 06/34] mm: give the 'ret' variable a better name __do_page_cache_readahead Christoph Hellwig
                   ` (28 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

This way the implementation doesn't depend on buffer_head internals.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c | 85 +++++++++++++++++++++++++-----------------------------
 1 file changed, 39 insertions(+), 46 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index bef5e91d40bf..0900da23172c 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -589,36 +589,51 @@ int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
 }
 EXPORT_SYMBOL_GPL(iomap_fiemap);
 
-/*
- * Seek for SEEK_DATA / SEEK_HOLE within @page, starting at @lastoff.
- *
- * Returns the offset within the file on success, and -ENOENT otherwise.
- */
-static loff_t
-page_seek_hole_data(struct page *page, loff_t lastoff, int whence)
+static bool
+page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff,
+		int whence)
 {
-	loff_t offset = page_offset(page);
-	struct buffer_head *bh, *head;
+	const struct address_space_operations *ops = inode->i_mapping->a_ops;
+	unsigned int bsize = i_blocksize(inode), off;
 	bool seek_data = whence == SEEK_DATA;
+	loff_t poff = page_offset(page);
 
-	if (lastoff < offset)
-		lastoff = offset;
-
-	bh = head = page_buffers(page);
-	do {
-		offset += bh->b_size;
-		if (lastoff >= offset)
-			continue;
+	if (WARN_ON_ONCE(*lastoff >= poff + PAGE_SIZE))
+		return false;
 
+	if (*lastoff < poff) {
 		/*
-		 * Any buffer with valid data in it should have BH_Uptodate set.
+		 * Last offset smaller than the start of the page means we found
+		 * a hole:
 		 */
-		if (buffer_uptodate(bh) == seek_data)
-			return lastoff;
+		if (whence == SEEK_HOLE)
+			return true;
+		*lastoff = poff;
+	}
+
+	/*
+	 * Just check the page unless we can and should check block ranges:
+	 */
+	if (bsize == PAGE_SIZE || !ops->is_partially_uptodate)
+		return PageUptodate(page) == seek_data;
 
-		lastoff = offset;
-	} while ((bh = bh->b_this_page) != head);
-	return -ENOENT;
+	lock_page(page);
+	if (unlikely(page->mapping != inode->i_mapping))
+		goto out_unlock_not_found;
+
+	for (off = 0; off < PAGE_SIZE; off += bsize) {
+		if ((*lastoff & ~PAGE_MASK) >= off + bsize)
+			continue;
+		if (ops->is_partially_uptodate(page, off, bsize) == seek_data) {
+			unlock_page(page);
+			return true;
+		}
+		*lastoff = poff + off + bsize;
+	}
+
+out_unlock_not_found:
+	unlock_page(page);
+	return false;
 }
 
 /*
@@ -655,30 +670,8 @@ page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
-			/*
-			 * At this point, the page may be truncated or
-			 * invalidated (changing page->mapping to NULL), or
-			 * even swizzled back from swapper_space to tmpfs file
-			 * mapping.  However, page->index will not change
-			 * because we have a reference on the page.
-                         *
-			 * If current page offset is beyond where we've ended,
-			 * we've found a hole.
-                         */
-			if (whence == SEEK_HOLE &&
-			    lastoff < page_offset(page))
+			if (page_seek_hole_data(inode, page, &lastoff, whence))
 				goto check_range;
-
-			lock_page(page);
-			if (likely(page->mapping == inode->i_mapping) &&
-			    page_has_buffers(page)) {
-				lastoff = page_seek_hole_data(page, lastoff, whence);
-				if (lastoff >= 0) {
-					unlock_page(page);
-					goto check_range;
-				}
-			}
-			unlock_page(page);
 			lastoff = page_offset(page) + PAGE_SIZE;
 		}
 		pagevec_release(&pvec);
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 06/34] mm: give the 'ret' variable a better name __do_page_cache_readahead
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 05/34] fs: use ->is_partially_uptodate in page_cache_seek_hole_data Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  5:42   ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 07/34] mm: return an unsigned int from __do_page_cache_readahead Christoph Hellwig
                   ` (27 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

It counts the number of pages acted on, so name it nr_pages to make that
obvious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 mm/readahead.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 539bbb6c1fad..16d0cb1e2616 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -156,7 +156,7 @@ int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	unsigned long end_index;	/* The last page we want to read */
 	LIST_HEAD(page_pool);
 	int page_idx;
-	int ret = 0;
+	int nr_pages = 0;
 	loff_t isize = i_size_read(inode);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
 
@@ -187,7 +187,7 @@ int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		list_add(&page->lru, &page_pool);
 		if (page_idx == nr_to_read - lookahead_size)
 			SetPageReadahead(page);
-		ret++;
+		nr_pages++;
 	}
 
 	/*
@@ -195,11 +195,11 @@ int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	 * uptodate then the caller will launch readpage again, and
 	 * will then handle the error.
 	 */
-	if (ret)
-		read_pages(mapping, filp, &page_pool, ret, gfp_mask);
+	if (nr_pages)
+		read_pages(mapping, filp, &page_pool, nr_pages, gfp_mask);
 	BUG_ON(!list_empty(&page_pool));
 out:
-	return ret;
+	return nr_pages;
 }
 
 /*
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 07/34] mm: return an unsigned int from __do_page_cache_readahead
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 06/34] mm: give the 'ret' variable a better name __do_page_cache_readahead Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  5:44   ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 08/34] mm: split ->readpages calls to avoid non-contiguous pages lists Christoph Hellwig
                   ` (26 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

We never return an error, so switch to returning an unsigned int.  Most
callers already did implicit casts to an unsigned type, and the one that
didn't can be simplified now.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 mm/internal.h  |  2 +-
 mm/readahead.c | 15 +++++----------
 2 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 62d8c34e63d5..954003ac766a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -53,7 +53,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details);
 
-extern int __do_page_cache_readahead(struct address_space *mapping,
+extern unsigned int __do_page_cache_readahead(struct address_space *mapping,
 		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
 		unsigned long lookahead_size);
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 16d0cb1e2616..fa4d4b767130 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -147,16 +147,16 @@ static int read_pages(struct address_space *mapping, struct file *filp,
  *
  * Returns the number of pages requested, or the maximum amount of I/O allowed.
  */
-int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
-			pgoff_t offset, unsigned long nr_to_read,
-			unsigned long lookahead_size)
+unsigned int __do_page_cache_readahead(struct address_space *mapping,
+		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
+		unsigned long lookahead_size)
 {
 	struct inode *inode = mapping->host;
 	struct page *page;
 	unsigned long end_index;	/* The last page we want to read */
 	LIST_HEAD(page_pool);
 	int page_idx;
-	int nr_pages = 0;
+	unsigned int nr_pages = 0;
 	loff_t isize = i_size_read(inode);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
 
@@ -223,16 +223,11 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
 	nr_to_read = min(nr_to_read, max_pages);
 	while (nr_to_read) {
-		int err;
-
 		unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_SIZE;
 
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
-		err = __do_page_cache_readahead(mapping, filp,
-						offset, this_chunk, 0);
-		if (err < 0)
-			return err;
+		__do_page_cache_readahead(mapping, filp, offset, this_chunk, 0);
 
 		offset += this_chunk;
 		nr_to_read -= this_chunk;
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 08/34] mm: split ->readpages calls to avoid non-contiguous pages lists
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 07/34] mm: return an unsigned int from __do_page_cache_readahead Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  5:46   ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 09/34] iomap: inline data should be an iomap type, not a flag Christoph Hellwig
                   ` (25 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

That way file systems don't have to go spotting for non-contiguous pages
and work around them.  It also kicks off I/O earlier, allowing it to
finish earlier and reduce latency.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 mm/readahead.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index fa4d4b767130..e273f0de3376 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -140,8 +140,8 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 }
 
 /*
- * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates all
- * the pages first, then submits them all for I/O. This avoids the very bad
+ * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates
+ * the pages first, then submits them for I/O. This avoids the very bad
  * behaviour which would occur if page allocations are causing VM writeback.
  * We really don't want to intermingle reads and writes like that.
  *
@@ -177,8 +177,18 @@ unsigned int __do_page_cache_readahead(struct address_space *mapping,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->i_pages, page_offset);
 		rcu_read_unlock();
-		if (page && !radix_tree_exceptional_entry(page))
+		if (page && !radix_tree_exceptional_entry(page)) {
+			/*
+			 * Page already present?  Kick off the current batch of
+			 * contiguous pages before continuing with the next
+			 * batch.
+			 */
+			if (nr_pages)
+				read_pages(mapping, filp, &page_pool, nr_pages,
+						gfp_mask);
+			nr_pages = 0;
 			continue;
+		}
 
 		page = __page_cache_alloc(gfp_mask);
 		if (!page)
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 09/34] iomap: inline data should be an iomap type, not a flag
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 08/34] mm: split ->readpages calls to avoid non-contiguous pages lists Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  5:49     ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 10/34] iomap: fix the comment describing IOMAP_NOWAIT Christoph Hellwig
                   ` (24 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

Inline data is fundamentally different from our normal mapped case in that
it doesn't even have a block address.  So instead of having a flag for it
it should be an entirely separate iomap range type.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/ext4/inline.c      |  4 ++--
 fs/gfs2/bmap.c        |  3 +--
 fs/iomap.c            | 21 ++++++++++++---------
 include/linux/iomap.h |  2 +-
 4 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 70cf4c7b268a..e1f00891ef95 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1835,8 +1835,8 @@ int ext4_inline_data_iomap(struct inode *inode, struct iomap *iomap)
 	iomap->offset = 0;
 	iomap->length = min_t(loff_t, ext4_get_inline_size(inode),
 			      i_size_read(inode));
-	iomap->type = 0;
-	iomap->flags = IOMAP_F_DATA_INLINE;
+	iomap->type = IOMAP_INLINE;
+	iomap->flags = 0;
 
 out:
 	up_read(&EXT4_I(inode)->xattr_sem);
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 278ed0869c3c..cbeedd3cfb36 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -680,8 +680,7 @@ static void gfs2_stuffed_iomap(struct inode *inode, struct iomap *iomap)
 		      sizeof(struct gfs2_dinode);
 	iomap->offset = 0;
 	iomap->length = i_size_read(inode);
-	iomap->type = IOMAP_MAPPED;
-	iomap->flags = IOMAP_F_DATA_INLINE;
+	iomap->type = IOMAP_INLINE;
 }
 
 /**
diff --git a/fs/iomap.c b/fs/iomap.c
index 0900da23172c..f52209a2c270 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -503,10 +503,13 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
 	case IOMAP_DELALLOC:
 		flags |= FIEMAP_EXTENT_DELALLOC | FIEMAP_EXTENT_UNKNOWN;
 		break;
+	case IOMAP_MAPPED:
+		break;
 	case IOMAP_UNWRITTEN:
 		flags |= FIEMAP_EXTENT_UNWRITTEN;
 		break;
-	case IOMAP_MAPPED:
+	case IOMAP_INLINE:
+		flags |= FIEMAP_EXTENT_DATA_INLINE;
 		break;
 	}
 
@@ -514,8 +517,6 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
 		flags |= FIEMAP_EXTENT_MERGED;
 	if (iomap->flags & IOMAP_F_SHARED)
 		flags |= FIEMAP_EXTENT_SHARED;
-	if (iomap->flags & IOMAP_F_DATA_INLINE)
-		flags |= FIEMAP_EXTENT_DATA_INLINE;
 
 	return fiemap_fill_next_extent(fi, iomap->offset,
 			iomap->addr != IOMAP_NULL_ADDR ? iomap->addr : 0,
@@ -1318,14 +1319,16 @@ static loff_t iomap_swapfile_activate_actor(struct inode *inode, loff_t pos,
 	struct iomap_swapfile_info *isi = data;
 	int error;
 
-	/* No inline data. */
-	if (iomap->flags & IOMAP_F_DATA_INLINE) {
+	switch (iomap->type) {
+	case IOMAP_MAPPED:
+	case IOMAP_UNWRITTEN:
+		/* Only real or unwritten extents. */
+		break;
+	case IOMAP_INLINE:
+		/* No inline data. */
 		pr_err("swapon: file is inline\n");
 		return -EINVAL;
-	}
-
-	/* Only real or unwritten extents. */
-	if (iomap->type != IOMAP_MAPPED && iomap->type != IOMAP_UNWRITTEN) {
+	default:
 		pr_err("swapon: file has unallocated extents\n");
 		return -EINVAL;
 	}
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 4bd87294219a..8f7095fc514e 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -18,6 +18,7 @@ struct vm_fault;
 #define IOMAP_DELALLOC	0x02	/* delayed allocation blocks */
 #define IOMAP_MAPPED	0x03	/* blocks allocated at @addr */
 #define IOMAP_UNWRITTEN	0x04	/* blocks allocated at @addr in unwritten state */
+#define IOMAP_INLINE	0x05	/* data inline in the inode */
 
 /*
  * Flags for all iomap mappings:
@@ -34,7 +35,6 @@ struct vm_fault;
  */
 #define IOMAP_F_MERGED		0x10	/* contains multiple blocks/extents */
 #define IOMAP_F_SHARED		0x20	/* block shared with another file */
-#define IOMAP_F_DATA_INLINE	0x40	/* data inline in the inode */
 
 /*
  * Magic value for addr:
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 10/34] iomap: fix the comment describing IOMAP_NOWAIT
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (8 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 09/34] iomap: inline data should be an iomap type, not a flag Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  5:49   ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 11/34] iomap: move IOMAP_F_BOUNDARY to gfs2 Christoph Hellwig
                   ` (23 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/iomap.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8f7095fc514e..13d19b4c29a9 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -59,7 +59,7 @@ struct iomap {
 #define IOMAP_REPORT		(1 << 2) /* report extent status, e.g. FIEMAP */
 #define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT		(1 << 4) /* direct I/O */
-#define IOMAP_NOWAIT		(1 << 5) /* Don't wait for writeback */
+#define IOMAP_NOWAIT		(1 << 5) /* do not block */
 
 struct iomap_ops {
 	/*
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 11/34] iomap: move IOMAP_F_BOUNDARY to gfs2
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (9 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 10/34] iomap: fix the comment describing IOMAP_NOWAIT Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  5:50   ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 12/34] iomap: use __bio_add_page in iomap_dio_zero Christoph Hellwig
                   ` (22 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

Just define a range of fs specific flags and use that in gfs2 instead of
exposing this internal flag flobally.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/gfs2/bmap.c        | 8 +++++---
 include/linux/iomap.h | 9 +++++++--
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index cbeedd3cfb36..8efa6297e19c 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -683,6 +683,8 @@ static void gfs2_stuffed_iomap(struct inode *inode, struct iomap *iomap)
 	iomap->type = IOMAP_INLINE;
 }
 
+#define IOMAP_F_GFS2_BOUNDARY IOMAP_F_PRIVATE
+
 /**
  * gfs2_iomap_begin - Map blocks from an inode to disk blocks
  * @inode: The inode
@@ -774,7 +776,7 @@ int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
 	bh = mp.mp_bh[ip->i_height - 1];
 	len = gfs2_extent_length(bh->b_data, bh->b_size, ptr, lend - lblock, &eob);
 	if (eob)
-		iomap->flags |= IOMAP_F_BOUNDARY;
+		iomap->flags |= IOMAP_F_GFS2_BOUNDARY;
 	iomap->length = (u64)len << inode->i_blkbits;
 
 out_release:
@@ -846,12 +848,12 @@ int gfs2_block_map(struct inode *inode, sector_t lblock,
 
 	if (iomap.length > bh_map->b_size) {
 		iomap.length = bh_map->b_size;
-		iomap.flags &= ~IOMAP_F_BOUNDARY;
+		iomap.flags &= ~IOMAP_F_GFS2_BOUNDARY;
 	}
 	if (iomap.addr != IOMAP_NULL_ADDR)
 		map_bh(bh_map, inode->i_sb, iomap.addr >> inode->i_blkbits);
 	bh_map->b_size = iomap.length;
-	if (iomap.flags & IOMAP_F_BOUNDARY)
+	if (iomap.flags & IOMAP_F_GFS2_BOUNDARY)
 		set_buffer_boundary(bh_map);
 	if (iomap.flags & IOMAP_F_NEW)
 		set_buffer_new(bh_map);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 13d19b4c29a9..819e0cd2a950 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -27,8 +27,7 @@ struct vm_fault;
  * written data and requires fdatasync to commit them to persistent storage.
  */
 #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
-#define IOMAP_F_BOUNDARY	0x02	/* mapping ends at metadata boundary */
-#define IOMAP_F_DIRTY		0x04	/* uncommitted metadata */
+#define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
 
 /*
  * Flags that only need to be reported for IOMAP_REPORT requests:
@@ -36,6 +35,12 @@ struct vm_fault;
 #define IOMAP_F_MERGED		0x10	/* contains multiple blocks/extents */
 #define IOMAP_F_SHARED		0x20	/* block shared with another file */
 
+/*
+ * Flags from 0x1000 up are for file system specific usage:
+ */
+#define IOMAP_F_PRIVATE		0x1000
+
+
 /*
  * Magic value for addr:
  */
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 12/34] iomap: use __bio_add_page in iomap_dio_zero
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (10 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 11/34] iomap: move IOMAP_F_BOUNDARY to gfs2 Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  5:51   ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 13/34] iomap: add a iomap_sector helper Christoph Hellwig
                   ` (21 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

We don't need any merging logic, and this also replaces a BUG_ON with a
WARN_ON_ONCE inside __bio_add_page for the impossible overflow condition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index f52209a2c270..8e28f25f086f 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -949,8 +949,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
 	get_page(page);
-	if (bio_add_page(bio, page, len, 0) != len)
-		BUG();
+	__bio_add_page(bio, page, len, 0);
 	bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
 
 	atomic_inc(&dio->ref);
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 13/34] iomap: add a iomap_sector helper
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (11 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 12/34] iomap: use __bio_add_page in iomap_dio_zero Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  5:52   ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 14/34] iomap: add an iomap-based bmap implementation Christoph Hellwig
                   ` (20 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

Factor the repeated calculation of the on-disk sector for a given logical
block into a littler helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 8e28f25f086f..f928df4ab9a9 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -97,6 +97,12 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	return written ? written : ret;
 }
 
+static sector_t
+iomap_sector(struct iomap *iomap, loff_t pos)
+{
+	return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
+}
+
 static void
 iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 {
@@ -354,11 +360,8 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
 static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
 		struct iomap *iomap)
 {
-	sector_t sector = (iomap->addr +
-			   (pos & PAGE_MASK) - iomap->offset) >> 9;
-
-	return __dax_zero_page_range(iomap->bdev, iomap->dax_dev, sector,
-			offset, bytes);
+	return __dax_zero_page_range(iomap->bdev, iomap->dax_dev,
+			iomap_sector(iomap, pos & PAGE_MASK), offset, bytes);
 }
 
 static loff_t
@@ -943,8 +946,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 
 	bio = bio_alloc(GFP_KERNEL, 1);
 	bio_set_dev(bio, iomap->bdev);
-	bio->bi_iter.bi_sector =
-		(iomap->addr + pos - iomap->offset) >> 9;
+	bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
@@ -1038,8 +1040,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 		bio = bio_alloc(GFP_KERNEL, nr_pages);
 		bio_set_dev(bio, iomap->bdev);
-		bio->bi_iter.bi_sector =
-			(iomap->addr + pos - iomap->offset) >> 9;
+		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
 		bio->bi_write_hint = dio->iocb->ki_hint;
 		bio->bi_private = dio;
 		bio->bi_end_io = iomap_dio_bio_end_io;
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 14/34] iomap: add an iomap-based bmap implementation
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (12 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 13/34] iomap: add a iomap_sector helper Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  5:54   ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 15/34] iomap: add an iomap-based readpage and readpages implementation Christoph Hellwig
                   ` (19 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

This adds a simple iomap-based implementation of the legacy ->bmap
interface.  Note that we can't easily add checks for rt or reflink
files, so these will have to remain in the callers.  This interface
just needs to die..

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c            | 34 ++++++++++++++++++++++++++++++++++
 include/linux/iomap.h |  3 +++
 2 files changed, 37 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index f928df4ab9a9..fa278ed338ce 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1411,3 +1411,37 @@ int iomap_swapfile_activate(struct swap_info_struct *sis,
 }
 EXPORT_SYMBOL_GPL(iomap_swapfile_activate);
 #endif /* CONFIG_SWAP */
+
+static loff_t
+iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
+		void *data, struct iomap *iomap)
+{
+	sector_t *bno = data, addr;
+
+	if (iomap->type == IOMAP_MAPPED) {
+		addr = (pos - iomap->offset + iomap->addr) >> inode->i_blkbits;
+		if (addr > INT_MAX)
+			WARN(1, "would truncate bmap result\n");
+		else
+			*bno = addr;
+	}
+	return 0;
+}
+
+/* legacy ->bmap interface.  0 is the error return (!) */
+sector_t
+iomap_bmap(struct address_space *mapping, sector_t bno,
+		const struct iomap_ops *ops)
+{
+	struct inode *inode = mapping->host;
+	loff_t pos = bno >> inode->i_blkbits;
+	unsigned blocksize = i_blocksize(inode);
+
+	if (filemap_write_and_wait(mapping))
+		return 0;
+
+	bno = 0;
+	iomap_apply(inode, pos, blocksize, 0, ops, &bno, iomap_bmap_actor);
+	return bno;
+}
+EXPORT_SYMBOL_GPL(iomap_bmap);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 819e0cd2a950..a044a824da85 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -4,6 +4,7 @@
 
 #include <linux/types.h>
 
+struct address_space;
 struct fiemap_extent_info;
 struct inode;
 struct iov_iter;
@@ -100,6 +101,8 @@ loff_t iomap_seek_hole(struct inode *inode, loff_t offset,
 		const struct iomap_ops *ops);
 loff_t iomap_seek_data(struct inode *inode, loff_t offset,
 		const struct iomap_ops *ops);
+sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
+		const struct iomap_ops *ops);
 
 /*
  * Flags for direct I/O ->end_io:
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 15/34] iomap: add an iomap-based readpage and readpages implementation
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (13 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 14/34] iomap: add an iomap-based bmap implementation Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  6:11   ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 16/34] iomap: add initial support for writes without buffer heads Christoph Hellwig
                   ` (18 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

Simply use iomap_apply to iterate over the file and a submit a bio for
each non-uptodate but mapped region and zero everything else.  Note that
as-is this can not be used for file systems with a blocksize smaller than
the page size, but that support will be added later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c            | 194 +++++++++++++++++++++++++++++++++++++++++-
 include/linux/iomap.h |   4 +
 2 files changed, 197 insertions(+), 1 deletion(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index fa278ed338ce..78259a2249f4 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (C) 2010 Red Hat, Inc.
- * Copyright (c) 2016 Christoph Hellwig.
+ * Copyright (c) 2016-2018 Christoph Hellwig.
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>
 #include <linux/gfp.h>
 #include <linux/mm.h>
+#include <linux/mm_inline.h>
 #include <linux/swap.h>
 #include <linux/pagemap.h>
 #include <linux/pagevec.h>
@@ -103,6 +104,197 @@ iomap_sector(struct iomap *iomap, loff_t pos)
 	return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
 }
 
+static void
+iomap_read_end_io(struct bio *bio)
+{
+	int error = blk_status_to_errno(bio->bi_status);
+	struct bio_vec *bvec;
+	int i;
+
+	bio_for_each_segment_all(bvec, bio, i)
+		page_endio(bvec->bv_page, false, error);
+	bio_put(bio);
+}
+
+static struct bio *
+iomap_read_bio_alloc(struct iomap *iomap, sector_t sector, loff_t length)
+{
+	int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	struct bio *bio = bio_alloc(GFP_NOFS, min(BIO_MAX_PAGES, nr_vecs));
+
+	bio->bi_opf = REQ_OP_READ;
+	bio->bi_iter.bi_sector = sector;
+	bio_set_dev(bio, iomap->bdev);
+	bio->bi_end_io = iomap_read_end_io;
+	return bio;
+}
+
+struct iomap_readpage_ctx {
+	struct page		*cur_page;
+	bool			cur_page_in_bio;
+	struct bio		*bio;
+	struct list_head	*pages;
+};
+
+static loff_t
+iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+		struct iomap *iomap)
+{
+	struct iomap_readpage_ctx *ctx = data;
+	struct page *page = ctx->cur_page;
+	unsigned poff = pos & (PAGE_SIZE - 1);
+	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
+	bool is_contig = false;
+	sector_t sector;
+
+	/* we don't support blocksize < PAGE_SIZE quite yet: */
+	WARN_ON_ONCE(pos != page_offset(page));
+	WARN_ON_ONCE(plen != PAGE_SIZE);
+
+	if (iomap->type != IOMAP_MAPPED || pos >= i_size_read(inode)) {
+		zero_user(page, poff, plen);
+		SetPageUptodate(page);
+		goto done;
+	}
+
+	ctx->cur_page_in_bio = true;
+
+	/*
+	 * Try to merge into a previous segment if we can.
+	 */
+	sector = iomap_sector(iomap, pos);
+	if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
+		if (__bio_try_merge_page(ctx->bio, page, plen, poff))
+			goto done;
+		is_contig = true;
+	}
+
+	if (!ctx->bio || !is_contig || bio_full(ctx->bio)) {
+		if (ctx->bio)
+			submit_bio(ctx->bio);
+		ctx->bio = iomap_read_bio_alloc(iomap, sector, length);
+	}
+
+	__bio_add_page(ctx->bio, page, plen, poff);
+done:
+	return plen;
+}
+
+int
+iomap_readpage(struct page *page, const struct iomap_ops *ops)
+{
+	struct iomap_readpage_ctx ctx = { .cur_page = page };
+	struct inode *inode = page->mapping->host;
+	unsigned poff;
+	loff_t ret;
+
+	WARN_ON_ONCE(page_has_buffers(page));
+
+	for (poff = 0; poff < PAGE_SIZE; poff += ret) {
+		ret = iomap_apply(inode, page_offset(page) + poff,
+				PAGE_SIZE - poff, 0, ops, &ctx,
+				iomap_readpage_actor);
+		if (ret <= 0) {
+			WARN_ON_ONCE(ret == 0);
+			SetPageError(page);
+			break;
+		}
+	}
+
+	if (ctx.bio) {
+		submit_bio(ctx.bio);
+		WARN_ON_ONCE(!ctx.cur_page_in_bio);
+	} else {
+		WARN_ON_ONCE(ctx.cur_page_in_bio);
+		unlock_page(page);
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iomap_readpage);
+
+static struct page *
+iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
+		loff_t length, loff_t *done)
+{
+	while (!list_empty(pages)) {
+		struct page *page = lru_to_page(pages);
+
+		if (page_offset(page) >= (u64)pos + length)
+			break;
+
+		list_del(&page->lru);
+		if (!add_to_page_cache_lru(page, inode->i_mapping, page->index,
+				GFP_NOFS))
+			return page;
+
+		*done += PAGE_SIZE;
+		put_page(page);
+	}
+
+	return NULL;
+}
+
+static loff_t
+iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
+		void *data, struct iomap *iomap)
+{
+	struct iomap_readpage_ctx *ctx = data;
+	loff_t done, ret;
+
+	for (done = 0; done < length; done += ret) {
+		if (ctx->cur_page && ((pos + done) & (PAGE_SIZE - 1)) == 0) {
+			if (!ctx->cur_page_in_bio)
+				unlock_page(ctx->cur_page);
+			put_page(ctx->cur_page);
+			ctx->cur_page = NULL;
+		}
+		if (!ctx->cur_page) {
+			ctx->cur_page = iomap_next_page(inode, ctx->pages,
+					pos, length, &done);
+			if (!ctx->cur_page)
+				break;
+			ctx->cur_page_in_bio = false;
+		}
+		ret = iomap_readpage_actor(inode, pos + done, length - done,
+				ctx, iomap);
+	}
+
+	return done;
+}
+
+int
+iomap_readpages(struct address_space *mapping, struct list_head *pages,
+		unsigned nr_pages, const struct iomap_ops *ops)
+{
+	struct iomap_readpage_ctx ctx = { .pages = pages };
+	loff_t pos = page_offset(list_entry(pages->prev, struct page, lru));
+	loff_t last = page_offset(list_entry(pages->next, struct page, lru));
+	loff_t length = last - pos + PAGE_SIZE, ret = 0;
+
+	while (length > 0) {
+		ret = iomap_apply(mapping->host, pos, length, 0, ops,
+				&ctx, iomap_readpages_actor);
+		if (ret <= 0) {
+			WARN_ON_ONCE(ret == 0);
+			goto done;
+		}
+		pos += ret;
+		length -= ret;
+	}
+	ret = 0;
+done:
+	if (ctx.bio)
+		submit_bio(ctx.bio);
+	if (ctx.cur_page) {
+		if (!ctx.cur_page_in_bio)
+			unlock_page(ctx.cur_page);
+		put_page(ctx.cur_page);
+	}
+	WARN_ON_ONCE(!ret && !list_empty(ctx.pages));
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iomap_readpages);
+
 static void
 iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index a044a824da85..7300d30ca495 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -9,6 +9,7 @@ struct fiemap_extent_info;
 struct inode;
 struct iov_iter;
 struct kiocb;
+struct page;
 struct vm_area_struct;
 struct vm_fault;
 
@@ -88,6 +89,9 @@ struct iomap_ops {
 
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops);
+int iomap_readpage(struct page *page, const struct iomap_ops *ops);
+int iomap_readpages(struct address_space *mapping, struct list_head *pages,
+		unsigned nr_pages, const struct iomap_ops *ops);
 int iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 16/34] iomap: add initial support for writes without buffer heads
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (14 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 15/34] iomap: add an iomap-based readpage and readpages implementation Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  6:21   ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 17/34] xfs: use iomap_bmap Christoph Hellwig
                   ` (17 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

For now just limited to blocksize == PAGE_SIZE, where we can simply read
in the full page in write begin, and just set the whole page dirty after
copying data into it.  This code is enabled by default and XFS will now
be feed pages without buffer heads in ->writepage and ->writepages.

If a file system sets the IOMAP_F_BUFFER_HEAD flag on the iomap the old
path will still be used, this both helps the transition in XFS and
prepares for the gfs2 migration to the iomap infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c            | 129 ++++++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_iomap.c    |   6 +-
 include/linux/iomap.h |   2 +
 3 files changed, 124 insertions(+), 13 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 78259a2249f4..debb859a8a14 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -308,6 +308,49 @@ iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 		truncate_pagecache_range(inode, max(pos, i_size), pos + len);
 }
 
+static int
+iomap_read_page_sync(struct inode *inode, loff_t block_start, struct page *page,
+		unsigned poff, unsigned plen, unsigned from, unsigned to,
+		struct iomap *iomap)
+{
+	struct bio_vec bvec;
+	struct bio bio;
+
+	if (iomap->type != IOMAP_MAPPED || block_start >= i_size_read(inode)) {
+		zero_user_segments(page, poff, from, to, poff + plen);
+		return 0;
+	}
+
+	bio_init(&bio, &bvec, 1);
+	bio.bi_opf = REQ_OP_READ;
+	bio.bi_iter.bi_sector = iomap_sector(iomap, block_start);
+	bio_set_dev(&bio, iomap->bdev);
+	__bio_add_page(&bio, page, plen, poff);
+	return submit_bio_wait(&bio);
+}
+
+static int
+__iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
+		struct page *page, struct iomap *iomap)
+{
+	loff_t block_size = i_blocksize(inode);
+	loff_t block_start = pos & ~(block_size - 1);
+	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
+	unsigned poff = block_start & (PAGE_SIZE - 1);
+	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, block_end - block_start);
+	unsigned from = pos & (PAGE_SIZE - 1);
+	unsigned to = from + len;
+
+	WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE);
+
+	if (PageUptodate(page))
+		return 0;
+	if (from <= poff && to >= poff + plen)
+		return 0;
+	return iomap_read_page_sync(inode, block_start, page,
+			poff, plen, from, to, iomap);
+}
+
 static int
 iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 		struct page **pagep, struct iomap *iomap)
@@ -325,7 +368,10 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 	if (!page)
 		return -ENOMEM;
 
-	status = __block_write_begin_int(page, pos, len, NULL, iomap);
+	if (iomap->flags & IOMAP_F_BUFFER_HEAD)
+		status = __block_write_begin_int(page, pos, len, NULL, iomap);
+	else
+		status = __iomap_write_begin(inode, pos, len, page, iomap);
 	if (unlikely(status)) {
 		unlock_page(page);
 		put_page(page);
@@ -338,14 +384,69 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 	return status;
 }
 
+int
+iomap_set_page_dirty(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+	int newly_dirty;
+
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
+
+	/*
+	 * Lock out page->mem_cgroup migration to keep PageDirty
+	 * synchronized with per-memcg dirty page counters.
+	 */
+	lock_page_memcg(page);
+	newly_dirty = !TestSetPageDirty(page);
+	if (newly_dirty)
+		__set_page_dirty(page, mapping, 0);
+	unlock_page_memcg(page);
+
+	if (newly_dirty)
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+	return newly_dirty;
+}
+EXPORT_SYMBOL_GPL(iomap_set_page_dirty);
+
+static int
+__iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
+		unsigned copied, struct page *page, struct iomap *iomap)
+{
+	flush_dcache_page(page);
+
+	/*
+	 * The blocks that were entirely written will now be uptodate, so we
+	 * don't have to worry about a readpage reading them and overwriting a
+	 * partial write.  However if we have encountered a short write and only
+	 * partially written into a block, it will not be marked uptodate, so a
+	 * readpage might come in and destroy our partial write.
+	 *
+	 * Do the simplest thing, and just treat any short write to a non
+	 * uptodate page as a zero-length write, and force the caller to redo
+	 * the whole thing.
+	 */
+	if (unlikely(copied < len && !PageUptodate(page))) {
+		copied = 0;
+	} else {
+		SetPageUptodate(page);
+		iomap_set_page_dirty(page);
+	}
+	return __generic_write_end(inode, pos, copied, page);
+}
+
 static int
 iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
-		unsigned copied, struct page *page)
+		unsigned copied, struct page *page, struct iomap *iomap)
 {
 	int ret;
 
-	ret = generic_write_end(NULL, inode->i_mapping, pos, len,
-			copied, page, NULL);
+	if (iomap->flags & IOMAP_F_BUFFER_HEAD)
+		ret = generic_write_end(NULL, inode->i_mapping, pos, len,
+				copied, page, NULL);
+	else
+		ret = __iomap_write_end(inode, pos, len, copied, page, iomap);
+
 	if (ret < len)
 		iomap_write_failed(inode, pos, len);
 	return ret;
@@ -400,7 +501,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 
 		flush_dcache_page(page);
 
-		status = iomap_write_end(inode, pos, bytes, copied, page);
+		status = iomap_write_end(inode, pos, bytes, copied, page,
+				iomap);
 		if (unlikely(status < 0))
 			break;
 		copied = status;
@@ -494,7 +596,7 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 
 		WARN_ON_ONCE(!PageUptodate(page));
 
-		status = iomap_write_end(inode, pos, bytes, bytes, page);
+		status = iomap_write_end(inode, pos, bytes, bytes, page, iomap);
 		if (unlikely(status <= 0)) {
 			if (WARN_ON_ONCE(status == 0))
 				return -EIO;
@@ -546,7 +648,7 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
 	zero_user(page, offset, bytes);
 	mark_page_accessed(page);
 
-	return iomap_write_end(inode, pos, bytes, bytes, page);
+	return iomap_write_end(inode, pos, bytes, bytes, page, iomap);
 }
 
 static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
@@ -632,11 +734,16 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
 	struct page *page = data;
 	int ret;
 
-	ret = __block_write_begin_int(page, pos, length, NULL, iomap);
-	if (ret)
-		return ret;
+	if (iomap->flags & IOMAP_F_BUFFER_HEAD) {
+		ret = __block_write_begin_int(page, pos, length, NULL, iomap);
+		if (ret)
+			return ret;
+		block_commit_write(page, 0, length);
+	} else {
+		WARN_ON_ONCE(!PageUptodate(page));
+		WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE);
+	}
 
-	block_commit_write(page, 0, length);
 	return length;
 }
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index c6ce6f9335b6..da6d1995e460 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -638,7 +638,7 @@ xfs_file_iomap_begin_delay(
 	 * Flag newly allocated delalloc blocks with IOMAP_F_NEW so we punch
 	 * them out if the write happens to fail.
 	 */
-	iomap->flags = IOMAP_F_NEW;
+	iomap->flags |= IOMAP_F_NEW;
 	trace_xfs_iomap_alloc(ip, offset, count, 0, &got);
 done:
 	if (isnullstartblock(got.br_startblock))
@@ -1031,6 +1031,8 @@ xfs_file_iomap_begin(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
+	iomap->flags |= IOMAP_F_BUFFER_HEAD;
+
 	if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
 			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
 		/* Reserve delalloc blocks for regular writeback. */
@@ -1131,7 +1133,7 @@ xfs_file_iomap_begin(
 	if (error)
 		return error;
 
-	iomap->flags = IOMAP_F_NEW;
+	iomap->flags |= IOMAP_F_NEW;
 	trace_xfs_iomap_alloc(ip, offset, length, 0, &imap);
 
 out_finish:
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 7300d30ca495..4d3d9d0cd69f 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -30,6 +30,7 @@ struct vm_fault;
  */
 #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
 #define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
+#define IOMAP_F_BUFFER_HEAD	0x04	/* file system requires buffer heads */
 
 /*
  * Flags that only need to be reported for IOMAP_REPORT requests:
@@ -92,6 +93,7 @@ ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 int iomap_readpage(struct page *page, const struct iomap_ops *ops);
 int iomap_readpages(struct address_space *mapping, struct list_head *pages,
 		unsigned nr_pages, const struct iomap_ops *ops);
+int iomap_set_page_dirty(struct page *page);
 int iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 17/34] xfs: use iomap_bmap
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (15 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 16/34] iomap: add initial support for writes without buffer heads Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  6:14   ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 18/34] xfs: use iomap for blocksize == PAGE_SIZE readpage and readpages Christoph Hellwig
                   ` (16 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

Switch to the iomap based bmap implementation to get rid of one of the
last users of xfs_get_blocks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 80de476cecf8..56e405572909 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1378,10 +1378,9 @@ xfs_vm_bmap(
 	struct address_space	*mapping,
 	sector_t		block)
 {
-	struct inode		*inode = (struct inode *)mapping->host;
-	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_inode	*ip = XFS_I(mapping->host);
 
-	trace_xfs_vm_bmap(XFS_I(inode));
+	trace_xfs_vm_bmap(ip);
 
 	/*
 	 * The swap code (ab-)uses ->bmap to get a block mapping and then
@@ -1394,9 +1393,7 @@ xfs_vm_bmap(
 	 */
 	if (xfs_is_reflink_inode(ip) || XFS_IS_REALTIME_INODE(ip))
 		return 0;
-
-	filemap_write_and_wait(mapping);
-	return generic_block_bmap(mapping, block, xfs_get_blocks);
+	return iomap_bmap(mapping, block, &xfs_iomap_ops);
 }
 
 STATIC int
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 18/34] xfs: use iomap for blocksize == PAGE_SIZE readpage and readpages
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (16 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 17/34] xfs: use iomap_bmap Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-30  6:22   ` Darrick J. Wong
  2018-05-23 14:43 ` [PATCH 19/34] xfs: simplify xfs_bmap_punch_delalloc_range Christoph Hellwig
                   ` (15 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

For file systems with a block size that equals the page size we never do
partial reads, so we can use the buffer_head-less iomap versions of
readpage and readpages without conflicting with the buffer_head structures
create later in write_begin.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 56e405572909..c631c457b444 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1402,6 +1402,8 @@ xfs_vm_readpage(
 	struct page		*page)
 {
 	trace_xfs_vm_readpage(page->mapping->host, 1);
+	if (i_blocksize(page->mapping->host) == PAGE_SIZE)
+		return iomap_readpage(page, &xfs_iomap_ops);
 	return mpage_readpage(page, xfs_get_blocks);
 }
 
@@ -1413,6 +1415,8 @@ xfs_vm_readpages(
 	unsigned		nr_pages)
 {
 	trace_xfs_vm_readpages(mapping->host, nr_pages);
+	if (i_blocksize(mapping->host) == PAGE_SIZE)
+		return iomap_readpages(mapping, pages, nr_pages, &xfs_iomap_ops);
 	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
 }
 
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 19/34] xfs: simplify xfs_bmap_punch_delalloc_range
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (17 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 18/34] xfs: use iomap for blocksize == PAGE_SIZE readpage and readpages Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-23 16:17   ` Brian Foster
  2018-05-23 14:43 ` [PATCH 20/34] xfs: simplify xfs_aops_discard_page Christoph Hellwig
                   ` (14 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

Instead of using xfs_bmapi_read to find delalloc extents and then punch
them out using xfs_bunmapi, opencode the loop to iterate over the extents
and call xfs_bmap_del_extent_delay directly.  This both simplifies the
code and reduces the number of extent tree lookups required.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_bmap_util.c | 78 ++++++++++++++----------------------------
 1 file changed, 25 insertions(+), 53 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 06badcbadeb4..c009bdf9fdce 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -695,12 +695,10 @@ xfs_getbmap(
 }
 
 /*
- * dead simple method of punching delalyed allocation blocks from a range in
- * the inode. Walks a block at a time so will be slow, but is only executed in
- * rare error cases so the overhead is not critical. This will always punch out
- * both the start and end blocks, even if the ranges only partially overlap
- * them, so it is up to the caller to ensure that partial blocks are not
- * passed in.
+ * Dead simple method of punching delalyed allocation blocks from a range in
+ * the inode.  This will always punch out both the start and end blocks, even
+ * if the ranges only partially overlap them, so it is up to the caller to
+ * ensure that partial blocks are not passed in.
  */
 int
 xfs_bmap_punch_delalloc_range(
@@ -708,63 +706,37 @@ xfs_bmap_punch_delalloc_range(
 	xfs_fileoff_t		start_fsb,
 	xfs_fileoff_t		length)
 {
-	xfs_fileoff_t		remaining = length;
+	struct xfs_ifork	*ifp = &ip->i_df;
+	struct xfs_bmbt_irec	got, del;
+	struct xfs_iext_cursor	icur;
 	int			error = 0;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 
-	do {
-		int		done;
-		xfs_bmbt_irec_t	imap;
-		int		nimaps = 1;
-		xfs_fsblock_t	firstblock;
-		struct xfs_defer_ops dfops;
+	if (!(ifp->if_flags & XFS_IFEXTENTS)) {
+		error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
+		if (error)
+			return error;
+	}
 
-		/*
-		 * Map the range first and check that it is a delalloc extent
-		 * before trying to unmap the range. Otherwise we will be
-		 * trying to remove a real extent (which requires a
-		 * transaction) or a hole, which is probably a bad idea...
-		 */
-		error = xfs_bmapi_read(ip, start_fsb, 1, &imap, &nimaps,
-				       XFS_BMAPI_ENTIRE);
+	if (!xfs_iext_lookup_extent(ip, ifp, start_fsb, &icur, &got))
+		return 0;
 
-		if (error) {
-			/* something screwed, just bail */
-			if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) {
-				xfs_alert(ip->i_mount,
-			"Failed delalloc mapping lookup ino %lld fsb %lld.",
-						ip->i_ino, start_fsb);
-			}
+	do {
+		if (got.br_startoff >= start_fsb + length)
 			break;
-		}
-		if (!nimaps) {
-			/* nothing there */
-			goto next_block;
-		}
-		if (imap.br_startblock != DELAYSTARTBLOCK) {
-			/* been converted, ignore */
-			goto next_block;
-		}
-		WARN_ON(imap.br_blockcount == 0);
+		if (!isnullstartblock(got.br_startblock))
+			continue;
 
-		/*
-		 * Note: while we initialise the firstblock/dfops pair, they
-		 * should never be used because blocks should never be
-		 * allocated or freed for a delalloc extent and hence we need
-		 * don't cancel or finish them after the xfs_bunmapi() call.
-		 */
-		xfs_defer_init(&dfops, &firstblock);
-		error = xfs_bunmapi(NULL, ip, start_fsb, 1, 0, 1, &firstblock,
-					&dfops, &done);
+		del = got;
+		xfs_trim_extent(&del, start_fsb, length);
+		error = xfs_bmap_del_extent_delay(ip, XFS_DATA_FORK, &icur,
+				&got, &del);
 		if (error)
 			break;
-
-		ASSERT(!xfs_defer_has_unfinished_work(&dfops));
-next_block:
-		start_fsb++;
-		remaining--;
-	} while(remaining > 0);
+		if (!xfs_iext_get_extent(ifp, &icur, &got))
+			break;
+	} while (xfs_iext_next_extent(ifp, &icur, &got));
 
 	return error;
 }
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 20/34] xfs: simplify xfs_aops_discard_page
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (18 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 19/34] xfs: simplify xfs_bmap_punch_delalloc_range Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-23 14:43 ` [PATCH 21/34] xfs: move locking into xfs_bmap_punch_delalloc_range Christoph Hellwig
                   ` (13 subsequent siblings)
  33 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

Instead of looking at the buffer heads to see if a block is delalloc just
call xfs_bmap_punch_delalloc_range on the whole page - this will leave
any non-delalloc block intact and handle the iteration for us.  As a side
effect one more place stops caring about buffer heads and we can remove the
xfs_check_page_type function entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 85 +++++------------------------------------------
 1 file changed, 9 insertions(+), 76 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index c631c457b444..f2333e351e07 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -711,49 +711,6 @@ xfs_map_at_offset(
 	clear_buffer_unwritten(bh);
 }
 
-/*
- * Test if a given page contains at least one buffer of a given @type.
- * If @check_all_buffers is true, then we walk all the buffers in the page to
- * try to find one of the type passed in. If it is not set, then the caller only
- * needs to check the first buffer on the page for a match.
- */
-STATIC bool
-xfs_check_page_type(
-	struct page		*page,
-	unsigned int		type,
-	bool			check_all_buffers)
-{
-	struct buffer_head	*bh;
-	struct buffer_head	*head;
-
-	if (PageWriteback(page))
-		return false;
-	if (!page->mapping)
-		return false;
-	if (!page_has_buffers(page))
-		return false;
-
-	bh = head = page_buffers(page);
-	do {
-		if (buffer_unwritten(bh)) {
-			if (type == XFS_IO_UNWRITTEN)
-				return true;
-		} else if (buffer_delay(bh)) {
-			if (type == XFS_IO_DELALLOC)
-				return true;
-		} else if (buffer_dirty(bh) && buffer_mapped(bh)) {
-			if (type == XFS_IO_OVERWRITE)
-				return true;
-		}
-
-		/* If we are only checking the first buffer, we are done now. */
-		if (!check_all_buffers)
-			break;
-	} while ((bh = bh->b_this_page) != head);
-
-	return false;
-}
-
 STATIC void
 xfs_vm_invalidatepage(
 	struct page		*page,
@@ -785,9 +742,6 @@ xfs_vm_invalidatepage(
  * transaction. Indeed - if we get ENOSPC errors, we have to be able to do this
  * truncation without a transaction as there is no space left for block
  * reservation (typically why we see a ENOSPC in writeback).
- *
- * This is not a performance critical path, so for now just do the punching a
- * buffer head at a time.
  */
 STATIC void
 xfs_aops_discard_page(
@@ -795,47 +749,26 @@ xfs_aops_discard_page(
 {
 	struct inode		*inode = page->mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
-	struct buffer_head	*bh, *head;
+	struct xfs_mount	*mp = ip->i_mount;
 	loff_t			offset = page_offset(page);
+	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, offset);
+	int			error;
 
-	if (!xfs_check_page_type(page, XFS_IO_DELALLOC, true))
-		goto out_invalidate;
-
-	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
+	if (XFS_FORCED_SHUTDOWN(mp))
 		goto out_invalidate;
 
-	xfs_alert(ip->i_mount,
+	xfs_alert(mp,
 		"page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
 			page, ip->i_ino, offset);
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	bh = head = page_buffers(page);
-	do {
-		int		error;
-		xfs_fileoff_t	start_fsb;
-
-		if (!buffer_delay(bh))
-			goto next_buffer;
-
-		start_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
-		error = xfs_bmap_punch_delalloc_range(ip, start_fsb, 1);
-		if (error) {
-			/* something screwed, just bail */
-			if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) {
-				xfs_alert(ip->i_mount,
-			"page discard unable to remove delalloc mapping.");
-			}
-			break;
-		}
-next_buffer:
-		offset += i_blocksize(inode);
-
-	} while ((bh = bh->b_this_page) != head);
-
+	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
+			PAGE_SIZE / i_blocksize(inode));
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	if (error && !XFS_FORCED_SHUTDOWN(mp))
+		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 out_invalidate:
 	xfs_vm_invalidatepage(page, 0, PAGE_SIZE);
-	return;
 }
 
 static int
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 21/34] xfs: move locking into xfs_bmap_punch_delalloc_range
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (19 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 20/34] xfs: simplify xfs_aops_discard_page Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-23 14:43 ` [PATCH 22/34] xfs: make xfs_writepage_map extent map centric Christoph Hellwig
                   ` (12 subsequent siblings)
  33 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

Both callers want the same looking, so do it only once.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c      | 2 --
 fs/xfs/xfs_bmap_util.c | 7 ++++---
 fs/xfs/xfs_iomap.c     | 3 ---
 3 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f2333e351e07..5dd09e83c81c 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -761,10 +761,8 @@ xfs_aops_discard_page(
 		"page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
 			page, ip->i_ino, offset);
 
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
 			PAGE_SIZE / i_blocksize(inode));
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	if (error && !XFS_FORCED_SHUTDOWN(mp))
 		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 out_invalidate:
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index c009bdf9fdce..1a55fc06f917 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -711,12 +711,11 @@ xfs_bmap_punch_delalloc_range(
 	struct xfs_iext_cursor	icur;
 	int			error = 0;
 
-	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
-
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	if (!(ifp->if_flags & XFS_IFEXTENTS)) {
 		error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
 		if (error)
-			return error;
+			goto out_unlock;
 	}
 
 	if (!xfs_iext_lookup_extent(ip, ifp, start_fsb, &icur, &got))
@@ -738,6 +737,8 @@ xfs_bmap_punch_delalloc_range(
 			break;
 	} while (xfs_iext_next_extent(ifp, &icur, &got));
 
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index da6d1995e460..f949f0dd7382 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1203,11 +1203,8 @@ xfs_file_iomap_end_delalloc(
 		truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
 					 XFS_FSB_TO_B(mp, end_fsb) - 1);
 
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
 		error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
 					       end_fsb - start_fsb);
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-
 		if (error && !XFS_FORCED_SHUTDOWN(mp)) {
 			xfs_alert(mp, "%s: unable to clean up ino %lld",
 				__func__, ip->i_ino);
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 22/34] xfs: make xfs_writepage_map extent map centric
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (20 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 21/34] xfs: move locking into xfs_bmap_punch_delalloc_range Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-24 14:59   ` Brian Foster
  2018-05-23 14:43 ` [PATCH 23/34] xfs: remove the now unused XFS_BMAPI_IGSTATE flag Christoph Hellwig
                   ` (11 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

xfs_writepage_map() iterates over the bufferheads on a page to decide
what sort of IO to do and what actions to take.  However, when it comes
to reflink and deciding when it needs to execute a COW operation, we no
longer look at the bufferhead state but instead we ignore than and look
up internal state held in teh COW fork extent list.

This means xfs_writepage_map() is somewhat confused. It does stuff, then
ignores it, then tries to handle the impedence mismatch by shovelling the
results inside the existing mapping code.  It works, but it's a bit of a
mess and it makes it hard to fix the cached map bug that the writepage
code currently has.

To unify the two different mechanisms, we first have to choose a direction.
That's already been set - we're de-emphasising bufferheads so they are no
longer a control structure as we need to do taht to allow for eventual
removal.  Hence we need to move away from looking at bufferhead state to
determine what operations we need to perform.

We can't completely get rid of bufferheads yet - they do contain some
state that is absolutely necessary, such as whether that part of the page
contains valid data or not (buffer_uptodate()).  Other state in the
bufferhead is redundant:

	BH_dirty - the page is dirty, so we can ignore this and just
		write it
	BH_delay - we have delalloc extent info in the DATA fork extent
		tree
	BH_unwritten - same as BH_delay
	BH_mapped - indicates we've already used it once for IO and it is
		mapped to a disk address. Needs to be ignored for COW
		blocks.

The BH_mapped flag is an interesting case - it's supposed to indicate that
it's already mapped to disk and so we can just use it "as is".  In theory,
we don't even have to do an extent lookup to find where to write it too,
but we have to do that anyway to determine we are actually writing over a
valid extent.  Hence it's not even serving the purpose of avoiding a an
extent lookup during writeback, and so we can pretty much ignore it.
Especially as we have to ignore it for COW operations...

Therefore, use the extent map as the source of information to tell us
what actions we need to take and what sort of IO we should perform.  The
first step is integration xfs_map_blocks() and xfs_map_cow() and have
xfs_map_blocks() set the io type according to what it looks up.  This
means it can easily handle both normal overwrite and COW cases.  The
only thing we also need to add is the ability to return hole mappings.

We need to return and cache hole mappings now for the case of multiple
blocks per page.  We no longer use the BH_mapped to indicate a block over
a hole, so we have to get that info from xfs_map_blocks().  We cache it so
that holes that span two pages don't need separate lookups.  This allows us
to avoid ever doing write IO over a hole, too.

Further, we need to drop the XFS_BMAPI_IGSTATE flag so that we don't
combine contiguous written and unwritten extents into a single map.  The
io type needs to match the extent type we are writing to so that we run the
correct IO completion routine for the IO. There is scope for optimisation
that would allow us to re-instate the XFS_BMAPI_IGSTATE flag, but this
requires tweaks to code outside the scope of this change.

Now that we have xfs_map_blocks() returning both a cached map and the type
of IO we need to perform, we can rewrite xfs_writepage_map() to drop all
the bufferhead control. It's also much simplified because it doesn't need
to explicitly handle COW operations.  Instead of iterating bufferheads, it
iterates blocks within the page and then looks up what per-block state is
required from the appropriate bufferhead.  It then validates the cached
map, and if it's not valid, we get a new map.  If we don't get a valid map
or it's over a hole, we skip the block.

At this point, we have to remap the bufferhead via xfs_map_at_offset().
As previously noted, we had to do this even if the buffer was already
mapped as the mapping would be stale for XFS_IO_DELALLOC, XFS_IO_UNWRITTEN
and XFS_IO_COW IO types.  With xfs_map_blocks() now controlling the type,
even XFS_IO_OVERWRITE types need remapping, as converted-but-not-yet-
written delalloc extents beyond EOF can be reported at XFS_IO_OVERWRITE.
Bufferheads that span such regions still need their BH_Delay flags cleared
and their block numbers calculated, so we now unconditionally map each
bufferhead before submission.

But wait! There's more - remember the old "treat unwritten extents as
holes on read" hack?  Yeah, that means we can have a dirty page with
unmapped, unwritten bufferheads that contain data!  What makes these so
special is that the unwritten "hole" bufferheads do not have a valid block
device pointer, so if we attempt to write them xfs_add_to_ioend() blows
up. So we make xfs_map_at_offset() do the "realtime or data device"
lookup from the inode and ignore what was or wasn't put into the
bufferhead when the buffer was instantiated.

The astute reader will have realised by now that this code treats
unwritten extents in multiple-blocks-per-page situations differently.
If we get any combination of unwritten blocks on a dirty page that contain
valid data in the page, we're going to convert them to real extents.  This
can actually be a win, because it means that pages with interleaving
unwritten and written blocks will get converted to a single written extent
with zeros replacing the interspersed unwritten blocks.  This is actually
good for reducing extent list and conversion overhead, and it means we
issue a contiguous IO instead of lots of little ones.  The downside is
that we use up a little extra IO bandwidth.  Neither of these seem like a
bad thing given that spinning disks are seek sensitive, and SSDs/pmem have
bandwidth to burn and the lower Io latency/CPU overhead of fewer, larger
IOs will result in better performance on them...

As a result of all this, the only state we actually care about from the
bufferhead is a single flag - BH_Uptodate. We still use the bufferhead to
pass some information to the bio via xfs_add_to_ioend(), but that is
trivial to separate and pass explicitly.  This means we really only need
1 bit of state per block per page from the buffered write path in the
writeback path.  Everything else we do with the bufferhead is purely to
make the buffered IO front end continue to work correctly. i.e we've
pretty much marginalised bufferheads in the writeback path completely.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
[hch: forward port + slight refactoring]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 277 +++++++++++++++++++++-------------------------
 fs/xfs/xfs_aops.h |   4 +-
 2 files changed, 130 insertions(+), 151 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 5dd09e83c81c..a50f69c2c602 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -378,78 +378,101 @@ xfs_map_blocks(
 	struct inode		*inode,
 	loff_t			offset,
 	struct xfs_bmbt_irec	*imap,
-	int			type)
+	int			*type)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	ssize_t			count = i_blocksize(inode);
 	xfs_fileoff_t		offset_fsb, end_fsb;
+	int			whichfork = XFS_DATA_FORK;
 	int			error = 0;
-	int			bmapi_flags = XFS_BMAPI_ENTIRE;
 	int			nimaps = 1;
 
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
-	/*
-	 * Truncate can race with writeback since writeback doesn't take the
-	 * iolock and truncate decreases the file size before it starts
-	 * truncating the pages between new_size and old_size.  Therefore, we
-	 * can end up in the situation where writeback gets a CoW fork mapping
-	 * but the truncate makes the mapping invalid and we end up in here
-	 * trying to get a new mapping.  Bail out here so that we simply never
-	 * get a valid mapping and so we drop the write altogether.  The page
-	 * truncation will kill the contents anyway.
-	 */
-	if (type == XFS_IO_COW && offset > i_size_read(inode))
-		return 0;
-
-	ASSERT(type != XFS_IO_COW);
-	if (type == XFS_IO_UNWRITTEN)
-		bmapi_flags |= XFS_BMAPI_IGSTATE;
-
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 	ASSERT(ip->i_d.di_format != XFS_DINODE_FMT_BTREE ||
 	       (ip->i_df.if_flags & XFS_IFEXTENTS));
 	ASSERT(offset <= mp->m_super->s_maxbytes);
 
+	if (xfs_is_reflink_inode(ip) &&
+	    xfs_reflink_find_cow_mapping(ip, offset, imap)) {
+		xfs_iunlock(ip, XFS_ILOCK_SHARED);
+		/*
+		 * Truncate can race with writeback since writeback doesn't
+		 * take the iolock and truncate decreases the file size before
+		 * it starts truncating the pages between new_size and old_size.
+		 * Therefore, we can end up in the situation where writeback
+		 * gets a CoW fork mapping but the truncate makes the mapping
+		 * invalid and we end up in here trying to get a new mapping.
+		 * bail out here so that we simply never get a valid mapping
+		 * and so we drop the write altogether.  The page truncation
+		 * will kill the contents anyway.
+		 */
+		if (offset > i_size_read(inode))
+			return 0;
+		whichfork = XFS_COW_FORK;
+		*type = XFS_IO_COW;
+		goto done;
+	}
+
 	if (offset > mp->m_super->s_maxbytes - count)
 		count = mp->m_super->s_maxbytes - offset;
 	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
 	offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
-				imap, &nimaps, bmapi_flags);
-	/*
-	 * Truncate an overwrite extent if there's a pending CoW
-	 * reservation before the end of this extent.  This forces us
-	 * to come back to writepage to take care of the CoW.
-	 */
-	if (nimaps && type == XFS_IO_OVERWRITE)
+				imap, &nimaps, XFS_BMAPI_ENTIRE);
+	if (!nimaps) {
+		/*
+		 * Lookup returns no match? Beyond eof? regardless,
+		 * return it as a hole so we don't write it
+		 */
+		imap->br_startoff = offset_fsb;
+		imap->br_blockcount = end_fsb - offset_fsb;
+		imap->br_startblock = HOLESTARTBLOCK;
+		*type = XFS_IO_HOLE;
+	} else if (imap->br_startblock == HOLESTARTBLOCK) {
+		/* landed in a hole */
+		*type = XFS_IO_HOLE;
+	} else if (isnullstartblock(imap->br_startblock)) {
+		/* got a delalloc extent */
+		*type = XFS_IO_DELALLOC;
+	} else {
+		/*
+		 * Got an existing extent for overwrite.  Truncate it if there
+		 * is a pending CoW reservation before the end of this extent,
+		 * so that we pick up the COW extents in the next iteration.
+		 */
 		xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb, imap);
+		if (imap->br_state == XFS_EXT_UNWRITTEN)
+			*type = XFS_IO_UNWRITTEN;
+		else
+			*type = XFS_IO_OVERWRITE;
+	}
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
-
 	if (error)
 		return error;
 
-	if (type == XFS_IO_DELALLOC &&
-	    (!nimaps || isnullstartblock(imap->br_startblock))) {
-		error = xfs_iomap_write_allocate(ip, XFS_DATA_FORK, offset,
-				imap);
-		if (!error)
-			trace_xfs_map_blocks_alloc(ip, offset, count, type, imap);
-		return error;
-	}
-
-#ifdef DEBUG
-	if (type == XFS_IO_UNWRITTEN) {
-		ASSERT(nimaps);
-		ASSERT(imap->br_startblock != HOLESTARTBLOCK);
-		ASSERT(imap->br_startblock != DELAYSTARTBLOCK);
+done:
+	switch (*type) {
+	case XFS_IO_HOLE:
+	case XFS_IO_OVERWRITE:
+	case XFS_IO_UNWRITTEN:
+		/* nothing to do! */
+		trace_xfs_map_blocks_found(ip, offset, count, *type, imap);
+		return 0;
+	case XFS_IO_DELALLOC:
+	case XFS_IO_COW:
+		error = xfs_iomap_write_allocate(ip, whichfork, offset, imap);
+		if (error)
+			return error;
+		trace_xfs_map_blocks_alloc(ip, offset, count, *type, imap);
+		return 0;
+	default:
+		ASSERT(1);
+		return -EFSCORRUPTED;
 	}
-#endif
-	if (nimaps)
-		trace_xfs_map_blocks_found(ip, offset, count, type, imap);
-	return 0;
 }
 
 STATIC bool
@@ -709,6 +732,14 @@ xfs_map_at_offset(
 	set_buffer_mapped(bh);
 	clear_buffer_delay(bh);
 	clear_buffer_unwritten(bh);
+
+	/*
+	 * If this is a realtime file, data may be on a different device.
+	 * to that pointed to from the buffer_head b_bdev currently. We can't
+	 * trust that the bufferhead has a already been mapped correctly, so
+	 * set the bdev now.
+	 */
+	bh->b_bdev = xfs_find_bdev_for_inode(inode);
 }
 
 STATIC void
@@ -769,56 +800,6 @@ xfs_aops_discard_page(
 	xfs_vm_invalidatepage(page, 0, PAGE_SIZE);
 }
 
-static int
-xfs_map_cow(
-	struct xfs_writepage_ctx *wpc,
-	struct inode		*inode,
-	loff_t			offset,
-	unsigned int		*new_type)
-{
-	struct xfs_inode	*ip = XFS_I(inode);
-	struct xfs_bmbt_irec	imap;
-	bool			is_cow = false;
-	int			error;
-
-	/*
-	 * If we already have a valid COW mapping keep using it.
-	 */
-	if (wpc->io_type == XFS_IO_COW) {
-		wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap, offset);
-		if (wpc->imap_valid) {
-			*new_type = XFS_IO_COW;
-			return 0;
-		}
-	}
-
-	/*
-	 * Else we need to check if there is a COW mapping at this offset.
-	 */
-	xfs_ilock(ip, XFS_ILOCK_SHARED);
-	is_cow = xfs_reflink_find_cow_mapping(ip, offset, &imap);
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
-
-	if (!is_cow)
-		return 0;
-
-	/*
-	 * And if the COW mapping has a delayed extent here we need to
-	 * allocate real space for it now.
-	 */
-	if (isnullstartblock(imap.br_startblock)) {
-		error = xfs_iomap_write_allocate(ip, XFS_COW_FORK, offset,
-				&imap);
-		if (error)
-			return error;
-	}
-
-	wpc->io_type = *new_type = XFS_IO_COW;
-	wpc->imap_valid = true;
-	wpc->imap = imap;
-	return 0;
-}
-
 /*
  * We implement an immediate ioend submission policy here to avoid needing to
  * chain multiple ioends and hence nest mempool allocations which can violate
@@ -845,85 +826,81 @@ xfs_writepage_map(
 {
 	LIST_HEAD(submit_list);
 	struct xfs_ioend	*ioend, *next;
-	struct buffer_head	*bh, *head;
+	struct buffer_head	*bh;
 	ssize_t			len = i_blocksize(inode);
-	uint64_t		offset;
 	int			error = 0;
 	int			count = 0;
-	int			uptodate = 1;
-	unsigned int		new_type;
+	bool			uptodate = true;
+	loff_t			file_offset;	/* file offset of page */
+	unsigned		poffset;	/* offset into page */
 
-	bh = head = page_buffers(page);
-	offset = page_offset(page);
-	do {
-		if (offset >= end_offset)
+	/*
+	 * Walk the blocks on the page, and we we run off then end of the
+	 * current map or find the current map invalid, grab a new one.
+	 * We only use bufferheads here to check per-block state - they no
+	 * longer control the iteration through the page. This allows us to
+	 * replace the bufferhead with some other state tracking mechanism in
+	 * future.
+	 */
+	file_offset = page_offset(page);
+	bh = page_buffers(page);
+	for (poffset = 0;
+	     poffset < PAGE_SIZE;
+	     poffset += len, file_offset += len, bh = bh->b_this_page) {
+		/* past the range we are writing, so nothing more to write. */
+		if (file_offset >= end_offset)
 			break;
-		if (!buffer_uptodate(bh))
-			uptodate = 0;
 
 		/*
-		 * set_page_dirty dirties all buffers in a page, independent
-		 * of their state.  The dirty state however is entirely
-		 * meaningless for holes (!mapped && uptodate), so skip
-		 * buffers covering holes here.
+		 * Block does not contain valid data, skip it, mark the current
+		 * map as invalid because we have a discontiguity. This ensures
+		 * we put subsequent writeable buffers into a new ioend.
 		 */
-		if (!buffer_mapped(bh) && buffer_uptodate(bh)) {
-			wpc->imap_valid = false;
-			continue;
-		}
-
-		if (buffer_unwritten(bh))
-			new_type = XFS_IO_UNWRITTEN;
-		else if (buffer_delay(bh))
-			new_type = XFS_IO_DELALLOC;
-		else if (buffer_uptodate(bh))
-			new_type = XFS_IO_OVERWRITE;
-		else {
+		if (!buffer_uptodate(bh)) {
 			if (PageUptodate(page))
 				ASSERT(buffer_mapped(bh));
-			/*
-			 * This buffer is not uptodate and will not be
-			 * written to disk.  Ensure that we will put any
-			 * subsequent writeable buffers into a new
-			 * ioend.
-			 */
+			uptodate = false;
 			wpc->imap_valid = false;
 			continue;
 		}
 
-		if (xfs_is_reflink_inode(XFS_I(inode))) {
-			error = xfs_map_cow(wpc, inode, offset, &new_type);
-			if (error)
-				goto out;
-		}
-
-		if (wpc->io_type != new_type) {
-			wpc->io_type = new_type;
-			wpc->imap_valid = false;
-		}
-
+		/* Check to see if current map spans this file offset */
 		if (wpc->imap_valid)
 			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
-							 offset);
+							 file_offset);
+		/*
+		 * If we don't have a valid map, now it's time to get a new one
+		 * for this offset.  This will convert delayed allocations
+		 * (including COW ones) into real extents.  If we return without
+		 * a valid map, it means we landed in a hole and we skip the
+		 * block.
+		 */
 		if (!wpc->imap_valid) {
-			error = xfs_map_blocks(inode, offset, &wpc->imap,
-					     wpc->io_type);
+			error = xfs_map_blocks(inode, file_offset, &wpc->imap,
+					     &wpc->io_type);
 			if (error)
 				goto out;
 			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
-							 offset);
+							 file_offset);
 		}
-		if (wpc->imap_valid) {
-			lock_buffer(bh);
-			if (wpc->io_type != XFS_IO_OVERWRITE)
-				xfs_map_at_offset(inode, bh, &wpc->imap, offset);
-			xfs_add_to_ioend(inode, bh, offset, wpc, wbc, &submit_list);
-			count++;
+
+		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
+			/*
+			 * set_page_dirty dirties all buffers in a page, independent
+			 * of their state.  The dirty state however is entirely
+			 * meaningless for holes (!mapped && uptodate), so check we did
+			 * have a buffer covering a hole here and continue.
+			 */
+			continue;
 		}
 
-	} while (offset += len, ((bh = bh->b_this_page) != head));
+		lock_buffer(bh);
+		xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
+		xfs_add_to_ioend(inode, bh, file_offset, wpc, wbc, &submit_list);
+		count++;
+	}
 
-	if (uptodate && bh == head)
+	if (uptodate && poffset == PAGE_SIZE)
 		SetPageUptodate(page);
 
 	ASSERT(wpc->ioend || list_empty(&submit_list));
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 69346d460dfa..b2ef5b661761 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -29,6 +29,7 @@ enum {
 	XFS_IO_UNWRITTEN,	/* covers allocated but uninitialized data */
 	XFS_IO_OVERWRITE,	/* covers already allocated extent */
 	XFS_IO_COW,		/* covers copy-on-write extent */
+	XFS_IO_HOLE,		/* covers region without any block allocation */
 };
 
 #define XFS_IO_TYPES \
@@ -36,7 +37,8 @@ enum {
 	{ XFS_IO_DELALLOC,		"delalloc" }, \
 	{ XFS_IO_UNWRITTEN,		"unwritten" }, \
 	{ XFS_IO_OVERWRITE,		"overwrite" }, \
-	{ XFS_IO_COW,			"CoW" }
+	{ XFS_IO_COW,			"CoW" }, \
+	{ XFS_IO_HOLE,			"hole" }
 
 /*
  * Structure for buffered I/O completions.
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 23/34] xfs: remove the now unused XFS_BMAPI_IGSTATE flag
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (21 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 22/34] xfs: make xfs_writepage_map extent map centric Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-23 14:43 ` [PATCH 24/34] xfs: remove xfs_reflink_find_cow_mapping Christoph Hellwig
                   ` (10 subsequent siblings)
  33 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_bmap.c | 6 ++----
 fs/xfs/libxfs/xfs_bmap.h | 3 ---
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 7b0e2b551e23..4b5e014417d2 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3799,8 +3799,7 @@ xfs_bmapi_update_map(
 		   mval[-1].br_startblock != HOLESTARTBLOCK &&
 		   mval->br_startblock == mval[-1].br_startblock +
 					  mval[-1].br_blockcount &&
-		   ((flags & XFS_BMAPI_IGSTATE) ||
-			mval[-1].br_state == mval->br_state)) {
+		   mval[-1].br_state == mval->br_state) {
 		ASSERT(mval->br_startoff ==
 		       mval[-1].br_startoff + mval[-1].br_blockcount);
 		mval[-1].br_blockcount += mval->br_blockcount;
@@ -3845,7 +3844,7 @@ xfs_bmapi_read(
 
 	ASSERT(*nmap >= 1);
 	ASSERT(!(flags & ~(XFS_BMAPI_ATTRFORK|XFS_BMAPI_ENTIRE|
-			   XFS_BMAPI_IGSTATE|XFS_BMAPI_COWFORK)));
+			   XFS_BMAPI_COWFORK)));
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED|XFS_ILOCK_EXCL));
 
 	if (unlikely(XFS_TEST_ERROR(
@@ -4290,7 +4289,6 @@ xfs_bmapi_write(
 
 	ASSERT(*nmap >= 1);
 	ASSERT(*nmap <= XFS_BMAP_MAX_NMAP);
-	ASSERT(!(flags & XFS_BMAPI_IGSTATE));
 	ASSERT(tp != NULL ||
 	       (flags & (XFS_BMAPI_CONVERT | XFS_BMAPI_COWFORK)) ==
 			(XFS_BMAPI_CONVERT | XFS_BMAPI_COWFORK));
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 2c233f9f1a26..a845fe57d1b5 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -80,8 +80,6 @@ struct xfs_extent_free_item
 #define XFS_BMAPI_METADATA	0x002	/* mapping metadata not user data */
 #define XFS_BMAPI_ATTRFORK	0x004	/* use attribute fork not data */
 #define XFS_BMAPI_PREALLOC	0x008	/* preallocation op: unwritten space */
-#define XFS_BMAPI_IGSTATE	0x010	/* Ignore state - */
-					/* combine contig. space */
 #define XFS_BMAPI_CONTIG	0x020	/* must allocate only one extent */
 /*
  * unwritten extent conversion - this needs write cache flushing and no additional
@@ -128,7 +126,6 @@ struct xfs_extent_free_item
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
 	{ XFS_BMAPI_ATTRFORK,	"ATTRFORK" }, \
 	{ XFS_BMAPI_PREALLOC,	"PREALLOC" }, \
-	{ XFS_BMAPI_IGSTATE,	"IGSTATE" }, \
 	{ XFS_BMAPI_CONTIG,	"CONTIG" }, \
 	{ XFS_BMAPI_CONVERT,	"CONVERT" }, \
 	{ XFS_BMAPI_ZERO,	"ZERO" }, \
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 24/34] xfs: remove xfs_reflink_find_cow_mapping
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (22 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 23/34] xfs: remove the now unused XFS_BMAPI_IGSTATE flag Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-23 14:43 ` [PATCH 25/34] xfs: remove xfs_reflink_trim_irec_to_next_cow Christoph Hellwig
                   ` (9 subsequent siblings)
  33 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

We only have one caller left, and open coding the simple extent list
lookup in it allows us to make the code both more understandable and
reuse calculations and variables already present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c    | 17 ++++++++++++-----
 fs/xfs/xfs_reflink.c | 30 ------------------------------
 fs/xfs/xfs_reflink.h |  2 --
 fs/xfs/xfs_trace.h   |  1 -
 4 files changed, 12 insertions(+), 38 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a50f69c2c602..a4b4a7037deb 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -385,6 +385,7 @@ xfs_map_blocks(
 	ssize_t			count = i_blocksize(inode);
 	xfs_fileoff_t		offset_fsb, end_fsb;
 	int			whichfork = XFS_DATA_FORK;
+	struct xfs_iext_cursor	icur;
 	int			error = 0;
 	int			nimaps = 1;
 
@@ -396,8 +397,18 @@ xfs_map_blocks(
 	       (ip->i_df.if_flags & XFS_IFEXTENTS));
 	ASSERT(offset <= mp->m_super->s_maxbytes);
 
+	if (offset > mp->m_super->s_maxbytes - count)
+		count = mp->m_super->s_maxbytes - offset;
+	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
+	offset_fsb = XFS_B_TO_FSBT(mp, offset);
+
+	/*
+	 * Check if this is offset is covered by a COW extents, and if yes use
+	 * it directly instead of looking up anything in the data fork.
+	 */
 	if (xfs_is_reflink_inode(ip) &&
-	    xfs_reflink_find_cow_mapping(ip, offset, imap)) {
+	    xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &icur, imap) &&
+	    imap->br_startoff <= offset_fsb) {
 		xfs_iunlock(ip, XFS_ILOCK_SHARED);
 		/*
 		 * Truncate can race with writeback since writeback doesn't
@@ -417,10 +428,6 @@ xfs_map_blocks(
 		goto done;
 	}
 
-	if (offset > mp->m_super->s_maxbytes - count)
-		count = mp->m_super->s_maxbytes - offset;
-	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
-	offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
 				imap, &nimaps, XFS_BMAPI_ENTIRE);
 	if (!nimaps) {
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 713e857d9ffa..8e5eb8e70c89 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -484,36 +484,6 @@ xfs_reflink_allocate_cow(
 	return error;
 }
 
-/*
- * Find the CoW reservation for a given byte offset of a file.
- */
-bool
-xfs_reflink_find_cow_mapping(
-	struct xfs_inode		*ip,
-	xfs_off_t			offset,
-	struct xfs_bmbt_irec		*imap)
-{
-	struct xfs_ifork		*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
-	xfs_fileoff_t			offset_fsb;
-	struct xfs_bmbt_irec		got;
-	struct xfs_iext_cursor		icur;
-
-	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL | XFS_ILOCK_SHARED));
-
-	if (!xfs_is_reflink_inode(ip))
-		return false;
-	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
-	if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, &icur, &got))
-		return false;
-	if (got.br_startoff > offset_fsb)
-		return false;
-
-	trace_xfs_reflink_find_cow_mapping(ip, offset, 1, XFS_IO_OVERWRITE,
-			&got);
-	*imap = got;
-	return true;
-}
-
 /*
  * Trim an extent to end at the next CoW reservation past offset_fsb.
  */
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 701487bab468..15a456492667 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -32,8 +32,6 @@ extern int xfs_reflink_allocate_cow(struct xfs_inode *ip,
 		struct xfs_bmbt_irec *imap, bool *shared, uint *lockmode);
 extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
-extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
-		struct xfs_bmbt_irec *imap);
 extern void xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
 		xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 9d4c4ca24fe6..ed8f774944ba 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3227,7 +3227,6 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_convert_cow);
 DEFINE_RW_EVENT(xfs_reflink_reserve_cow);
 
 DEFINE_SIMPLE_IO_EVENT(xfs_reflink_bounce_dio_write);
-DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);
 DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_irec);
 
 DEFINE_SIMPLE_IO_EVENT(xfs_reflink_cancel_cow_range);
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 25/34] xfs: remove xfs_reflink_trim_irec_to_next_cow
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (23 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 24/34] xfs: remove xfs_reflink_find_cow_mapping Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-24 14:59   ` Brian Foster
  2018-05-23 14:43 ` [PATCH 26/34] xfs: simplify xfs_map_blocks by using xfs_iext_lookup_extent directly Christoph Hellwig
                   ` (8 subsequent siblings)
  33 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

In the only caller we just did a lookup in the COW extent tree for
the same offset.  Reuse that result and save a lookup, as well as
shortening the ilock hold time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c    | 25 +++++++++++++++++--------
 fs/xfs/xfs_reflink.c | 33 ---------------------------------
 fs/xfs/xfs_reflink.h |  2 --
 3 files changed, 17 insertions(+), 43 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a4b4a7037deb..354d26d66c12 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -383,11 +383,12 @@ xfs_map_blocks(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	ssize_t			count = i_blocksize(inode);
-	xfs_fileoff_t		offset_fsb, end_fsb;
+	xfs_fileoff_t		offset_fsb, end_fsb, cow_fsb = 0;
 	int			whichfork = XFS_DATA_FORK;
 	struct xfs_iext_cursor	icur;
 	int			error = 0;
 	int			nimaps = 1;
+	bool			cow_valid = false;
 
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
@@ -407,8 +408,11 @@ xfs_map_blocks(
 	 * it directly instead of looking up anything in the data fork.
 	 */
 	if (xfs_is_reflink_inode(ip) &&
-	    xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &icur, imap) &&
-	    imap->br_startoff <= offset_fsb) {
+	    xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &icur, imap)) {
+		cow_fsb = imap->br_startoff;
+		cow_valid = true;
+	}
+	if (cow_valid && cow_fsb <= offset_fsb) {
 		xfs_iunlock(ip, XFS_ILOCK_SHARED);
 		/*
 		 * Truncate can race with writeback since writeback doesn't
@@ -430,6 +434,10 @@ xfs_map_blocks(
 
 	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
 				imap, &nimaps, XFS_BMAPI_ENTIRE);
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+	if (error)
+		return error;
+
 	if (!nimaps) {
 		/*
 		 * Lookup returns no match? Beyond eof? regardless,
@@ -451,16 +459,17 @@ xfs_map_blocks(
 		 * is a pending CoW reservation before the end of this extent,
 		 * so that we pick up the COW extents in the next iteration.
 		 */
-		xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb, imap);
+		if (cow_valid &&
+		    cow_fsb < imap->br_startoff + imap->br_blockcount) {
+			imap->br_blockcount = cow_fsb - imap->br_startoff;
+			trace_xfs_reflink_trim_irec(ip, imap);
+		}
+
 		if (imap->br_state == XFS_EXT_UNWRITTEN)
 			*type = XFS_IO_UNWRITTEN;
 		else
 			*type = XFS_IO_OVERWRITE;
 	}
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
-	if (error)
-		return error;
-
 done:
 	switch (*type) {
 	case XFS_IO_HOLE:
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 8e5eb8e70c89..ff76bc56ff3d 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -484,39 +484,6 @@ xfs_reflink_allocate_cow(
 	return error;
 }
 
-/*
- * Trim an extent to end at the next CoW reservation past offset_fsb.
- */
-void
-xfs_reflink_trim_irec_to_next_cow(
-	struct xfs_inode		*ip,
-	xfs_fileoff_t			offset_fsb,
-	struct xfs_bmbt_irec		*imap)
-{
-	struct xfs_ifork		*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
-	struct xfs_bmbt_irec		got;
-	struct xfs_iext_cursor		icur;
-
-	if (!xfs_is_reflink_inode(ip))
-		return;
-
-	/* Find the extent in the CoW fork. */
-	if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, &icur, &got))
-		return;
-
-	/* This is the extent before; try sliding up one. */
-	if (got.br_startoff < offset_fsb) {
-		if (!xfs_iext_next_extent(ifp, &icur, &got))
-			return;
-	}
-
-	if (got.br_startoff >= imap->br_startoff + imap->br_blockcount)
-		return;
-
-	imap->br_blockcount = got.br_startoff - imap->br_startoff;
-	trace_xfs_reflink_trim_irec(ip, imap);
-}
-
 /*
  * Cancel CoW reservations for some block range of an inode.
  *
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 15a456492667..e8d4d50c629f 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -32,8 +32,6 @@ extern int xfs_reflink_allocate_cow(struct xfs_inode *ip,
 		struct xfs_bmbt_irec *imap, bool *shared, uint *lockmode);
 extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
-extern void xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
-		xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
 
 extern int xfs_reflink_cancel_cow_blocks(struct xfs_inode *ip,
 		struct xfs_trans **tpp, xfs_fileoff_t offset_fsb,
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 26/34] xfs: simplify xfs_map_blocks by using xfs_iext_lookup_extent directly
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (24 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 25/34] xfs: remove xfs_reflink_trim_irec_to_next_cow Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-23 14:43 ` [PATCH 27/34] xfs: don't clear imap_valid for a non-uptodate buffers Christoph Hellwig
                   ` (7 subsequent siblings)
  33 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

xfs_bmapi_read adds zero value in xfs_map_blocks.  Replace it with a
direct call to the low-level extent lookup function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 19 +++++--------------
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 354d26d66c12..b1dee2171194 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -387,7 +387,6 @@ xfs_map_blocks(
 	int			whichfork = XFS_DATA_FORK;
 	struct xfs_iext_cursor	icur;
 	int			error = 0;
-	int			nimaps = 1;
 	bool			cow_valid = false;
 
 	if (XFS_FORCED_SHUTDOWN(mp))
@@ -432,24 +431,16 @@ xfs_map_blocks(
 		goto done;
 	}
 
-	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
-				imap, &nimaps, XFS_BMAPI_ENTIRE);
+	if (!xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, imap))
+		imap->br_startoff = end_fsb;	/* fake a hole past EOF */
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
-	if (error)
-		return error;
 
-	if (!nimaps) {
-		/*
-		 * Lookup returns no match? Beyond eof? regardless,
-		 * return it as a hole so we don't write it
-		 */
+	if (imap->br_startoff > offset_fsb) {
+		/* landed in a hole or beyond EOF */
+		imap->br_blockcount = imap->br_startoff - offset_fsb;
 		imap->br_startoff = offset_fsb;
-		imap->br_blockcount = end_fsb - offset_fsb;
 		imap->br_startblock = HOLESTARTBLOCK;
 		*type = XFS_IO_HOLE;
-	} else if (imap->br_startblock == HOLESTARTBLOCK) {
-		/* landed in a hole */
-		*type = XFS_IO_HOLE;
 	} else if (isnullstartblock(imap->br_startblock)) {
 		/* got a delalloc extent */
 		*type = XFS_IO_DELALLOC;
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 27/34] xfs: don't clear imap_valid for a non-uptodate buffers
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (25 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 26/34] xfs: simplify xfs_map_blocks by using xfs_iext_lookup_extent directly Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-23 14:43 ` [PATCH 28/34] xfs: remove the imap_valid flag Christoph Hellwig
                   ` (6 subsequent siblings)
  33 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

Finding a buffer that isn't uptodate doesn't invalidate the mapping for
any given block.  The last_sector check will already take care of starting
another ioend as soon as we find any non-update buffer, and if the current
mapping doesn't include the next uptodate buffer the xfs_imap_valid check
will take care of it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b1dee2171194..82fd08c29f7f 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -859,15 +859,12 @@ xfs_writepage_map(
 			break;
 
 		/*
-		 * Block does not contain valid data, skip it, mark the current
-		 * map as invalid because we have a discontiguity. This ensures
-		 * we put subsequent writeable buffers into a new ioend.
+		 * Block does not contain valid data, skip it.
 		 */
 		if (!buffer_uptodate(bh)) {
 			if (PageUptodate(page))
 				ASSERT(buffer_mapped(bh));
 			uptodate = false;
-			wpc->imap_valid = false;
 			continue;
 		}
 
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 28/34] xfs: remove the imap_valid flag
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (26 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 27/34] xfs: don't clear imap_valid for a non-uptodate buffers Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-23 14:43 ` [PATCH 29/34] xfs: don't look at buffer heads in xfs_add_to_ioend Christoph Hellwig
                   ` (5 subsequent siblings)
  33 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

Simplify the way we check for a valid imap - we know we have a valid
mapping after xfs_map_blocks returned successfully, and we know we can
call xfs_imap_valid on any imap, as it will always fail on a
zero-initialized map.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 82fd08c29f7f..f01c1dd737ec 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -42,7 +42,6 @@
  */
 struct xfs_writepage_ctx {
 	struct xfs_bmbt_irec    imap;
-	bool			imap_valid;
 	unsigned int		io_type;
 	struct xfs_ioend	*ioend;
 	sector_t		last_block;
@@ -868,10 +867,6 @@ xfs_writepage_map(
 			continue;
 		}
 
-		/* Check to see if current map spans this file offset */
-		if (wpc->imap_valid)
-			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
-							 file_offset);
 		/*
 		 * If we don't have a valid map, now it's time to get a new one
 		 * for this offset.  This will convert delayed allocations
@@ -879,16 +874,14 @@ xfs_writepage_map(
 		 * a valid map, it means we landed in a hole and we skip the
 		 * block.
 		 */
-		if (!wpc->imap_valid) {
+		if (!xfs_imap_valid(inode, &wpc->imap, file_offset)) {
 			error = xfs_map_blocks(inode, file_offset, &wpc->imap,
 					     &wpc->io_type);
 			if (error)
 				goto out;
-			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
-							 file_offset);
 		}
 
-		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
+		if (wpc->io_type == XFS_IO_HOLE) {
 			/*
 			 * set_page_dirty dirties all buffers in a page, independent
 			 * of their state.  The dirty state however is entirely
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 29/34] xfs: don't look at buffer heads in xfs_add_to_ioend
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (27 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 28/34] xfs: remove the imap_valid flag Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-23 14:43 ` [PATCH 30/34] xfs: move all writeback buffer_head manipulation into xfs_map_at_offset Christoph Hellwig
                   ` (4 subsequent siblings)
  33 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

Calculate all information for the bio based on the passed in information
without requiring a buffer_head structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 68 ++++++++++++++++++++++-------------------------
 1 file changed, 32 insertions(+), 36 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f01c1dd737ec..592b33b35a30 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -44,7 +44,6 @@ struct xfs_writepage_ctx {
 	struct xfs_bmbt_irec    imap;
 	unsigned int		io_type;
 	struct xfs_ioend	*ioend;
-	sector_t		last_block;
 };
 
 void
@@ -545,11 +544,6 @@ xfs_start_page_writeback(
 	unlock_page(page);
 }
 
-static inline int xfs_bio_add_buffer(struct bio *bio, struct buffer_head *bh)
-{
-	return bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
-}
-
 /*
  * Submit the bio for an ioend. We are passed an ioend with a bio attached to
  * it, and we submit that bio. The ioend may be used for multiple bio
@@ -604,27 +598,20 @@ xfs_submit_ioend(
 	return 0;
 }
 
-static void
-xfs_init_bio_from_bh(
-	struct bio		*bio,
-	struct buffer_head	*bh)
-{
-	bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
-	bio_set_dev(bio, bh->b_bdev);
-}
-
 static struct xfs_ioend *
 xfs_alloc_ioend(
 	struct inode		*inode,
 	unsigned int		type,
 	xfs_off_t		offset,
-	struct buffer_head	*bh)
+	struct block_device	*bdev,
+	sector_t		sector)
 {
 	struct xfs_ioend	*ioend;
 	struct bio		*bio;
 
 	bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_PAGES, xfs_ioend_bioset);
-	xfs_init_bio_from_bh(bio, bh);
+	bio_set_dev(bio, bdev);
+	bio->bi_iter.bi_sector = sector;
 
 	ioend = container_of(bio, struct xfs_ioend, io_inline_bio);
 	INIT_LIST_HEAD(&ioend->io_list);
@@ -649,13 +636,14 @@ static void
 xfs_chain_bio(
 	struct xfs_ioend	*ioend,
 	struct writeback_control *wbc,
-	struct buffer_head	*bh)
+	struct block_device	*bdev,
+	sector_t		sector)
 {
 	struct bio *new;
 
 	new = bio_alloc(GFP_NOFS, BIO_MAX_PAGES);
-	xfs_init_bio_from_bh(new, bh);
-
+	bio_set_dev(new, bdev);
+	new->bi_iter.bi_sector = sector;
 	bio_chain(ioend->io_bio, new);
 	bio_get(ioend->io_bio);		/* for xfs_destroy_ioend */
 	ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
@@ -665,39 +653,45 @@ xfs_chain_bio(
 }
 
 /*
- * Test to see if we've been building up a completion structure for
- * earlier buffers -- if so, we try to append to this ioend if we
- * can, otherwise we finish off any current ioend and start another.
- * Return the ioend we finished off so that the caller can submit it
- * once it has finished processing the dirty page.
+ * Test to see if we have an existing ioend structure that we could append to
+ * first, otherwise finish off the current ioend and start another.
  */
 STATIC void
 xfs_add_to_ioend(
 	struct inode		*inode,
-	struct buffer_head	*bh,
 	xfs_off_t		offset,
+	struct page		*page,
 	struct xfs_writepage_ctx *wpc,
 	struct writeback_control *wbc,
 	struct list_head	*iolist)
 {
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct block_device	*bdev = xfs_find_bdev_for_inode(inode);
+	unsigned		len = i_blocksize(inode);
+	unsigned		poff = offset & (PAGE_SIZE - 1);
+	sector_t		sector;
+
+	sector = xfs_fsb_to_db(ip, wpc->imap.br_startblock) +
+		((offset - XFS_FSB_TO_B(mp, wpc->imap.br_startoff)) >> 9);
+
 	if (!wpc->ioend || wpc->io_type != wpc->ioend->io_type ||
-	    bh->b_blocknr != wpc->last_block + 1 ||
+	    sector != bio_end_sector(wpc->ioend->io_bio) ||
 	    offset != wpc->ioend->io_offset + wpc->ioend->io_size) {
 		if (wpc->ioend)
 			list_add(&wpc->ioend->io_list, iolist);
-		wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, bh);
+		wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset,
+				bdev, sector);
 	}
 
 	/*
-	 * If the buffer doesn't fit into the bio we need to allocate a new
-	 * one.  This shouldn't happen more than once for a given buffer.
+	 * If the block doesn't fit into the bio we need to allocate a new
+	 * one.  This shouldn't happen more than once for a given block.
 	 */
-	while (xfs_bio_add_buffer(wpc->ioend->io_bio, bh) != bh->b_size)
-		xfs_chain_bio(wpc->ioend, wbc, bh);
+	while (bio_add_page(wpc->ioend->io_bio, page, len, poff) != len)
+		xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
 
-	wpc->ioend->io_size += bh->b_size;
-	wpc->last_block = bh->b_blocknr;
-	xfs_start_buffer_writeback(bh);
+	wpc->ioend->io_size += len;
 }
 
 STATIC void
@@ -893,7 +887,9 @@ xfs_writepage_map(
 
 		lock_buffer(bh);
 		xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
-		xfs_add_to_ioend(inode, bh, file_offset, wpc, wbc, &submit_list);
+		xfs_add_to_ioend(inode, file_offset, page, wpc, wbc,
+				&submit_list);
+		xfs_start_buffer_writeback(bh);
 		count++;
 	}
 
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 30/34] xfs: move all writeback buffer_head manipulation into xfs_map_at_offset
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (28 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 29/34] xfs: don't look at buffer heads in xfs_add_to_ioend Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-23 14:43 ` [PATCH 31/34] xfs: remove xfs_start_page_writeback Christoph Hellwig
                   ` (3 subsequent siblings)
  33 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

This keeps it in a single place so it can be made otional more easily.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 22 +++++-----------------
 1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 592b33b35a30..951b329abb23 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -505,21 +505,6 @@ xfs_imap_valid(
 		offset < imap->br_startoff + imap->br_blockcount;
 }
 
-STATIC void
-xfs_start_buffer_writeback(
-	struct buffer_head	*bh)
-{
-	ASSERT(buffer_mapped(bh));
-	ASSERT(buffer_locked(bh));
-	ASSERT(!buffer_delay(bh));
-	ASSERT(!buffer_unwritten(bh));
-
-	bh->b_end_io = NULL;
-	set_buffer_async_write(bh);
-	set_buffer_uptodate(bh);
-	clear_buffer_dirty(bh);
-}
-
 STATIC void
 xfs_start_page_writeback(
 	struct page		*page,
@@ -728,6 +713,7 @@ xfs_map_at_offset(
 	ASSERT(imap->br_startblock != HOLESTARTBLOCK);
 	ASSERT(imap->br_startblock != DELAYSTARTBLOCK);
 
+	lock_buffer(bh);
 	xfs_map_buffer(inode, bh, imap, offset);
 	set_buffer_mapped(bh);
 	clear_buffer_delay(bh);
@@ -740,6 +726,10 @@ xfs_map_at_offset(
 	 * set the bdev now.
 	 */
 	bh->b_bdev = xfs_find_bdev_for_inode(inode);
+	bh->b_end_io = NULL;
+	set_buffer_async_write(bh);
+	set_buffer_uptodate(bh);
+	clear_buffer_dirty(bh);
 }
 
 STATIC void
@@ -885,11 +875,9 @@ xfs_writepage_map(
 			continue;
 		}
 
-		lock_buffer(bh);
 		xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
 		xfs_add_to_ioend(inode, file_offset, page, wpc, wbc,
 				&submit_list);
-		xfs_start_buffer_writeback(bh);
 		count++;
 	}
 
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 31/34] xfs: remove xfs_start_page_writeback
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (29 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 30/34] xfs: move all writeback buffer_head manipulation into xfs_map_at_offset Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-23 14:43 ` [PATCH 32/34] xfs: refactor the tail of xfs_writepage_map Christoph Hellwig
                   ` (2 subsequent siblings)
  33 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

This helper only has two callers, one of them with a constant error
argument.  Remove it to make pending changes to the code a little easier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 47 +++++++++++++++++++++--------------------------
 1 file changed, 21 insertions(+), 26 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 951b329abb23..dd92d99df51f 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -505,30 +505,6 @@ xfs_imap_valid(
 		offset < imap->br_startoff + imap->br_blockcount;
 }
 
-STATIC void
-xfs_start_page_writeback(
-	struct page		*page,
-	int			clear_dirty)
-{
-	ASSERT(PageLocked(page));
-	ASSERT(!PageWriteback(page));
-
-	/*
-	 * if the page was not fully cleaned, we need to ensure that the higher
-	 * layers come back to it correctly. That means we need to keep the page
-	 * dirty, and for WB_SYNC_ALL writeback we need to ensure the
-	 * PAGECACHE_TAG_TOWRITE index mark is not removed so another attempt to
-	 * write this page in this writeback sweep will be made.
-	 */
-	if (clear_dirty) {
-		clear_page_dirty_for_io(page);
-		set_page_writeback(page);
-	} else
-		set_page_writeback_keepwrite(page);
-
-	unlock_page(page);
-}
-
 /*
  * Submit the bio for an ioend. We are passed an ioend with a bio attached to
  * it, and we submit that bio. The ioend may be used for multiple bio
@@ -887,6 +863,9 @@ xfs_writepage_map(
 	ASSERT(wpc->ioend || list_empty(&submit_list));
 
 out:
+	ASSERT(PageLocked(page));
+	ASSERT(!PageWriteback(page));
+
 	/*
 	 * On error, we have to fail the ioend here because we have locked
 	 * buffers in the ioend. If we don't do this, we'll deadlock
@@ -905,7 +884,21 @@ xfs_writepage_map(
 	 * treated correctly on error.
 	 */
 	if (count) {
-		xfs_start_page_writeback(page, !error);
+		/*
+		 * If the page was not fully cleaned, we need to ensure that the
+		 * higher layers come back to it correctly.  That means we need
+		 * to keep the page dirty, and for WB_SYNC_ALL writeback we need
+		 * to ensure the PAGECACHE_TAG_TOWRITE index mark is not removed
+		 * so another attempt to write this page in this writeback sweep
+		 * will be made.
+		 */
+		if (error) {
+			set_page_writeback_keepwrite(page);
+		} else {
+			clear_page_dirty_for_io(page);
+			set_page_writeback(page);
+		}
+		unlock_page(page);
 
 		/*
 		 * Preserve the original error if there was one, otherwise catch
@@ -930,7 +923,9 @@ xfs_writepage_map(
 		 * race with a partial page truncate on a sub-page block sized
 		 * filesystem. In that case we need to mark the page clean.
 		 */
-		xfs_start_page_writeback(page, 1);
+		clear_page_dirty_for_io(page);
+		set_page_writeback(page);
+		unlock_page(page);
 		end_page_writeback(page);
 	}
 
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 32/34] xfs: refactor the tail of xfs_writepage_map
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (30 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 31/34] xfs: remove xfs_start_page_writeback Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-23 14:43 ` [PATCH 33/34] xfs: do not set the page uptodate in xfs_writepage_map Christoph Hellwig
  2018-05-23 14:43 ` [PATCH 34/34] xfs: allow writeback on pages without buffer heads Christoph Hellwig
  33 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

Rejuggle how we deal with the different error vs non-error and have
ioends vs not have ioend cases to keep the fast path streamlined, and
the duplicate code at a minimum.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 65 +++++++++++++++++++++++------------------------
 1 file changed, 32 insertions(+), 33 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index dd92d99df51f..a4e53e0a57c2 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -883,7 +883,14 @@ xfs_writepage_map(
 	 * submission of outstanding ioends on the writepage context so they are
 	 * treated correctly on error.
 	 */
-	if (count) {
+	if (unlikely(error)) {
+		if (!count) {
+			xfs_aops_discard_page(page);
+			ClearPageUptodate(page);
+			unlock_page(page);
+			goto done;
+		}
+
 		/*
 		 * If the page was not fully cleaned, we need to ensure that the
 		 * higher layers come back to it correctly.  That means we need
@@ -892,43 +899,35 @@ xfs_writepage_map(
 		 * so another attempt to write this page in this writeback sweep
 		 * will be made.
 		 */
-		if (error) {
-			set_page_writeback_keepwrite(page);
-		} else {
-			clear_page_dirty_for_io(page);
-			set_page_writeback(page);
-		}
-		unlock_page(page);
-
-		/*
-		 * Preserve the original error if there was one, otherwise catch
-		 * submission errors here and propagate into subsequent ioend
-		 * submissions.
-		 */
-		list_for_each_entry_safe(ioend, next, &submit_list, io_list) {
-			int error2;
-
-			list_del_init(&ioend->io_list);
-			error2 = xfs_submit_ioend(wbc, ioend, error);
-			if (error2 && !error)
-				error = error2;
-		}
-	} else if (error) {
-		xfs_aops_discard_page(page);
-		ClearPageUptodate(page);
-		unlock_page(page);
+		set_page_writeback_keepwrite(page);
 	} else {
-		/*
-		 * We can end up here with no error and nothing to write if we
-		 * race with a partial page truncate on a sub-page block sized
-		 * filesystem. In that case we need to mark the page clean.
-		 */
 		clear_page_dirty_for_io(page);
 		set_page_writeback(page);
-		unlock_page(page);
-		end_page_writeback(page);
 	}
 
+	unlock_page(page);
+
+	/*
+	 * Preserve the original error if there was one, otherwise catch
+	 * submission errors here and propagate into subsequent ioend
+	 * submissions.
+	 */
+	list_for_each_entry_safe(ioend, next, &submit_list, io_list) {
+		int error2;
+
+		list_del_init(&ioend->io_list);
+		error2 = xfs_submit_ioend(wbc, ioend, error);
+		if (error2 && !error)
+			error = error2;
+	}
+
+	/*
+	 * We can end up here with no error and nothing to write if we race with
+	 * a partial page truncate on a sub-page block sized filesystem.
+	 */
+	if (!count)
+		end_page_writeback(page);
+done:
 	mapping_set_error(page->mapping, error);
 	return error;
 }
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 33/34] xfs: do not set the page uptodate in xfs_writepage_map
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (31 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 32/34] xfs: refactor the tail of xfs_writepage_map Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  2018-05-23 14:43 ` [PATCH 34/34] xfs: allow writeback on pages without buffer heads Christoph Hellwig
  33 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

We already track the page uptodate status based on the buffer uptodate
status, which is updated whenever reading or zeroing blocks.

This code has been there since commit a ptool commit in 2002, which
claims to:

    "merge" the 2.4 fsx fix for block size < page size to 2.5.  This needed
    major changes to actually fit.

and isn't present in other writepage implementations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a4e53e0a57c2..492f4a4b1deb 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -796,7 +796,6 @@ xfs_writepage_map(
 	ssize_t			len = i_blocksize(inode);
 	int			error = 0;
 	int			count = 0;
-	bool			uptodate = true;
 	loff_t			file_offset;	/* file offset of page */
 	unsigned		poffset;	/* offset into page */
 
@@ -823,7 +822,6 @@ xfs_writepage_map(
 		if (!buffer_uptodate(bh)) {
 			if (PageUptodate(page))
 				ASSERT(buffer_mapped(bh));
-			uptodate = false;
 			continue;
 		}
 
@@ -857,9 +855,6 @@ xfs_writepage_map(
 		count++;
 	}
 
-	if (uptodate && poffset == PAGE_SIZE)
-		SetPageUptodate(page);
-
 	ASSERT(wpc->ioend || list_empty(&submit_list));
 
 out:
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 34/34] xfs: allow writeback on pages without buffer heads
  2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
                   ` (32 preceding siblings ...)
  2018-05-23 14:43 ` [PATCH 33/34] xfs: do not set the page uptodate in xfs_writepage_map Christoph Hellwig
@ 2018-05-23 14:43 ` Christoph Hellwig
  33 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-23 14:43 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-mm

Disable the IOMAP_F_BUFFER_HEAD flag on file systems with a block size
equal to the page size, and deal with pages without buffer heads in
writeback.  Thanks to the previous refactoring this is basically trivial
now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c  | 47 +++++++++++++++++++++++++++++++++-------------
 fs/xfs/xfs_iomap.c |  3 ++-
 2 files changed, 36 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 492f4a4b1deb..efa2cbb27d67 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -91,6 +91,19 @@ xfs_find_daxdev_for_inode(
 		return mp->m_ddev_targp->bt_daxdev;
 }
 
+static void
+xfs_finish_page_writeback(
+	struct inode		*inode,
+	struct bio_vec		*bvec,
+	int			error)
+{
+	if (error) {
+		SetPageError(bvec->bv_page);
+		mapping_set_error(inode->i_mapping, -EIO);
+	}
+	end_page_writeback(bvec->bv_page);
+}
+
 /*
  * We're now finished for good with this page.  Update the page state via the
  * associated buffer_heads, paying attention to the start and end offsets that
@@ -103,7 +116,7 @@ xfs_find_daxdev_for_inode(
  * and buffers potentially freed after every call to end_buffer_async_write.
  */
 static void
-xfs_finish_page_writeback(
+xfs_finish_buffer_writeback(
 	struct inode		*inode,
 	struct bio_vec		*bvec,
 	int			error)
@@ -178,9 +191,12 @@ xfs_destroy_ioend(
 			next = bio->bi_private;
 
 		/* walk each page on bio, ending page IO on them */
-		bio_for_each_segment_all(bvec, bio, i)
-			xfs_finish_page_writeback(inode, bvec, error);
-
+		bio_for_each_segment_all(bvec, bio, i) {
+			if (page_has_buffers(bvec->bv_page))
+				xfs_finish_buffer_writeback(inode, bvec, error);
+			else
+				xfs_finish_page_writeback(inode, bvec, error);
+		}
 		bio_put(bio);
 	}
 
@@ -792,13 +808,16 @@ xfs_writepage_map(
 {
 	LIST_HEAD(submit_list);
 	struct xfs_ioend	*ioend, *next;
-	struct buffer_head	*bh;
+	struct buffer_head	*bh = NULL;
 	ssize_t			len = i_blocksize(inode);
 	int			error = 0;
 	int			count = 0;
 	loff_t			file_offset;	/* file offset of page */
 	unsigned		poffset;	/* offset into page */
 
+	if (page_has_buffers(page))
+		bh = page_buffers(page);
+
 	/*
 	 * Walk the blocks on the page, and we we run off then end of the
 	 * current map or find the current map invalid, grab a new one.
@@ -807,11 +826,9 @@ xfs_writepage_map(
 	 * replace the bufferhead with some other state tracking mechanism in
 	 * future.
 	 */
-	file_offset = page_offset(page);
-	bh = page_buffers(page);
-	for (poffset = 0;
+	for (poffset = 0, file_offset = page_offset(page);
 	     poffset < PAGE_SIZE;
-	     poffset += len, file_offset += len, bh = bh->b_this_page) {
+	     poffset += len, file_offset += len) {
 		/* past the range we are writing, so nothing more to write. */
 		if (file_offset >= end_offset)
 			break;
@@ -819,9 +836,10 @@ xfs_writepage_map(
 		/*
 		 * Block does not contain valid data, skip it.
 		 */
-		if (!buffer_uptodate(bh)) {
+		if (bh && !buffer_uptodate(bh)) {
 			if (PageUptodate(page))
 				ASSERT(buffer_mapped(bh));
+			bh = bh->b_this_page;
 			continue;
 		}
 
@@ -846,10 +864,15 @@ xfs_writepage_map(
 			 * meaningless for holes (!mapped && uptodate), so check we did
 			 * have a buffer covering a hole here and continue.
 			 */
+			if (bh)
+				bh = bh->b_this_page;
 			continue;
 		}
 
-		xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
+		if (bh) {
+			xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
+			bh = bh->b_this_page;
+		}
 		xfs_add_to_ioend(inode, file_offset, page, wpc, wbc,
 				&submit_list);
 		count++;
@@ -949,8 +972,6 @@ xfs_do_writepage(
 
 	trace_xfs_writepage(inode, page, 0, 0);
 
-	ASSERT(page_has_buffers(page));
-
 	/*
 	 * Refuse to write the page out if we are called from reclaim context.
 	 *
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index f949f0dd7382..93c40da3378a 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1031,7 +1031,8 @@ xfs_file_iomap_begin(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
-	iomap->flags |= IOMAP_F_BUFFER_HEAD;
+	if (i_blocksize(inode) < PAGE_SIZE)
+		iomap->flags |= IOMAP_F_BUFFER_HEAD;
 
 	if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
 			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 19/34] xfs: simplify xfs_bmap_punch_delalloc_range
  2018-05-23 14:43 ` [PATCH 19/34] xfs: simplify xfs_bmap_punch_delalloc_range Christoph Hellwig
@ 2018-05-23 16:17   ` Brian Foster
  2018-05-24  8:01     ` Christoph Hellwig
  0 siblings, 1 reply; 78+ messages in thread
From: Brian Foster @ 2018-05-23 16:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:42PM +0200, Christoph Hellwig wrote:
> Instead of using xfs_bmapi_read to find delalloc extents and then punch
> them out using xfs_bunmapi, opencode the loop to iterate over the extents
> and call xfs_bmap_del_extent_delay directly.  This both simplifies the
> code and reduces the number of extent tree lookups required.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_bmap_util.c | 78 ++++++++++++++----------------------------
>  1 file changed, 25 insertions(+), 53 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 06badcbadeb4..c009bdf9fdce 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
...
> @@ -708,63 +706,37 @@ xfs_bmap_punch_delalloc_range(
>  	xfs_fileoff_t		start_fsb,
>  	xfs_fileoff_t		length)
>  {
> -	xfs_fileoff_t		remaining = length;
> +	struct xfs_ifork	*ifp = &ip->i_df;
> +	struct xfs_bmbt_irec	got, del;
> +	struct xfs_iext_cursor	icur;
>  	int			error = 0;
>  
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  
> -	do {
> -		int		done;
> -		xfs_bmbt_irec_t	imap;
> -		int		nimaps = 1;
> -		xfs_fsblock_t	firstblock;
> -		struct xfs_defer_ops dfops;
> +	if (!(ifp->if_flags & XFS_IFEXTENTS)) {
> +		error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
> +		if (error)
> +			return error;
> +	}
>  
> -		/*
> -		 * Map the range first and check that it is a delalloc extent
> -		 * before trying to unmap the range. Otherwise we will be
> -		 * trying to remove a real extent (which requires a
> -		 * transaction) or a hole, which is probably a bad idea...
> -		 */
> -		error = xfs_bmapi_read(ip, start_fsb, 1, &imap, &nimaps,
> -				       XFS_BMAPI_ENTIRE);
> +	if (!xfs_iext_lookup_extent(ip, ifp, start_fsb, &icur, &got))
> +		return 0;
>  
> -		if (error) {
> -			/* something screwed, just bail */
> -			if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) {
> -				xfs_alert(ip->i_mount,
> -			"Failed delalloc mapping lookup ino %lld fsb %lld.",
> -						ip->i_ino, start_fsb);
> -			}
> +	do {
> +		if (got.br_startoff >= start_fsb + length)
>  			break;
> -		}
> -		if (!nimaps) {
> -			/* nothing there */
> -			goto next_block;
> -		}
> -		if (imap.br_startblock != DELAYSTARTBLOCK) {
> -			/* been converted, ignore */
> -			goto next_block;
> -		}
> -		WARN_ON(imap.br_blockcount == 0);
> +		if (!isnullstartblock(got.br_startblock))
> +			continue;
>  
> -		/*
> -		 * Note: while we initialise the firstblock/dfops pair, they
> -		 * should never be used because blocks should never be
> -		 * allocated or freed for a delalloc extent and hence we need
> -		 * don't cancel or finish them after the xfs_bunmapi() call.
> -		 */
> -		xfs_defer_init(&dfops, &firstblock);
> -		error = xfs_bunmapi(NULL, ip, start_fsb, 1, 0, 1, &firstblock,
> -					&dfops, &done);
> +		del = got;
> +		xfs_trim_extent(&del, start_fsb, length);
> +		error = xfs_bmap_del_extent_delay(ip, XFS_DATA_FORK, &icur,
> +				&got, &del);
>  		if (error)
>  			break;
> -
> -		ASSERT(!xfs_defer_has_unfinished_work(&dfops));
> -next_block:
> -		start_fsb++;
> -		remaining--;
> -	} while(remaining > 0);
> +		if (!xfs_iext_get_extent(ifp, &icur, &got))
> +			break;

Mostly looks Ok, but I'm not following what this get_extent() call is
for..? It also doesn't look like it would always do the right thing with
sub-page blocks. Consider a page with a couple discontig delalloc blocks
that happen to be the first extents in the file. The first
xfs_bmap_del_extent_delay() would do:

	xfs_iext_remove(ip, icur, state);
	xfs_iext_prev(ifp, icur);

... which I think sets cur->pos to -1, causes the get_extent() to fail
and thus fails to remove the subsequent delalloc blocks. Hm?

Brian

> +	} while (xfs_iext_next_extent(ifp, &icur, &got));
>  
>  	return error;
>  }
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 19/34] xfs: simplify xfs_bmap_punch_delalloc_range
  2018-05-23 16:17   ` Brian Foster
@ 2018-05-24  8:01     ` Christoph Hellwig
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-24  8:01 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 12:17:11PM -0400, Brian Foster wrote:
> Mostly looks Ok, but I'm not following what this get_extent() call is
> for..? It also doesn't look like it would always do the right thing with
> sub-page blocks. Consider a page with a couple discontig delalloc blocks
> that happen to be the first extents in the file. The first
> xfs_bmap_del_extent_delay() would do:
> 
> 	xfs_iext_remove(ip, icur, state);
> 	xfs_iext_prev(ifp, icur);
> 
> ... which I think sets cur->pos to -1, causes the get_extent() to fail
> and thus fails to remove the subsequent delalloc blocks. Hm?

True.  This function should probably walk the extent list backwards
like xfs_bunmapi as that is the model that xfs_bmap_del_extent_* is
built around.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 22/34] xfs: make xfs_writepage_map extent map centric
  2018-05-23 14:43 ` [PATCH 22/34] xfs: make xfs_writepage_map extent map centric Christoph Hellwig
@ 2018-05-24 14:59   ` Brian Foster
  2018-05-24 16:53     ` Christoph Hellwig
  0 siblings, 1 reply; 78+ messages in thread
From: Brian Foster @ 2018-05-24 14:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm, Dave Chinner

On Wed, May 23, 2018 at 04:43:45PM +0200, Christoph Hellwig wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
...
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 5dd09e83c81c..a50f69c2c602 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
...
> @@ -845,85 +826,81 @@ xfs_writepage_map(
>  {
>  	LIST_HEAD(submit_list);
>  	struct xfs_ioend	*ioend, *next;
> -	struct buffer_head	*bh, *head;
> +	struct buffer_head	*bh;
>  	ssize_t			len = i_blocksize(inode);
> -	uint64_t		offset;
>  	int			error = 0;
>  	int			count = 0;
> -	int			uptodate = 1;
> -	unsigned int		new_type;
> +	bool			uptodate = true;
> +	loff_t			file_offset;	/* file offset of page */
> +	unsigned		poffset;	/* offset into page */
>  
> -	bh = head = page_buffers(page);
> -	offset = page_offset(page);
> -	do {
> -		if (offset >= end_offset)
> +	/*
> +	 * Walk the blocks on the page, and we we run off then end of the
> +	 * current map or find the current map invalid, grab a new one.
> +	 * We only use bufferheads here to check per-block state - they no
> +	 * longer control the iteration through the page. This allows us to
> +	 * replace the bufferhead with some other state tracking mechanism in
> +	 * future.
> +	 */
> +	file_offset = page_offset(page);
> +	bh = page_buffers(page);
> +	for (poffset = 0;
> +	     poffset < PAGE_SIZE;
> +	     poffset += len, file_offset += len, bh = bh->b_this_page) {
> +		/* past the range we are writing, so nothing more to write. */
> +		if (file_offset >= end_offset)
>  			break;
> -		if (!buffer_uptodate(bh))
> -			uptodate = 0;
>  
>  		/*
> -		 * set_page_dirty dirties all buffers in a page, independent
> -		 * of their state.  The dirty state however is entirely
> -		 * meaningless for holes (!mapped && uptodate), so skip
> -		 * buffers covering holes here.
> +		 * Block does not contain valid data, skip it, mark the current
> +		 * map as invalid because we have a discontiguity. This ensures
> +		 * we put subsequent writeable buffers into a new ioend.
>  		 */
> -		if (!buffer_mapped(bh) && buffer_uptodate(bh)) {
> -			wpc->imap_valid = false;
> -			continue;
> -		}
> -
> -		if (buffer_unwritten(bh))
> -			new_type = XFS_IO_UNWRITTEN;
> -		else if (buffer_delay(bh))
> -			new_type = XFS_IO_DELALLOC;
> -		else if (buffer_uptodate(bh))
> -			new_type = XFS_IO_OVERWRITE;
> -		else {
> +		if (!buffer_uptodate(bh)) {
>  			if (PageUptodate(page))
>  				ASSERT(buffer_mapped(bh));
> -			/*
> -			 * This buffer is not uptodate and will not be
> -			 * written to disk.  Ensure that we will put any
> -			 * subsequent writeable buffers into a new
> -			 * ioend.
> -			 */
> +			uptodate = false;
>  			wpc->imap_valid = false;
>  			continue;
>  		}
>  
> -		if (xfs_is_reflink_inode(XFS_I(inode))) {
> -			error = xfs_map_cow(wpc, inode, offset, &new_type);
> -			if (error)
> -				goto out;
> -		}
> -
> -		if (wpc->io_type != new_type) {
> -			wpc->io_type = new_type;
> -			wpc->imap_valid = false;
> -		}
> -
> +		/* Check to see if current map spans this file offset */
>  		if (wpc->imap_valid)
>  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
> -							 offset);
> +							 file_offset);
> +		/*
> +		 * If we don't have a valid map, now it's time to get a new one
> +		 * for this offset.  This will convert delayed allocations
> +		 * (including COW ones) into real extents.  If we return without
> +		 * a valid map, it means we landed in a hole and we skip the
> +		 * block.
> +		 */
>  		if (!wpc->imap_valid) {
> -			error = xfs_map_blocks(inode, offset, &wpc->imap,
> -					     wpc->io_type);
> +			error = xfs_map_blocks(inode, file_offset, &wpc->imap,
> +					     &wpc->io_type);
>  			if (error)
>  				goto out;
>  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
> -							 offset);
> +							 file_offset);
>  		}
> -		if (wpc->imap_valid) {
> -			lock_buffer(bh);
> -			if (wpc->io_type != XFS_IO_OVERWRITE)
> -				xfs_map_at_offset(inode, bh, &wpc->imap, offset);
> -			xfs_add_to_ioend(inode, bh, offset, wpc, wbc, &submit_list);
> -			count++;
> +
> +		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
> +			/*
> +			 * set_page_dirty dirties all buffers in a page, independent
> +			 * of their state.  The dirty state however is entirely
> +			 * meaningless for holes (!mapped && uptodate), so check we did
> +			 * have a buffer covering a hole here and continue.
> +			 */

The comment above doesn't make much sense given that we don't check for
anything here and just continue the loop.

That aside, the concern I had with this patch when it was last posted is
that it indirectly dropped the error/consistency check between page
state and extent state provided by the XFS_BMAPI_DELALLOC flag. What was
historically an accounting/reservation issue was turned into something
like this by XFS_BMAPI_DELALLOC:

# xfs_io -c "pwrite 0 4k" -c fsync /mnt/file
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0041 sec (974.184 KiB/sec and 243.5460 ops/sec)
fsync: Input/output error

As of this patch, that same error condition now behaves something like
this:

[root@localhost ~]# xfs_io -c "pwrite 0 4k" -c fsync /mnt/file
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0029 sec (1.325 MiB/sec and 339.2130 ops/sec)
[root@localhost ~]# ls -al /mnt/file
-rw-r--r--. 1 root root 4096 May 24 08:27 /mnt/file
[root@localhost ~]# umount  /mnt ; mount /dev/test/scratch /mnt/
[root@localhost ~]# ls -al /mnt/file
-rw-r--r--. 1 root root 0 May 24 08:27 /mnt/file

So our behavior has changed from forced block allocation (violating
reservation) and writing the data, to instead return an error, and now
to silently skip the page. I suppose there are situations (i.e., races
with truncate) where a hole is valid and the correct behavior is to skip
the page, and this is admittedly an error condition that "should never
happen," but can we at least add an assert somewhere in this series that
ensures if uptodate data maps over a hole that the associated block
offset is beyond EOF (or something of that nature)?

Brian

> +			continue;
>  		}
>  
> -	} while (offset += len, ((bh = bh->b_this_page) != head));
> +		lock_buffer(bh);
> +		xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
> +		xfs_add_to_ioend(inode, bh, file_offset, wpc, wbc, &submit_list);
> +		count++;
> +	}
>  
> -	if (uptodate && bh == head)
> +	if (uptodate && poffset == PAGE_SIZE)
>  		SetPageUptodate(page);
>  
>  	ASSERT(wpc->ioend || list_empty(&submit_list));
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index 69346d460dfa..b2ef5b661761 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -29,6 +29,7 @@ enum {
>  	XFS_IO_UNWRITTEN,	/* covers allocated but uninitialized data */
>  	XFS_IO_OVERWRITE,	/* covers already allocated extent */
>  	XFS_IO_COW,		/* covers copy-on-write extent */
> +	XFS_IO_HOLE,		/* covers region without any block allocation */
>  };
>  
>  #define XFS_IO_TYPES \
> @@ -36,7 +37,8 @@ enum {
>  	{ XFS_IO_DELALLOC,		"delalloc" }, \
>  	{ XFS_IO_UNWRITTEN,		"unwritten" }, \
>  	{ XFS_IO_OVERWRITE,		"overwrite" }, \
> -	{ XFS_IO_COW,			"CoW" }
> +	{ XFS_IO_COW,			"CoW" }, \
> +	{ XFS_IO_HOLE,			"hole" }
>  
>  /*
>   * Structure for buffered I/O completions.
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 25/34] xfs: remove xfs_reflink_trim_irec_to_next_cow
  2018-05-23 14:43 ` [PATCH 25/34] xfs: remove xfs_reflink_trim_irec_to_next_cow Christoph Hellwig
@ 2018-05-24 14:59   ` Brian Foster
  2018-05-24 15:06     ` Brian Foster
  0 siblings, 1 reply; 78+ messages in thread
From: Brian Foster @ 2018-05-24 14:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:48PM +0200, Christoph Hellwig wrote:
> In the only caller we just did a lookup in the COW extent tree for
> the same offset.  Reuse that result and save a lookup, as well as
> shortening the ilock hold time.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_aops.c    | 25 +++++++++++++++++--------
>  fs/xfs/xfs_reflink.c | 33 ---------------------------------
>  fs/xfs/xfs_reflink.h |  2 --
>  3 files changed, 17 insertions(+), 43 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index a4b4a7037deb..354d26d66c12 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -383,11 +383,12 @@ xfs_map_blocks(
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
>  	ssize_t			count = i_blocksize(inode);
> -	xfs_fileoff_t		offset_fsb, end_fsb;
> +	xfs_fileoff_t		offset_fsb, end_fsb, cow_fsb = 0;

cow_fsb should probably be initialized to NULLFSBLOCK rather than 0.
With that, you also shouldn't need cow_valid. Otherwise looks Ok to me.

Brian

>  	int			whichfork = XFS_DATA_FORK;
>  	struct xfs_iext_cursor	icur;
>  	int			error = 0;
>  	int			nimaps = 1;
> +	bool			cow_valid = false;
>  
>  	if (XFS_FORCED_SHUTDOWN(mp))
>  		return -EIO;
> @@ -407,8 +408,11 @@ xfs_map_blocks(
>  	 * it directly instead of looking up anything in the data fork.
>  	 */
>  	if (xfs_is_reflink_inode(ip) &&
> -	    xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &icur, imap) &&
> -	    imap->br_startoff <= offset_fsb) {
> +	    xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &icur, imap)) {
> +		cow_fsb = imap->br_startoff;
> +		cow_valid = true;
> +	}
> +	if (cow_valid && cow_fsb <= offset_fsb) {
>  		xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  		/*
>  		 * Truncate can race with writeback since writeback doesn't
> @@ -430,6 +434,10 @@ xfs_map_blocks(
>  
>  	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
>  				imap, &nimaps, XFS_BMAPI_ENTIRE);
> +	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +	if (error)
> +		return error;
> +
>  	if (!nimaps) {
>  		/*
>  		 * Lookup returns no match? Beyond eof? regardless,
> @@ -451,16 +459,17 @@ xfs_map_blocks(
>  		 * is a pending CoW reservation before the end of this extent,
>  		 * so that we pick up the COW extents in the next iteration.
>  		 */
> -		xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb, imap);
> +		if (cow_valid &&
> +		    cow_fsb < imap->br_startoff + imap->br_blockcount) {
> +			imap->br_blockcount = cow_fsb - imap->br_startoff;
> +			trace_xfs_reflink_trim_irec(ip, imap);
> +		}
> +
>  		if (imap->br_state == XFS_EXT_UNWRITTEN)
>  			*type = XFS_IO_UNWRITTEN;
>  		else
>  			*type = XFS_IO_OVERWRITE;
>  	}
> -	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> -	if (error)
> -		return error;
> -
>  done:
>  	switch (*type) {
>  	case XFS_IO_HOLE:
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 8e5eb8e70c89..ff76bc56ff3d 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -484,39 +484,6 @@ xfs_reflink_allocate_cow(
>  	return error;
>  }
>  
> -/*
> - * Trim an extent to end at the next CoW reservation past offset_fsb.
> - */
> -void
> -xfs_reflink_trim_irec_to_next_cow(
> -	struct xfs_inode		*ip,
> -	xfs_fileoff_t			offset_fsb,
> -	struct xfs_bmbt_irec		*imap)
> -{
> -	struct xfs_ifork		*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> -	struct xfs_bmbt_irec		got;
> -	struct xfs_iext_cursor		icur;
> -
> -	if (!xfs_is_reflink_inode(ip))
> -		return;
> -
> -	/* Find the extent in the CoW fork. */
> -	if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, &icur, &got))
> -		return;
> -
> -	/* This is the extent before; try sliding up one. */
> -	if (got.br_startoff < offset_fsb) {
> -		if (!xfs_iext_next_extent(ifp, &icur, &got))
> -			return;
> -	}
> -
> -	if (got.br_startoff >= imap->br_startoff + imap->br_blockcount)
> -		return;
> -
> -	imap->br_blockcount = got.br_startoff - imap->br_startoff;
> -	trace_xfs_reflink_trim_irec(ip, imap);
> -}
> -
>  /*
>   * Cancel CoW reservations for some block range of an inode.
>   *
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index 15a456492667..e8d4d50c629f 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -32,8 +32,6 @@ extern int xfs_reflink_allocate_cow(struct xfs_inode *ip,
>  		struct xfs_bmbt_irec *imap, bool *shared, uint *lockmode);
>  extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
>  		xfs_off_t count);
> -extern void xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
> -		xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
>  
>  extern int xfs_reflink_cancel_cow_blocks(struct xfs_inode *ip,
>  		struct xfs_trans **tpp, xfs_fileoff_t offset_fsb,
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 25/34] xfs: remove xfs_reflink_trim_irec_to_next_cow
  2018-05-24 14:59   ` Brian Foster
@ 2018-05-24 15:06     ` Brian Foster
  2018-05-24 17:10       ` Christoph Hellwig
  0 siblings, 1 reply; 78+ messages in thread
From: Brian Foster @ 2018-05-24 15:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Thu, May 24, 2018 at 10:59:43AM -0400, Brian Foster wrote:
> On Wed, May 23, 2018 at 04:43:48PM +0200, Christoph Hellwig wrote:
> > In the only caller we just did a lookup in the COW extent tree for
> > the same offset.  Reuse that result and save a lookup, as well as
> > shortening the ilock hold time.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/xfs/xfs_aops.c    | 25 +++++++++++++++++--------
> >  fs/xfs/xfs_reflink.c | 33 ---------------------------------
> >  fs/xfs/xfs_reflink.h |  2 --
> >  3 files changed, 17 insertions(+), 43 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index a4b4a7037deb..354d26d66c12 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -383,11 +383,12 @@ xfs_map_blocks(
> >  	struct xfs_inode	*ip = XFS_I(inode);
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	ssize_t			count = i_blocksize(inode);
> > -	xfs_fileoff_t		offset_fsb, end_fsb;
> > +	xfs_fileoff_t		offset_fsb, end_fsb, cow_fsb = 0;
> 
> cow_fsb should probably be initialized to NULLFSBLOCK rather than 0.
> With that, you also shouldn't need cow_valid. Otherwise looks Ok to me.
> 

Err.. I guess NULLFILEOFF would be more appropriate here, but same
idea..

> Brian
> 
> >  	int			whichfork = XFS_DATA_FORK;
> >  	struct xfs_iext_cursor	icur;
> >  	int			error = 0;
> >  	int			nimaps = 1;
> > +	bool			cow_valid = false;
> >  
> >  	if (XFS_FORCED_SHUTDOWN(mp))
> >  		return -EIO;
> > @@ -407,8 +408,11 @@ xfs_map_blocks(
> >  	 * it directly instead of looking up anything in the data fork.
> >  	 */
> >  	if (xfs_is_reflink_inode(ip) &&
> > -	    xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &icur, imap) &&
> > -	    imap->br_startoff <= offset_fsb) {
> > +	    xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &icur, imap)) {
> > +		cow_fsb = imap->br_startoff;
> > +		cow_valid = true;
> > +	}
> > +	if (cow_valid && cow_fsb <= offset_fsb) {
> >  		xfs_iunlock(ip, XFS_ILOCK_SHARED);
> >  		/*
> >  		 * Truncate can race with writeback since writeback doesn't
> > @@ -430,6 +434,10 @@ xfs_map_blocks(
> >  
> >  	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
> >  				imap, &nimaps, XFS_BMAPI_ENTIRE);
> > +	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > +	if (error)
> > +		return error;
> > +
> >  	if (!nimaps) {
> >  		/*
> >  		 * Lookup returns no match? Beyond eof? regardless,
> > @@ -451,16 +459,17 @@ xfs_map_blocks(
> >  		 * is a pending CoW reservation before the end of this extent,
> >  		 * so that we pick up the COW extents in the next iteration.
> >  		 */
> > -		xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb, imap);
> > +		if (cow_valid &&
> > +		    cow_fsb < imap->br_startoff + imap->br_blockcount) {
> > +			imap->br_blockcount = cow_fsb - imap->br_startoff;
> > +			trace_xfs_reflink_trim_irec(ip, imap);
> > +		}
> > +
> >  		if (imap->br_state == XFS_EXT_UNWRITTEN)
> >  			*type = XFS_IO_UNWRITTEN;
> >  		else
> >  			*type = XFS_IO_OVERWRITE;
> >  	}
> > -	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > -	if (error)
> > -		return error;
> > -
> >  done:
> >  	switch (*type) {
> >  	case XFS_IO_HOLE:
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 8e5eb8e70c89..ff76bc56ff3d 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -484,39 +484,6 @@ xfs_reflink_allocate_cow(
> >  	return error;
> >  }
> >  
> > -/*
> > - * Trim an extent to end at the next CoW reservation past offset_fsb.
> > - */
> > -void
> > -xfs_reflink_trim_irec_to_next_cow(
> > -	struct xfs_inode		*ip,
> > -	xfs_fileoff_t			offset_fsb,
> > -	struct xfs_bmbt_irec		*imap)
> > -{
> > -	struct xfs_ifork		*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> > -	struct xfs_bmbt_irec		got;
> > -	struct xfs_iext_cursor		icur;
> > -
> > -	if (!xfs_is_reflink_inode(ip))
> > -		return;
> > -
> > -	/* Find the extent in the CoW fork. */
> > -	if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, &icur, &got))
> > -		return;
> > -
> > -	/* This is the extent before; try sliding up one. */
> > -	if (got.br_startoff < offset_fsb) {
> > -		if (!xfs_iext_next_extent(ifp, &icur, &got))
> > -			return;
> > -	}
> > -
> > -	if (got.br_startoff >= imap->br_startoff + imap->br_blockcount)
> > -		return;
> > -
> > -	imap->br_blockcount = got.br_startoff - imap->br_startoff;
> > -	trace_xfs_reflink_trim_irec(ip, imap);
> > -}
> > -
> >  /*
> >   * Cancel CoW reservations for some block range of an inode.
> >   *
> > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > index 15a456492667..e8d4d50c629f 100644
> > --- a/fs/xfs/xfs_reflink.h
> > +++ b/fs/xfs/xfs_reflink.h
> > @@ -32,8 +32,6 @@ extern int xfs_reflink_allocate_cow(struct xfs_inode *ip,
> >  		struct xfs_bmbt_irec *imap, bool *shared, uint *lockmode);
> >  extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
> >  		xfs_off_t count);
> > -extern void xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
> > -		xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
> >  
> >  extern int xfs_reflink_cancel_cow_blocks(struct xfs_inode *ip,
> >  		struct xfs_trans **tpp, xfs_fileoff_t offset_fsb,
> > -- 
> > 2.17.0
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 22/34] xfs: make xfs_writepage_map extent map centric
  2018-05-24 14:59   ` Brian Foster
@ 2018-05-24 16:53     ` Christoph Hellwig
  2018-05-24 18:13       ` Brian Foster
  0 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-24 16:53 UTC (permalink / raw)
  To: Brian Foster
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, linux-mm, Dave Chinner

> > +		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
> > +			/*
> > +			 * set_page_dirty dirties all buffers in a page, independent
> > +			 * of their state.  The dirty state however is entirely
> > +			 * meaningless for holes (!mapped && uptodate), so check we did
> > +			 * have a buffer covering a hole here and continue.
> > +			 */
> 
> The comment above doesn't make much sense given that we don't check for
> anything here and just continue the loop.

It gets removed in the last patch of the original series when we
kill buffer heads.  But I can fold the removal into this patch as well.

> That aside, the concern I had with this patch when it was last posted is
> that it indirectly dropped the error/consistency check between page
> state and extent state provided by the XFS_BMAPI_DELALLOC flag. What was
> historically an accounting/reservation issue was turned into something
> like this by XFS_BMAPI_DELALLOC:
> 
> # xfs_io -c "pwrite 0 4k" -c fsync /mnt/file
> wrote 4096/4096 bytes at offset 0
> 4 KiB, 1 ops; 0.0041 sec (974.184 KiB/sec and 243.5460 ops/sec)
> fsync: Input/output error

What is that issue that gets you an I/O error on a 4k write?  That
is what is missing in the above reproducer?

> As of this patch, that same error condition now behaves something like
> this:
> 
> [root@localhost ~]# xfs_io -c "pwrite 0 4k" -c fsync /mnt/file
> wrote 4096/4096 bytes at offset 0
> 4 KiB, 1 ops; 0.0029 sec (1.325 MiB/sec and 339.2130 ops/sec)
> [root@localhost ~]# ls -al /mnt/file
> -rw-r--r--. 1 root root 4096 May 24 08:27 /mnt/file
> [root@localhost ~]# umount  /mnt ; mount /dev/test/scratch /mnt/
> [root@localhost ~]# ls -al /mnt/file
> -rw-r--r--. 1 root root 0 May 24 08:27 /mnt/file
> 
> So our behavior has changed from forced block allocation (violating
> reservation) and writing the data, to instead return an error, and now
> to silently skip the page.

We should never, ever allocate space that we didn't have a delalloc
reservation for in writepage/writepages.  But I agree that we should
record and error.  I have to admit I'm lost on where we did record
the error and why we don't do that now.  I'd be happy to fix it.

> I suppose there are situations (i.e., races
> with truncate) where a hole is valid and the correct behavior is to skip
> the page, and this is admittedly an error condition that "should never
> happen," but can we at least add an assert somewhere in this series that
> ensures if uptodate data maps over a hole that the associated block
> offset is beyond EOF (or something of that nature)?

We can have plenty of holes in dirty pages.  However we should never
allocate blocks for them.  Fortunately we stop even looking at anything
but the extent tree for block status by the end of this series for 4k
file systems, and with the next series even for small block sizes, so
that whole mismatch is a thing of the past now.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 25/34] xfs: remove xfs_reflink_trim_irec_to_next_cow
  2018-05-24 15:06     ` Brian Foster
@ 2018-05-24 17:10       ` Christoph Hellwig
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-24 17:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, linux-mm

On Thu, May 24, 2018 at 11:06:59AM -0400, Brian Foster wrote:
> On Thu, May 24, 2018 at 10:59:43AM -0400, Brian Foster wrote:
> > On Wed, May 23, 2018 at 04:43:48PM +0200, Christoph Hellwig wrote:
> > > In the only caller we just did a lookup in the COW extent tree for
> > > the same offset.  Reuse that result and save a lookup, as well as
> > > shortening the ilock hold time.
> > > 
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > ---
> > >  fs/xfs/xfs_aops.c    | 25 +++++++++++++++++--------
> > >  fs/xfs/xfs_reflink.c | 33 ---------------------------------
> > >  fs/xfs/xfs_reflink.h |  2 --
> > >  3 files changed, 17 insertions(+), 43 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index a4b4a7037deb..354d26d66c12 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -383,11 +383,12 @@ xfs_map_blocks(
> > >  	struct xfs_inode	*ip = XFS_I(inode);
> > >  	struct xfs_mount	*mp = ip->i_mount;
> > >  	ssize_t			count = i_blocksize(inode);
> > > -	xfs_fileoff_t		offset_fsb, end_fsb;
> > > +	xfs_fileoff_t		offset_fsb, end_fsb, cow_fsb = 0;
> > 
> > cow_fsb should probably be initialized to NULLFSBLOCK rather than 0.
> > With that, you also shouldn't need cow_valid. Otherwise looks Ok to me.
> > 
> 
> Err.. I guess NULLFILEOFF would be more appropriate here, but same
> idea..

Yes, I'll start using it.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 22/34] xfs: make xfs_writepage_map extent map centric
  2018-05-24 16:53     ` Christoph Hellwig
@ 2018-05-24 18:13       ` Brian Foster
  2018-05-25  6:19         ` Christoph Hellwig
  0 siblings, 1 reply; 78+ messages in thread
From: Brian Foster @ 2018-05-24 18:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm, Dave Chinner

On Thu, May 24, 2018 at 06:53:50PM +0200, Christoph Hellwig wrote:
> > > +		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
> > > +			/*
> > > +			 * set_page_dirty dirties all buffers in a page, independent
> > > +			 * of their state.  The dirty state however is entirely
> > > +			 * meaningless for holes (!mapped && uptodate), so check we did
> > > +			 * have a buffer covering a hole here and continue.
> > > +			 */
> > 
> > The comment above doesn't make much sense given that we don't check for
> > anything here and just continue the loop.
> 
> It gets removed in the last patch of the original series when we
> kill buffer heads.  But I can fold the removal into this patch as well.
> 

Ah, I was thinking this patch added that comment when it actually mostly
moves it (it does tweak it a bit). Eh, no big deal either way.

> > That aside, the concern I had with this patch when it was last posted is
> > that it indirectly dropped the error/consistency check between page
> > state and extent state provided by the XFS_BMAPI_DELALLOC flag. What was
> > historically an accounting/reservation issue was turned into something
> > like this by XFS_BMAPI_DELALLOC:
> > 
> > # xfs_io -c "pwrite 0 4k" -c fsync /mnt/file
> > wrote 4096/4096 bytes at offset 0
> > 4 KiB, 1 ops; 0.0041 sec (974.184 KiB/sec and 243.5460 ops/sec)
> > fsync: Input/output error
> 
> What is that issue that gets you an I/O error on a 4k write?  That
> is what is missing in the above reproducer?
> 

Sorry... I should have mentioned this is a simulated error and not
something that actually occurs right now. You can manufacture it easy
enough using the drop_writes error tag and comment out the pagecache
truncate code in xfs_file_iomap_end_delalloc().

> > As of this patch, that same error condition now behaves something like
> > this:
> > 
> > [root@localhost ~]# xfs_io -c "pwrite 0 4k" -c fsync /mnt/file
> > wrote 4096/4096 bytes at offset 0
> > 4 KiB, 1 ops; 0.0029 sec (1.325 MiB/sec and 339.2130 ops/sec)
> > [root@localhost ~]# ls -al /mnt/file
> > -rw-r--r--. 1 root root 4096 May 24 08:27 /mnt/file
> > [root@localhost ~]# umount  /mnt ; mount /dev/test/scratch /mnt/
> > [root@localhost ~]# ls -al /mnt/file
> > -rw-r--r--. 1 root root 0 May 24 08:27 /mnt/file
> > 
> > So our behavior has changed from forced block allocation (violating
> > reservation) and writing the data, to instead return an error, and now
> > to silently skip the page.
> 
> We should never, ever allocate space that we didn't have a delalloc
> reservation for in writepage/writepages.  But I agree that we should
> record and error.  I have to admit I'm lost on where we did record
> the error and why we don't do that now.  I'd be happy to fix it.
> 

Right, the error behavior came from the XFS_BMAPI_DELALLOC flag that was
passed from xfs_iomap_write_allocate(). It caused xfs_bmapi_write() to
detect that we were in a hole and return an error in the !COW_FORK case
since we were expecting to do delalloc conversion from writeback.

Note that I'm not saying there's a vector to reproduce this problem in
the current code that I'm aware of. I'm just saying it's happened in the
past due to bugs and I'd like to preserve some kind of basic sanity
check (as an error or assert) if we have enough state available to do
so.

> > I suppose there are situations (i.e., races
> > with truncate) where a hole is valid and the correct behavior is to skip
> > the page, and this is admittedly an error condition that "should never
> > happen," but can we at least add an assert somewhere in this series that
> > ensures if uptodate data maps over a hole that the associated block
> > offset is beyond EOF (or something of that nature)?
> 
> We can have plenty of holes in dirty pages.  However we should never
> allocate blocks for them.  Fortunately we stop even looking at anything
> but the extent tree for block status by the end of this series for 4k
> file systems, and with the next series even for small block sizes, so
> that whole mismatch is a thing of the past now.

Ok, so I guess writeback can see uptodate blocks over a hole if some
other block in that page is dirty. Perhaps we could make sure that a
dirty page has at least one block that maps to an actual extent or
otherwise the page has been truncated..?

I guess having another dirty block bitmap similar to
iomap_page->uptodate could be required to tell for sure whether a
particular block should definitely have a block on-disk or not. It may
not be worth doing that just for additional error checks, but I still
have to look into the last few patches to grok all the iomap_page stuff.

Brian

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 22/34] xfs: make xfs_writepage_map extent map centric
  2018-05-24 18:13       ` Brian Foster
@ 2018-05-25  6:19         ` Christoph Hellwig
  2018-05-25 11:35           ` Brian Foster
  0 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-25  6:19 UTC (permalink / raw)
  To: Brian Foster
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, linux-mm, Dave Chinner

On Thu, May 24, 2018 at 02:13:56PM -0400, Brian Foster wrote:
> Ok, so I guess writeback can see uptodate blocks over a hole if some
> other block in that page is dirty.

Yes.

> Perhaps we could make sure that a
> dirty page has at least one block that maps to an actual extent or
> otherwise the page has been truncated..?

We have the following comment near the end of xfs_writepage_map:

	/*
	 * We can end up here with no error and nothing to write if we
	 * race with a partial page truncate on a sub-page block sized
	 * filesystem. In that case we need to mark the page clean.
	 */

And I'm pretty sure I managed to hit that case easily in xfstests,
as my initial handling of it was wrong.  So I don't think we can
even check for that.

> I guess having another dirty block bitmap similar to
> iomap_page->uptodate could be required to tell for sure whether a
> particular block should definitely have a block on-disk or not. It may
> not be worth doing that just for additional error checks, but I still
> have to look into the last few patches to grok all the iomap_page stuff.

I don't think it's worth it.  The sub-page dirty tracking has been one
of the issues with the buffer head code that caused a lot of problems,
and that we want to get rid of.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 22/34] xfs: make xfs_writepage_map extent map centric
  2018-05-25  6:19         ` Christoph Hellwig
@ 2018-05-25 11:35           ` Brian Foster
  2018-05-28  7:15             ` Christoph Hellwig
  0 siblings, 1 reply; 78+ messages in thread
From: Brian Foster @ 2018-05-25 11:35 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm, Dave Chinner

On Fri, May 25, 2018 at 08:19:00AM +0200, Christoph Hellwig wrote:
> On Thu, May 24, 2018 at 02:13:56PM -0400, Brian Foster wrote:
> > Ok, so I guess writeback can see uptodate blocks over a hole if some
> > other block in that page is dirty.
> 
> Yes.
> 
> > Perhaps we could make sure that a
> > dirty page has at least one block that maps to an actual extent or
> > otherwise the page has been truncated..?
> 
> We have the following comment near the end of xfs_writepage_map:
> 

That comment is what I'm basing on...

> 	/*
> 	 * We can end up here with no error and nothing to write if we
> 	 * race with a partial page truncate on a sub-page block sized
> 	 * filesystem. In that case we need to mark the page clean.
> 	 */
> 

So we can correctly end up with nothing to write on a dirty page, but it
presumes a race with truncate. So suppose we end up with a dirty page,
at least one uptodate block, count is zero (i.e., due to holes) and
i_size is beyond the page. Would that not be completely bogus? If bogus,
I think that would at least detect the dumb example I posted earlier.

Brian

> And I'm pretty sure I managed to hit that case easily in xfstests,
> as my initial handling of it was wrong.  So I don't think we can
> even check for that.
> 
> > I guess having another dirty block bitmap similar to
> > iomap_page->uptodate could be required to tell for sure whether a
> > particular block should definitely have a block on-disk or not. It may
> > not be worth doing that just for additional error checks, but I still
> > have to look into the last few patches to grok all the iomap_page stuff.
> 
> I don't think it's worth it.  The sub-page dirty tracking has been one
> of the issues with the buffer head code that caused a lot of problems,
> and that we want to get rid of.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 22/34] xfs: make xfs_writepage_map extent map centric
  2018-05-25 11:35           ` Brian Foster
@ 2018-05-28  7:15             ` Christoph Hellwig
  2018-05-29 11:26               ` Brian Foster
  0 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-28  7:15 UTC (permalink / raw)
  To: Brian Foster
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, linux-mm, Dave Chinner

On Fri, May 25, 2018 at 07:35:33AM -0400, Brian Foster wrote:
> That comment is what I'm basing on...
> 
> > 	/*
> > 	 * We can end up here with no error and nothing to write if we
> > 	 * race with a partial page truncate on a sub-page block sized
> > 	 * filesystem. In that case we need to mark the page clean.
> > 	 */
> > 
> 
> So we can correctly end up with nothing to write on a dirty page, but it
> presumes a race with truncate. So suppose we end up with a dirty page,
>
> at least one uptodate block, count is zero (i.e., due to holes) and
> i_size is beyond the page. Would that not be completely bogus? If bogus,
> I think that would at least detect the dumb example I posted earlier.

The trivial file_offset >= i_size_read assert explodes pretty soon
in generic/091, and already does so with the existing mainline code.

I'd rather not open another can of worms right now..

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 22/34] xfs: make xfs_writepage_map extent map centric
  2018-05-28  7:15             ` Christoph Hellwig
@ 2018-05-29 11:26               ` Brian Foster
  2018-05-29 13:08                 ` Christoph Hellwig
  0 siblings, 1 reply; 78+ messages in thread
From: Brian Foster @ 2018-05-29 11:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm, Dave Chinner

On Mon, May 28, 2018 at 09:15:43AM +0200, Christoph Hellwig wrote:
> On Fri, May 25, 2018 at 07:35:33AM -0400, Brian Foster wrote:
> > That comment is what I'm basing on...
> > 
> > > 	/*
> > > 	 * We can end up here with no error and nothing to write if we
> > > 	 * race with a partial page truncate on a sub-page block sized
> > > 	 * filesystem. In that case we need to mark the page clean.
> > > 	 */
> > > 
> > 
> > So we can correctly end up with nothing to write on a dirty page, but it
> > presumes a race with truncate. So suppose we end up with a dirty page,
> >
> > at least one uptodate block, count is zero (i.e., due to holes) and
> > i_size is beyond the page. Would that not be completely bogus? If bogus,
> > I think that would at least detect the dumb example I posted earlier.
> 
> The trivial file_offset >= i_size_read assert explodes pretty soon
> in generic/091, and already does so with the existing mainline code.
> 

What exactly is the trivial check? Can you show the code please?

Brian

> I'd rather not open another can of worms right now..
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 22/34] xfs: make xfs_writepage_map extent map centric
  2018-05-29 11:26               ` Brian Foster
@ 2018-05-29 13:08                 ` Christoph Hellwig
  2018-05-29 17:04                   ` Brian Foster
  0 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-29 13:08 UTC (permalink / raw)
  To: Brian Foster
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, linux-mm, Dave Chinner

On Tue, May 29, 2018 at 07:26:31AM -0400, Brian Foster wrote:
> What exactly is the trivial check? Can you show the code please?

ASSERT(file_offset > i_size_read(inode)) in the !count block
at the end of xfs_writepage_map.

(file_offset replaced with page_offset(page) + offset for the mainline
code).

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 22/34] xfs: make xfs_writepage_map extent map centric
  2018-05-29 13:08                 ` Christoph Hellwig
@ 2018-05-29 17:04                   ` Brian Foster
  0 siblings, 0 replies; 78+ messages in thread
From: Brian Foster @ 2018-05-29 17:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm, Dave Chinner

On Tue, May 29, 2018 at 03:08:46PM +0200, Christoph Hellwig wrote:
> On Tue, May 29, 2018 at 07:26:31AM -0400, Brian Foster wrote:
> > What exactly is the trivial check? Can you show the code please?
> 
> ASSERT(file_offset > i_size_read(inode)) in the !count block
> at the end of xfs_writepage_map.
> 
> (file_offset replaced with page_offset(page) + offset for the mainline
> code).

Ok, so we (mainline) somehow or another end up in writeback with a page
(inside EOF) with a combination of (!mapped && !uptodate) and (!mapped
&& uptodate) (unwritten?) buffers, none of them actually being dirty.
I'm not quite sure how that happens, but I think it does rule out the
count == 0 && at least one uptodate segment logic I proposed earlier.

Fair enough. Given that, I'm not sure there's a good way to trigger such
error detection without actual dirty state, and it's certainly not worth
complicating the design just for that. Thanks for trying, at least.

Hmm, that does have me wondering a bit if/how we'd end up writing back
zeroed blocks over unwritten extents with no other dirty user data in
the page (since the initial xfs_writepage_map() rework patch factors out
the uptodate && !mapped skipping logic).

Brian

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/34] block: add a lower-level bio_add_page interface
  2018-05-23 14:43 ` [PATCH 01/34] block: add a lower-level bio_add_page interface Christoph Hellwig
@ 2018-05-30  5:28   ` Darrick J. Wong
  0 siblings, 0 replies; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  5:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:24PM +0200, Christoph Hellwig wrote:
> For the upcoming removal of buffer heads in XFS we need to keep track of
> the number of outstanding writeback requests per page.  For this we need
> to know if bio_add_page merged a region with the previous bvec or not.
> Instead of adding additional arguments this refactors bio_add_page to
> be implemented using three lower level helpers which users like XFS can
> use directly if they care about the merge decisions.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Jens Axboe <axboe@kernel.dk>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  block/bio.c         | 96 +++++++++++++++++++++++++++++----------------
>  include/linux/bio.h |  9 +++++
>  2 files changed, 72 insertions(+), 33 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 53e0f0a1ed94..fdf635d42bbd 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -773,7 +773,7 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
>  			return 0;
>  	}
>  
> -	if (bio->bi_vcnt >= bio->bi_max_vecs)
> +	if (bio_full(bio))
>  		return 0;
>  
>  	/*
> @@ -821,52 +821,82 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
>  EXPORT_SYMBOL(bio_add_pc_page);
>  
>  /**
> - *	bio_add_page	-	attempt to add page to bio
> - *	@bio: destination bio
> - *	@page: page to add
> - *	@len: vec entry length
> - *	@offset: vec entry offset
> + * __bio_try_merge_page - try appending data to an existing bvec.
> + * @bio: destination bio
> + * @page: page to add
> + * @len: length of the data to add
> + * @off: offset of the data in @page
>   *
> - *	Attempt to add a page to the bio_vec maplist. This will only fail
> - *	if either bio->bi_vcnt == bio->bi_max_vecs or it's a cloned bio.
> + * Try to add the data at @page + @off to the last bvec of @bio.  This is a
> + * a useful optimisation for file systems with a block size smaller than the
> + * page size.
> + *
> + * Return %true on success or %false on failure.
>   */
> -int bio_add_page(struct bio *bio, struct page *page,
> -		 unsigned int len, unsigned int offset)
> +bool __bio_try_merge_page(struct bio *bio, struct page *page,
> +		unsigned int len, unsigned int off)
>  {
> -	struct bio_vec *bv;
> -
> -	/*
> -	 * cloned bio must not modify vec list
> -	 */
>  	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
> -		return 0;
> +		return false;
>  
> -	/*
> -	 * For filesystems with a blocksize smaller than the pagesize
> -	 * we will often be called with the same page as last time and
> -	 * a consecutive offset.  Optimize this special case.
> -	 */
>  	if (bio->bi_vcnt > 0) {
> -		bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
> +		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>  
> -		if (page == bv->bv_page &&
> -		    offset == bv->bv_offset + bv->bv_len) {
> +		if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
>  			bv->bv_len += len;
> -			goto done;
> +			bio->bi_iter.bi_size += len;
> +			return true;
>  		}
>  	}
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(__bio_try_merge_page);
>  
> -	if (bio->bi_vcnt >= bio->bi_max_vecs)
> -		return 0;
> +/**
> + * __bio_add_page - add page to a bio in a new segment
> + * @bio: destination bio
> + * @page: page to add
> + * @len: length of the data to add
> + * @off: offset of the data in @page
> + *
> + * Add the data at @page + @off to @bio as a new bvec.  The caller must ensure
> + * that @bio has space for another bvec.
> + */
> +void __bio_add_page(struct bio *bio, struct page *page,
> +		unsigned int len, unsigned int off)
> +{
> +	struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt];
>  
> -	bv		= &bio->bi_io_vec[bio->bi_vcnt];
> -	bv->bv_page	= page;
> -	bv->bv_len	= len;
> -	bv->bv_offset	= offset;
> +	WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
> +	WARN_ON_ONCE(bio_full(bio));
> +
> +	bv->bv_page = page;
> +	bv->bv_offset = off;
> +	bv->bv_len = len;
>  
> -	bio->bi_vcnt++;
> -done:
>  	bio->bi_iter.bi_size += len;
> +	bio->bi_vcnt++;
> +}
> +EXPORT_SYMBOL_GPL(__bio_add_page);
> +
> +/**
> + *	bio_add_page	-	attempt to add page to bio
> + *	@bio: destination bio
> + *	@page: page to add
> + *	@len: vec entry length
> + *	@offset: vec entry offset
> + *
> + *	Attempt to add a page to the bio_vec maplist. This will only fail
> + *	if either bio->bi_vcnt == bio->bi_max_vecs or it's a cloned bio.
> + */
> +int bio_add_page(struct bio *bio, struct page *page,
> +		 unsigned int len, unsigned int offset)
> +{
> +	if (!__bio_try_merge_page(bio, page, len, offset)) {
> +		if (bio_full(bio))
> +			return 0;
> +		__bio_add_page(bio, page, len, offset);
> +	}
>  	return len;
>  }
>  EXPORT_SYMBOL(bio_add_page);
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index ce547a25e8ae..3e73c8bc25ea 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -123,6 +123,11 @@ static inline void *bio_data(struct bio *bio)
>  	return NULL;
>  }
>  
> +static inline bool bio_full(struct bio *bio)
> +{
> +	return bio->bi_vcnt >= bio->bi_max_vecs;
> +}
> +
>  /*
>   * will die
>   */
> @@ -470,6 +475,10 @@ void bio_chain(struct bio *, struct bio *);
>  extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
>  extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
>  			   unsigned int, unsigned int);
> +bool __bio_try_merge_page(struct bio *bio, struct page *page,
> +		unsigned int len, unsigned int off);
> +void __bio_add_page(struct bio *bio, struct page *page,
> +		unsigned int len, unsigned int off);
>  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
>  struct rq_map_data;
>  extern struct bio *bio_map_user_iov(struct request_queue *,
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 02/34] fs: factor out a __generic_write_end helper
  2018-05-23 14:43 ` [PATCH 02/34] fs: factor out a __generic_write_end helper Christoph Hellwig
@ 2018-05-30  5:30   ` Darrick J. Wong
  0 siblings, 0 replies; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  5:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:25PM +0200, Christoph Hellwig wrote:
> Bits of the buffer.c based write_end implementations that don't know
> about buffer_heads and can be reused by other implementations.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/buffer.c   | 67 +++++++++++++++++++++++++++------------------------
>  fs/internal.h |  2 ++
>  2 files changed, 37 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 249b83fafe48..bd964b2ad99a 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2076,6 +2076,40 @@ int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
>  }
>  EXPORT_SYMBOL(block_write_begin);
>  
> +int __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
> +		struct page *page)
> +{
> +	loff_t old_size = inode->i_size;
> +	bool i_size_changed = false;
> +
> +	/*
> +	 * No need to use i_size_read() here, the i_size cannot change under us
> +	 * because we hold i_rwsem.
> +	 *
> +	 * But it's important to update i_size while still holding page lock:
> +	 * page writeout could otherwise come in and zero beyond i_size.
> +	 */
> +	if (pos + copied > inode->i_size) {
> +		i_size_write(inode, pos + copied);
> +		i_size_changed = true;
> +	}
> +
> +	unlock_page(page);
> +	put_page(page);
> +
> +	if (old_size < pos)
> +		pagecache_isize_extended(inode, old_size, pos);
> +	/*
> +	 * Don't mark the inode dirty under page lock. First, it unnecessarily
> +	 * makes the holding time of page lock longer. Second, it forces lock
> +	 * ordering of page lock and transaction start for journaling
> +	 * filesystems.
> +	 */
> +	if (i_size_changed)
> +		mark_inode_dirty(inode);
> +	return copied;
> +}
> +
>  int block_write_end(struct file *file, struct address_space *mapping,
>  			loff_t pos, unsigned len, unsigned copied,
>  			struct page *page, void *fsdata)
> @@ -2116,39 +2150,8 @@ int generic_write_end(struct file *file, struct address_space *mapping,
>  			loff_t pos, unsigned len, unsigned copied,
>  			struct page *page, void *fsdata)
>  {
> -	struct inode *inode = mapping->host;
> -	loff_t old_size = inode->i_size;
> -	int i_size_changed = 0;
> -
>  	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> -
> -	/*
> -	 * No need to use i_size_read() here, the i_size
> -	 * cannot change under us because we hold i_mutex.
> -	 *
> -	 * But it's important to update i_size while still holding page lock:
> -	 * page writeout could otherwise come in and zero beyond i_size.
> -	 */
> -	if (pos+copied > inode->i_size) {
> -		i_size_write(inode, pos+copied);
> -		i_size_changed = 1;
> -	}
> -
> -	unlock_page(page);
> -	put_page(page);
> -
> -	if (old_size < pos)
> -		pagecache_isize_extended(inode, old_size, pos);
> -	/*
> -	 * Don't mark the inode dirty under page lock. First, it unnecessarily
> -	 * makes the holding time of page lock longer. Second, it forces lock
> -	 * ordering of page lock and transaction start for journaling
> -	 * filesystems.
> -	 */
> -	if (i_size_changed)
> -		mark_inode_dirty(inode);
> -
> -	return copied;
> +	return __generic_write_end(mapping->host, pos, copied, page);
>  }
>  EXPORT_SYMBOL(generic_write_end);
>  
> diff --git a/fs/internal.h b/fs/internal.h
> index e08972db0303..b955232d3d49 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -43,6 +43,8 @@ static inline int __sync_blockdev(struct block_device *bdev, int wait)
>  extern void guard_bio_eod(int rw, struct bio *bio);
>  extern int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
>  		get_block_t *get_block, struct iomap *iomap);
> +int __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
> +		struct page *page);
>  
>  /*
>   * char_dev.c
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 03/34] fs: move page_cache_seek_hole_data to iomap.c
  2018-05-23 14:43 ` [PATCH 03/34] fs: move page_cache_seek_hole_data to iomap.c Christoph Hellwig
@ 2018-05-30  5:31   ` Darrick J. Wong
  0 siblings, 0 replies; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  5:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:26PM +0200, Christoph Hellwig wrote:
> This function is only used by the iomap code, depends on being called
> from it, and will soon stop poking into buffer head internals.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/buffer.c                 | 114 -----------------------------------
>  fs/iomap.c                  | 116 ++++++++++++++++++++++++++++++++++++
>  include/linux/buffer_head.h |   2 -
>  3 files changed, 116 insertions(+), 116 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index bd964b2ad99a..aba2a948b235 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -3430,120 +3430,6 @@ int bh_submit_read(struct buffer_head *bh)
>  }
>  EXPORT_SYMBOL(bh_submit_read);
>  
> -/*
> - * Seek for SEEK_DATA / SEEK_HOLE within @page, starting at @lastoff.
> - *
> - * Returns the offset within the file on success, and -ENOENT otherwise.
> - */
> -static loff_t
> -page_seek_hole_data(struct page *page, loff_t lastoff, int whence)
> -{
> -	loff_t offset = page_offset(page);
> -	struct buffer_head *bh, *head;
> -	bool seek_data = whence == SEEK_DATA;
> -
> -	if (lastoff < offset)
> -		lastoff = offset;
> -
> -	bh = head = page_buffers(page);
> -	do {
> -		offset += bh->b_size;
> -		if (lastoff >= offset)
> -			continue;
> -
> -		/*
> -		 * Unwritten extents that have data in the page cache covering
> -		 * them can be identified by the BH_Unwritten state flag.
> -		 * Pages with multiple buffers might have a mix of holes, data
> -		 * and unwritten extents - any buffer with valid data in it
> -		 * should have BH_Uptodate flag set on it.
> -		 */
> -
> -		if ((buffer_unwritten(bh) || buffer_uptodate(bh)) == seek_data)
> -			return lastoff;
> -
> -		lastoff = offset;
> -	} while ((bh = bh->b_this_page) != head);
> -	return -ENOENT;
> -}
> -
> -/*
> - * Seek for SEEK_DATA / SEEK_HOLE in the page cache.
> - *
> - * Within unwritten extents, the page cache determines which parts are holes
> - * and which are data: unwritten and uptodate buffer heads count as data;
> - * everything else counts as a hole.
> - *
> - * Returns the resulting offset on successs, and -ENOENT otherwise.
> - */
> -loff_t
> -page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
> -			  int whence)
> -{
> -	pgoff_t index = offset >> PAGE_SHIFT;
> -	pgoff_t end = DIV_ROUND_UP(offset + length, PAGE_SIZE);
> -	loff_t lastoff = offset;
> -	struct pagevec pvec;
> -
> -	if (length <= 0)
> -		return -ENOENT;
> -
> -	pagevec_init(&pvec);
> -
> -	do {
> -		unsigned nr_pages, i;
> -
> -		nr_pages = pagevec_lookup_range(&pvec, inode->i_mapping, &index,
> -						end - 1);
> -		if (nr_pages == 0)
> -			break;
> -
> -		for (i = 0; i < nr_pages; i++) {
> -			struct page *page = pvec.pages[i];
> -
> -			/*
> -			 * At this point, the page may be truncated or
> -			 * invalidated (changing page->mapping to NULL), or
> -			 * even swizzled back from swapper_space to tmpfs file
> -			 * mapping.  However, page->index will not change
> -			 * because we have a reference on the page.
> -                         *
> -			 * If current page offset is beyond where we've ended,
> -			 * we've found a hole.
> -                         */
> -			if (whence == SEEK_HOLE &&
> -			    lastoff < page_offset(page))
> -				goto check_range;
> -
> -			lock_page(page);
> -			if (likely(page->mapping == inode->i_mapping) &&
> -			    page_has_buffers(page)) {
> -				lastoff = page_seek_hole_data(page, lastoff, whence);
> -				if (lastoff >= 0) {
> -					unlock_page(page);
> -					goto check_range;
> -				}
> -			}
> -			unlock_page(page);
> -			lastoff = page_offset(page) + PAGE_SIZE;
> -		}
> -		pagevec_release(&pvec);
> -	} while (index < end);
> -
> -	/* When no page at lastoff and we are not done, we found a hole. */
> -	if (whence != SEEK_HOLE)
> -		goto not_found;
> -
> -check_range:
> -	if (lastoff < offset + length)
> -		goto out;
> -not_found:
> -	lastoff = -ENOENT;
> -out:
> -	pagevec_release(&pvec);
> -	return lastoff;
> -}
> -
>  void __init buffer_init(void)
>  {
>  	unsigned long nrpages;
> diff --git a/fs/iomap.c b/fs/iomap.c
> index f2456d0d8ddd..4a01d2f4e8e9 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -20,6 +20,7 @@
>  #include <linux/mm.h>
>  #include <linux/swap.h>
>  #include <linux/pagemap.h>
> +#include <linux/pagevec.h>
>  #include <linux/file.h>
>  #include <linux/uio.h>
>  #include <linux/backing-dev.h>
> @@ -588,6 +589,121 @@ int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
>  }
>  EXPORT_SYMBOL_GPL(iomap_fiemap);
>  
> +/*
> + * Seek for SEEK_DATA / SEEK_HOLE within @page, starting at @lastoff.
> + *
> + * Returns the offset within the file on success, and -ENOENT otherwise.
> + */
> +static loff_t
> +page_seek_hole_data(struct page *page, loff_t lastoff, int whence)
> +{
> +	loff_t offset = page_offset(page);
> +	struct buffer_head *bh, *head;
> +	bool seek_data = whence == SEEK_DATA;
> +
> +	if (lastoff < offset)
> +		lastoff = offset;
> +
> +	bh = head = page_buffers(page);
> +	do {
> +		offset += bh->b_size;
> +		if (lastoff >= offset)
> +			continue;
> +
> +		/*
> +		 * Unwritten extents that have data in the page cache covering
> +		 * them can be identified by the BH_Unwritten state flag.
> +		 * Pages with multiple buffers might have a mix of holes, data
> +		 * and unwritten extents - any buffer with valid data in it
> +		 * should have BH_Uptodate flag set on it.
> +		 */
> +
> +		if ((buffer_unwritten(bh) || buffer_uptodate(bh)) == seek_data)
> +			return lastoff;
> +
> +		lastoff = offset;
> +	} while ((bh = bh->b_this_page) != head);
> +	return -ENOENT;
> +}
> +
> +/*
> + * Seek for SEEK_DATA / SEEK_HOLE in the page cache.
> + *
> + * Within unwritten extents, the page cache determines which parts are holes
> + * and which are data: unwritten and uptodate buffer heads count as data;
> + * everything else counts as a hole.
> + *
> + * Returns the resulting offset on successs, and -ENOENT otherwise.
> + */
> +static loff_t
> +page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
> +		int whence)
> +{
> +	pgoff_t index = offset >> PAGE_SHIFT;
> +	pgoff_t end = DIV_ROUND_UP(offset + length, PAGE_SIZE);
> +	loff_t lastoff = offset;
> +	struct pagevec pvec;
> +
> +	if (length <= 0)
> +		return -ENOENT;
> +
> +	pagevec_init(&pvec);
> +
> +	do {
> +		unsigned nr_pages, i;
> +
> +		nr_pages = pagevec_lookup_range(&pvec, inode->i_mapping, &index,
> +						end - 1);
> +		if (nr_pages == 0)
> +			break;
> +
> +		for (i = 0; i < nr_pages; i++) {
> +			struct page *page = pvec.pages[i];
> +
> +			/*
> +			 * At this point, the page may be truncated or
> +			 * invalidated (changing page->mapping to NULL), or
> +			 * even swizzled back from swapper_space to tmpfs file
> +			 * mapping.  However, page->index will not change
> +			 * because we have a reference on the page.
> +                         *
> +			 * If current page offset is beyond where we've ended,
> +			 * we've found a hole.
> +                         */
> +			if (whence == SEEK_HOLE &&
> +			    lastoff < page_offset(page))
> +				goto check_range;
> +
> +			lock_page(page);
> +			if (likely(page->mapping == inode->i_mapping) &&
> +			    page_has_buffers(page)) {
> +				lastoff = page_seek_hole_data(page, lastoff, whence);
> +				if (lastoff >= 0) {
> +					unlock_page(page);
> +					goto check_range;
> +				}
> +			}
> +			unlock_page(page);
> +			lastoff = page_offset(page) + PAGE_SIZE;
> +		}
> +		pagevec_release(&pvec);
> +	} while (index < end);
> +
> +	/* When no page at lastoff and we are not done, we found a hole. */
> +	if (whence != SEEK_HOLE)
> +		goto not_found;
> +
> +check_range:
> +	if (lastoff < offset + length)
> +		goto out;
> +not_found:
> +	lastoff = -ENOENT;
> +out:
> +	pagevec_release(&pvec);
> +	return lastoff;
> +}
> +
> +
>  static loff_t
>  iomap_seek_hole_actor(struct inode *inode, loff_t offset, loff_t length,
>  		      void *data, struct iomap *iomap)
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 894e5d125de6..96225a77c112 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -205,8 +205,6 @@ void write_boundary_block(struct block_device *bdev,
>  			sector_t bblock, unsigned blocksize);
>  int bh_uptodate_or_lock(struct buffer_head *bh);
>  int bh_submit_read(struct buffer_head *bh);
> -loff_t page_cache_seek_hole_data(struct inode *inode, loff_t offset,
> -				 loff_t length, int whence);
>  
>  extern int buffer_heads_over_limit;
>  
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 04/34] fs: remove the buffer_unwritten check in page_seek_hole_data
  2018-05-23 14:43 ` [PATCH 04/34] fs: remove the buffer_unwritten check in page_seek_hole_data Christoph Hellwig
@ 2018-05-30  5:36   ` Darrick J. Wong
  0 siblings, 0 replies; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  5:36 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:27PM +0200, Christoph Hellwig wrote:
> We only call into this function through the iomap iterators, so we already
> know the buffer is unwritten.  In addition to that we always require the
> uptodate flag that is ORed with the result anyway.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok, though it took me a while to dig through all the twisty
bits...

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/iomap.c | 13 ++++---------
>  1 file changed, 4 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 4a01d2f4e8e9..bef5e91d40bf 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -611,14 +611,9 @@ page_seek_hole_data(struct page *page, loff_t lastoff, int whence)
>  			continue;
>  
>  		/*
> -		 * Unwritten extents that have data in the page cache covering
> -		 * them can be identified by the BH_Unwritten state flag.
> -		 * Pages with multiple buffers might have a mix of holes, data
> -		 * and unwritten extents - any buffer with valid data in it
> -		 * should have BH_Uptodate flag set on it.
> +		 * Any buffer with valid data in it should have BH_Uptodate set.
>  		 */
> -
> -		if ((buffer_unwritten(bh) || buffer_uptodate(bh)) == seek_data)
> +		if (buffer_uptodate(bh) == seek_data)
>  			return lastoff;
>  
>  		lastoff = offset;
> @@ -630,8 +625,8 @@ page_seek_hole_data(struct page *page, loff_t lastoff, int whence)
>   * Seek for SEEK_DATA / SEEK_HOLE in the page cache.
>   *
>   * Within unwritten extents, the page cache determines which parts are holes
> - * and which are data: unwritten and uptodate buffer heads count as data;
> - * everything else counts as a hole.
> + * and which are data: uptodate buffer heads count as data; everything else
> + * counts as a hole.
>   *
>   * Returns the resulting offset on successs, and -ENOENT otherwise.
>   */
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/34] fs: use ->is_partially_uptodate in page_cache_seek_hole_data
  2018-05-23 14:43 ` [PATCH 05/34] fs: use ->is_partially_uptodate in page_cache_seek_hole_data Christoph Hellwig
@ 2018-05-30  5:41   ` Darrick J. Wong
  0 siblings, 0 replies; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  5:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:28PM +0200, Christoph Hellwig wrote:
> This way the implementation doesn't depend on buffer_head internals.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/iomap.c | 85 +++++++++++++++++++++++++-----------------------------
>  1 file changed, 39 insertions(+), 46 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index bef5e91d40bf..0900da23172c 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -589,36 +589,51 @@ int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
>  }
>  EXPORT_SYMBOL_GPL(iomap_fiemap);
>  
> -/*
> - * Seek for SEEK_DATA / SEEK_HOLE within @page, starting at @lastoff.
> - *
> - * Returns the offset within the file on success, and -ENOENT otherwise.
> - */
> -static loff_t
> -page_seek_hole_data(struct page *page, loff_t lastoff, int whence)
> +static bool
> +page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff,
> +		int whence)
>  {
> -	loff_t offset = page_offset(page);
> -	struct buffer_head *bh, *head;
> +	const struct address_space_operations *ops = inode->i_mapping->a_ops;
> +	unsigned int bsize = i_blocksize(inode), off;
>  	bool seek_data = whence == SEEK_DATA;
> +	loff_t poff = page_offset(page);
>  
> -	if (lastoff < offset)
> -		lastoff = offset;
> -
> -	bh = head = page_buffers(page);
> -	do {
> -		offset += bh->b_size;
> -		if (lastoff >= offset)
> -			continue;
> +	if (WARN_ON_ONCE(*lastoff >= poff + PAGE_SIZE))
> +		return false;
>  
> +	if (*lastoff < poff) {
>  		/*
> -		 * Any buffer with valid data in it should have BH_Uptodate set.
> +		 * Last offset smaller than the start of the page means we found
> +		 * a hole:
>  		 */
> -		if (buffer_uptodate(bh) == seek_data)
> -			return lastoff;
> +		if (whence == SEEK_HOLE)
> +			return true;
> +		*lastoff = poff;
> +	}
> +
> +	/*
> +	 * Just check the page unless we can and should check block ranges:
> +	 */
> +	if (bsize == PAGE_SIZE || !ops->is_partially_uptodate)
> +		return PageUptodate(page) == seek_data;
>  
> -		lastoff = offset;
> -	} while ((bh = bh->b_this_page) != head);
> -	return -ENOENT;
> +	lock_page(page);
> +	if (unlikely(page->mapping != inode->i_mapping))
> +		goto out_unlock_not_found;
> +
> +	for (off = 0; off < PAGE_SIZE; off += bsize) {
> +		if ((*lastoff & ~PAGE_MASK) >= off + bsize)
> +			continue;
> +		if (ops->is_partially_uptodate(page, off, bsize) == seek_data) {
> +			unlock_page(page);
> +			return true;
> +		}
> +		*lastoff = poff + off + bsize;
> +	}
> +
> +out_unlock_not_found:
> +	unlock_page(page);
> +	return false;
>  }
>  
>  /*
> @@ -655,30 +670,8 @@ page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
>  		for (i = 0; i < nr_pages; i++) {
>  			struct page *page = pvec.pages[i];
>  
> -			/*
> -			 * At this point, the page may be truncated or
> -			 * invalidated (changing page->mapping to NULL), or
> -			 * even swizzled back from swapper_space to tmpfs file
> -			 * mapping.  However, page->index will not change
> -			 * because we have a reference on the page.
> -                         *
> -			 * If current page offset is beyond where we've ended,
> -			 * we've found a hole.
> -                         */
> -			if (whence == SEEK_HOLE &&
> -			    lastoff < page_offset(page))
> +			if (page_seek_hole_data(inode, page, &lastoff, whence))
>  				goto check_range;
> -
> -			lock_page(page);
> -			if (likely(page->mapping == inode->i_mapping) &&
> -			    page_has_buffers(page)) {
> -				lastoff = page_seek_hole_data(page, lastoff, whence);
> -				if (lastoff >= 0) {
> -					unlock_page(page);
> -					goto check_range;
> -				}
> -			}
> -			unlock_page(page);
>  			lastoff = page_offset(page) + PAGE_SIZE;
>  		}
>  		pagevec_release(&pvec);
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 06/34] mm: give the 'ret' variable a better name __do_page_cache_readahead
  2018-05-23 14:43 ` [PATCH 06/34] mm: give the 'ret' variable a better name __do_page_cache_readahead Christoph Hellwig
@ 2018-05-30  5:42   ` Darrick J. Wong
  0 siblings, 0 replies; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  5:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:29PM +0200, Christoph Hellwig wrote:
> It counts the number of pages acted on, so name it nr_pages to make that
> obvious.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok, so long as the mm folks don't have any strong opinions...
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  mm/readahead.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 539bbb6c1fad..16d0cb1e2616 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -156,7 +156,7 @@ int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
>  	unsigned long end_index;	/* The last page we want to read */
>  	LIST_HEAD(page_pool);
>  	int page_idx;
> -	int ret = 0;
> +	int nr_pages = 0;
>  	loff_t isize = i_size_read(inode);
>  	gfp_t gfp_mask = readahead_gfp_mask(mapping);
>  
> @@ -187,7 +187,7 @@ int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
>  		list_add(&page->lru, &page_pool);
>  		if (page_idx == nr_to_read - lookahead_size)
>  			SetPageReadahead(page);
> -		ret++;
> +		nr_pages++;
>  	}
>  
>  	/*
> @@ -195,11 +195,11 @@ int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
>  	 * uptodate then the caller will launch readpage again, and
>  	 * will then handle the error.
>  	 */
> -	if (ret)
> -		read_pages(mapping, filp, &page_pool, ret, gfp_mask);
> +	if (nr_pages)
> +		read_pages(mapping, filp, &page_pool, nr_pages, gfp_mask);
>  	BUG_ON(!list_empty(&page_pool));
>  out:
> -	return ret;
> +	return nr_pages;
>  }
>  
>  /*
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/34] mm: return an unsigned int from __do_page_cache_readahead
  2018-05-23 14:43 ` [PATCH 07/34] mm: return an unsigned int from __do_page_cache_readahead Christoph Hellwig
@ 2018-05-30  5:44   ` Darrick J. Wong
  0 siblings, 0 replies; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  5:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:30PM +0200, Christoph Hellwig wrote:
> We never return an error, so switch to returning an unsigned int.  Most
> callers already did implicit casts to an unsigned type, and the one that
> didn't can be simplified now.
> 
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok; anyone from the mm side has a strong opinion?
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  mm/internal.h  |  2 +-
>  mm/readahead.c | 15 +++++----------
>  2 files changed, 6 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 62d8c34e63d5..954003ac766a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -53,7 +53,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>  			     unsigned long addr, unsigned long end,
>  			     struct zap_details *details);
>  
> -extern int __do_page_cache_readahead(struct address_space *mapping,
> +extern unsigned int __do_page_cache_readahead(struct address_space *mapping,
>  		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
>  		unsigned long lookahead_size);
>  
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 16d0cb1e2616..fa4d4b767130 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -147,16 +147,16 @@ static int read_pages(struct address_space *mapping, struct file *filp,
>   *
>   * Returns the number of pages requested, or the maximum amount of I/O allowed.
>   */
> -int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
> -			pgoff_t offset, unsigned long nr_to_read,
> -			unsigned long lookahead_size)
> +unsigned int __do_page_cache_readahead(struct address_space *mapping,
> +		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
> +		unsigned long lookahead_size)
>  {
>  	struct inode *inode = mapping->host;
>  	struct page *page;
>  	unsigned long end_index;	/* The last page we want to read */
>  	LIST_HEAD(page_pool);
>  	int page_idx;
> -	int nr_pages = 0;
> +	unsigned int nr_pages = 0;
>  	loff_t isize = i_size_read(inode);
>  	gfp_t gfp_mask = readahead_gfp_mask(mapping);
>  
> @@ -223,16 +223,11 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
>  	max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
>  	nr_to_read = min(nr_to_read, max_pages);
>  	while (nr_to_read) {
> -		int err;
> -
>  		unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_SIZE;
>  
>  		if (this_chunk > nr_to_read)
>  			this_chunk = nr_to_read;
> -		err = __do_page_cache_readahead(mapping, filp,
> -						offset, this_chunk, 0);
> -		if (err < 0)
> -			return err;
> +		__do_page_cache_readahead(mapping, filp, offset, this_chunk, 0);
>  
>  		offset += this_chunk;
>  		nr_to_read -= this_chunk;
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/34] mm: split ->readpages calls to avoid non-contiguous pages lists
  2018-05-23 14:43 ` [PATCH 08/34] mm: split ->readpages calls to avoid non-contiguous pages lists Christoph Hellwig
@ 2018-05-30  5:46   ` Darrick J. Wong
  0 siblings, 0 replies; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  5:46 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:31PM +0200, Christoph Hellwig wrote:
> That way file systems don't have to go spotting for non-contiguous pages
> and work around them.  It also kicks off I/O earlier, allowing it to
> finish earlier and reduce latency.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok, if anyone has a strong opinion they better yell soon...
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  mm/readahead.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index fa4d4b767130..e273f0de3376 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -140,8 +140,8 @@ static int read_pages(struct address_space *mapping, struct file *filp,
>  }
>  
>  /*
> - * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates all
> - * the pages first, then submits them all for I/O. This avoids the very bad
> + * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates
> + * the pages first, then submits them for I/O. This avoids the very bad
>   * behaviour which would occur if page allocations are causing VM writeback.
>   * We really don't want to intermingle reads and writes like that.
>   *
> @@ -177,8 +177,18 @@ unsigned int __do_page_cache_readahead(struct address_space *mapping,
>  		rcu_read_lock();
>  		page = radix_tree_lookup(&mapping->i_pages, page_offset);
>  		rcu_read_unlock();
> -		if (page && !radix_tree_exceptional_entry(page))
> +		if (page && !radix_tree_exceptional_entry(page)) {
> +			/*
> +			 * Page already present?  Kick off the current batch of
> +			 * contiguous pages before continuing with the next
> +			 * batch.
> +			 */
> +			if (nr_pages)
> +				read_pages(mapping, filp, &page_pool, nr_pages,
> +						gfp_mask);
> +			nr_pages = 0;
>  			continue;
> +		}
>  
>  		page = __page_cache_alloc(gfp_mask);
>  		if (!page)
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/34] iomap: inline data should be an iomap type, not a flag
  2018-05-23 14:43 ` [PATCH 09/34] iomap: inline data should be an iomap type, not a flag Christoph Hellwig
@ 2018-05-30  5:49     ` Darrick J. Wong
  0 siblings, 0 replies; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  5:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, linux-mm, linux-ext4, cluster-devel

On Wed, May 23, 2018 at 04:43:32PM +0200, Christoph Hellwig wrote:
> Inline data is fundamentally different from our normal mapped case in that
> it doesn't even have a block address.  So instead of having a flag for it
> it should be an entirely separate iomap range type.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok to me, anyone from gfs2/ext4 want to ack this?
Let's cc those lists and see what happens...

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/ext4/inline.c      |  4 ++--
>  fs/gfs2/bmap.c        |  3 +--
>  fs/iomap.c            | 21 ++++++++++++---------
>  include/linux/iomap.h |  2 +-
>  4 files changed, 16 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> index 70cf4c7b268a..e1f00891ef95 100644
> --- a/fs/ext4/inline.c
> +++ b/fs/ext4/inline.c
> @@ -1835,8 +1835,8 @@ int ext4_inline_data_iomap(struct inode *inode, struct iomap *iomap)
>  	iomap->offset = 0;
>  	iomap->length = min_t(loff_t, ext4_get_inline_size(inode),
>  			      i_size_read(inode));
> -	iomap->type = 0;
> -	iomap->flags = IOMAP_F_DATA_INLINE;
> +	iomap->type = IOMAP_INLINE;
> +	iomap->flags = 0;
>  
>  out:
>  	up_read(&EXT4_I(inode)->xattr_sem);
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index 278ed0869c3c..cbeedd3cfb36 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -680,8 +680,7 @@ static void gfs2_stuffed_iomap(struct inode *inode, struct iomap *iomap)
>  		      sizeof(struct gfs2_dinode);
>  	iomap->offset = 0;
>  	iomap->length = i_size_read(inode);
> -	iomap->type = IOMAP_MAPPED;
> -	iomap->flags = IOMAP_F_DATA_INLINE;
> +	iomap->type = IOMAP_INLINE;
>  }
>  
>  /**
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 0900da23172c..f52209a2c270 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -503,10 +503,13 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
>  	case IOMAP_DELALLOC:
>  		flags |= FIEMAP_EXTENT_DELALLOC | FIEMAP_EXTENT_UNKNOWN;
>  		break;
> +	case IOMAP_MAPPED:
> +		break;
>  	case IOMAP_UNWRITTEN:
>  		flags |= FIEMAP_EXTENT_UNWRITTEN;
>  		break;
> -	case IOMAP_MAPPED:
> +	case IOMAP_INLINE:
> +		flags |= FIEMAP_EXTENT_DATA_INLINE;
>  		break;
>  	}
>  
> @@ -514,8 +517,6 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
>  		flags |= FIEMAP_EXTENT_MERGED;
>  	if (iomap->flags & IOMAP_F_SHARED)
>  		flags |= FIEMAP_EXTENT_SHARED;
> -	if (iomap->flags & IOMAP_F_DATA_INLINE)
> -		flags |= FIEMAP_EXTENT_DATA_INLINE;
>  
>  	return fiemap_fill_next_extent(fi, iomap->offset,
>  			iomap->addr != IOMAP_NULL_ADDR ? iomap->addr : 0,
> @@ -1318,14 +1319,16 @@ static loff_t iomap_swapfile_activate_actor(struct inode *inode, loff_t pos,
>  	struct iomap_swapfile_info *isi = data;
>  	int error;
>  
> -	/* No inline data. */
> -	if (iomap->flags & IOMAP_F_DATA_INLINE) {
> +	switch (iomap->type) {
> +	case IOMAP_MAPPED:
> +	case IOMAP_UNWRITTEN:
> +		/* Only real or unwritten extents. */
> +		break;
> +	case IOMAP_INLINE:
> +		/* No inline data. */
>  		pr_err("swapon: file is inline\n");
>  		return -EINVAL;
> -	}
> -
> -	/* Only real or unwritten extents. */
> -	if (iomap->type != IOMAP_MAPPED && iomap->type != IOMAP_UNWRITTEN) {
> +	default:
>  		pr_err("swapon: file has unallocated extents\n");
>  		return -EINVAL;
>  	}
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 4bd87294219a..8f7095fc514e 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -18,6 +18,7 @@ struct vm_fault;
>  #define IOMAP_DELALLOC	0x02	/* delayed allocation blocks */
>  #define IOMAP_MAPPED	0x03	/* blocks allocated at @addr */
>  #define IOMAP_UNWRITTEN	0x04	/* blocks allocated at @addr in unwritten state */
> +#define IOMAP_INLINE	0x05	/* data inline in the inode */
>  
>  /*
>   * Flags for all iomap mappings:
> @@ -34,7 +35,6 @@ struct vm_fault;
>   */
>  #define IOMAP_F_MERGED		0x10	/* contains multiple blocks/extents */
>  #define IOMAP_F_SHARED		0x20	/* block shared with another file */
> -#define IOMAP_F_DATA_INLINE	0x40	/* data inline in the inode */
>  
>  /*
>   * Magic value for addr:
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/34] iomap: inline data should be an iomap type, not a flag
@ 2018-05-30  5:49     ` Darrick J. Wong
  0 siblings, 0 replies; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  5:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, linux-ext4, cluster-devel, linux-mm

On Wed, May 23, 2018 at 04:43:32PM +0200, Christoph Hellwig wrote:
> Inline data is fundamentally different from our normal mapped case in that
> it doesn't even have a block address.  So instead of having a flag for it
> it should be an entirely separate iomap range type.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok to me, anyone from gfs2/ext4 want to ack this?
Let's cc those lists and see what happens...

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/ext4/inline.c      |  4 ++--
>  fs/gfs2/bmap.c        |  3 +--
>  fs/iomap.c            | 21 ++++++++++++---------
>  include/linux/iomap.h |  2 +-
>  4 files changed, 16 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> index 70cf4c7b268a..e1f00891ef95 100644
> --- a/fs/ext4/inline.c
> +++ b/fs/ext4/inline.c
> @@ -1835,8 +1835,8 @@ int ext4_inline_data_iomap(struct inode *inode, struct iomap *iomap)
>  	iomap->offset = 0;
>  	iomap->length = min_t(loff_t, ext4_get_inline_size(inode),
>  			      i_size_read(inode));
> -	iomap->type = 0;
> -	iomap->flags = IOMAP_F_DATA_INLINE;
> +	iomap->type = IOMAP_INLINE;
> +	iomap->flags = 0;
>  
>  out:
>  	up_read(&EXT4_I(inode)->xattr_sem);
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index 278ed0869c3c..cbeedd3cfb36 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -680,8 +680,7 @@ static void gfs2_stuffed_iomap(struct inode *inode, struct iomap *iomap)
>  		      sizeof(struct gfs2_dinode);
>  	iomap->offset = 0;
>  	iomap->length = i_size_read(inode);
> -	iomap->type = IOMAP_MAPPED;
> -	iomap->flags = IOMAP_F_DATA_INLINE;
> +	iomap->type = IOMAP_INLINE;
>  }
>  
>  /**
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 0900da23172c..f52209a2c270 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -503,10 +503,13 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
>  	case IOMAP_DELALLOC:
>  		flags |= FIEMAP_EXTENT_DELALLOC | FIEMAP_EXTENT_UNKNOWN;
>  		break;
> +	case IOMAP_MAPPED:
> +		break;
>  	case IOMAP_UNWRITTEN:
>  		flags |= FIEMAP_EXTENT_UNWRITTEN;
>  		break;
> -	case IOMAP_MAPPED:
> +	case IOMAP_INLINE:
> +		flags |= FIEMAP_EXTENT_DATA_INLINE;
>  		break;
>  	}
>  
> @@ -514,8 +517,6 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
>  		flags |= FIEMAP_EXTENT_MERGED;
>  	if (iomap->flags & IOMAP_F_SHARED)
>  		flags |= FIEMAP_EXTENT_SHARED;
> -	if (iomap->flags & IOMAP_F_DATA_INLINE)
> -		flags |= FIEMAP_EXTENT_DATA_INLINE;
>  
>  	return fiemap_fill_next_extent(fi, iomap->offset,
>  			iomap->addr != IOMAP_NULL_ADDR ? iomap->addr : 0,
> @@ -1318,14 +1319,16 @@ static loff_t iomap_swapfile_activate_actor(struct inode *inode, loff_t pos,
>  	struct iomap_swapfile_info *isi = data;
>  	int error;
>  
> -	/* No inline data. */
> -	if (iomap->flags & IOMAP_F_DATA_INLINE) {
> +	switch (iomap->type) {
> +	case IOMAP_MAPPED:
> +	case IOMAP_UNWRITTEN:
> +		/* Only real or unwritten extents. */
> +		break;
> +	case IOMAP_INLINE:
> +		/* No inline data. */
>  		pr_err("swapon: file is inline\n");
>  		return -EINVAL;
> -	}
> -
> -	/* Only real or unwritten extents. */
> -	if (iomap->type != IOMAP_MAPPED && iomap->type != IOMAP_UNWRITTEN) {
> +	default:
>  		pr_err("swapon: file has unallocated extents\n");
>  		return -EINVAL;
>  	}
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 4bd87294219a..8f7095fc514e 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -18,6 +18,7 @@ struct vm_fault;
>  #define IOMAP_DELALLOC	0x02	/* delayed allocation blocks */
>  #define IOMAP_MAPPED	0x03	/* blocks allocated at @addr */
>  #define IOMAP_UNWRITTEN	0x04	/* blocks allocated at @addr in unwritten state */
> +#define IOMAP_INLINE	0x05	/* data inline in the inode */
>  
>  /*
>   * Flags for all iomap mappings:
> @@ -34,7 +35,6 @@ struct vm_fault;
>   */
>  #define IOMAP_F_MERGED		0x10	/* contains multiple blocks/extents */
>  #define IOMAP_F_SHARED		0x20	/* block shared with another file */
> -#define IOMAP_F_DATA_INLINE	0x40	/* data inline in the inode */
>  
>  /*
>   * Magic value for addr:
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/34] iomap: fix the comment describing IOMAP_NOWAIT
  2018-05-23 14:43 ` [PATCH 10/34] iomap: fix the comment describing IOMAP_NOWAIT Christoph Hellwig
@ 2018-05-30  5:49   ` Darrick J. Wong
  0 siblings, 0 replies; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  5:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:33PM +0200, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  include/linux/iomap.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 8f7095fc514e..13d19b4c29a9 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -59,7 +59,7 @@ struct iomap {
>  #define IOMAP_REPORT		(1 << 2) /* report extent status, e.g. FIEMAP */
>  #define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
>  #define IOMAP_DIRECT		(1 << 4) /* direct I/O */
> -#define IOMAP_NOWAIT		(1 << 5) /* Don't wait for writeback */
> +#define IOMAP_NOWAIT		(1 << 5) /* do not block */
>  
>  struct iomap_ops {
>  	/*
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 11/34] iomap: move IOMAP_F_BOUNDARY to gfs2
  2018-05-23 14:43 ` [PATCH 11/34] iomap: move IOMAP_F_BOUNDARY to gfs2 Christoph Hellwig
@ 2018-05-30  5:50   ` Darrick J. Wong
  2018-05-30  9:30     ` [Cluster-devel] " Steven Whitehouse
  0 siblings, 1 reply; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  5:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm, cluster-devel

On Wed, May 23, 2018 at 04:43:34PM +0200, Christoph Hellwig wrote:
> Just define a range of fs specific flags and use that in gfs2 instead of
> exposing this internal flag flobally.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok to me, but better if the gfs2 folks [cc'd now] ack this...
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/gfs2/bmap.c        | 8 +++++---
>  include/linux/iomap.h | 9 +++++++--
>  2 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index cbeedd3cfb36..8efa6297e19c 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -683,6 +683,8 @@ static void gfs2_stuffed_iomap(struct inode *inode, struct iomap *iomap)
>  	iomap->type = IOMAP_INLINE;
>  }
>  
> +#define IOMAP_F_GFS2_BOUNDARY IOMAP_F_PRIVATE
> +
>  /**
>   * gfs2_iomap_begin - Map blocks from an inode to disk blocks
>   * @inode: The inode
> @@ -774,7 +776,7 @@ int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
>  	bh = mp.mp_bh[ip->i_height - 1];
>  	len = gfs2_extent_length(bh->b_data, bh->b_size, ptr, lend - lblock, &eob);
>  	if (eob)
> -		iomap->flags |= IOMAP_F_BOUNDARY;
> +		iomap->flags |= IOMAP_F_GFS2_BOUNDARY;
>  	iomap->length = (u64)len << inode->i_blkbits;
>  
>  out_release:
> @@ -846,12 +848,12 @@ int gfs2_block_map(struct inode *inode, sector_t lblock,
>  
>  	if (iomap.length > bh_map->b_size) {
>  		iomap.length = bh_map->b_size;
> -		iomap.flags &= ~IOMAP_F_BOUNDARY;
> +		iomap.flags &= ~IOMAP_F_GFS2_BOUNDARY;
>  	}
>  	if (iomap.addr != IOMAP_NULL_ADDR)
>  		map_bh(bh_map, inode->i_sb, iomap.addr >> inode->i_blkbits);
>  	bh_map->b_size = iomap.length;
> -	if (iomap.flags & IOMAP_F_BOUNDARY)
> +	if (iomap.flags & IOMAP_F_GFS2_BOUNDARY)
>  		set_buffer_boundary(bh_map);
>  	if (iomap.flags & IOMAP_F_NEW)
>  		set_buffer_new(bh_map);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 13d19b4c29a9..819e0cd2a950 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -27,8 +27,7 @@ struct vm_fault;
>   * written data and requires fdatasync to commit them to persistent storage.
>   */
>  #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
> -#define IOMAP_F_BOUNDARY	0x02	/* mapping ends at metadata boundary */
> -#define IOMAP_F_DIRTY		0x04	/* uncommitted metadata */
> +#define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
>  
>  /*
>   * Flags that only need to be reported for IOMAP_REPORT requests:
> @@ -36,6 +35,12 @@ struct vm_fault;
>  #define IOMAP_F_MERGED		0x10	/* contains multiple blocks/extents */
>  #define IOMAP_F_SHARED		0x20	/* block shared with another file */
>  
> +/*
> + * Flags from 0x1000 up are for file system specific usage:
> + */
> +#define IOMAP_F_PRIVATE		0x1000
> +
> +
>  /*
>   * Magic value for addr:
>   */
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 12/34] iomap: use __bio_add_page in iomap_dio_zero
  2018-05-23 14:43 ` [PATCH 12/34] iomap: use __bio_add_page in iomap_dio_zero Christoph Hellwig
@ 2018-05-30  5:51   ` Darrick J. Wong
  0 siblings, 0 replies; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  5:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:35PM +0200, Christoph Hellwig wrote:
> We don't need any merging logic, and this also replaces a BUG_ON with a
> WARN_ON_ONCE inside __bio_add_page for the impossible overflow condition.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/iomap.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index f52209a2c270..8e28f25f086f 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -949,8 +949,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
>  	bio->bi_end_io = iomap_dio_bio_end_io;
>  
>  	get_page(page);
> -	if (bio_add_page(bio, page, len, 0) != len)
> -		BUG();
> +	__bio_add_page(bio, page, len, 0);
>  	bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
>  
>  	atomic_inc(&dio->ref);
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 13/34] iomap: add a iomap_sector helper
  2018-05-23 14:43 ` [PATCH 13/34] iomap: add a iomap_sector helper Christoph Hellwig
@ 2018-05-30  5:52   ` Darrick J. Wong
  0 siblings, 0 replies; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  5:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:36PM +0200, Christoph Hellwig wrote:
> Factor the repeated calculation of the on-disk sector for a given logical
> block into a littler helper.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/iomap.c | 19 ++++++++++---------
>  1 file changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 8e28f25f086f..f928df4ab9a9 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -97,6 +97,12 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>  	return written ? written : ret;
>  }
>  
> +static sector_t
> +iomap_sector(struct iomap *iomap, loff_t pos)
> +{
> +	return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
> +}
> +
>  static void
>  iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
>  {
> @@ -354,11 +360,8 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
>  static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
>  		struct iomap *iomap)
>  {
> -	sector_t sector = (iomap->addr +
> -			   (pos & PAGE_MASK) - iomap->offset) >> 9;
> -
> -	return __dax_zero_page_range(iomap->bdev, iomap->dax_dev, sector,
> -			offset, bytes);
> +	return __dax_zero_page_range(iomap->bdev, iomap->dax_dev,
> +			iomap_sector(iomap, pos & PAGE_MASK), offset, bytes);
>  }
>  
>  static loff_t
> @@ -943,8 +946,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
>  
>  	bio = bio_alloc(GFP_KERNEL, 1);
>  	bio_set_dev(bio, iomap->bdev);
> -	bio->bi_iter.bi_sector =
> -		(iomap->addr + pos - iomap->offset) >> 9;
> +	bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
>  	bio->bi_private = dio;
>  	bio->bi_end_io = iomap_dio_bio_end_io;
>  
> @@ -1038,8 +1040,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
>  
>  		bio = bio_alloc(GFP_KERNEL, nr_pages);
>  		bio_set_dev(bio, iomap->bdev);
> -		bio->bi_iter.bi_sector =
> -			(iomap->addr + pos - iomap->offset) >> 9;
> +		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
>  		bio->bi_write_hint = dio->iocb->ki_hint;
>  		bio->bi_private = dio;
>  		bio->bi_end_io = iomap_dio_bio_end_io;
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 14/34] iomap: add an iomap-based bmap implementation
  2018-05-23 14:43 ` [PATCH 14/34] iomap: add an iomap-based bmap implementation Christoph Hellwig
@ 2018-05-30  5:54   ` Darrick J. Wong
  0 siblings, 0 replies; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  5:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:37PM +0200, Christoph Hellwig wrote:
> This adds a simple iomap-based implementation of the legacy ->bmap
> interface.  Note that we can't easily add checks for rt or reflink
> files, so these will have to remain in the callers.  This interface
> just needs to die..
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/iomap.c            | 34 ++++++++++++++++++++++++++++++++++
>  include/linux/iomap.h |  3 +++
>  2 files changed, 37 insertions(+)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index f928df4ab9a9..fa278ed338ce 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1411,3 +1411,37 @@ int iomap_swapfile_activate(struct swap_info_struct *sis,
>  }
>  EXPORT_SYMBOL_GPL(iomap_swapfile_activate);
>  #endif /* CONFIG_SWAP */
> +
> +static loff_t
> +iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
> +		void *data, struct iomap *iomap)
> +{
> +	sector_t *bno = data, addr;
> +
> +	if (iomap->type == IOMAP_MAPPED) {
> +		addr = (pos - iomap->offset + iomap->addr) >> inode->i_blkbits;
> +		if (addr > INT_MAX)
> +			WARN(1, "would truncate bmap result\n");
> +		else
> +			*bno = addr;
> +	}
> +	return 0;
> +}
> +
> +/* legacy ->bmap interface.  0 is the error return (!) */
> +sector_t
> +iomap_bmap(struct address_space *mapping, sector_t bno,
> +		const struct iomap_ops *ops)
> +{
> +	struct inode *inode = mapping->host;
> +	loff_t pos = bno >> inode->i_blkbits;
> +	unsigned blocksize = i_blocksize(inode);
> +
> +	if (filemap_write_and_wait(mapping))
> +		return 0;
> +
> +	bno = 0;
> +	iomap_apply(inode, pos, blocksize, 0, ops, &bno, iomap_bmap_actor);
> +	return bno;
> +}
> +EXPORT_SYMBOL_GPL(iomap_bmap);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 819e0cd2a950..a044a824da85 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -4,6 +4,7 @@
>  
>  #include <linux/types.h>
>  
> +struct address_space;
>  struct fiemap_extent_info;
>  struct inode;
>  struct iov_iter;
> @@ -100,6 +101,8 @@ loff_t iomap_seek_hole(struct inode *inode, loff_t offset,
>  		const struct iomap_ops *ops);
>  loff_t iomap_seek_data(struct inode *inode, loff_t offset,
>  		const struct iomap_ops *ops);
> +sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
> +		const struct iomap_ops *ops);
>  
>  /*
>   * Flags for direct I/O ->end_io:
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 15/34] iomap: add an iomap-based readpage and readpages implementation
  2018-05-23 14:43 ` [PATCH 15/34] iomap: add an iomap-based readpage and readpages implementation Christoph Hellwig
@ 2018-05-30  6:11   ` Darrick J. Wong
  2018-05-30  6:23     ` Christoph Hellwig
  0 siblings, 1 reply; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  6:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:38PM +0200, Christoph Hellwig wrote:
> Simply use iomap_apply to iterate over the file and a submit a bio for
> each non-uptodate but mapped region and zero everything else.  Note that
> as-is this can not be used for file systems with a blocksize smaller than
> the page size, but that support will be added later.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap.c            | 194 +++++++++++++++++++++++++++++++++++++++++-
>  include/linux/iomap.h |   4 +
>  2 files changed, 197 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index fa278ed338ce..78259a2249f4 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (C) 2010 Red Hat, Inc.
> - * Copyright (c) 2016 Christoph Hellwig.
> + * Copyright (c) 2016-2018 Christoph Hellwig.
>   *
>   * This program is free software; you can redistribute it and/or modify it
>   * under the terms and conditions of the GNU General Public License,
> @@ -18,6 +18,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/gfp.h>
>  #include <linux/mm.h>
> +#include <linux/mm_inline.h>
>  #include <linux/swap.h>
>  #include <linux/pagemap.h>
>  #include <linux/pagevec.h>
> @@ -103,6 +104,197 @@ iomap_sector(struct iomap *iomap, loff_t pos)
>  	return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
>  }
>  
> +static void
> +iomap_read_end_io(struct bio *bio)
> +{
> +	int error = blk_status_to_errno(bio->bi_status);
> +	struct bio_vec *bvec;
> +	int i;
> +
> +	bio_for_each_segment_all(bvec, bio, i)
> +		page_endio(bvec->bv_page, false, error);
> +	bio_put(bio);
> +}
> +
> +static struct bio *
> +iomap_read_bio_alloc(struct iomap *iomap, sector_t sector, loff_t length)
> +{
> +	int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	struct bio *bio = bio_alloc(GFP_NOFS, min(BIO_MAX_PAGES, nr_vecs));
> +
> +	bio->bi_opf = REQ_OP_READ;
> +	bio->bi_iter.bi_sector = sector;
> +	bio_set_dev(bio, iomap->bdev);
> +	bio->bi_end_io = iomap_read_end_io;
> +	return bio;
> +}
> +
> +struct iomap_readpage_ctx {
> +	struct page		*cur_page;
> +	bool			cur_page_in_bio;
> +	struct bio		*bio;
> +	struct list_head	*pages;
> +};
> +
> +static loff_t
> +iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> +		struct iomap *iomap)
> +{
> +	struct iomap_readpage_ctx *ctx = data;
> +	struct page *page = ctx->cur_page;
> +	unsigned poff = pos & (PAGE_SIZE - 1);
> +	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
> +	bool is_contig = false;
> +	sector_t sector;
> +
> +	/* we don't support blocksize < PAGE_SIZE quite yet: */
> +	WARN_ON_ONCE(pos != page_offset(page));
> +	WARN_ON_ONCE(plen != PAGE_SIZE);
> +
> +	if (iomap->type != IOMAP_MAPPED || pos >= i_size_read(inode)) {
> +		zero_user(page, poff, plen);
> +		SetPageUptodate(page);
> +		goto done;
> +	}
> +
> +	ctx->cur_page_in_bio = true;
> +
> +	/*
> +	 * Try to merge into a previous segment if we can.
> +	 */
> +	sector = iomap_sector(iomap, pos);
> +	if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
> +		if (__bio_try_merge_page(ctx->bio, page, plen, poff))
> +			goto done;
> +		is_contig = true;
> +	}
> +
> +	if (!ctx->bio || !is_contig || bio_full(ctx->bio)) {
> +		if (ctx->bio)
> +			submit_bio(ctx->bio);
> +		ctx->bio = iomap_read_bio_alloc(iomap, sector, length);
> +	}
> +
> +	__bio_add_page(ctx->bio, page, plen, poff);
> +done:
> +	return plen;
> +}
> +
> +int
> +iomap_readpage(struct page *page, const struct iomap_ops *ops)
> +{
> +	struct iomap_readpage_ctx ctx = { .cur_page = page };
> +	struct inode *inode = page->mapping->host;
> +	unsigned poff;
> +	loff_t ret;
> +
> +	WARN_ON_ONCE(page_has_buffers(page));
> +
> +	for (poff = 0; poff < PAGE_SIZE; poff += ret) {
> +		ret = iomap_apply(inode, page_offset(page) + poff,
> +				PAGE_SIZE - poff, 0, ops, &ctx,
> +				iomap_readpage_actor);
> +		if (ret <= 0) {
> +			WARN_ON_ONCE(ret == 0);
> +			SetPageError(page);
> +			break;
> +		}
> +	}
> +
> +	if (ctx.bio) {
> +		submit_bio(ctx.bio);
> +		WARN_ON_ONCE(!ctx.cur_page_in_bio);
> +	} else {
> +		WARN_ON_ONCE(ctx.cur_page_in_bio);
> +		unlock_page(page);
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iomap_readpage);
> +
> +static struct page *
> +iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
> +		loff_t length, loff_t *done)
> +{
> +	while (!list_empty(pages)) {
> +		struct page *page = lru_to_page(pages);
> +
> +		if (page_offset(page) >= (u64)pos + length)
> +			break;
> +
> +		list_del(&page->lru);
> +		if (!add_to_page_cache_lru(page, inode->i_mapping, page->index,
> +				GFP_NOFS))

I'm curious about this line -- if add_to_page_cache_lru returns an
error, why don't we want to send that back up the stack?  Is the idea
that the page doesn't become uptodate and something else notices?   It
seems a little odd that on error we just skip to the next page.

(If this /is/ correct then comment is needed here.)

--D

> +			return page;
> +
> +		*done += PAGE_SIZE;
> +		put_page(page);
> +	}
> +
> +	return NULL;
> +}
> +
> +static loff_t
> +iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
> +		void *data, struct iomap *iomap)
> +{
> +	struct iomap_readpage_ctx *ctx = data;
> +	loff_t done, ret;
> +
> +	for (done = 0; done < length; done += ret) {
> +		if (ctx->cur_page && ((pos + done) & (PAGE_SIZE - 1)) == 0) {
> +			if (!ctx->cur_page_in_bio)
> +				unlock_page(ctx->cur_page);
> +			put_page(ctx->cur_page);
> +			ctx->cur_page = NULL;
> +		}
> +		if (!ctx->cur_page) {
> +			ctx->cur_page = iomap_next_page(inode, ctx->pages,
> +					pos, length, &done);
> +			if (!ctx->cur_page)
> +				break;
> +			ctx->cur_page_in_bio = false;
> +		}
> +		ret = iomap_readpage_actor(inode, pos + done, length - done,
> +				ctx, iomap);
> +	}
> +
> +	return done;
> +}
> +
> +int
> +iomap_readpages(struct address_space *mapping, struct list_head *pages,
> +		unsigned nr_pages, const struct iomap_ops *ops)
> +{
> +	struct iomap_readpage_ctx ctx = { .pages = pages };
> +	loff_t pos = page_offset(list_entry(pages->prev, struct page, lru));
> +	loff_t last = page_offset(list_entry(pages->next, struct page, lru));
> +	loff_t length = last - pos + PAGE_SIZE, ret = 0;
> +
> +	while (length > 0) {
> +		ret = iomap_apply(mapping->host, pos, length, 0, ops,
> +				&ctx, iomap_readpages_actor);
> +		if (ret <= 0) {
> +			WARN_ON_ONCE(ret == 0);
> +			goto done;
> +		}
> +		pos += ret;
> +		length -= ret;
> +	}
> +	ret = 0;
> +done:
> +	if (ctx.bio)
> +		submit_bio(ctx.bio);
> +	if (ctx.cur_page) {
> +		if (!ctx.cur_page_in_bio)
> +			unlock_page(ctx.cur_page);
> +		put_page(ctx.cur_page);
> +	}
> +	WARN_ON_ONCE(!ret && !list_empty(ctx.pages));
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iomap_readpages);
> +
>  static void
>  iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
>  {
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index a044a824da85..7300d30ca495 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -9,6 +9,7 @@ struct fiemap_extent_info;
>  struct inode;
>  struct iov_iter;
>  struct kiocb;
> +struct page;
>  struct vm_area_struct;
>  struct vm_fault;
>  
> @@ -88,6 +89,9 @@ struct iomap_ops {
>  
>  ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
>  		const struct iomap_ops *ops);
> +int iomap_readpage(struct page *page, const struct iomap_ops *ops);
> +int iomap_readpages(struct address_space *mapping, struct list_head *pages,
> +		unsigned nr_pages, const struct iomap_ops *ops);
>  int iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
>  		const struct iomap_ops *ops);
>  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 17/34] xfs: use iomap_bmap
  2018-05-23 14:43 ` [PATCH 17/34] xfs: use iomap_bmap Christoph Hellwig
@ 2018-05-30  6:14   ` Darrick J. Wong
  0 siblings, 0 replies; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  6:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:40PM +0200, Christoph Hellwig wrote:
> Switch to the iomap based bmap implementation to get rid of one of the
> last users of xfs_get_blocks.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_aops.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 80de476cecf8..56e405572909 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1378,10 +1378,9 @@ xfs_vm_bmap(
>  	struct address_space	*mapping,
>  	sector_t		block)
>  {
> -	struct inode		*inode = (struct inode *)mapping->host;
> -	struct xfs_inode	*ip = XFS_I(inode);
> +	struct xfs_inode	*ip = XFS_I(mapping->host);
>  
> -	trace_xfs_vm_bmap(XFS_I(inode));
> +	trace_xfs_vm_bmap(ip);
>  
>  	/*
>  	 * The swap code (ab-)uses ->bmap to get a block mapping and then
> @@ -1394,9 +1393,7 @@ xfs_vm_bmap(
>  	 */
>  	if (xfs_is_reflink_inode(ip) || XFS_IS_REALTIME_INODE(ip))
>  		return 0;
> -
> -	filemap_write_and_wait(mapping);
> -	return generic_block_bmap(mapping, block, xfs_get_blocks);
> +	return iomap_bmap(mapping, block, &xfs_iomap_ops);
>  }
>  
>  STATIC int
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 16/34] iomap: add initial support for writes without buffer heads
  2018-05-23 14:43 ` [PATCH 16/34] iomap: add initial support for writes without buffer heads Christoph Hellwig
@ 2018-05-30  6:21   ` Darrick J. Wong
  0 siblings, 0 replies; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  6:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:39PM +0200, Christoph Hellwig wrote:
> For now just limited to blocksize == PAGE_SIZE, where we can simply read
> in the full page in write begin, and just set the whole page dirty after
> copying data into it.  This code is enabled by default and XFS will now
> be feed pages without buffer heads in ->writepage and ->writepages.
> 
> If a file system sets the IOMAP_F_BUFFER_HEAD flag on the iomap the old
> path will still be used, this both helps the transition in XFS and
> prepares for the gfs2 migration to the iomap infrastructure.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok, stopping here for the night...
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/iomap.c            | 129 ++++++++++++++++++++++++++++++++++++++----
>  fs/xfs/xfs_iomap.c    |   6 +-
>  include/linux/iomap.h |   2 +
>  3 files changed, 124 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 78259a2249f4..debb859a8a14 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -308,6 +308,49 @@ iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
>  		truncate_pagecache_range(inode, max(pos, i_size), pos + len);
>  }
>  
> +static int
> +iomap_read_page_sync(struct inode *inode, loff_t block_start, struct page *page,
> +		unsigned poff, unsigned plen, unsigned from, unsigned to,
> +		struct iomap *iomap)
> +{
> +	struct bio_vec bvec;
> +	struct bio bio;
> +
> +	if (iomap->type != IOMAP_MAPPED || block_start >= i_size_read(inode)) {
> +		zero_user_segments(page, poff, from, to, poff + plen);
> +		return 0;
> +	}
> +
> +	bio_init(&bio, &bvec, 1);
> +	bio.bi_opf = REQ_OP_READ;
> +	bio.bi_iter.bi_sector = iomap_sector(iomap, block_start);
> +	bio_set_dev(&bio, iomap->bdev);
> +	__bio_add_page(&bio, page, plen, poff);
> +	return submit_bio_wait(&bio);
> +}
> +
> +static int
> +__iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
> +		struct page *page, struct iomap *iomap)
> +{
> +	loff_t block_size = i_blocksize(inode);
> +	loff_t block_start = pos & ~(block_size - 1);
> +	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
> +	unsigned poff = block_start & (PAGE_SIZE - 1);
> +	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, block_end - block_start);
> +	unsigned from = pos & (PAGE_SIZE - 1);
> +	unsigned to = from + len;
> +
> +	WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE);
> +
> +	if (PageUptodate(page))
> +		return 0;
> +	if (from <= poff && to >= poff + plen)
> +		return 0;
> +	return iomap_read_page_sync(inode, block_start, page,
> +			poff, plen, from, to, iomap);
> +}
> +
>  static int
>  iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>  		struct page **pagep, struct iomap *iomap)
> @@ -325,7 +368,10 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>  	if (!page)
>  		return -ENOMEM;
>  
> -	status = __block_write_begin_int(page, pos, len, NULL, iomap);
> +	if (iomap->flags & IOMAP_F_BUFFER_HEAD)
> +		status = __block_write_begin_int(page, pos, len, NULL, iomap);
> +	else
> +		status = __iomap_write_begin(inode, pos, len, page, iomap);
>  	if (unlikely(status)) {
>  		unlock_page(page);
>  		put_page(page);
> @@ -338,14 +384,69 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>  	return status;
>  }
>  
> +int
> +iomap_set_page_dirty(struct page *page)
> +{
> +	struct address_space *mapping = page_mapping(page);
> +	int newly_dirty;
> +
> +	if (unlikely(!mapping))
> +		return !TestSetPageDirty(page);
> +
> +	/*
> +	 * Lock out page->mem_cgroup migration to keep PageDirty
> +	 * synchronized with per-memcg dirty page counters.
> +	 */
> +	lock_page_memcg(page);
> +	newly_dirty = !TestSetPageDirty(page);
> +	if (newly_dirty)
> +		__set_page_dirty(page, mapping, 0);
> +	unlock_page_memcg(page);
> +
> +	if (newly_dirty)
> +		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +	return newly_dirty;
> +}
> +EXPORT_SYMBOL_GPL(iomap_set_page_dirty);
> +
> +static int
> +__iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
> +		unsigned copied, struct page *page, struct iomap *iomap)
> +{
> +	flush_dcache_page(page);
> +
> +	/*
> +	 * The blocks that were entirely written will now be uptodate, so we
> +	 * don't have to worry about a readpage reading them and overwriting a
> +	 * partial write.  However if we have encountered a short write and only
> +	 * partially written into a block, it will not be marked uptodate, so a
> +	 * readpage might come in and destroy our partial write.
> +	 *
> +	 * Do the simplest thing, and just treat any short write to a non
> +	 * uptodate page as a zero-length write, and force the caller to redo
> +	 * the whole thing.
> +	 */
> +	if (unlikely(copied < len && !PageUptodate(page))) {
> +		copied = 0;
> +	} else {
> +		SetPageUptodate(page);
> +		iomap_set_page_dirty(page);
> +	}
> +	return __generic_write_end(inode, pos, copied, page);
> +}
> +
>  static int
>  iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
> -		unsigned copied, struct page *page)
> +		unsigned copied, struct page *page, struct iomap *iomap)
>  {
>  	int ret;
>  
> -	ret = generic_write_end(NULL, inode->i_mapping, pos, len,
> -			copied, page, NULL);
> +	if (iomap->flags & IOMAP_F_BUFFER_HEAD)
> +		ret = generic_write_end(NULL, inode->i_mapping, pos, len,
> +				copied, page, NULL);
> +	else
> +		ret = __iomap_write_end(inode, pos, len, copied, page, iomap);
> +
>  	if (ret < len)
>  		iomap_write_failed(inode, pos, len);
>  	return ret;
> @@ -400,7 +501,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  
>  		flush_dcache_page(page);
>  
> -		status = iomap_write_end(inode, pos, bytes, copied, page);
> +		status = iomap_write_end(inode, pos, bytes, copied, page,
> +				iomap);
>  		if (unlikely(status < 0))
>  			break;
>  		copied = status;
> @@ -494,7 +596,7 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  
>  		WARN_ON_ONCE(!PageUptodate(page));
>  
> -		status = iomap_write_end(inode, pos, bytes, bytes, page);
> +		status = iomap_write_end(inode, pos, bytes, bytes, page, iomap);
>  		if (unlikely(status <= 0)) {
>  			if (WARN_ON_ONCE(status == 0))
>  				return -EIO;
> @@ -546,7 +648,7 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
>  	zero_user(page, offset, bytes);
>  	mark_page_accessed(page);
>  
> -	return iomap_write_end(inode, pos, bytes, bytes, page);
> +	return iomap_write_end(inode, pos, bytes, bytes, page, iomap);
>  }
>  
>  static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
> @@ -632,11 +734,16 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
>  	struct page *page = data;
>  	int ret;
>  
> -	ret = __block_write_begin_int(page, pos, length, NULL, iomap);
> -	if (ret)
> -		return ret;
> +	if (iomap->flags & IOMAP_F_BUFFER_HEAD) {
> +		ret = __block_write_begin_int(page, pos, length, NULL, iomap);
> +		if (ret)
> +			return ret;
> +		block_commit_write(page, 0, length);
> +	} else {
> +		WARN_ON_ONCE(!PageUptodate(page));
> +		WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE);
> +	}
>  
> -	block_commit_write(page, 0, length);
>  	return length;
>  }
>  
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index c6ce6f9335b6..da6d1995e460 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -638,7 +638,7 @@ xfs_file_iomap_begin_delay(
>  	 * Flag newly allocated delalloc blocks with IOMAP_F_NEW so we punch
>  	 * them out if the write happens to fail.
>  	 */
> -	iomap->flags = IOMAP_F_NEW;
> +	iomap->flags |= IOMAP_F_NEW;
>  	trace_xfs_iomap_alloc(ip, offset, count, 0, &got);
>  done:
>  	if (isnullstartblock(got.br_startblock))
> @@ -1031,6 +1031,8 @@ xfs_file_iomap_begin(
>  	if (XFS_FORCED_SHUTDOWN(mp))
>  		return -EIO;
>  
> +	iomap->flags |= IOMAP_F_BUFFER_HEAD;
> +
>  	if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
>  			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>  		/* Reserve delalloc blocks for regular writeback. */
> @@ -1131,7 +1133,7 @@ xfs_file_iomap_begin(
>  	if (error)
>  		return error;
>  
> -	iomap->flags = IOMAP_F_NEW;
> +	iomap->flags |= IOMAP_F_NEW;
>  	trace_xfs_iomap_alloc(ip, offset, length, 0, &imap);
>  
>  out_finish:
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 7300d30ca495..4d3d9d0cd69f 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -30,6 +30,7 @@ struct vm_fault;
>   */
>  #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
>  #define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
> +#define IOMAP_F_BUFFER_HEAD	0x04	/* file system requires buffer heads */
>  
>  /*
>   * Flags that only need to be reported for IOMAP_REPORT requests:
> @@ -92,6 +93,7 @@ ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
>  int iomap_readpage(struct page *page, const struct iomap_ops *ops);
>  int iomap_readpages(struct address_space *mapping, struct list_head *pages,
>  		unsigned nr_pages, const struct iomap_ops *ops);
> +int iomap_set_page_dirty(struct page *page);
>  int iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
>  		const struct iomap_ops *ops);
>  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 18/34] xfs: use iomap for blocksize == PAGE_SIZE readpage and readpages
  2018-05-23 14:43 ` [PATCH 18/34] xfs: use iomap for blocksize == PAGE_SIZE readpage and readpages Christoph Hellwig
@ 2018-05-30  6:22   ` Darrick J. Wong
  0 siblings, 0 replies; 78+ messages in thread
From: Darrick J. Wong @ 2018-05-30  6:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Wed, May 23, 2018 at 04:43:41PM +0200, Christoph Hellwig wrote:
> For file systems with a block size that equals the page size we never do
> partial reads, so we can use the buffer_head-less iomap versions of
> readpage and readpages without conflicting with the buffer_head structures
> create later in write_begin.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_aops.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 56e405572909..c631c457b444 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1402,6 +1402,8 @@ xfs_vm_readpage(
>  	struct page		*page)
>  {
>  	trace_xfs_vm_readpage(page->mapping->host, 1);
> +	if (i_blocksize(page->mapping->host) == PAGE_SIZE)
> +		return iomap_readpage(page, &xfs_iomap_ops);
>  	return mpage_readpage(page, xfs_get_blocks);
>  }
>  
> @@ -1413,6 +1415,8 @@ xfs_vm_readpages(
>  	unsigned		nr_pages)
>  {
>  	trace_xfs_vm_readpages(mapping->host, nr_pages);
> +	if (i_blocksize(mapping->host) == PAGE_SIZE)
> +		return iomap_readpages(mapping, pages, nr_pages, &xfs_iomap_ops);
>  	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
>  }
>  
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 15/34] iomap: add an iomap-based readpage and readpages implementation
  2018-05-30  6:11   ` Darrick J. Wong
@ 2018-05-30  6:23     ` Christoph Hellwig
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-30  6:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, linux-mm

On Tue, May 29, 2018 at 11:11:46PM -0700, Darrick J. Wong wrote:
> > +		list_del(&page->lru);
> > +		if (!add_to_page_cache_lru(page, inode->i_mapping, page->index,
> > +				GFP_NOFS))
> 
> I'm curious about this line -- if add_to_page_cache_lru returns an
> error, why don't we want to send that back up the stack?  Is the idea
> that the page doesn't become uptodate and something else notices?   It
> seems a little odd that on error we just skip to the next page.
> 
> (If this /is/ correct then comment is needed here.)

readpages is only used for read-ahead, so the upper layers literally
don't care as long as we don't mess up the page refcount.  This logic
is taken straight from mpage_readpages, but I'll add a comment anyway.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Cluster-devel] [PATCH 11/34] iomap: move IOMAP_F_BOUNDARY to gfs2
  2018-05-30  5:50   ` Darrick J. Wong
@ 2018-05-30  9:30     ` Steven Whitehouse
  2018-05-30  9:59       ` Christoph Hellwig
  0 siblings, 1 reply; 78+ messages in thread
From: Steven Whitehouse @ 2018-05-30  9:30 UTC (permalink / raw)
  To: Darrick J. Wong, Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, cluster-devel, linux-mm

Hi,


On 30/05/18 06:50, Darrick J. Wong wrote:
> On Wed, May 23, 2018 at 04:43:34PM +0200, Christoph Hellwig wrote:
>> Just define a range of fs specific flags and use that in gfs2 instead of
>> exposing this internal flag flobally.
>>
>> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Looks ok to me, but better if the gfs2 folks [cc'd now] ack this...
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
>
> --D
I may have missed the context here, but I thought that the boundary was 
a generic thing meaning "there will have to be a metadata read before 
more blocks can be mapped" so I'm not sure why that would now be GFS2 
specific?

Steve.

>> ---
>>   fs/gfs2/bmap.c        | 8 +++++---
>>   include/linux/iomap.h | 9 +++++++--
>>   2 files changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
>> index cbeedd3cfb36..8efa6297e19c 100644
>> --- a/fs/gfs2/bmap.c
>> +++ b/fs/gfs2/bmap.c
>> @@ -683,6 +683,8 @@ static void gfs2_stuffed_iomap(struct inode *inode, struct iomap *iomap)
>>   	iomap->type = IOMAP_INLINE;
>>   }
>>   
>> +#define IOMAP_F_GFS2_BOUNDARY IOMAP_F_PRIVATE
>> +
>>   /**
>>    * gfs2_iomap_begin - Map blocks from an inode to disk blocks
>>    * @inode: The inode
>> @@ -774,7 +776,7 @@ int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
>>   	bh = mp.mp_bh[ip->i_height - 1];
>>   	len = gfs2_extent_length(bh->b_data, bh->b_size, ptr, lend - lblock, &eob);
>>   	if (eob)
>> -		iomap->flags |= IOMAP_F_BOUNDARY;
>> +		iomap->flags |= IOMAP_F_GFS2_BOUNDARY;
>>   	iomap->length = (u64)len << inode->i_blkbits;
>>   
>>   out_release:
>> @@ -846,12 +848,12 @@ int gfs2_block_map(struct inode *inode, sector_t lblock,
>>   
>>   	if (iomap.length > bh_map->b_size) {
>>   		iomap.length = bh_map->b_size;
>> -		iomap.flags &= ~IOMAP_F_BOUNDARY;
>> +		iomap.flags &= ~IOMAP_F_GFS2_BOUNDARY;
>>   	}
>>   	if (iomap.addr != IOMAP_NULL_ADDR)
>>   		map_bh(bh_map, inode->i_sb, iomap.addr >> inode->i_blkbits);
>>   	bh_map->b_size = iomap.length;
>> -	if (iomap.flags & IOMAP_F_BOUNDARY)
>> +	if (iomap.flags & IOMAP_F_GFS2_BOUNDARY)
>>   		set_buffer_boundary(bh_map);
>>   	if (iomap.flags & IOMAP_F_NEW)
>>   		set_buffer_new(bh_map);
>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index 13d19b4c29a9..819e0cd2a950 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -27,8 +27,7 @@ struct vm_fault;
>>    * written data and requires fdatasync to commit them to persistent storage.
>>    */
>>   #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
>> -#define IOMAP_F_BOUNDARY	0x02	/* mapping ends at metadata boundary */
>> -#define IOMAP_F_DIRTY		0x04	/* uncommitted metadata */
>> +#define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
>>   
>>   /*
>>    * Flags that only need to be reported for IOMAP_REPORT requests:
>> @@ -36,6 +35,12 @@ struct vm_fault;
>>   #define IOMAP_F_MERGED		0x10	/* contains multiple blocks/extents */
>>   #define IOMAP_F_SHARED		0x20	/* block shared with another file */
>>   
>> +/*
>> + * Flags from 0x1000 up are for file system specific usage:
>> + */
>> +#define IOMAP_F_PRIVATE		0x1000
>> +
>> +
>>   /*
>>    * Magic value for addr:
>>    */
>> -- 
>> 2.17.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Cluster-devel] [PATCH 11/34] iomap: move IOMAP_F_BOUNDARY to gfs2
  2018-05-30  9:30     ` [Cluster-devel] " Steven Whitehouse
@ 2018-05-30  9:59       ` Christoph Hellwig
  2018-05-30 10:02         ` Steven Whitehouse
  0 siblings, 1 reply; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-30  9:59 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Darrick J. Wong, Christoph Hellwig, linux-xfs, linux-fsdevel,
	cluster-devel, linux-mm

On Wed, May 30, 2018 at 10:30:32AM +0100, Steven Whitehouse wrote:
> I may have missed the context here, but I thought that the boundary was a 
> generic thing meaning "there will have to be a metadata read before more 
> blocks can be mapped" so I'm not sure why that would now be GFS2 specific?

It was always a hack.  But with iomap it doesn't make any sensee to start
with, all metadata I/O happens in iomap_begin, so there is no point in
marking an iomap with flags like this for the actual iomap interface.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Cluster-devel] [PATCH 11/34] iomap: move IOMAP_F_BOUNDARY to gfs2
  2018-05-30  9:59       ` Christoph Hellwig
@ 2018-05-30 10:02         ` Steven Whitehouse
  2018-05-30 10:10             ` Christoph Hellwig
  0 siblings, 1 reply; 78+ messages in thread
From: Steven Whitehouse @ 2018-05-30 10:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, linux-xfs, linux-fsdevel, cluster-devel,
	linux-mm, Andreas Grünbacher

Hi,


On 30/05/18 10:59, Christoph Hellwig wrote:
> On Wed, May 30, 2018 at 10:30:32AM +0100, Steven Whitehouse wrote:
>> I may have missed the context here, but I thought that the boundary was a
>> generic thing meaning "there will have to be a metadata read before more
>> blocks can be mapped" so I'm not sure why that would now be GFS2 specific?
> It was always a hack.  But with iomap it doesn't make any sensee to start
> with, all metadata I/O happens in iomap_begin, so there is no point in
> marking an iomap with flags like this for the actual iomap interface.

In that case,  maybe it would be simpler to drop it for GFS2. Unless we 
are getting a lot of benefit from it, then we should probably just 
follow the generic pattern here. Eventually we'll move everything to 
iomap, so that the bh mapping interface will be gone. That implies that 
we might be able to drop it now, to avoid this complication during the 
conversion.

Andreas, do you see any issues with that?

Steve.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Cluster-devel] [PATCH 11/34] iomap: move IOMAP_F_BOUNDARY to gfs2
  2018-05-30 10:02         ` Steven Whitehouse
  2018-05-30 10:10             ` Christoph Hellwig
@ 2018-05-30 10:10             ` Christoph Hellwig
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-30 10:10 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel,
	cluster-devel, linux-mm, Andreas Grünbacher

On Wed, May 30, 2018 at 11:02:08AM +0100, Steven Whitehouse wrote:
> In that case,� maybe it would be simpler to drop it for GFS2. Unless we 
> are getting a lot of benefit from it, then we should probably just follow 
> the generic pattern here. Eventually we'll move everything to iomap, so 
> that the bh mapping interface will be gone. That implies that we might be 
> able to drop it now, to avoid this complication during the conversion.
>
> Andreas, do you see any issues with that?

I suspect it actually is doing the wrong thing today.  It certainly
does for SSDs, and it probably doesn't do a useful thing for modern
disks with intelligent caches either.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Cluster-devel] [PATCH 11/34] iomap: move IOMAP_F_BOUNDARY to gfs2
@ 2018-05-30 10:10             ` Christoph Hellwig
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-30 10:10 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel,
	cluster-devel, linux-mm, Andreas Grünbacher

On Wed, May 30, 2018 at 11:02:08AM +0100, Steven Whitehouse wrote:
> In that case,  maybe it would be simpler to drop it for GFS2. Unless we 
> are getting a lot of benefit from it, then we should probably just follow 
> the generic pattern here. Eventually we'll move everything to iomap, so 
> that the bh mapping interface will be gone. That implies that we might be 
> able to drop it now, to avoid this complication during the conversion.
>
> Andreas, do you see any issues with that?

I suspect it actually is doing the wrong thing today.  It certainly
does for SSDs, and it probably doesn't do a useful thing for modern
disks with intelligent caches either.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Cluster-devel] [PATCH 11/34] iomap: move IOMAP_F_BOUNDARY to gfs2
@ 2018-05-30 10:10             ` Christoph Hellwig
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-30 10:10 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel,
	cluster-devel, linux-mm, Andreas Grünbacher

On Wed, May 30, 2018 at 11:02:08AM +0100, Steven Whitehouse wrote:
> In that case,  maybe it would be simpler to drop it for GFS2. Unless we 
> are getting a lot of benefit from it, then we should probably just follow 
> the generic pattern here. Eventually we'll move everything to iomap, so 
> that the bh mapping interface will be gone. That implies that we might be 
> able to drop it now, to avoid this complication during the conversion.
>
> Andreas, do you see any issues with that?

I suspect it actually is doing the wrong thing today.  It certainly
does for SSDs, and it probably doesn't do a useful thing for modern
disks with intelligent caches either.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Cluster-devel] [PATCH 11/34] iomap: move IOMAP_F_BOUNDARY to gfs2
  2018-05-30 10:10             ` Christoph Hellwig
  (?)
  (?)
@ 2018-05-30 10:12             ` Steven Whitehouse
  2018-05-30 11:03               ` Andreas Gruenbacher
  -1 siblings, 1 reply; 78+ messages in thread
From: Steven Whitehouse @ 2018-05-30 10:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, linux-xfs, linux-fsdevel, cluster-devel,
	linux-mm, Andreas Grünbacher

Hi,


On 30/05/18 11:10, Christoph Hellwig wrote:
> On Wed, May 30, 2018 at 11:02:08AM +0100, Steven Whitehouse wrote:
>> In that case,  maybe it would be simpler to drop it for GFS2. Unless we
>> are getting a lot of benefit from it, then we should probably just follow
>> the generic pattern here. Eventually we'll move everything to iomap, so
>> that the bh mapping interface will be gone. That implies that we might be
>> able to drop it now, to avoid this complication during the conversion.
>>
>> Andreas, do you see any issues with that?
> I suspect it actually is doing the wrong thing today.  It certainly
> does for SSDs, and it probably doesn't do a useful thing for modern
> disks with intelligent caches either.

Yes, agreed that it makes no sense for SSDs,

Steve.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Cluster-devel] [PATCH 11/34] iomap: move IOMAP_F_BOUNDARY to gfs2
  2018-05-30 10:12             ` Steven Whitehouse
@ 2018-05-30 11:03               ` Andreas Gruenbacher
  0 siblings, 0 replies; 78+ messages in thread
From: Andreas Gruenbacher @ 2018-05-30 11:03 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel,
	cluster-devel, linux-mm

On 30 May 2018 at 12:12, Steven Whitehouse <swhiteho@redhat.com> wrote:
> Hi,
>
> On 30/05/18 11:10, Christoph Hellwig wrote:
>>
>> On Wed, May 30, 2018 at 11:02:08AM +0100, Steven Whitehouse wrote:
>>>
>>> In that case,  maybe it would be simpler to drop it for GFS2. Unless we
>>> are getting a lot of benefit from it, then we should probably just follow
>>> the generic pattern here. Eventually we'll move everything to iomap, so
>>> that the bh mapping interface will be gone. That implies that we might be
>>> able to drop it now, to avoid this complication during the conversion.
>>>
>>> Andreas, do you see any issues with that?

We're not handling reads through iomap yet, so I'd be happier with
keeping that flag in one form or the other until we get there. This
will go away eventually anyway.

>> I suspect it actually is doing the wrong thing today.  It certainly
>> does for SSDs, and it probably doesn't do a useful thing for modern
>> disks with intelligent caches either.
>
>
> Yes, agreed that it makes no sense for SSDs,

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 30/34] xfs: move all writeback buffer_head manipulation into xfs_map_at_offset
  2018-05-18 16:47 buffered I/O without buffer heads in xfs and iomap v2 Christoph Hellwig
@ 2018-05-18 16:48 ` Christoph Hellwig
  0 siblings, 0 replies; 78+ messages in thread
From: Christoph Hellwig @ 2018-05-18 16:48 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-block, linux-mm

This keeps it in a single place so it can be made otional more easily.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 22 +++++-----------------
 1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 592b33b35a30..951b329abb23 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -505,21 +505,6 @@ xfs_imap_valid(
 		offset < imap->br_startoff + imap->br_blockcount;
 }
 
-STATIC void
-xfs_start_buffer_writeback(
-	struct buffer_head	*bh)
-{
-	ASSERT(buffer_mapped(bh));
-	ASSERT(buffer_locked(bh));
-	ASSERT(!buffer_delay(bh));
-	ASSERT(!buffer_unwritten(bh));
-
-	bh->b_end_io = NULL;
-	set_buffer_async_write(bh);
-	set_buffer_uptodate(bh);
-	clear_buffer_dirty(bh);
-}
-
 STATIC void
 xfs_start_page_writeback(
 	struct page		*page,
@@ -728,6 +713,7 @@ xfs_map_at_offset(
 	ASSERT(imap->br_startblock != HOLESTARTBLOCK);
 	ASSERT(imap->br_startblock != DELAYSTARTBLOCK);
 
+	lock_buffer(bh);
 	xfs_map_buffer(inode, bh, imap, offset);
 	set_buffer_mapped(bh);
 	clear_buffer_delay(bh);
@@ -740,6 +726,10 @@ xfs_map_at_offset(
 	 * set the bdev now.
 	 */
 	bh->b_bdev = xfs_find_bdev_for_inode(inode);
+	bh->b_end_io = NULL;
+	set_buffer_async_write(bh);
+	set_buffer_uptodate(bh);
+	clear_buffer_dirty(bh);
 }
 
 STATIC void
@@ -885,11 +875,9 @@ xfs_writepage_map(
 			continue;
 		}
 
-		lock_buffer(bh);
 		xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
 		xfs_add_to_ioend(inode, file_offset, page, wpc, wbc,
 				&submit_list);
-		xfs_start_buffer_writeback(bh);
 		count++;
 	}
 
-- 
2.17.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2018-05-30 11:03 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-23 14:43 buffered I/O without buffer heads in xfs and iomap v3 Christoph Hellwig
2018-05-23 14:43 ` [PATCH 01/34] block: add a lower-level bio_add_page interface Christoph Hellwig
2018-05-30  5:28   ` Darrick J. Wong
2018-05-23 14:43 ` [PATCH 02/34] fs: factor out a __generic_write_end helper Christoph Hellwig
2018-05-30  5:30   ` Darrick J. Wong
2018-05-23 14:43 ` [PATCH 03/34] fs: move page_cache_seek_hole_data to iomap.c Christoph Hellwig
2018-05-30  5:31   ` Darrick J. Wong
2018-05-23 14:43 ` [PATCH 04/34] fs: remove the buffer_unwritten check in page_seek_hole_data Christoph Hellwig
2018-05-30  5:36   ` Darrick J. Wong
2018-05-23 14:43 ` [PATCH 05/34] fs: use ->is_partially_uptodate in page_cache_seek_hole_data Christoph Hellwig
2018-05-30  5:41   ` Darrick J. Wong
2018-05-23 14:43 ` [PATCH 06/34] mm: give the 'ret' variable a better name __do_page_cache_readahead Christoph Hellwig
2018-05-30  5:42   ` Darrick J. Wong
2018-05-23 14:43 ` [PATCH 07/34] mm: return an unsigned int from __do_page_cache_readahead Christoph Hellwig
2018-05-30  5:44   ` Darrick J. Wong
2018-05-23 14:43 ` [PATCH 08/34] mm: split ->readpages calls to avoid non-contiguous pages lists Christoph Hellwig
2018-05-30  5:46   ` Darrick J. Wong
2018-05-23 14:43 ` [PATCH 09/34] iomap: inline data should be an iomap type, not a flag Christoph Hellwig
2018-05-30  5:49   ` Darrick J. Wong
2018-05-30  5:49     ` Darrick J. Wong
2018-05-23 14:43 ` [PATCH 10/34] iomap: fix the comment describing IOMAP_NOWAIT Christoph Hellwig
2018-05-30  5:49   ` Darrick J. Wong
2018-05-23 14:43 ` [PATCH 11/34] iomap: move IOMAP_F_BOUNDARY to gfs2 Christoph Hellwig
2018-05-30  5:50   ` Darrick J. Wong
2018-05-30  9:30     ` [Cluster-devel] " Steven Whitehouse
2018-05-30  9:59       ` Christoph Hellwig
2018-05-30 10:02         ` Steven Whitehouse
2018-05-30 10:10           ` Christoph Hellwig
2018-05-30 10:10             ` Christoph Hellwig
2018-05-30 10:10             ` Christoph Hellwig
2018-05-30 10:12             ` Steven Whitehouse
2018-05-30 11:03               ` Andreas Gruenbacher
2018-05-23 14:43 ` [PATCH 12/34] iomap: use __bio_add_page in iomap_dio_zero Christoph Hellwig
2018-05-30  5:51   ` Darrick J. Wong
2018-05-23 14:43 ` [PATCH 13/34] iomap: add a iomap_sector helper Christoph Hellwig
2018-05-30  5:52   ` Darrick J. Wong
2018-05-23 14:43 ` [PATCH 14/34] iomap: add an iomap-based bmap implementation Christoph Hellwig
2018-05-30  5:54   ` Darrick J. Wong
2018-05-23 14:43 ` [PATCH 15/34] iomap: add an iomap-based readpage and readpages implementation Christoph Hellwig
2018-05-30  6:11   ` Darrick J. Wong
2018-05-30  6:23     ` Christoph Hellwig
2018-05-23 14:43 ` [PATCH 16/34] iomap: add initial support for writes without buffer heads Christoph Hellwig
2018-05-30  6:21   ` Darrick J. Wong
2018-05-23 14:43 ` [PATCH 17/34] xfs: use iomap_bmap Christoph Hellwig
2018-05-30  6:14   ` Darrick J. Wong
2018-05-23 14:43 ` [PATCH 18/34] xfs: use iomap for blocksize == PAGE_SIZE readpage and readpages Christoph Hellwig
2018-05-30  6:22   ` Darrick J. Wong
2018-05-23 14:43 ` [PATCH 19/34] xfs: simplify xfs_bmap_punch_delalloc_range Christoph Hellwig
2018-05-23 16:17   ` Brian Foster
2018-05-24  8:01     ` Christoph Hellwig
2018-05-23 14:43 ` [PATCH 20/34] xfs: simplify xfs_aops_discard_page Christoph Hellwig
2018-05-23 14:43 ` [PATCH 21/34] xfs: move locking into xfs_bmap_punch_delalloc_range Christoph Hellwig
2018-05-23 14:43 ` [PATCH 22/34] xfs: make xfs_writepage_map extent map centric Christoph Hellwig
2018-05-24 14:59   ` Brian Foster
2018-05-24 16:53     ` Christoph Hellwig
2018-05-24 18:13       ` Brian Foster
2018-05-25  6:19         ` Christoph Hellwig
2018-05-25 11:35           ` Brian Foster
2018-05-28  7:15             ` Christoph Hellwig
2018-05-29 11:26               ` Brian Foster
2018-05-29 13:08                 ` Christoph Hellwig
2018-05-29 17:04                   ` Brian Foster
2018-05-23 14:43 ` [PATCH 23/34] xfs: remove the now unused XFS_BMAPI_IGSTATE flag Christoph Hellwig
2018-05-23 14:43 ` [PATCH 24/34] xfs: remove xfs_reflink_find_cow_mapping Christoph Hellwig
2018-05-23 14:43 ` [PATCH 25/34] xfs: remove xfs_reflink_trim_irec_to_next_cow Christoph Hellwig
2018-05-24 14:59   ` Brian Foster
2018-05-24 15:06     ` Brian Foster
2018-05-24 17:10       ` Christoph Hellwig
2018-05-23 14:43 ` [PATCH 26/34] xfs: simplify xfs_map_blocks by using xfs_iext_lookup_extent directly Christoph Hellwig
2018-05-23 14:43 ` [PATCH 27/34] xfs: don't clear imap_valid for a non-uptodate buffers Christoph Hellwig
2018-05-23 14:43 ` [PATCH 28/34] xfs: remove the imap_valid flag Christoph Hellwig
2018-05-23 14:43 ` [PATCH 29/34] xfs: don't look at buffer heads in xfs_add_to_ioend Christoph Hellwig
2018-05-23 14:43 ` [PATCH 30/34] xfs: move all writeback buffer_head manipulation into xfs_map_at_offset Christoph Hellwig
2018-05-23 14:43 ` [PATCH 31/34] xfs: remove xfs_start_page_writeback Christoph Hellwig
2018-05-23 14:43 ` [PATCH 32/34] xfs: refactor the tail of xfs_writepage_map Christoph Hellwig
2018-05-23 14:43 ` [PATCH 33/34] xfs: do not set the page uptodate in xfs_writepage_map Christoph Hellwig
2018-05-23 14:43 ` [PATCH 34/34] xfs: allow writeback on pages without buffer heads Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2018-05-18 16:47 buffered I/O without buffer heads in xfs and iomap v2 Christoph Hellwig
2018-05-18 16:48 ` [PATCH 30/34] xfs: move all writeback buffer_head manipulation into xfs_map_at_offset Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.