linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* stop using buffer heads in xfs v6
@ 2018-06-15 13:01 Christoph Hellwig
  2018-06-15 13:01 ` [PATCH 01/24] iomap: add an iomap-based readpage and readpages implementation Christoph Hellwig
                   ` (23 more replies)
  0 siblings, 24 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:01 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Hi all,

this series adds support for buffered writes without buffer heads to
the iomap and XFS code.

Note: the iomap patches should go into a shared branch for the gfs2
iomap work.

A git tree is available at:

    git://git.infradead.org/users/hch/xfs.git xfs-iomap-write.6

Gitweb:

    http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-iomap-write.6

Changes since v5:
 - merged the leftovers of the previous series into one
 - rebased on top of the gfs2 iomap prep series
 - fixed the map_blocks map conditions
 - clean up a loop in __iomap_write_begin

Changes since v4:
 - make the logic more clear in xfs_bmap_punch_delalloc_range
 - fix a missing unlock in xfs_bmap_punch_delalloc_range
 - always look up COW extents to check for overlaps
 - lots of patch reshuffling and splitting without impact to the final
   result

Changes since v3:
 - iterate backwards in xfs_bmap_punch_delalloc_range
 - remove the cow_valid variable in xfs_reflink_trim_irec_to_next_cow
 - additional trivial xfs_map_blocks simplifications
 - split the read side into a separate prep series
 - moved the SEEK_HOLE/DATA patches not strictly required out of the
   series

Changes since v2:
 - minor page_seek_hole_data tweaks
 - don't read data entirely covered by the write operation in write_begin
 - fix zeroing on write_begin I/O failure
 - remove iomap_block_needs_zeroing to make the code more clear
 - update comments on __do_page_cache_readahead

Changes since v1:
 - fix the iomap_readpages error handling
 - use unsigned file offsets in a few places to avoid arithmetic overflows
 - allocate a iomap_page in iomap_page_mkwrite to fix generic/095
 - improve a few comments
 - add more asserts
 - warn about truncated block numbers from ->bmap
 - new patch to change the __do_page_cache_readahead return value to
   unsigned int
 - remove an incorrectly added empty line
 - make inline data an explicit iomap type instead of a flag
 - add a IOMAP_F_BUFFER_HEAD flag to force use of buffers heads for gfs2,
   and keep the basic buffer head infrastructure around for now.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 01/24] iomap: add an iomap-based readpage and readpages implementation
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
@ 2018-06-15 13:01 ` Christoph Hellwig
  2018-06-29 14:44   ` [PATCH] iomap: Add inline data support to iomap_readpage_actor Andreas Gruenbacher
  2018-06-15 13:01 ` [PATCH 02/24] xfs: use iomap for blocksize == PAGE_SIZE readpage and readpages Christoph Hellwig
                   ` (22 subsequent siblings)
  23 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:01 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Simply use iomap_apply to iterate over the file and a submit a bio for
each non-uptodate but mapped region and zero everything else.  Note that
as-is this can not be used for file systems with a blocksize smaller than
the page size, but that support will be added later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/iomap.c            | 214 +++++++++++++++++++++++++++++++++++++++++-
 include/linux/iomap.h |   3 +
 2 files changed, 216 insertions(+), 1 deletion(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 9e265118e8b0..e99e4fbc31f4 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (C) 2010 Red Hat, Inc.
- * Copyright (c) 2016 Christoph Hellwig.
+ * Copyright (c) 2016-2018 Christoph Hellwig.
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>
 #include <linux/gfp.h>
 #include <linux/mm.h>
+#include <linux/mm_inline.h>
 #include <linux/swap.h>
 #include <linux/pagemap.h>
 #include <linux/pagevec.h>
@@ -123,6 +124,217 @@ iomap_read_inline_data(struct inode *inode, struct page *page,
 	SetPageUptodate(page);
 }
 
+static void
+iomap_read_end_io(struct bio *bio)
+{
+	int error = blk_status_to_errno(bio->bi_status);
+	struct bio_vec *bvec;
+	int i;
+
+	bio_for_each_segment_all(bvec, bio, i)
+		page_endio(bvec->bv_page, false, error);
+	bio_put(bio);
+}
+
+struct iomap_readpage_ctx {
+	struct page		*cur_page;
+	bool			cur_page_in_bio;
+	bool			is_readahead;
+	struct bio		*bio;
+	struct list_head	*pages;
+};
+
+static loff_t
+iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+		struct iomap *iomap)
+{
+	struct iomap_readpage_ctx *ctx = data;
+	struct page *page = ctx->cur_page;
+	unsigned poff = pos & (PAGE_SIZE - 1);
+	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
+	bool is_contig = false;
+	sector_t sector;
+
+	/* we don't support blocksize < PAGE_SIZE quite yet. */
+	WARN_ON_ONCE(pos != page_offset(page));
+	WARN_ON_ONCE(plen != PAGE_SIZE);
+
+	if (iomap->type != IOMAP_MAPPED || pos >= i_size_read(inode)) {
+		zero_user(page, poff, plen);
+		SetPageUptodate(page);
+		goto done;
+	}
+
+	ctx->cur_page_in_bio = true;
+
+	/*
+	 * Try to merge into a previous segment if we can.
+	 */
+	sector = iomap_sector(iomap, pos);
+	if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
+		if (__bio_try_merge_page(ctx->bio, page, plen, poff))
+			goto done;
+		is_contig = true;
+	}
+
+	if (!ctx->bio || !is_contig || bio_full(ctx->bio)) {
+		gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
+		int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+		if (ctx->bio)
+			submit_bio(ctx->bio);
+
+		if (ctx->is_readahead) /* same as readahead_gfp_mask */
+			gfp |= __GFP_NORETRY | __GFP_NOWARN;
+		ctx->bio = bio_alloc(gfp, min(BIO_MAX_PAGES, nr_vecs));
+		ctx->bio->bi_opf = REQ_OP_READ;
+		if (ctx->is_readahead)
+			ctx->bio->bi_opf |= REQ_RAHEAD;
+		ctx->bio->bi_iter.bi_sector = sector;
+		bio_set_dev(ctx->bio, iomap->bdev);
+		ctx->bio->bi_end_io = iomap_read_end_io;
+	}
+
+	__bio_add_page(ctx->bio, page, plen, poff);
+done:
+	return plen;
+}
+
+int
+iomap_readpage(struct page *page, const struct iomap_ops *ops)
+{
+	struct iomap_readpage_ctx ctx = { .cur_page = page };
+	struct inode *inode = page->mapping->host;
+	unsigned poff;
+	loff_t ret;
+
+	WARN_ON_ONCE(page_has_buffers(page));
+
+	for (poff = 0; poff < PAGE_SIZE; poff += ret) {
+		ret = iomap_apply(inode, page_offset(page) + poff,
+				PAGE_SIZE - poff, 0, ops, &ctx,
+				iomap_readpage_actor);
+		if (ret <= 0) {
+			WARN_ON_ONCE(ret == 0);
+			SetPageError(page);
+			break;
+		}
+	}
+
+	if (ctx.bio) {
+		submit_bio(ctx.bio);
+		WARN_ON_ONCE(!ctx.cur_page_in_bio);
+	} else {
+		WARN_ON_ONCE(ctx.cur_page_in_bio);
+		unlock_page(page);
+	}
+
+	/*
+	 * Just like mpage_readpages and block_read_full_page we always
+	 * return 0 and just mark the page as PageError on errors.  This
+	 * should be cleaned up all through the stack eventually.
+	 */
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iomap_readpage);
+
+static struct page *
+iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
+		loff_t length, loff_t *done)
+{
+	while (!list_empty(pages)) {
+		struct page *page = lru_to_page(pages);
+
+		if (page_offset(page) >= (u64)pos + length)
+			break;
+
+		list_del(&page->lru);
+		if (!add_to_page_cache_lru(page, inode->i_mapping, page->index,
+				GFP_NOFS))
+			return page;
+
+		/*
+		 * If we already have a page in the page cache at index we are
+		 * done.  Upper layers don't care if it is uptodate after the
+		 * readpages call itself as every page gets checked again once
+		 * actually needed.
+		 */
+		*done += PAGE_SIZE;
+		put_page(page);
+	}
+
+	return NULL;
+}
+
+static loff_t
+iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
+		void *data, struct iomap *iomap)
+{
+	struct iomap_readpage_ctx *ctx = data;
+	loff_t done, ret;
+
+	for (done = 0; done < length; done += ret) {
+		if (ctx->cur_page && ((pos + done) & (PAGE_SIZE - 1)) == 0) {
+			if (!ctx->cur_page_in_bio)
+				unlock_page(ctx->cur_page);
+			put_page(ctx->cur_page);
+			ctx->cur_page = NULL;
+		}
+		if (!ctx->cur_page) {
+			ctx->cur_page = iomap_next_page(inode, ctx->pages,
+					pos, length, &done);
+			if (!ctx->cur_page)
+				break;
+			ctx->cur_page_in_bio = false;
+		}
+		ret = iomap_readpage_actor(inode, pos + done, length - done,
+				ctx, iomap);
+	}
+
+	return done;
+}
+
+int
+iomap_readpages(struct address_space *mapping, struct list_head *pages,
+		unsigned nr_pages, const struct iomap_ops *ops)
+{
+	struct iomap_readpage_ctx ctx = {
+		.pages		= pages,
+		.is_readahead	= true,
+	};
+	loff_t pos = page_offset(list_entry(pages->prev, struct page, lru));
+	loff_t last = page_offset(list_entry(pages->next, struct page, lru));
+	loff_t length = last - pos + PAGE_SIZE, ret = 0;
+
+	while (length > 0) {
+		ret = iomap_apply(mapping->host, pos, length, 0, ops,
+				&ctx, iomap_readpages_actor);
+		if (ret <= 0) {
+			WARN_ON_ONCE(ret == 0);
+			goto done;
+		}
+		pos += ret;
+		length -= ret;
+	}
+	ret = 0;
+done:
+	if (ctx.bio)
+		submit_bio(ctx.bio);
+	if (ctx.cur_page) {
+		if (!ctx.cur_page_in_bio)
+			unlock_page(ctx.cur_page);
+		put_page(ctx.cur_page);
+	}
+
+	/*
+	 * Check that we didn't lose a page due to the arcance calling
+	 * conventions..
+	 */
+	WARN_ON_ONCE(!ret && !list_empty(ctx.pages));
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iomap_readpages);
+
 static void
 iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 43cc9208599f..cfbc28b4005c 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -100,6 +100,9 @@ struct iomap_ops {
 
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops);
+int iomap_readpage(struct page *page, const struct iomap_ops *ops);
+int iomap_readpages(struct address_space *mapping, struct list_head *pages,
+		unsigned nr_pages, const struct iomap_ops *ops);
 int iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 02/24] xfs: use iomap for blocksize == PAGE_SIZE readpage and readpages
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
  2018-06-15 13:01 ` [PATCH 01/24] iomap: add an iomap-based readpage and readpages implementation Christoph Hellwig
@ 2018-06-15 13:01 ` Christoph Hellwig
  2018-06-15 13:01 ` [PATCH 03/24] iomap: add initial support for writes without buffer heads Christoph Hellwig
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:01 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

For file systems with a block size that equals the page size we never do
partial reads, so we can use the buffer_head-less iomap versions of
readpage and readpages without conflicting with the buffer_head structures
create later in write_begin.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index ca6903726689..dba439ebfe4c 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1402,6 +1402,8 @@ xfs_vm_readpage(
 	struct page		*page)
 {
 	trace_xfs_vm_readpage(page->mapping->host, 1);
+	if (i_blocksize(page->mapping->host) == PAGE_SIZE)
+		return iomap_readpage(page, &xfs_iomap_ops);
 	return mpage_readpage(page, xfs_get_blocks);
 }
 
@@ -1413,6 +1415,8 @@ xfs_vm_readpages(
 	unsigned		nr_pages)
 {
 	trace_xfs_vm_readpages(mapping->host, nr_pages);
+	if (i_blocksize(mapping->host) == PAGE_SIZE)
+		return iomap_readpages(mapping, pages, nr_pages, &xfs_iomap_ops);
 	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
 }
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 03/24] iomap: add initial support for writes without buffer heads
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
  2018-06-15 13:01 ` [PATCH 01/24] iomap: add an iomap-based readpage and readpages implementation Christoph Hellwig
  2018-06-15 13:01 ` [PATCH 02/24] xfs: use iomap for blocksize == PAGE_SIZE readpage and readpages Christoph Hellwig
@ 2018-06-15 13:01 ` Christoph Hellwig
  2018-06-15 13:01 ` [PATCH 04/24] xfs: simplify xfs_bmap_punch_delalloc_range Christoph Hellwig
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:01 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

For now just limited to blocksize == PAGE_SIZE, where we can simply read
in the full page in write begin, and just set the whole page dirty after
copying data into it.  This code is enabled by default and XFS will now
be feed pages without buffer heads in ->writepage and ->writepages.

If a file system sets the IOMAP_F_BUFFER_HEAD flag on the iomap the old
path will still be used, this both helps the transition in XFS and
prepares for the gfs2 migration to the iomap infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/iomap.c            | 115 +++++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_iomap.c    |   6 ++-
 include/linux/iomap.h |   2 +
 3 files changed, 114 insertions(+), 9 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index e99e4fbc31f4..a504077bb38f 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -348,6 +348,48 @@ iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 		truncate_pagecache_range(inode, max(pos, i_size), pos + len);
 }
 
+static int
+iomap_read_page_sync(struct inode *inode, loff_t block_start, struct page *page,
+		unsigned poff, unsigned plen, unsigned from, unsigned to,
+		struct iomap *iomap)
+{
+	struct bio_vec bvec;
+	struct bio bio;
+
+	if (iomap->type != IOMAP_MAPPED || block_start >= i_size_read(inode)) {
+		zero_user_segments(page, poff, from, to, poff + plen);
+		return 0;
+	}
+
+	bio_init(&bio, &bvec, 1);
+	bio.bi_opf = REQ_OP_READ;
+	bio.bi_iter.bi_sector = iomap_sector(iomap, block_start);
+	bio_set_dev(&bio, iomap->bdev);
+	__bio_add_page(&bio, page, plen, poff);
+	return submit_bio_wait(&bio);
+}
+
+static int
+__iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
+		struct page *page, struct iomap *iomap)
+{
+	loff_t block_size = i_blocksize(inode);
+	loff_t block_start = pos & ~(block_size - 1);
+	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
+	unsigned poff = block_start & (PAGE_SIZE - 1);
+	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, block_end - block_start);
+	unsigned from = pos & (PAGE_SIZE - 1), to = from + len;
+
+	WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE);
+
+	if (PageUptodate(page))
+		return 0;
+	if (from <= poff && to >= poff + plen)
+		return 0;
+	return iomap_read_page_sync(inode, block_start, page,
+			poff, plen, from, to, iomap);
+}
+
 static int
 iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 		struct page **pagep, struct iomap *iomap)
@@ -367,9 +409,10 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 
 	if (iomap->type == IOMAP_INLINE)
 		iomap_read_inline_data(inode, page, iomap);
-	else
+	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
 		status = __block_write_begin_int(page, pos, len, NULL, iomap);
-
+	else
+		status = __iomap_write_begin(inode, pos, len, page, iomap);
 	if (unlikely(status)) {
 		unlock_page(page);
 		put_page(page);
@@ -382,6 +425,57 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 	return status;
 }
 
+int
+iomap_set_page_dirty(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+	int newly_dirty;
+
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
+
+	/*
+	 * Lock out page->mem_cgroup migration to keep PageDirty
+	 * synchronized with per-memcg dirty page counters.
+	 */
+	lock_page_memcg(page);
+	newly_dirty = !TestSetPageDirty(page);
+	if (newly_dirty)
+		__set_page_dirty(page, mapping, 0);
+	unlock_page_memcg(page);
+
+	if (newly_dirty)
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+	return newly_dirty;
+}
+EXPORT_SYMBOL_GPL(iomap_set_page_dirty);
+
+static int
+__iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
+		unsigned copied, struct page *page, struct iomap *iomap)
+{
+	flush_dcache_page(page);
+
+	/*
+	 * The blocks that were entirely written will now be uptodate, so we
+	 * don't have to worry about a readpage reading them and overwriting a
+	 * partial write.  However if we have encountered a short write and only
+	 * partially written into a block, it will not be marked uptodate, so a
+	 * readpage might come in and destroy our partial write.
+	 *
+	 * Do the simplest thing, and just treat any short write to a non
+	 * uptodate page as a zero-length write, and force the caller to redo
+	 * the whole thing.
+	 */
+	if (unlikely(copied < len && !PageUptodate(page))) {
+		copied = 0;
+	} else {
+		SetPageUptodate(page);
+		iomap_set_page_dirty(page);
+	}
+	return __generic_write_end(inode, pos, copied, page);
+}
+
 static int
 iomap_write_end_inline(struct inode *inode, struct page *page,
 		struct iomap *iomap, loff_t pos, unsigned copied)
@@ -408,9 +502,11 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
 
 	if (iomap->type == IOMAP_INLINE) {
 		ret = iomap_write_end_inline(inode, page, iomap, pos, copied);
-	} else {
+	} else if (iomap->flags & IOMAP_F_BUFFER_HEAD) {
 		ret = generic_write_end(NULL, inode->i_mapping, pos, len,
 				copied, page, NULL);
+	} else {
+		ret = __iomap_write_end(inode, pos, len, copied, page, iomap);
 	}
 
 	if (iomap->page_done)
@@ -703,11 +799,16 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
 	struct page *page = data;
 	int ret;
 
-	ret = __block_write_begin_int(page, pos, length, NULL, iomap);
-	if (ret)
-		return ret;
+	if (iomap->flags & IOMAP_F_BUFFER_HEAD) {
+		ret = __block_write_begin_int(page, pos, length, NULL, iomap);
+		if (ret)
+			return ret;
+		block_commit_write(page, 0, length);
+	} else {
+		WARN_ON_ONCE(!PageUptodate(page));
+		WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE);
+	}
 
-	block_commit_write(page, 0, length);
 	return length;
 }
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index c9c3d0df5e4c..85c66e78f9e8 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -640,7 +640,7 @@ xfs_file_iomap_begin_delay(
 	 * Flag newly allocated delalloc blocks with IOMAP_F_NEW so we punch
 	 * them out if the write happens to fail.
 	 */
-	iomap->flags = IOMAP_F_NEW;
+	iomap->flags |= IOMAP_F_NEW;
 	trace_xfs_iomap_alloc(ip, offset, count, 0, &got);
 done:
 	if (isnullstartblock(got.br_startblock))
@@ -1033,6 +1033,8 @@ xfs_file_iomap_begin(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
+	iomap->flags |= IOMAP_F_BUFFER_HEAD;
+
 	if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
 			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
 		/* Reserve delalloc blocks for regular writeback. */
@@ -1133,7 +1135,7 @@ xfs_file_iomap_begin(
 	if (error)
 		return error;
 
-	iomap->flags = IOMAP_F_NEW;
+	iomap->flags |= IOMAP_F_NEW;
 	trace_xfs_iomap_alloc(ip, offset, length, 0, &imap);
 
 out_finish:
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index cfbc28b4005c..c2706cfb27c7 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -30,6 +30,7 @@ struct vm_fault;
  */
 #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
 #define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
+#define IOMAP_F_BUFFER_HEAD	0x04	/* file system requires buffer heads */
 
 /*
  * Flags that only need to be reported for IOMAP_REPORT requests:
@@ -103,6 +104,7 @@ ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 int iomap_readpage(struct page *page, const struct iomap_ops *ops);
 int iomap_readpages(struct address_space *mapping, struct list_head *pages,
 		unsigned nr_pages, const struct iomap_ops *ops);
+int iomap_set_page_dirty(struct page *page);
 int iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 04/24] xfs: simplify xfs_bmap_punch_delalloc_range
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2018-06-15 13:01 ` [PATCH 03/24] iomap: add initial support for writes without buffer heads Christoph Hellwig
@ 2018-06-15 13:01 ` Christoph Hellwig
  2018-06-15 13:01 ` [PATCH 05/24] xfs: simplify xfs_aops_discard_page Christoph Hellwig
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:01 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Instead of using xfs_bmapi_read to find delalloc extents and then punch
them out using xfs_bunmapi, opencode the loop to iterate over the extents
and call xfs_bmap_del_extent_delay directly.  This both simplifies the
code and reduces the number of extent tree lookups required.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c | 85 ++++++++++++++++--------------------------
 1 file changed, 32 insertions(+), 53 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 06badcbadeb4..0070b877ed94 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -695,12 +695,10 @@ xfs_getbmap(
 }
 
 /*
- * dead simple method of punching delalyed allocation blocks from a range in
- * the inode. Walks a block at a time so will be slow, but is only executed in
- * rare error cases so the overhead is not critical. This will always punch out
- * both the start and end blocks, even if the ranges only partially overlap
- * them, so it is up to the caller to ensure that partial blocks are not
- * passed in.
+ * Dead simple method of punching delalyed allocation blocks from a range in
+ * the inode.  This will always punch out both the start and end blocks, even
+ * if the ranges only partially overlap them, so it is up to the caller to
+ * ensure that partial blocks are not passed in.
  */
 int
 xfs_bmap_punch_delalloc_range(
@@ -708,63 +706,44 @@ xfs_bmap_punch_delalloc_range(
 	xfs_fileoff_t		start_fsb,
 	xfs_fileoff_t		length)
 {
-	xfs_fileoff_t		remaining = length;
+	struct xfs_ifork	*ifp = &ip->i_df;
+	xfs_fileoff_t		end_fsb = start_fsb + length;
+	struct xfs_bmbt_irec	got, del;
+	struct xfs_iext_cursor	icur;
 	int			error = 0;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 
-	do {
-		int		done;
-		xfs_bmbt_irec_t	imap;
-		int		nimaps = 1;
-		xfs_fsblock_t	firstblock;
-		struct xfs_defer_ops dfops;
+	if (!(ifp->if_flags & XFS_IFEXTENTS)) {
+		error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
+		if (error)
+			return error;
+	}
 
-		/*
-		 * Map the range first and check that it is a delalloc extent
-		 * before trying to unmap the range. Otherwise we will be
-		 * trying to remove a real extent (which requires a
-		 * transaction) or a hole, which is probably a bad idea...
-		 */
-		error = xfs_bmapi_read(ip, start_fsb, 1, &imap, &nimaps,
-				       XFS_BMAPI_ENTIRE);
+	if (!xfs_iext_lookup_extent_before(ip, ifp, &end_fsb, &icur, &got))
+		return 0;
 
-		if (error) {
-			/* something screwed, just bail */
-			if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) {
-				xfs_alert(ip->i_mount,
-			"Failed delalloc mapping lookup ino %lld fsb %lld.",
-						ip->i_ino, start_fsb);
-			}
-			break;
-		}
-		if (!nimaps) {
-			/* nothing there */
-			goto next_block;
-		}
-		if (imap.br_startblock != DELAYSTARTBLOCK) {
-			/* been converted, ignore */
-			goto next_block;
-		}
-		WARN_ON(imap.br_blockcount == 0);
+	while (got.br_startoff + got.br_blockcount > start_fsb) {
+		del = got;
+		xfs_trim_extent(&del, start_fsb, length);
 
 		/*
-		 * Note: while we initialise the firstblock/dfops pair, they
-		 * should never be used because blocks should never be
-		 * allocated or freed for a delalloc extent and hence we need
-		 * don't cancel or finish them after the xfs_bunmapi() call.
+		 * A delete can push the cursor forward. Step back to the
+		 * previous extent on non-delalloc or extents outside the
+		 * target range.
 		 */
-		xfs_defer_init(&dfops, &firstblock);
-		error = xfs_bunmapi(NULL, ip, start_fsb, 1, 0, 1, &firstblock,
-					&dfops, &done);
-		if (error)
-			break;
+		if (!del.br_blockcount ||
+		    !isnullstartblock(del.br_startblock)) {
+			if (!xfs_iext_prev_extent(ifp, &icur, &got))
+				break;
+			continue;
+		}
 
-		ASSERT(!xfs_defer_has_unfinished_work(&dfops));
-next_block:
-		start_fsb++;
-		remaining--;
-	} while(remaining > 0);
+		error = xfs_bmap_del_extent_delay(ip, XFS_DATA_FORK, &icur,
+						  &got, &del);
+		if (error || !xfs_iext_get_extent(ifp, &icur, &got))
+			break;
+	}
 
 	return error;
 }
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 05/24] xfs: simplify xfs_aops_discard_page
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2018-06-15 13:01 ` [PATCH 04/24] xfs: simplify xfs_bmap_punch_delalloc_range Christoph Hellwig
@ 2018-06-15 13:01 ` Christoph Hellwig
  2018-06-15 13:01 ` [PATCH 06/24] xfs: move locking into xfs_bmap_punch_delalloc_range Christoph Hellwig
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:01 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Instead of looking at the buffer heads to see if a block is delalloc just
call xfs_bmap_punch_delalloc_range on the whole page - this will leave
any non-delalloc block intact and handle the iteration for us.  As a side
effect one more place stops caring about buffer heads and we can remove the
xfs_check_page_type function entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c | 85 +++++------------------------------------------
 1 file changed, 9 insertions(+), 76 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index dba439ebfe4c..d94dd6fe5127 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -711,49 +711,6 @@ xfs_map_at_offset(
 	clear_buffer_unwritten(bh);
 }
 
-/*
- * Test if a given page contains at least one buffer of a given @type.
- * If @check_all_buffers is true, then we walk all the buffers in the page to
- * try to find one of the type passed in. If it is not set, then the caller only
- * needs to check the first buffer on the page for a match.
- */
-STATIC bool
-xfs_check_page_type(
-	struct page		*page,
-	unsigned int		type,
-	bool			check_all_buffers)
-{
-	struct buffer_head	*bh;
-	struct buffer_head	*head;
-
-	if (PageWriteback(page))
-		return false;
-	if (!page->mapping)
-		return false;
-	if (!page_has_buffers(page))
-		return false;
-
-	bh = head = page_buffers(page);
-	do {
-		if (buffer_unwritten(bh)) {
-			if (type == XFS_IO_UNWRITTEN)
-				return true;
-		} else if (buffer_delay(bh)) {
-			if (type == XFS_IO_DELALLOC)
-				return true;
-		} else if (buffer_dirty(bh) && buffer_mapped(bh)) {
-			if (type == XFS_IO_OVERWRITE)
-				return true;
-		}
-
-		/* If we are only checking the first buffer, we are done now. */
-		if (!check_all_buffers)
-			break;
-	} while ((bh = bh->b_this_page) != head);
-
-	return false;
-}
-
 STATIC void
 xfs_vm_invalidatepage(
 	struct page		*page,
@@ -785,9 +742,6 @@ xfs_vm_invalidatepage(
  * transaction. Indeed - if we get ENOSPC errors, we have to be able to do this
  * truncation without a transaction as there is no space left for block
  * reservation (typically why we see a ENOSPC in writeback).
- *
- * This is not a performance critical path, so for now just do the punching a
- * buffer head at a time.
  */
 STATIC void
 xfs_aops_discard_page(
@@ -795,47 +749,26 @@ xfs_aops_discard_page(
 {
 	struct inode		*inode = page->mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
-	struct buffer_head	*bh, *head;
+	struct xfs_mount	*mp = ip->i_mount;
 	loff_t			offset = page_offset(page);
+	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, offset);
+	int			error;
 
-	if (!xfs_check_page_type(page, XFS_IO_DELALLOC, true))
-		goto out_invalidate;
-
-	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
+	if (XFS_FORCED_SHUTDOWN(mp))
 		goto out_invalidate;
 
-	xfs_alert(ip->i_mount,
+	xfs_alert(mp,
 		"page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
 			page, ip->i_ino, offset);
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	bh = head = page_buffers(page);
-	do {
-		int		error;
-		xfs_fileoff_t	start_fsb;
-
-		if (!buffer_delay(bh))
-			goto next_buffer;
-
-		start_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
-		error = xfs_bmap_punch_delalloc_range(ip, start_fsb, 1);
-		if (error) {
-			/* something screwed, just bail */
-			if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) {
-				xfs_alert(ip->i_mount,
-			"page discard unable to remove delalloc mapping.");
-			}
-			break;
-		}
-next_buffer:
-		offset += i_blocksize(inode);
-
-	} while ((bh = bh->b_this_page) != head);
-
+	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
+			PAGE_SIZE / i_blocksize(inode));
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	if (error && !XFS_FORCED_SHUTDOWN(mp))
+		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 out_invalidate:
 	xfs_vm_invalidatepage(page, 0, PAGE_SIZE);
-	return;
 }
 
 static int
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 06/24] xfs: move locking into xfs_bmap_punch_delalloc_range
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2018-06-15 13:01 ` [PATCH 05/24] xfs: simplify xfs_aops_discard_page Christoph Hellwig
@ 2018-06-15 13:01 ` Christoph Hellwig
  2018-06-19  5:26   ` Darrick J. Wong
  2018-06-15 13:01 ` [PATCH 07/24] xfs: do not set the page uptodate in xfs_writepage_map Christoph Hellwig
                   ` (17 subsequent siblings)
  23 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:01 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Both callers want the same looking, so do it only once.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_aops.c      | 2 --
 fs/xfs/xfs_bmap_util.c | 9 +++++----
 fs/xfs/xfs_iomap.c     | 3 ---
 3 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d94dd6fe5127..026721c4991e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -761,10 +761,8 @@ xfs_aops_discard_page(
 		"page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
 			page, ip->i_ino, offset);
 
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
 			PAGE_SIZE / i_blocksize(inode));
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	if (error && !XFS_FORCED_SHUTDOWN(mp))
 		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 out_invalidate:
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 0070b877ed94..6690a038892d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -712,16 +712,15 @@ xfs_bmap_punch_delalloc_range(
 	struct xfs_iext_cursor	icur;
 	int			error = 0;
 
-	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
-
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	if (!(ifp->if_flags & XFS_IFEXTENTS)) {
 		error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
 		if (error)
-			return error;
+			goto out_unlock;
 	}
 
 	if (!xfs_iext_lookup_extent_before(ip, ifp, &end_fsb, &icur, &got))
-		return 0;
+		goto out_unlock;
 
 	while (got.br_startoff + got.br_blockcount > start_fsb) {
 		del = got;
@@ -745,6 +744,8 @@ xfs_bmap_punch_delalloc_range(
 			break;
 	}
 
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 85c66e78f9e8..758ad9572d04 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1205,11 +1205,8 @@ xfs_file_iomap_end_delalloc(
 		truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
 					 XFS_FSB_TO_B(mp, end_fsb) - 1);
 
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
 		error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
 					       end_fsb - start_fsb);
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-
 		if (error && !XFS_FORCED_SHUTDOWN(mp)) {
 			xfs_alert(mp, "%s: unable to clean up ino %lld",
 				__func__, ip->i_ino);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 07/24] xfs: do not set the page uptodate in xfs_writepage_map
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2018-06-15 13:01 ` [PATCH 06/24] xfs: move locking into xfs_bmap_punch_delalloc_range Christoph Hellwig
@ 2018-06-15 13:01 ` Christoph Hellwig
  2018-06-15 13:01 ` [PATCH 08/24] xfs: don't clear imap_valid for a non-uptodate buffers Christoph Hellwig
                   ` (16 subsequent siblings)
  23 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:01 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

We already track the page uptodate status based on the buffer uptodate
status, which is updated whenever reading or zeroing blocks.

This code has been there since commit a ptool commit in 2002, which
claims to:

    "merge" the 2.4 fsx fix for block size < page size to 2.5.  This needed
    major changes to actually fit.

and isn't present in other writepage implementations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 026721c4991e..5c611c0784ca 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -850,7 +850,6 @@ xfs_writepage_map(
 	uint64_t		offset;
 	int			error = 0;
 	int			count = 0;
-	int			uptodate = 1;
 	unsigned int		new_type;
 
 	bh = head = page_buffers(page);
@@ -858,8 +857,6 @@ xfs_writepage_map(
 	do {
 		if (offset >= end_offset)
 			break;
-		if (!buffer_uptodate(bh))
-			uptodate = 0;
 
 		/*
 		 * set_page_dirty dirties all buffers in a page, independent
@@ -923,9 +920,6 @@ xfs_writepage_map(
 
 	} while (offset += len, ((bh = bh->b_this_page) != head));
 
-	if (uptodate && bh == head)
-		SetPageUptodate(page);
-
 	ASSERT(wpc->ioend || list_empty(&submit_list));
 
 out:
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 08/24] xfs: don't clear imap_valid for a non-uptodate buffers
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2018-06-15 13:01 ` [PATCH 07/24] xfs: do not set the page uptodate in xfs_writepage_map Christoph Hellwig
@ 2018-06-15 13:01 ` Christoph Hellwig
  2018-06-15 13:01 ` [PATCH 09/24] xfs: don't use XFS_BMAPI_IGSTATE in xfs_map_blocks Christoph Hellwig
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:01 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Finding a buffer that isn't uptodate doesn't invalidate the mapping for
any given block.  The last_sector check will already take care of starting
another ioend as soon as we find any non-update buffer, and if the current
mapping doesn't include the next uptodate buffer the xfs_imap_valid check
will take care of it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 5c611c0784ca..307ca66879b3 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -864,10 +864,8 @@ xfs_writepage_map(
 		 * meaningless for holes (!mapped && uptodate), so skip
 		 * buffers covering holes here.
 		 */
-		if (!buffer_mapped(bh) && buffer_uptodate(bh)) {
-			wpc->imap_valid = false;
+		if (!buffer_mapped(bh) && buffer_uptodate(bh))
 			continue;
-		}
 
 		if (buffer_unwritten(bh))
 			new_type = XFS_IO_UNWRITTEN;
@@ -880,11 +878,8 @@ xfs_writepage_map(
 				ASSERT(buffer_mapped(bh));
 			/*
 			 * This buffer is not uptodate and will not be
-			 * written to disk.  Ensure that we will put any
-			 * subsequent writeable buffers into a new
-			 * ioend.
+			 * written to disk.
 			 */
-			wpc->imap_valid = false;
 			continue;
 		}
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 09/24] xfs: don't use XFS_BMAPI_IGSTATE in xfs_map_blocks
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2018-06-15 13:01 ` [PATCH 08/24] xfs: don't clear imap_valid for a non-uptodate buffers Christoph Hellwig
@ 2018-06-15 13:01 ` Christoph Hellwig
  2018-06-19  5:27   ` Darrick J. Wong
  2018-06-15 13:01 ` [PATCH 10/24] xfs: remove xfs_reflink_trim_irec_to_next_cow Christoph Hellwig
                   ` (14 subsequent siblings)
  23 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:01 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

We want to be able to use the extent state as a reliably indicator for
the type of I/O, and stop using the buffer head state.  For this we
need to stop using the XFS_BMAPI_IGSTATE so that we don't see merged
extents of different types.

Based on a patch from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_aops.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 307ca66879b3..39b933e17192 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -385,7 +385,6 @@ xfs_map_blocks(
 	ssize_t			count = i_blocksize(inode);
 	xfs_fileoff_t		offset_fsb, end_fsb;
 	int			error = 0;
-	int			bmapi_flags = XFS_BMAPI_ENTIRE;
 	int			nimaps = 1;
 
 	if (XFS_FORCED_SHUTDOWN(mp))
@@ -405,8 +404,6 @@ xfs_map_blocks(
 		return 0;
 
 	ASSERT(type != XFS_IO_COW);
-	if (type == XFS_IO_UNWRITTEN)
-		bmapi_flags |= XFS_BMAPI_IGSTATE;
 
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 	ASSERT(ip->i_d.di_format != XFS_DINODE_FMT_BTREE ||
@@ -418,7 +415,7 @@ xfs_map_blocks(
 	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
 	offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
-				imap, &nimaps, bmapi_flags);
+				imap, &nimaps, XFS_BMAPI_ENTIRE);
 	/*
 	 * Truncate an overwrite extent if there's a pending CoW
 	 * reservation before the end of this extent.  This forces us
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 10/24] xfs: remove xfs_reflink_trim_irec_to_next_cow
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (8 preceding siblings ...)
  2018-06-15 13:01 ` [PATCH 09/24] xfs: don't use XFS_BMAPI_IGSTATE in xfs_map_blocks Christoph Hellwig
@ 2018-06-15 13:01 ` Christoph Hellwig
  2018-06-19  5:30   ` Darrick J. Wong
  2018-06-15 13:01 ` [PATCH 11/24] xfs: remove xfs_map_cow Christoph Hellwig
                   ` (13 subsequent siblings)
  23 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:01 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

We already have to check for overlapping COW extents everytime we
come back to a page in xfs_writepage_map / xfs_map_cow, so this
additional trim is not required.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_aops.c    |  7 -------
 fs/xfs/xfs_reflink.c | 33 ---------------------------------
 fs/xfs/xfs_reflink.h |  2 --
 fs/xfs/xfs_trace.h   |  1 -
 4 files changed, 43 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 39b933e17192..c8fc6786b06c 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -416,13 +416,6 @@ xfs_map_blocks(
 	offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
 				imap, &nimaps, XFS_BMAPI_ENTIRE);
-	/*
-	 * Truncate an overwrite extent if there's a pending CoW
-	 * reservation before the end of this extent.  This forces us
-	 * to come back to writepage to take care of the CoW.
-	 */
-	if (nimaps && type == XFS_IO_OVERWRITE)
-		xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb, imap);
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
 	if (error)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 713e857d9ffa..d0092c864458 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -514,39 +514,6 @@ xfs_reflink_find_cow_mapping(
 	return true;
 }
 
-/*
- * Trim an extent to end at the next CoW reservation past offset_fsb.
- */
-void
-xfs_reflink_trim_irec_to_next_cow(
-	struct xfs_inode		*ip,
-	xfs_fileoff_t			offset_fsb,
-	struct xfs_bmbt_irec		*imap)
-{
-	struct xfs_ifork		*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
-	struct xfs_bmbt_irec		got;
-	struct xfs_iext_cursor		icur;
-
-	if (!xfs_is_reflink_inode(ip))
-		return;
-
-	/* Find the extent in the CoW fork. */
-	if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, &icur, &got))
-		return;
-
-	/* This is the extent before; try sliding up one. */
-	if (got.br_startoff < offset_fsb) {
-		if (!xfs_iext_next_extent(ifp, &icur, &got))
-			return;
-	}
-
-	if (got.br_startoff >= imap->br_startoff + imap->br_blockcount)
-		return;
-
-	imap->br_blockcount = got.br_startoff - imap->br_startoff;
-	trace_xfs_reflink_trim_irec(ip, imap);
-}
-
 /*
  * Cancel CoW reservations for some block range of an inode.
  *
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 701487bab468..c0a25609007b 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -34,8 +34,6 @@ extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
 extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
 		struct xfs_bmbt_irec *imap);
-extern void xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
-		xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
 
 extern int xfs_reflink_cancel_cow_blocks(struct xfs_inode *ip,
 		struct xfs_trans **tpp, xfs_fileoff_t offset_fsb,
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 9d4c4ca24fe6..041045a8d65c 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3228,7 +3228,6 @@ DEFINE_RW_EVENT(xfs_reflink_reserve_cow);
 
 DEFINE_SIMPLE_IO_EVENT(xfs_reflink_bounce_dio_write);
 DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);
-DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_irec);
 
 DEFINE_SIMPLE_IO_EVENT(xfs_reflink_cancel_cow_range);
 DEFINE_SIMPLE_IO_EVENT(xfs_reflink_end_cow);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 11/24] xfs: remove xfs_map_cow
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (9 preceding siblings ...)
  2018-06-15 13:01 ` [PATCH 10/24] xfs: remove xfs_reflink_trim_irec_to_next_cow Christoph Hellwig
@ 2018-06-15 13:01 ` Christoph Hellwig
  2018-06-18 17:38   ` Brian Foster
  2018-06-15 13:01 ` [PATCH 12/24] xfs: rename the offset variable in xfs_writepage_map Christoph Hellwig
                   ` (12 subsequent siblings)
  23 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:01 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

We can handle the existing cow mapping case as a special case directly
in xfs_writepage_map, and share code for allocating delalloc blocks
with regular I/O in xfs_map_blocks.  This means we need to always
call xfs_map_blocks for reflink inodes, but we can still skip most of
the work if it turns out that there is no COW mapping overlapping the
current block.

As a subtle detail we need to start caching holes in the wpc to deal
with the case of COW reservations between EOF.  But we'll need that
infrastructure later anyway, so this is no big deal.

Based on a patch from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 186 ++++++++++++++++++++++------------------------
 fs/xfs/xfs_aops.h |   4 +-
 2 files changed, 92 insertions(+), 98 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index c8fc6786b06c..ff27bdcc49a1 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -375,70 +375,107 @@ xfs_end_bio(
 
 STATIC int
 xfs_map_blocks(
+	struct xfs_writepage_ctx *wpc,
 	struct inode		*inode,
-	loff_t			offset,
-	struct xfs_bmbt_irec	*imap,
-	int			type)
+	loff_t			offset)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	ssize_t			count = i_blocksize(inode);
-	xfs_fileoff_t		offset_fsb, end_fsb;
+	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset), end_fsb;
+	struct xfs_bmbt_irec	imap;
+	int			whichfork = XFS_DATA_FORK;
 	int			error = 0;
 	int			nimaps = 1;
 
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
-	/*
-	 * Truncate can race with writeback since writeback doesn't take the
-	 * iolock and truncate decreases the file size before it starts
-	 * truncating the pages between new_size and old_size.  Therefore, we
-	 * can end up in the situation where writeback gets a CoW fork mapping
-	 * but the truncate makes the mapping invalid and we end up in here
-	 * trying to get a new mapping.  Bail out here so that we simply never
-	 * get a valid mapping and so we drop the write altogether.  The page
-	 * truncation will kill the contents anyway.
-	 */
-	if (type == XFS_IO_COW && offset > i_size_read(inode))
-		return 0;
-
-	ASSERT(type != XFS_IO_COW);
-
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 	ASSERT(ip->i_d.di_format != XFS_DINODE_FMT_BTREE ||
 	       (ip->i_df.if_flags & XFS_IFEXTENTS));
 	ASSERT(offset <= mp->m_super->s_maxbytes);
 
+	if (xfs_is_reflink_inode(ip) &&
+	    xfs_reflink_find_cow_mapping(ip, offset, &imap)) {
+		xfs_iunlock(ip, XFS_ILOCK_SHARED);
+		/*
+		 * Truncate can race with writeback since writeback doesn't
+		 * take the iolock and truncate decreases the file size before
+		 * it starts truncating the pages between new_size and old_size.
+		 * Therefore, we can end up in the situation where writeback
+		 * gets a CoW fork mapping but the truncate makes the mapping
+		 * invalid and we end up in here trying to get a new mapping.
+		 * bail out here so that we simply never get a valid mapping
+		 * and so we drop the write altogether.  The page truncation
+		 * will kill the contents anyway.
+		 */
+		if (offset > i_size_read(inode)) {
+			wpc->io_type = XFS_IO_HOLE;
+			return 0;
+		}
+		whichfork = XFS_COW_FORK;
+		wpc->io_type = XFS_IO_COW;
+		goto allocate_blocks;
+	}
+
+	/*
+	 * Map valid and no COW extent in the way?  We're done.
+	 */
+	if (wpc->imap_valid) {
+		xfs_iunlock(ip, XFS_ILOCK_SHARED);
+		return 0;
+	}
+
+	/*
+	 * If we don't have a valid map, now it's time to get a new one for this
+	 * offset.  This will convert delayed allocations (including COW ones)
+	 * into real extents.
+	 */
 	if (offset > mp->m_super->s_maxbytes - count)
 		count = mp->m_super->s_maxbytes - offset;
 	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
 	offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
-				imap, &nimaps, XFS_BMAPI_ENTIRE);
+				&imap, &nimaps, XFS_BMAPI_ENTIRE);
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
-
 	if (error)
 		return error;
 
-	if (type == XFS_IO_DELALLOC &&
-	    (!nimaps || isnullstartblock(imap->br_startblock))) {
-		error = xfs_iomap_write_allocate(ip, XFS_DATA_FORK, offset,
-				imap);
-		if (!error)
-			trace_xfs_map_blocks_alloc(ip, offset, count, type, imap);
-		return error;
+	if (!nimaps) {
+		/*
+		 * Lookup returns no match? Beyond eof? regardless,
+		 * return it as a hole so we don't write it
+		 */
+		imap.br_startoff = offset_fsb;
+		imap.br_blockcount = end_fsb - offset_fsb;
+		imap.br_startblock = HOLESTARTBLOCK;
+		wpc->io_type = XFS_IO_HOLE;
+	} else if (imap.br_startblock == HOLESTARTBLOCK) {
+		/* landed in a hole */
+		wpc->io_type = XFS_IO_HOLE;
 	}
 
+	if (wpc->io_type == XFS_IO_DELALLOC &&
+	    (!nimaps || isnullstartblock(imap.br_startblock)))
+		goto allocate_blocks;
+
 #ifdef DEBUG
-	if (type == XFS_IO_UNWRITTEN) {
+	if (wpc->io_type == XFS_IO_UNWRITTEN) {
 		ASSERT(nimaps);
-		ASSERT(imap->br_startblock != HOLESTARTBLOCK);
-		ASSERT(imap->br_startblock != DELAYSTARTBLOCK);
+		ASSERT(imap.br_startblock != HOLESTARTBLOCK);
+		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
 	}
 #endif
-	if (nimaps)
-		trace_xfs_map_blocks_found(ip, offset, count, type, imap);
+	wpc->imap = imap;
+	trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
+	return 0;
+allocate_blocks:
+	error = xfs_iomap_write_allocate(ip, whichfork, offset, &imap);
+	if (error)
+		return error;
+	wpc->imap = imap;
+	trace_xfs_map_blocks_alloc(ip, offset, count, wpc->io_type, &imap);
 	return 0;
 }
 
@@ -759,56 +796,6 @@ xfs_aops_discard_page(
 	xfs_vm_invalidatepage(page, 0, PAGE_SIZE);
 }
 
-static int
-xfs_map_cow(
-	struct xfs_writepage_ctx *wpc,
-	struct inode		*inode,
-	loff_t			offset,
-	unsigned int		*new_type)
-{
-	struct xfs_inode	*ip = XFS_I(inode);
-	struct xfs_bmbt_irec	imap;
-	bool			is_cow = false;
-	int			error;
-
-	/*
-	 * If we already have a valid COW mapping keep using it.
-	 */
-	if (wpc->io_type == XFS_IO_COW) {
-		wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap, offset);
-		if (wpc->imap_valid) {
-			*new_type = XFS_IO_COW;
-			return 0;
-		}
-	}
-
-	/*
-	 * Else we need to check if there is a COW mapping at this offset.
-	 */
-	xfs_ilock(ip, XFS_ILOCK_SHARED);
-	is_cow = xfs_reflink_find_cow_mapping(ip, offset, &imap);
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
-
-	if (!is_cow)
-		return 0;
-
-	/*
-	 * And if the COW mapping has a delayed extent here we need to
-	 * allocate real space for it now.
-	 */
-	if (isnullstartblock(imap.br_startblock)) {
-		error = xfs_iomap_write_allocate(ip, XFS_COW_FORK, offset,
-				&imap);
-		if (error)
-			return error;
-	}
-
-	wpc->io_type = *new_type = XFS_IO_COW;
-	wpc->imap_valid = true;
-	wpc->imap = imap;
-	return 0;
-}
-
 /*
  * We implement an immediate ioend submission policy here to avoid needing to
  * chain multiple ioends and hence nest mempool allocations which can violate
@@ -873,10 +860,13 @@ xfs_writepage_map(
 			continue;
 		}
 
-		if (xfs_is_reflink_inode(XFS_I(inode))) {
-			error = xfs_map_cow(wpc, inode, offset, &new_type);
-			if (error)
-				goto out;
+		/*
+		 * If we already have a valid COW mapping keep using it.
+		 */
+		if (wpc->io_type == XFS_IO_COW &&
+		    xfs_imap_valid(inode, &wpc->imap, offset)) {
+			wpc->imap_valid = true;
+			new_type = XFS_IO_COW;
 		}
 
 		if (wpc->io_type != new_type) {
@@ -887,22 +877,24 @@ xfs_writepage_map(
 		if (wpc->imap_valid)
 			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
 							 offset);
-		if (!wpc->imap_valid) {
-			error = xfs_map_blocks(inode, offset, &wpc->imap,
-					     wpc->io_type);
+		if (!wpc->imap_valid ||
+		    (xfs_is_reflink_inode(XFS_I(inode)) &&
+		     wpc->io_type != XFS_IO_COW)) {
+			error = xfs_map_blocks(wpc, inode, offset);
 			if (error)
 				goto out;
 			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
 							 offset);
 		}
-		if (wpc->imap_valid) {
-			lock_buffer(bh);
-			if (wpc->io_type != XFS_IO_OVERWRITE)
-				xfs_map_at_offset(inode, bh, &wpc->imap, offset);
-			xfs_add_to_ioend(inode, bh, offset, wpc, wbc, &submit_list);
-			count++;
-		}
 
+		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE)
+			continue;
+
+		lock_buffer(bh);
+		if (wpc->io_type != XFS_IO_OVERWRITE)
+			xfs_map_at_offset(inode, bh, &wpc->imap, offset);
+		xfs_add_to_ioend(inode, bh, offset, wpc, wbc, &submit_list);
+		count++;
 	} while (offset += len, ((bh = bh->b_this_page) != head));
 
 	ASSERT(wpc->ioend || list_empty(&submit_list));
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 694c85b03813..466d22e5bc36 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -29,6 +29,7 @@ enum {
 	XFS_IO_UNWRITTEN,	/* covers allocated but uninitialized data */
 	XFS_IO_OVERWRITE,	/* covers already allocated extent */
 	XFS_IO_COW,		/* covers copy-on-write extent */
+	XFS_IO_HOLE,		/* covers region without any block allocation */
 };
 
 #define XFS_IO_TYPES \
@@ -36,7 +37,8 @@ enum {
 	{ XFS_IO_DELALLOC,		"delalloc" }, \
 	{ XFS_IO_UNWRITTEN,		"unwritten" }, \
 	{ XFS_IO_OVERWRITE,		"overwrite" }, \
-	{ XFS_IO_COW,			"CoW" }
+	{ XFS_IO_COW,			"CoW" }, \
+	{ XFS_IO_HOLE,			"hole" }
 
 /*
  * Structure for buffered I/O completions.
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 12/24] xfs: rename the offset variable in xfs_writepage_map
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (10 preceding siblings ...)
  2018-06-15 13:01 ` [PATCH 11/24] xfs: remove xfs_map_cow Christoph Hellwig
@ 2018-06-15 13:01 ` Christoph Hellwig
  2018-06-19  5:37   ` Darrick J. Wong
  2018-06-15 13:01 ` [PATCH 13/24] xfs: make xfs_writepage_map extent map centric Christoph Hellwig
                   ` (11 subsequent siblings)
  23 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:01 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Calling it file_offset makes the usage more clear, especially with
a new poffset variable that will be added soon for the offset inside
the page.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_aops.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index ff27bdcc49a1..8c1a28f39197 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -824,15 +824,15 @@ xfs_writepage_map(
 	struct xfs_ioend	*ioend, *next;
 	struct buffer_head	*bh, *head;
 	ssize_t			len = i_blocksize(inode);
-	uint64_t		offset;
+	uint64_t		file_offset;	/* file offset of page */
 	int			error = 0;
 	int			count = 0;
 	unsigned int		new_type;
 
 	bh = head = page_buffers(page);
-	offset = page_offset(page);
+	file_offset = page_offset(page);
 	do {
-		if (offset >= end_offset)
+		if (file_offset >= end_offset)
 			break;
 
 		/*
@@ -864,7 +864,7 @@ xfs_writepage_map(
 		 * If we already have a valid COW mapping keep using it.
 		 */
 		if (wpc->io_type == XFS_IO_COW &&
-		    xfs_imap_valid(inode, &wpc->imap, offset)) {
+		    xfs_imap_valid(inode, &wpc->imap, file_offset)) {
 			wpc->imap_valid = true;
 			new_type = XFS_IO_COW;
 		}
@@ -876,15 +876,15 @@ xfs_writepage_map(
 
 		if (wpc->imap_valid)
 			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
-							 offset);
+							 file_offset);
 		if (!wpc->imap_valid ||
 		    (xfs_is_reflink_inode(XFS_I(inode)) &&
 		     wpc->io_type != XFS_IO_COW)) {
-			error = xfs_map_blocks(wpc, inode, offset);
+			error = xfs_map_blocks(wpc, inode, file_offset);
 			if (error)
 				goto out;
 			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
-							 offset);
+							 file_offset);
 		}
 
 		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE)
@@ -892,10 +892,10 @@ xfs_writepage_map(
 
 		lock_buffer(bh);
 		if (wpc->io_type != XFS_IO_OVERWRITE)
-			xfs_map_at_offset(inode, bh, &wpc->imap, offset);
-		xfs_add_to_ioend(inode, bh, offset, wpc, wbc, &submit_list);
+			xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
+		xfs_add_to_ioend(inode, bh, file_offset, wpc, wbc, &submit_list);
 		count++;
-	} while (offset += len, ((bh = bh->b_this_page) != head));
+	} while (file_offset += len, ((bh = bh->b_this_page) != head));
 
 	ASSERT(wpc->ioend || list_empty(&submit_list));
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 13/24] xfs: make xfs_writepage_map extent map centric
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (11 preceding siblings ...)
  2018-06-15 13:01 ` [PATCH 12/24] xfs: rename the offset variable in xfs_writepage_map Christoph Hellwig
@ 2018-06-15 13:01 ` Christoph Hellwig
  2018-06-18 17:38   ` Brian Foster
  2018-06-19  5:43   ` Darrick J. Wong
  2018-06-15 13:01 ` [PATCH 14/24] xfs: remove the now unused XFS_BMAPI_IGSTATE flag Christoph Hellwig
                   ` (10 subsequent siblings)
  23 siblings, 2 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:01 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

xfs_writepage_map() iterates over the bufferheads on a page to decide
what sort of IO to do and what actions to take.  However, when it comes
to reflink and deciding when it needs to execute a COW operation, we no
longer look at the bufferhead state but instead we ignore than and look
up internal state held in the COW fork extent list.

This means xfs_writepage_map() is somewhat confused. It does stuff, then
ignores it, then tries to handle the impedence mismatch by shovelling the
results inside the existing mapping code.  It works, but it's a bit of a
mess and it makes it hard to fix the cached map bug that the writepage
code currently has.

To unify the two different mechanisms, we first have to choose a direction.
That's already been set - we're de-emphasising bufferheads so they are no
longer a control structure as we need to do taht to allow for eventual
removal.  Hence we need to move away from looking at bufferhead state to
determine what operations we need to perform.

We can't completely get rid of bufferheads yet - they do contain some
state that is absolutely necessary, such as whether that part of the page
contains valid data or not (buffer_uptodate()).  Other state in the
bufferhead is redundant:

	BH_dirty - the page is dirty, so we can ignore this and just
		write it
	BH_delay - we have delalloc extent info in the DATA fork extent
		tree
	BH_unwritten - same as BH_delay
	BH_mapped - indicates we've already used it once for IO and it is
		mapped to a disk address. Needs to be ignored for COW
		blocks.

The BH_mapped flag is an interesting case - it's supposed to indicate that
it's already mapped to disk and so we can just use it "as is".  In theory,
we don't even have to do an extent lookup to find where to write it too,
but we have to do that anyway to determine we are actually writing over a
valid extent.  Hence it's not even serving the purpose of avoiding a an
extent lookup during writeback, and so we can pretty much ignore it.
Especially as we have to ignore it for COW operations...

Therefore, use the extent map as the source of information to tell us
what actions we need to take and what sort of IO we should perform.  The
first step is to have xfs_map_blocks() set the io type according to what
it looks up.  This means it can easily handle both normal overwrite and
COW cases.  The only thing we also need to add is the ability to return
hole mappings.

We need to return and cache hole mappings now for the case of multiple
blocks per page.  We no longer use the BH_mapped to indicate a block over
a hole, so we have to get that info from xfs_map_blocks().  We cache it so
that holes that span two pages don't need separate lookups.  This allows us
to avoid ever doing write IO over a hole, too.

Now that we have xfs_map_blocks() returning both a cached map and the type
of IO we need to perform, we can rewrite xfs_writepage_map() to drop all
the bufferhead control. It's also much simplified because it doesn't need
to explicitly handle COW operations.  Instead of iterating bufferheads, it
iterates blocks within the page and then looks up what per-block state is
required from the appropriate bufferhead.  It then validates the cached
map, and if it's not valid, we get a new map.  If we don't get a valid map
or it's over a hole, we skip the block.

At this point, we have to remap the bufferhead via xfs_map_at_offset().
As previously noted, we had to do this even if the buffer was already
mapped as the mapping would be stale for XFS_IO_DELALLOC, XFS_IO_UNWRITTEN
and XFS_IO_COW IO types.  With xfs_map_blocks() now controlling the type,
even XFS_IO_OVERWRITE types need remapping, as converted-but-not-yet-
written delalloc extents beyond EOF can be reported at XFS_IO_OVERWRITE.
Bufferheads that span such regions still need their BH_Delay flags cleared
and their block numbers calculated, so we now unconditionally map each
bufferhead before submission.

But wait! There's more - remember the old "treat unwritten extents as
holes on read" hack?  Yeah, that means we can have a dirty page with
unmapped, unwritten bufferheads that contain data!  What makes these so
special is that the unwritten "hole" bufferheads do not have a valid block
device pointer, so if we attempt to write them xfs_add_to_ioend() blows
up. So we make xfs_map_at_offset() do the "realtime or data device"
lookup from the inode and ignore what was or wasn't put into the
bufferhead when the buffer was instantiated.

The astute reader will have realised by now that this code treats
unwritten extents in multiple-blocks-per-page situations differently.
If we get any combination of unwritten blocks on a dirty page that contain
valid data in the page, we're going to convert them to real extents.  This
can actually be a win, because it means that pages with interleaving
unwritten and written blocks will get converted to a single written extent
with zeros replacing the interspersed unwritten blocks.  This is actually
good for reducing extent list and conversion overhead, and it means we
issue a contiguous IO instead of lots of little ones.  The downside is
that we use up a little extra IO bandwidth.  Neither of these seem like a
bad thing given that spinning disks are seek sensitive, and SSDs/pmem have
bandwidth to burn and the lower Io latency/CPU overhead of fewer, larger
IOs will result in better performance on them...

As a result of all this, the only state we actually care about from the
bufferhead is a single flag - BH_Uptodate. We still use the bufferhead to
pass some information to the bio via xfs_add_to_ioend(), but that is
trivial to separate and pass explicitly.  This means we really only need
1 bit of state per block per page from the buffered write path in the
writeback path.  Everything else we do with the bufferhead is purely to
make the buffered IO front end continue to work correctly. i.e we've
pretty much marginalised bufferheads in the writeback path completely.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
[hch: forward port + slight refactoring]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 89 ++++++++++++++++++++---------------------------
 1 file changed, 37 insertions(+), 52 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 8c1a28f39197..165891ecb4ba 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -454,19 +454,19 @@ xfs_map_blocks(
 	} else if (imap.br_startblock == HOLESTARTBLOCK) {
 		/* landed in a hole */
 		wpc->io_type = XFS_IO_HOLE;
-	}
-
-	if (wpc->io_type == XFS_IO_DELALLOC &&
-	    (!nimaps || isnullstartblock(imap.br_startblock)))
-		goto allocate_blocks;
+	} else {
+		if (isnullstartblock(imap.br_startblock)) {
+			/* got a delalloc extent */
+			wpc->io_type = XFS_IO_DELALLOC;
+			goto allocate_blocks;
+		}
 
-#ifdef DEBUG
-	if (wpc->io_type == XFS_IO_UNWRITTEN) {
-		ASSERT(nimaps);
-		ASSERT(imap.br_startblock != HOLESTARTBLOCK);
-		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
+		if (imap.br_state == XFS_EXT_UNWRITTEN)
+			wpc->io_type = XFS_IO_UNWRITTEN;
+		else
+			wpc->io_type = XFS_IO_OVERWRITE;
 	}
-#endif
+
 	wpc->imap = imap;
 	trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
 	return 0;
@@ -736,6 +736,14 @@ xfs_map_at_offset(
 	set_buffer_mapped(bh);
 	clear_buffer_delay(bh);
 	clear_buffer_unwritten(bh);
+
+	/*
+	 * If this is a realtime file, data may be on a different device.
+	 * to that pointed to from the buffer_head b_bdev currently. We can't
+	 * trust that the bufferhead has a already been mapped correctly, so
+	 * set the bdev now.
+	 */
+	bh->b_bdev = xfs_find_bdev_for_inode(inode);
 }
 
 STATIC void
@@ -822,58 +830,36 @@ xfs_writepage_map(
 {
 	LIST_HEAD(submit_list);
 	struct xfs_ioend	*ioend, *next;
-	struct buffer_head	*bh, *head;
+	struct buffer_head	*bh;
 	ssize_t			len = i_blocksize(inode);
 	uint64_t		file_offset;	/* file offset of page */
+	unsigned		poffset;	/* offset into page */
 	int			error = 0;
 	int			count = 0;
-	unsigned int		new_type;
 
-	bh = head = page_buffers(page);
+	/*
+	 * Walk the blocks on the page, and we we run off then end of the
+	 * current map or find the current map invalid, grab a new one.
+	 * We only use bufferheads here to check per-block state - they no
+	 * longer control the iteration through the page. This allows us to
+	 * replace the bufferhead with some other state tracking mechanism in
+	 * future.
+	 */
 	file_offset = page_offset(page);
-	do {
+	bh = page_buffers(page);
+	for (poffset = 0;
+	     poffset < PAGE_SIZE;
+	     poffset += len, file_offset += len, bh = bh->b_this_page) {
+		/* past the range we are writing, so nothing more to write. */
 		if (file_offset >= end_offset)
 			break;
 
-		/*
-		 * set_page_dirty dirties all buffers in a page, independent
-		 * of their state.  The dirty state however is entirely
-		 * meaningless for holes (!mapped && uptodate), so skip
-		 * buffers covering holes here.
-		 */
-		if (!buffer_mapped(bh) && buffer_uptodate(bh))
-			continue;
-
-		if (buffer_unwritten(bh))
-			new_type = XFS_IO_UNWRITTEN;
-		else if (buffer_delay(bh))
-			new_type = XFS_IO_DELALLOC;
-		else if (buffer_uptodate(bh))
-			new_type = XFS_IO_OVERWRITE;
-		else {
+		if (!buffer_uptodate(bh)) {
 			if (PageUptodate(page))
 				ASSERT(buffer_mapped(bh));
-			/*
-			 * This buffer is not uptodate and will not be
-			 * written to disk.
-			 */
 			continue;
 		}
 
-		/*
-		 * If we already have a valid COW mapping keep using it.
-		 */
-		if (wpc->io_type == XFS_IO_COW &&
-		    xfs_imap_valid(inode, &wpc->imap, file_offset)) {
-			wpc->imap_valid = true;
-			new_type = XFS_IO_COW;
-		}
-
-		if (wpc->io_type != new_type) {
-			wpc->io_type = new_type;
-			wpc->imap_valid = false;
-		}
-
 		if (wpc->imap_valid)
 			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
 							 file_offset);
@@ -891,11 +877,10 @@ xfs_writepage_map(
 			continue;
 
 		lock_buffer(bh);
-		if (wpc->io_type != XFS_IO_OVERWRITE)
-			xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
+		xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
 		xfs_add_to_ioend(inode, bh, file_offset, wpc, wbc, &submit_list);
 		count++;
-	} while (file_offset += len, ((bh = bh->b_this_page) != head));
+	}
 
 	ASSERT(wpc->ioend || list_empty(&submit_list));
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 14/24] xfs: remove the now unused XFS_BMAPI_IGSTATE flag
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (12 preceding siblings ...)
  2018-06-15 13:01 ` [PATCH 13/24] xfs: make xfs_writepage_map extent map centric Christoph Hellwig
@ 2018-06-15 13:01 ` Christoph Hellwig
  2018-06-15 13:02 ` [PATCH 15/24] xfs: remove xfs_reflink_find_cow_mapping Christoph Hellwig
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:01 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 6 ++----
 fs/xfs/libxfs/xfs_bmap.h | 3 ---
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 7b0e2b551e23..4b5e014417d2 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3799,8 +3799,7 @@ xfs_bmapi_update_map(
 		   mval[-1].br_startblock != HOLESTARTBLOCK &&
 		   mval->br_startblock == mval[-1].br_startblock +
 					  mval[-1].br_blockcount &&
-		   ((flags & XFS_BMAPI_IGSTATE) ||
-			mval[-1].br_state == mval->br_state)) {
+		   mval[-1].br_state == mval->br_state) {
 		ASSERT(mval->br_startoff ==
 		       mval[-1].br_startoff + mval[-1].br_blockcount);
 		mval[-1].br_blockcount += mval->br_blockcount;
@@ -3845,7 +3844,7 @@ xfs_bmapi_read(
 
 	ASSERT(*nmap >= 1);
 	ASSERT(!(flags & ~(XFS_BMAPI_ATTRFORK|XFS_BMAPI_ENTIRE|
-			   XFS_BMAPI_IGSTATE|XFS_BMAPI_COWFORK)));
+			   XFS_BMAPI_COWFORK)));
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED|XFS_ILOCK_EXCL));
 
 	if (unlikely(XFS_TEST_ERROR(
@@ -4290,7 +4289,6 @@ xfs_bmapi_write(
 
 	ASSERT(*nmap >= 1);
 	ASSERT(*nmap <= XFS_BMAP_MAX_NMAP);
-	ASSERT(!(flags & XFS_BMAPI_IGSTATE));
 	ASSERT(tp != NULL ||
 	       (flags & (XFS_BMAPI_CONVERT | XFS_BMAPI_COWFORK)) ==
 			(XFS_BMAPI_CONVERT | XFS_BMAPI_COWFORK));
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 2c233f9f1a26..a845fe57d1b5 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -80,8 +80,6 @@ struct xfs_extent_free_item
 #define XFS_BMAPI_METADATA	0x002	/* mapping metadata not user data */
 #define XFS_BMAPI_ATTRFORK	0x004	/* use attribute fork not data */
 #define XFS_BMAPI_PREALLOC	0x008	/* preallocation op: unwritten space */
-#define XFS_BMAPI_IGSTATE	0x010	/* Ignore state - */
-					/* combine contig. space */
 #define XFS_BMAPI_CONTIG	0x020	/* must allocate only one extent */
 /*
  * unwritten extent conversion - this needs write cache flushing and no additional
@@ -128,7 +126,6 @@ struct xfs_extent_free_item
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
 	{ XFS_BMAPI_ATTRFORK,	"ATTRFORK" }, \
 	{ XFS_BMAPI_PREALLOC,	"PREALLOC" }, \
-	{ XFS_BMAPI_IGSTATE,	"IGSTATE" }, \
 	{ XFS_BMAPI_CONTIG,	"CONTIG" }, \
 	{ XFS_BMAPI_CONVERT,	"CONVERT" }, \
 	{ XFS_BMAPI_ZERO,	"ZERO" }, \
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 15/24] xfs: remove xfs_reflink_find_cow_mapping
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (13 preceding siblings ...)
  2018-06-15 13:01 ` [PATCH 14/24] xfs: remove the now unused XFS_BMAPI_IGSTATE flag Christoph Hellwig
@ 2018-06-15 13:02 ` Christoph Hellwig
  2018-06-15 13:02 ` [PATCH 16/24] xfs: simplify xfs_map_blocks by using xfs_iext_lookup_extent directly Christoph Hellwig
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:02 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

We only have one caller left, and open coding the simple extent list
lookup in it allows us to make the code both more understandable and
reuse calculations and variables already present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c    | 19 +++++++++++++------
 fs/xfs/xfs_reflink.c | 30 ------------------------------
 fs/xfs/xfs_reflink.h |  2 --
 fs/xfs/xfs_trace.h   |  1 -
 4 files changed, 13 insertions(+), 39 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 165891ecb4ba..17aa9dc9a8c3 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -382,9 +382,10 @@ xfs_map_blocks(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	ssize_t			count = i_blocksize(inode);
-	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset), end_fsb;
+	xfs_fileoff_t		offset_fsb, end_fsb;
 	struct xfs_bmbt_irec	imap;
 	int			whichfork = XFS_DATA_FORK;
+	struct xfs_iext_cursor	icur;
 	int			error = 0;
 	int			nimaps = 1;
 
@@ -396,8 +397,18 @@ xfs_map_blocks(
 	       (ip->i_df.if_flags & XFS_IFEXTENTS));
 	ASSERT(offset <= mp->m_super->s_maxbytes);
 
+	if (offset > mp->m_super->s_maxbytes - count)
+		count = mp->m_super->s_maxbytes - offset;
+	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
+	offset_fsb = XFS_B_TO_FSBT(mp, offset);
+
+	/*
+	 * Check if this is offset is covered by a COW extents, and if yes use
+	 * it directly instead of looking up anything in the data fork.
+	 */
 	if (xfs_is_reflink_inode(ip) &&
-	    xfs_reflink_find_cow_mapping(ip, offset, &imap)) {
+	    xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &icur, &imap) &&
+	    imap.br_startoff <= offset_fsb) {
 		xfs_iunlock(ip, XFS_ILOCK_SHARED);
 		/*
 		 * Truncate can race with writeback since writeback doesn't
@@ -432,10 +443,6 @@ xfs_map_blocks(
 	 * offset.  This will convert delayed allocations (including COW ones)
 	 * into real extents.
 	 */
-	if (offset > mp->m_super->s_maxbytes - count)
-		count = mp->m_super->s_maxbytes - offset;
-	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
-	offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
 				&imap, &nimaps, XFS_BMAPI_ENTIRE);
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index d0092c864458..ff76bc56ff3d 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -484,36 +484,6 @@ xfs_reflink_allocate_cow(
 	return error;
 }
 
-/*
- * Find the CoW reservation for a given byte offset of a file.
- */
-bool
-xfs_reflink_find_cow_mapping(
-	struct xfs_inode		*ip,
-	xfs_off_t			offset,
-	struct xfs_bmbt_irec		*imap)
-{
-	struct xfs_ifork		*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
-	xfs_fileoff_t			offset_fsb;
-	struct xfs_bmbt_irec		got;
-	struct xfs_iext_cursor		icur;
-
-	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL | XFS_ILOCK_SHARED));
-
-	if (!xfs_is_reflink_inode(ip))
-		return false;
-	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
-	if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, &icur, &got))
-		return false;
-	if (got.br_startoff > offset_fsb)
-		return false;
-
-	trace_xfs_reflink_find_cow_mapping(ip, offset, 1, XFS_IO_OVERWRITE,
-			&got);
-	*imap = got;
-	return true;
-}
-
 /*
  * Cancel CoW reservations for some block range of an inode.
  *
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index c0a25609007b..e8d4d50c629f 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -32,8 +32,6 @@ extern int xfs_reflink_allocate_cow(struct xfs_inode *ip,
 		struct xfs_bmbt_irec *imap, bool *shared, uint *lockmode);
 extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
-extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
-		struct xfs_bmbt_irec *imap);
 
 extern int xfs_reflink_cancel_cow_blocks(struct xfs_inode *ip,
 		struct xfs_trans **tpp, xfs_fileoff_t offset_fsb,
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 041045a8d65c..666844532030 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3227,7 +3227,6 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_convert_cow);
 DEFINE_RW_EVENT(xfs_reflink_reserve_cow);
 
 DEFINE_SIMPLE_IO_EVENT(xfs_reflink_bounce_dio_write);
-DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);
 
 DEFINE_SIMPLE_IO_EVENT(xfs_reflink_cancel_cow_range);
 DEFINE_SIMPLE_IO_EVENT(xfs_reflink_end_cow);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 16/24] xfs: simplify xfs_map_blocks by using xfs_iext_lookup_extent directly
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (14 preceding siblings ...)
  2018-06-15 13:02 ` [PATCH 15/24] xfs: remove xfs_reflink_find_cow_mapping Christoph Hellwig
@ 2018-06-15 13:02 ` Christoph Hellwig
  2018-06-15 13:02 ` [PATCH 17/24] xfs: remove the imap_valid flag Christoph Hellwig
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:02 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

xfs_bmapi_read adds zero value in xfs_map_blocks.  Replace it with a
direct call to the low-level extent lookup function.

Note that we now always pass a 0 length to the trace points as we ask
for an unspecified len.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c | 19 +++++--------------
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 17aa9dc9a8c3..25239bc189d4 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -387,7 +387,6 @@ xfs_map_blocks(
 	int			whichfork = XFS_DATA_FORK;
 	struct xfs_iext_cursor	icur;
 	int			error = 0;
-	int			nimaps = 1;
 
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
@@ -443,24 +442,16 @@ xfs_map_blocks(
 	 * offset.  This will convert delayed allocations (including COW ones)
 	 * into real extents.
 	 */
-	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
-				&imap, &nimaps, XFS_BMAPI_ENTIRE);
+	if (!xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap))
+		imap.br_startoff = end_fsb;	/* fake a hole past EOF */
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
-	if (error)
-		return error;
 
-	if (!nimaps) {
-		/*
-		 * Lookup returns no match? Beyond eof? regardless,
-		 * return it as a hole so we don't write it
-		 */
+	if (imap.br_startoff > offset_fsb) {
+		/* landed in a hole or beyond EOF */
+		imap.br_blockcount = imap.br_startoff - offset_fsb;
 		imap.br_startoff = offset_fsb;
-		imap.br_blockcount = end_fsb - offset_fsb;
 		imap.br_startblock = HOLESTARTBLOCK;
 		wpc->io_type = XFS_IO_HOLE;
-	} else if (imap.br_startblock == HOLESTARTBLOCK) {
-		/* landed in a hole */
-		wpc->io_type = XFS_IO_HOLE;
 	} else {
 		if (isnullstartblock(imap.br_startblock)) {
 			/* got a delalloc extent */
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 17/24] xfs: remove the imap_valid flag
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (15 preceding siblings ...)
  2018-06-15 13:02 ` [PATCH 16/24] xfs: simplify xfs_map_blocks by using xfs_iext_lookup_extent directly Christoph Hellwig
@ 2018-06-15 13:02 ` Christoph Hellwig
  2018-06-15 13:02 ` [PATCH 18/24] xfs: don't look at buffer heads in xfs_add_to_ioend Christoph Hellwig
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:02 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Simplify the way we check for a valid imap - we know we have a valid
mapping after xfs_map_blocks returned successfully, and we know we can
call xfs_imap_valid on any imap, as it will always fail on a
zero-initialized map.

We can also remove the xfs_imap_valid function and fold it into
xfs_map_blocks now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c | 75 ++++++++++++++++++++---------------------------
 1 file changed, 32 insertions(+), 43 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 25239bc189d4..7da3fe49b4de 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -42,7 +42,6 @@
  */
 struct xfs_writepage_ctx {
 	struct xfs_bmbt_irec    imap;
-	bool			imap_valid;
 	unsigned int		io_type;
 	struct xfs_ioend	*ioend;
 	sector_t		last_block;
@@ -382,15 +381,41 @@ xfs_map_blocks(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	ssize_t			count = i_blocksize(inode);
-	xfs_fileoff_t		offset_fsb, end_fsb;
+	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset), end_fsb;
 	struct xfs_bmbt_irec	imap;
 	int			whichfork = XFS_DATA_FORK;
 	struct xfs_iext_cursor	icur;
+	bool			imap_valid;
 	int			error = 0;
 
+	/*
+	 * We have to make sure the cached mapping is within EOF to protect
+	 * against eofblocks trimming on file release leaving us with a stale
+	 * mapping. Otherwise, a page for a subsequent file extending buffered
+	 * write could get picked up by this writeback cycle and written to the
+	 * wrong blocks.
+	 *
+	 * Note that what we really want here is a generic mapping invalidation
+	 * mechanism to protect us from arbitrary extent modifying contexts, not
+	 * just eofblocks.
+	 */
+	xfs_trim_extent_eof(&wpc->imap, ip);
+
+	imap_valid = offset_fsb >= wpc->imap.br_startoff &&
+		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount;
+	if (imap_valid &&
+	    (!xfs_is_reflink_inode(ip) || wpc->io_type == XFS_IO_COW))
+		return 0;
+
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
+	/*
+	 * If we don't have a valid map, now it's time to get a new one for this
+	 * offset.  This will convert delayed allocations (including COW ones)
+	 * into real extents.  If we return without a valid map, it means we
+	 * landed in a hole and we skip the block.
+	 */
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 	ASSERT(ip->i_d.di_format != XFS_DINODE_FMT_BTREE ||
 	       (ip->i_df.if_flags & XFS_IFEXTENTS));
@@ -399,7 +424,6 @@ xfs_map_blocks(
 	if (offset > mp->m_super->s_maxbytes - count)
 		count = mp->m_super->s_maxbytes - offset;
 	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
-	offset_fsb = XFS_B_TO_FSBT(mp, offset);
 
 	/*
 	 * Check if this is offset is covered by a COW extents, and if yes use
@@ -432,7 +456,7 @@ xfs_map_blocks(
 	/*
 	 * Map valid and no COW extent in the way?  We're done.
 	 */
-	if (wpc->imap_valid) {
+	if (imap_valid) {
 		xfs_iunlock(ip, XFS_ILOCK_SHARED);
 		return 0;
 	}
@@ -477,31 +501,6 @@ xfs_map_blocks(
 	return 0;
 }
 
-STATIC bool
-xfs_imap_valid(
-	struct inode		*inode,
-	struct xfs_bmbt_irec	*imap,
-	xfs_off_t		offset)
-{
-	offset >>= inode->i_blkbits;
-
-	/*
-	 * We have to make sure the cached mapping is within EOF to protect
-	 * against eofblocks trimming on file release leaving us with a stale
-	 * mapping. Otherwise, a page for a subsequent file extending buffered
-	 * write could get picked up by this writeback cycle and written to the
-	 * wrong blocks.
-	 *
-	 * Note that what we really want here is a generic mapping invalidation
-	 * mechanism to protect us from arbitrary extent modifying contexts, not
-	 * just eofblocks.
-	 */
-	xfs_trim_extent_eof(imap, XFS_I(inode));
-
-	return offset >= imap->br_startoff &&
-		offset < imap->br_startoff + imap->br_blockcount;
-}
-
 STATIC void
 xfs_start_buffer_writeback(
 	struct buffer_head	*bh)
@@ -858,20 +857,11 @@ xfs_writepage_map(
 			continue;
 		}
 
-		if (wpc->imap_valid)
-			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
-							 file_offset);
-		if (!wpc->imap_valid ||
-		    (xfs_is_reflink_inode(XFS_I(inode)) &&
-		     wpc->io_type != XFS_IO_COW)) {
-			error = xfs_map_blocks(wpc, inode, file_offset);
-			if (error)
-				goto out;
-			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
-							 file_offset);
-		}
+		error = xfs_map_blocks(wpc, inode, file_offset);
+		if (error)
+			goto out;
 
-		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE)
+		if (wpc->io_type == XFS_IO_HOLE)
 			continue;
 
 		lock_buffer(bh);
@@ -881,7 +871,6 @@ xfs_writepage_map(
 	}
 
 	ASSERT(wpc->ioend || list_empty(&submit_list));
-
 out:
 	/*
 	 * On error, we have to fail the ioend here because we have locked
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 18/24] xfs: don't look at buffer heads in xfs_add_to_ioend
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (16 preceding siblings ...)
  2018-06-15 13:02 ` [PATCH 17/24] xfs: remove the imap_valid flag Christoph Hellwig
@ 2018-06-15 13:02 ` Christoph Hellwig
  2018-06-15 13:02 ` [PATCH 19/24] xfs: move all writeback buffer_head manipulation into xfs_map_at_offset Christoph Hellwig
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:02 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Calculate all information for the bio based on the passed in information
without requiring a buffer_head structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c | 68 ++++++++++++++++++++++-------------------------
 1 file changed, 32 insertions(+), 36 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 7da3fe49b4de..d6bd38d4243e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -44,7 +44,6 @@ struct xfs_writepage_ctx {
 	struct xfs_bmbt_irec    imap;
 	unsigned int		io_type;
 	struct xfs_ioend	*ioend;
-	sector_t		last_block;
 };
 
 void
@@ -540,11 +539,6 @@ xfs_start_page_writeback(
 	unlock_page(page);
 }
 
-static inline int xfs_bio_add_buffer(struct bio *bio, struct buffer_head *bh)
-{
-	return bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
-}
-
 /*
  * Submit the bio for an ioend. We are passed an ioend with a bio attached to
  * it, and we submit that bio. The ioend may be used for multiple bio
@@ -599,27 +593,20 @@ xfs_submit_ioend(
 	return 0;
 }
 
-static void
-xfs_init_bio_from_bh(
-	struct bio		*bio,
-	struct buffer_head	*bh)
-{
-	bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
-	bio_set_dev(bio, bh->b_bdev);
-}
-
 static struct xfs_ioend *
 xfs_alloc_ioend(
 	struct inode		*inode,
 	unsigned int		type,
 	xfs_off_t		offset,
-	struct buffer_head	*bh)
+	struct block_device	*bdev,
+	sector_t		sector)
 {
 	struct xfs_ioend	*ioend;
 	struct bio		*bio;
 
 	bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_PAGES, &xfs_ioend_bioset);
-	xfs_init_bio_from_bh(bio, bh);
+	bio_set_dev(bio, bdev);
+	bio->bi_iter.bi_sector = sector;
 
 	ioend = container_of(bio, struct xfs_ioend, io_inline_bio);
 	INIT_LIST_HEAD(&ioend->io_list);
@@ -644,13 +631,14 @@ static void
 xfs_chain_bio(
 	struct xfs_ioend	*ioend,
 	struct writeback_control *wbc,
-	struct buffer_head	*bh)
+	struct block_device	*bdev,
+	sector_t		sector)
 {
 	struct bio *new;
 
 	new = bio_alloc(GFP_NOFS, BIO_MAX_PAGES);
-	xfs_init_bio_from_bh(new, bh);
-
+	bio_set_dev(new, bdev);
+	new->bi_iter.bi_sector = sector;
 	bio_chain(ioend->io_bio, new);
 	bio_get(ioend->io_bio);		/* for xfs_destroy_ioend */
 	ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
@@ -660,39 +648,45 @@ xfs_chain_bio(
 }
 
 /*
- * Test to see if we've been building up a completion structure for
- * earlier buffers -- if so, we try to append to this ioend if we
- * can, otherwise we finish off any current ioend and start another.
- * Return the ioend we finished off so that the caller can submit it
- * once it has finished processing the dirty page.
+ * Test to see if we have an existing ioend structure that we could append to
+ * first, otherwise finish off the current ioend and start another.
  */
 STATIC void
 xfs_add_to_ioend(
 	struct inode		*inode,
-	struct buffer_head	*bh,
 	xfs_off_t		offset,
+	struct page		*page,
 	struct xfs_writepage_ctx *wpc,
 	struct writeback_control *wbc,
 	struct list_head	*iolist)
 {
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct block_device	*bdev = xfs_find_bdev_for_inode(inode);
+	unsigned		len = i_blocksize(inode);
+	unsigned		poff = offset & (PAGE_SIZE - 1);
+	sector_t		sector;
+
+	sector = xfs_fsb_to_db(ip, wpc->imap.br_startblock) +
+		((offset - XFS_FSB_TO_B(mp, wpc->imap.br_startoff)) >> 9);
+
 	if (!wpc->ioend || wpc->io_type != wpc->ioend->io_type ||
-	    bh->b_blocknr != wpc->last_block + 1 ||
+	    sector != bio_end_sector(wpc->ioend->io_bio) ||
 	    offset != wpc->ioend->io_offset + wpc->ioend->io_size) {
 		if (wpc->ioend)
 			list_add(&wpc->ioend->io_list, iolist);
-		wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, bh);
+		wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset,
+				bdev, sector);
 	}
 
 	/*
-	 * If the buffer doesn't fit into the bio we need to allocate a new
-	 * one.  This shouldn't happen more than once for a given buffer.
+	 * If the block doesn't fit into the bio we need to allocate a new
+	 * one.  This shouldn't happen more than once for a given block.
 	 */
-	while (xfs_bio_add_buffer(wpc->ioend->io_bio, bh) != bh->b_size)
-		xfs_chain_bio(wpc->ioend, wbc, bh);
+	while (bio_add_page(wpc->ioend->io_bio, page, len, poff) != len)
+		xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
 
-	wpc->ioend->io_size += bh->b_size;
-	wpc->last_block = bh->b_blocknr;
-	xfs_start_buffer_writeback(bh);
+	wpc->ioend->io_size += len;
 }
 
 STATIC void
@@ -866,7 +860,9 @@ xfs_writepage_map(
 
 		lock_buffer(bh);
 		xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
-		xfs_add_to_ioend(inode, bh, file_offset, wpc, wbc, &submit_list);
+		xfs_add_to_ioend(inode, file_offset, page, wpc, wbc,
+				&submit_list);
+		xfs_start_buffer_writeback(bh);
 		count++;
 	}
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 19/24] xfs: move all writeback buffer_head manipulation into xfs_map_at_offset
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (17 preceding siblings ...)
  2018-06-15 13:02 ` [PATCH 18/24] xfs: don't look at buffer heads in xfs_add_to_ioend Christoph Hellwig
@ 2018-06-15 13:02 ` Christoph Hellwig
  2018-06-15 13:02 ` [PATCH 20/24] xfs: remove xfs_start_page_writeback Christoph Hellwig
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:02 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

This keeps it in a single place so it can be made otional more easily.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c | 22 +++++-----------------
 1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d6bd38d4243e..261fb011ef9d 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -500,21 +500,6 @@ xfs_map_blocks(
 	return 0;
 }
 
-STATIC void
-xfs_start_buffer_writeback(
-	struct buffer_head	*bh)
-{
-	ASSERT(buffer_mapped(bh));
-	ASSERT(buffer_locked(bh));
-	ASSERT(!buffer_delay(bh));
-	ASSERT(!buffer_unwritten(bh));
-
-	bh->b_end_io = NULL;
-	set_buffer_async_write(bh);
-	set_buffer_uptodate(bh);
-	clear_buffer_dirty(bh);
-}
-
 STATIC void
 xfs_start_page_writeback(
 	struct page		*page,
@@ -723,6 +708,7 @@ xfs_map_at_offset(
 	ASSERT(imap->br_startblock != HOLESTARTBLOCK);
 	ASSERT(imap->br_startblock != DELAYSTARTBLOCK);
 
+	lock_buffer(bh);
 	xfs_map_buffer(inode, bh, imap, offset);
 	set_buffer_mapped(bh);
 	clear_buffer_delay(bh);
@@ -735,6 +721,10 @@ xfs_map_at_offset(
 	 * set the bdev now.
 	 */
 	bh->b_bdev = xfs_find_bdev_for_inode(inode);
+	bh->b_end_io = NULL;
+	set_buffer_async_write(bh);
+	set_buffer_uptodate(bh);
+	clear_buffer_dirty(bh);
 }
 
 STATIC void
@@ -858,11 +848,9 @@ xfs_writepage_map(
 		if (wpc->io_type == XFS_IO_HOLE)
 			continue;
 
-		lock_buffer(bh);
 		xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
 		xfs_add_to_ioend(inode, file_offset, page, wpc, wbc,
 				&submit_list);
-		xfs_start_buffer_writeback(bh);
 		count++;
 	}
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 20/24] xfs: remove xfs_start_page_writeback
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (18 preceding siblings ...)
  2018-06-15 13:02 ` [PATCH 19/24] xfs: move all writeback buffer_head manipulation into xfs_map_at_offset Christoph Hellwig
@ 2018-06-15 13:02 ` Christoph Hellwig
  2018-06-15 13:02 ` [PATCH 21/24] xfs: refactor the tail of xfs_writepage_map Christoph Hellwig
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:02 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

This helper only has two callers, one of them with a constant error
argument.  Remove it to make pending changes to the code a little easier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c | 47 +++++++++++++++++++++--------------------------
 1 file changed, 21 insertions(+), 26 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 261fb011ef9d..52d4f5503da7 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -500,30 +500,6 @@ xfs_map_blocks(
 	return 0;
 }
 
-STATIC void
-xfs_start_page_writeback(
-	struct page		*page,
-	int			clear_dirty)
-{
-	ASSERT(PageLocked(page));
-	ASSERT(!PageWriteback(page));
-
-	/*
-	 * if the page was not fully cleaned, we need to ensure that the higher
-	 * layers come back to it correctly. That means we need to keep the page
-	 * dirty, and for WB_SYNC_ALL writeback we need to ensure the
-	 * PAGECACHE_TAG_TOWRITE index mark is not removed so another attempt to
-	 * write this page in this writeback sweep will be made.
-	 */
-	if (clear_dirty) {
-		clear_page_dirty_for_io(page);
-		set_page_writeback(page);
-	} else
-		set_page_writeback_keepwrite(page);
-
-	unlock_page(page);
-}
-
 /*
  * Submit the bio for an ioend. We are passed an ioend with a bio attached to
  * it, and we submit that bio. The ioend may be used for multiple bio
@@ -856,6 +832,9 @@ xfs_writepage_map(
 
 	ASSERT(wpc->ioend || list_empty(&submit_list));
 out:
+	ASSERT(PageLocked(page));
+	ASSERT(!PageWriteback(page));
+
 	/*
 	 * On error, we have to fail the ioend here because we have locked
 	 * buffers in the ioend. If we don't do this, we'll deadlock
@@ -874,7 +853,21 @@ xfs_writepage_map(
 	 * treated correctly on error.
 	 */
 	if (count) {
-		xfs_start_page_writeback(page, !error);
+		/*
+		 * If the page was not fully cleaned, we need to ensure that the
+		 * higher layers come back to it correctly.  That means we need
+		 * to keep the page dirty, and for WB_SYNC_ALL writeback we need
+		 * to ensure the PAGECACHE_TAG_TOWRITE index mark is not removed
+		 * so another attempt to write this page in this writeback sweep
+		 * will be made.
+		 */
+		if (error) {
+			set_page_writeback_keepwrite(page);
+		} else {
+			clear_page_dirty_for_io(page);
+			set_page_writeback(page);
+		}
+		unlock_page(page);
 
 		/*
 		 * Preserve the original error if there was one, otherwise catch
@@ -899,7 +892,9 @@ xfs_writepage_map(
 		 * race with a partial page truncate on a sub-page block sized
 		 * filesystem. In that case we need to mark the page clean.
 		 */
-		xfs_start_page_writeback(page, 1);
+		clear_page_dirty_for_io(page);
+		set_page_writeback(page);
+		unlock_page(page);
 		end_page_writeback(page);
 	}
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 21/24] xfs: refactor the tail of xfs_writepage_map
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (19 preceding siblings ...)
  2018-06-15 13:02 ` [PATCH 20/24] xfs: remove xfs_start_page_writeback Christoph Hellwig
@ 2018-06-15 13:02 ` Christoph Hellwig
  2018-06-15 13:02 ` [PATCH 22/24] xfs: allow writeback on pages without buffer heads Christoph Hellwig
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:02 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Rejuggle how we deal with the different error vs non-error and have
ioends vs not have ioend cases to keep the fast path streamlined, and
the duplicate code at a minimum.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c | 65 +++++++++++++++++++++++------------------------
 1 file changed, 32 insertions(+), 33 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 52d4f5503da7..6f8ac9cd3fb3 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -852,7 +852,14 @@ xfs_writepage_map(
 	 * submission of outstanding ioends on the writepage context so they are
 	 * treated correctly on error.
 	 */
-	if (count) {
+	if (unlikely(error)) {
+		if (!count) {
+			xfs_aops_discard_page(page);
+			ClearPageUptodate(page);
+			unlock_page(page);
+			goto done;
+		}
+
 		/*
 		 * If the page was not fully cleaned, we need to ensure that the
 		 * higher layers come back to it correctly.  That means we need
@@ -861,43 +868,35 @@ xfs_writepage_map(
 		 * so another attempt to write this page in this writeback sweep
 		 * will be made.
 		 */
-		if (error) {
-			set_page_writeback_keepwrite(page);
-		} else {
-			clear_page_dirty_for_io(page);
-			set_page_writeback(page);
-		}
-		unlock_page(page);
-
-		/*
-		 * Preserve the original error if there was one, otherwise catch
-		 * submission errors here and propagate into subsequent ioend
-		 * submissions.
-		 */
-		list_for_each_entry_safe(ioend, next, &submit_list, io_list) {
-			int error2;
-
-			list_del_init(&ioend->io_list);
-			error2 = xfs_submit_ioend(wbc, ioend, error);
-			if (error2 && !error)
-				error = error2;
-		}
-	} else if (error) {
-		xfs_aops_discard_page(page);
-		ClearPageUptodate(page);
-		unlock_page(page);
+		set_page_writeback_keepwrite(page);
 	} else {
-		/*
-		 * We can end up here with no error and nothing to write if we
-		 * race with a partial page truncate on a sub-page block sized
-		 * filesystem. In that case we need to mark the page clean.
-		 */
 		clear_page_dirty_for_io(page);
 		set_page_writeback(page);
-		unlock_page(page);
-		end_page_writeback(page);
 	}
 
+	unlock_page(page);
+
+	/*
+	 * Preserve the original error if there was one, otherwise catch
+	 * submission errors here and propagate into subsequent ioend
+	 * submissions.
+	 */
+	list_for_each_entry_safe(ioend, next, &submit_list, io_list) {
+		int error2;
+
+		list_del_init(&ioend->io_list);
+		error2 = xfs_submit_ioend(wbc, ioend, error);
+		if (error2 && !error)
+			error = error2;
+	}
+
+	/*
+	 * We can end up here with no error and nothing to write if we race with
+	 * a partial page truncate on a sub-page block sized filesystem.
+	 */
+	if (!count)
+		end_page_writeback(page);
+done:
 	mapping_set_error(page->mapping, error);
 	return error;
 }
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 22/24] xfs: allow writeback on pages without buffer heads
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (20 preceding siblings ...)
  2018-06-15 13:02 ` [PATCH 21/24] xfs: refactor the tail of xfs_writepage_map Christoph Hellwig
@ 2018-06-15 13:02 ` Christoph Hellwig
  2018-06-15 13:02 ` [PATCH 23/24] iomap: add support for sub-pagesize buffered I/O " Christoph Hellwig
  2018-06-15 13:02 ` [PATCH 24/24] xfs: add support for sub-pagesize writeback without buffer_heads Christoph Hellwig
  23 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:02 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Disable the IOMAP_F_BUFFER_HEAD flag on file systems with a block size
equal to the page size, and deal with pages without buffer heads in
writeback.  Thanks to the previous refactoring this is basically trivial
now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c  | 50 +++++++++++++++++++++++++++++++++-------------
 fs/xfs/xfs_iomap.c |  3 ++-
 2 files changed, 38 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 6f8ac9cd3fb3..556aad821a17 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -91,6 +91,19 @@ xfs_find_daxdev_for_inode(
 		return mp->m_ddev_targp->bt_daxdev;
 }
 
+static void
+xfs_finish_page_writeback(
+	struct inode		*inode,
+	struct bio_vec		*bvec,
+	int			error)
+{
+	if (error) {
+		SetPageError(bvec->bv_page);
+		mapping_set_error(inode->i_mapping, -EIO);
+	}
+	end_page_writeback(bvec->bv_page);
+}
+
 /*
  * We're now finished for good with this page.  Update the page state via the
  * associated buffer_heads, paying attention to the start and end offsets that
@@ -103,7 +116,7 @@ xfs_find_daxdev_for_inode(
  * and buffers potentially freed after every call to end_buffer_async_write.
  */
 static void
-xfs_finish_page_writeback(
+xfs_finish_buffer_writeback(
 	struct inode		*inode,
 	struct bio_vec		*bvec,
 	int			error)
@@ -178,9 +191,12 @@ xfs_destroy_ioend(
 			next = bio->bi_private;
 
 		/* walk each page on bio, ending page IO on them */
-		bio_for_each_segment_all(bvec, bio, i)
-			xfs_finish_page_writeback(inode, bvec, error);
-
+		bio_for_each_segment_all(bvec, bio, i) {
+			if (page_has_buffers(bvec->bv_page))
+				xfs_finish_buffer_writeback(inode, bvec, error);
+			else
+				xfs_finish_page_writeback(inode, bvec, error);
+		}
 		bio_put(bio);
 	}
 
@@ -787,13 +803,16 @@ xfs_writepage_map(
 {
 	LIST_HEAD(submit_list);
 	struct xfs_ioend	*ioend, *next;
-	struct buffer_head	*bh;
+	struct buffer_head	*bh = NULL;
 	ssize_t			len = i_blocksize(inode);
 	uint64_t		file_offset;	/* file offset of page */
 	unsigned		poffset;	/* offset into page */
 	int			error = 0;
 	int			count = 0;
 
+	if (page_has_buffers(page))
+		bh = page_buffers(page);
+
 	/*
 	 * Walk the blocks on the page, and we we run off then end of the
 	 * current map or find the current map invalid, grab a new one.
@@ -802,18 +821,17 @@ xfs_writepage_map(
 	 * replace the bufferhead with some other state tracking mechanism in
 	 * future.
 	 */
-	file_offset = page_offset(page);
-	bh = page_buffers(page);
-	for (poffset = 0;
+	for (poffset = 0, file_offset = page_offset(page);
 	     poffset < PAGE_SIZE;
-	     poffset += len, file_offset += len, bh = bh->b_this_page) {
+	     poffset += len, file_offset += len) {
 		/* past the range we are writing, so nothing more to write. */
 		if (file_offset >= end_offset)
 			break;
 
-		if (!buffer_uptodate(bh)) {
+		if (bh && !buffer_uptodate(bh)) {
 			if (PageUptodate(page))
 				ASSERT(buffer_mapped(bh));
+			bh = bh->b_this_page;
 			continue;
 		}
 
@@ -821,10 +839,16 @@ xfs_writepage_map(
 		if (error)
 			goto out;
 
-		if (wpc->io_type == XFS_IO_HOLE)
+		if (wpc->io_type == XFS_IO_HOLE) {
+			if (bh)
+				bh = bh->b_this_page;
 			continue;
+		}
 
-		xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
+		if (bh) {
+			xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
+			bh = bh->b_this_page;
+		}
 		xfs_add_to_ioend(inode, file_offset, page, wpc, wbc,
 				&submit_list);
 		count++;
@@ -923,8 +947,6 @@ xfs_do_writepage(
 
 	trace_xfs_writepage(inode, page, 0, 0);
 
-	ASSERT(page_has_buffers(page));
-
 	/*
 	 * Refuse to write the page out if we are called from reclaim context.
 	 *
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 758ad9572d04..a29c27e86f60 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1033,7 +1033,8 @@ xfs_file_iomap_begin(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
-	iomap->flags |= IOMAP_F_BUFFER_HEAD;
+	if (i_blocksize(inode) < PAGE_SIZE)
+		iomap->flags |= IOMAP_F_BUFFER_HEAD;
 
 	if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
 			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 23/24] iomap: add support for sub-pagesize buffered I/O without buffer heads
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (21 preceding siblings ...)
  2018-06-15 13:02 ` [PATCH 22/24] xfs: allow writeback on pages without buffer heads Christoph Hellwig
@ 2018-06-15 13:02 ` Christoph Hellwig
  2018-06-19 16:52   ` Brian Foster
  2018-06-15 13:02 ` [PATCH 24/24] xfs: add support for sub-pagesize writeback without buffer_heads Christoph Hellwig
  23 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:02 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

After already supporting a simple implementation of buffered writes for
the blocksize == PAGE_SIZE case in the last commit this adds full support
even for smaller block sizes.   There are three bits of per-block
information in the buffer_head structure that really matter for the iomap
read and write path:

 - uptodate status (BH_uptodate)
 - marked as currently under read I/O (BH_Async_Read)
 - marked as currently under write I/O (BH_Async_Write)

Instead of having new per-block structures this now adds a per-page
structure called struct iomap_page to track this information in a slightly
different form:

 - a bitmap for the per-block uptodate status.  For worst case of a 64k
   page size system this bitmap needs to contain 128 bits.  For the
   typical 4k page size case it only needs 8 bits, although we still
   need a full unsigned long due to the way the atomic bitmap API works.
 - two atomic_t counters are used to track the outstanding read and write
   counts

There is quite a bit of boilerplate code as the buffered I/O path uses
various helper methods, but the actual code is very straight forward.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c            | 262 ++++++++++++++++++++++++++++++++++++++----
 include/linux/iomap.h |  31 +++++
 2 files changed, 272 insertions(+), 21 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index a504077bb38f..c59d1922991d 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -17,6 +17,7 @@
 #include <linux/iomap.h>
 #include <linux/uaccess.h>
 #include <linux/gfp.h>
+#include <linux/migrate.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
 #include <linux/swap.h>
@@ -104,6 +105,121 @@ iomap_sector(struct iomap *iomap, loff_t pos)
 	return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
 }
 
+static struct iomap_page *
+iomap_page_create(struct inode *inode, struct page *page)
+{
+	struct iomap_page *iop = to_iomap_page(page);
+
+	if (iop || i_blocksize(inode) == PAGE_SIZE)
+		return iop;
+
+	iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
+	atomic_set(&iop->read_count, 0);
+	atomic_set(&iop->write_count, 0);
+	bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
+	set_page_private(page, (unsigned long)iop);
+	SetPagePrivate(page);
+	return iop;
+}
+
+static void
+iomap_page_release(struct page *page)
+{
+	struct iomap_page *iop = to_iomap_page(page);
+
+	if (!iop)
+		return;
+	WARN_ON_ONCE(atomic_read(&iop->read_count));
+	WARN_ON_ONCE(atomic_read(&iop->write_count));
+	ClearPagePrivate(page);
+	set_page_private(page, 0);
+	kfree(iop);
+}
+
+/*
+ * Calculate the range inside the page that we actually need to read.
+ */
+static void
+iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
+		loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
+{
+	unsigned poff = *pos & (PAGE_SIZE - 1);
+	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
+
+	if (iop) {
+		unsigned block_size = i_blocksize(inode);
+		unsigned first = poff >> inode->i_blkbits;
+		unsigned last = (poff + plen - 1) >> inode->i_blkbits;
+		unsigned int i;
+
+		/* move forward for each leading block marked uptodate */
+		for (i = first; i <= last; i++) {
+			if (!test_bit(i, iop->uptodate))
+				break;
+			*pos += block_size;
+			poff += block_size;
+			plen -= block_size;
+		}
+
+		/* truncate len if we find any trailing uptodate block(s) */
+		for ( ; i <= last; i++) {
+			if (test_bit(i, iop->uptodate)) {
+				plen -= (last - i + 1) * block_size;
+				break;
+			}
+		}
+	}
+
+	*offp = poff;
+	*lenp = plen;
+}
+
+static void
+iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len)
+{
+	struct iomap_page *iop = to_iomap_page(page);
+	struct inode *inode = page->mapping->host;
+	unsigned first = off >> inode->i_blkbits;
+	unsigned last = (off + len - 1) >> inode->i_blkbits;
+	unsigned int i;
+	bool uptodate = true;
+
+	if (iop) {
+		for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
+			if (i >= first && i <= last)
+				set_bit(i, iop->uptodate);
+			else if (!test_bit(i, iop->uptodate))
+				uptodate = false;
+		}
+	}
+
+	if (uptodate && !PageError(page))
+		SetPageUptodate(page);
+}
+
+static void
+iomap_read_finish(struct iomap_page *iop, struct page *page)
+{
+	if (!iop || atomic_dec_and_test(&iop->read_count))
+		unlock_page(page);
+}
+
+static void
+iomap_read_page_end_io(struct bio_vec *bvec, int error)
+{
+	struct page *page = bvec->bv_page;
+	struct iomap_page *iop = to_iomap_page(page);
+
+	if (unlikely(error)) {
+		ClearPageUptodate(page);
+		SetPageError(page);
+	} else {
+		iomap_set_range_uptodate(page, bvec->bv_offset, bvec->bv_len);
+	}
+
+	iomap_read_finish(iop, page);
+}
+
 static void
 iomap_read_inline_data(struct inode *inode, struct page *page,
 		struct iomap *iomap)
@@ -132,7 +248,7 @@ iomap_read_end_io(struct bio *bio)
 	int i;
 
 	bio_for_each_segment_all(bvec, bio, i)
-		page_endio(bvec->bv_page, false, error);
+		iomap_read_page_end_io(bvec, error);
 	bio_put(bio);
 }
 
@@ -150,18 +266,19 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 {
 	struct iomap_readpage_ctx *ctx = data;
 	struct page *page = ctx->cur_page;
-	unsigned poff = pos & (PAGE_SIZE - 1);
-	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
+	struct iomap_page *iop = iomap_page_create(inode, page);
 	bool is_contig = false;
+	loff_t orig_pos = pos;
+	unsigned poff, plen;
 	sector_t sector;
 
-	/* we don't support blocksize < PAGE_SIZE quite yet. */
-	WARN_ON_ONCE(pos != page_offset(page));
-	WARN_ON_ONCE(plen != PAGE_SIZE);
+	iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
+	if (plen == 0)
+		goto done;
 
 	if (iomap->type != IOMAP_MAPPED || pos >= i_size_read(inode)) {
 		zero_user(page, poff, plen);
-		SetPageUptodate(page);
+		iomap_set_range_uptodate(page, poff, plen);
 		goto done;
 	}
 
@@ -177,6 +294,14 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		is_contig = true;
 	}
 
+	/*
+	 * If we start a new segment we need to increase the read count, and we
+	 * need to do so before submitting any previous full bio to make sure
+	 * that we don't prematurely unlock the page.
+	 */
+	if (iop)
+		atomic_inc(&iop->read_count);
+
 	if (!ctx->bio || !is_contig || bio_full(ctx->bio)) {
 		gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
 		int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
@@ -197,7 +322,13 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 
 	__bio_add_page(ctx->bio, page, plen, poff);
 done:
-	return plen;
+	/*
+	 * Move the caller beyond our range so that it keeps making progress.
+	 * For that we have to include any leading non-uptodate ranges, but
+	 * we can skip trailing ones as they will be handled in the next
+	 * iteration.
+	 */
+	return pos - orig_pos + plen;
 }
 
 int
@@ -208,8 +339,6 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)
 	unsigned poff;
 	loff_t ret;
 
-	WARN_ON_ONCE(page_has_buffers(page));
-
 	for (poff = 0; poff < PAGE_SIZE; poff += ret) {
 		ret = iomap_apply(inode, page_offset(page) + poff,
 				PAGE_SIZE - poff, 0, ops, &ctx,
@@ -335,6 +464,84 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages,
 }
 EXPORT_SYMBOL_GPL(iomap_readpages);
 
+int
+iomap_is_partially_uptodate(struct page *page, unsigned long from,
+		unsigned long count)
+{
+	struct iomap_page *iop = to_iomap_page(page);
+	struct inode *inode = page->mapping->host;
+	unsigned first = from >> inode->i_blkbits;
+	unsigned last = (from + count - 1) >> inode->i_blkbits;
+	unsigned i;
+
+	if (iop) {
+		for (i = first; i <= last; i++)
+			if (!test_bit(i, iop->uptodate))
+				return 0;
+		return 1;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
+
+int
+iomap_releasepage(struct page *page, gfp_t gfp_mask)
+{
+	/*
+	 * mm accommodates an old ext3 case where clean pages might not have had
+	 * the dirty bit cleared. Thus, it can send actual dirty pages to
+	 * ->releasepage() via shrink_active_list(), skip those here.
+	 */
+	if (PageDirty(page) || PageWriteback(page))
+		return 0;
+	iomap_page_release(page);
+	return 1;
+}
+EXPORT_SYMBOL_GPL(iomap_releasepage);
+
+void
+iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
+{
+	/*
+	 * If we are invalidating the entire page, clear the dirty state from it
+	 * and release it to avoid unnecessary buildup of the LRU.
+	 */
+	if (offset == 0 && len == PAGE_SIZE) {
+		WARN_ON_ONCE(PageWriteback(page));
+		cancel_dirty_page(page);
+		iomap_page_release(page);
+	}
+}
+EXPORT_SYMBOL_GPL(iomap_invalidatepage);
+
+#ifdef CONFIG_MIGRATION
+int
+iomap_migrate_page(struct address_space *mapping, struct page *newpage,
+		struct page *page, enum migrate_mode mode)
+{
+	int ret;
+
+	ret = migrate_page_move_mapping(mapping, newpage, page, NULL, mode, 0);
+	if (ret != MIGRATEPAGE_SUCCESS)
+		return ret;
+
+	if (page_has_private(page)) {
+		ClearPagePrivate(page);
+		set_page_private(newpage, page_private(page));
+		set_page_private(page, 0);
+		SetPagePrivate(newpage);
+	}
+
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
+	return MIGRATEPAGE_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(iomap_migrate_page);
+#endif /* CONFIG_MIGRATION */
+
 static void
 iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 {
@@ -358,6 +565,7 @@ iomap_read_page_sync(struct inode *inode, loff_t block_start, struct page *page,
 
 	if (iomap->type != IOMAP_MAPPED || block_start >= i_size_read(inode)) {
 		zero_user_segments(page, poff, from, to, poff + plen);
+		iomap_set_range_uptodate(page, poff, plen);
 		return 0;
 	}
 
@@ -373,21 +581,33 @@ static int
 __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
 		struct page *page, struct iomap *iomap)
 {
+	struct iomap_page *iop = iomap_page_create(inode, page);
 	loff_t block_size = i_blocksize(inode);
 	loff_t block_start = pos & ~(block_size - 1);
 	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
-	unsigned poff = block_start & (PAGE_SIZE - 1);
-	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, block_end - block_start);
-	unsigned from = pos & (PAGE_SIZE - 1), to = from + len;
-
-	WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE);
+	unsigned from = pos & (PAGE_SIZE - 1), to = from + len, poff, plen;
+	int status = 0;
 
 	if (PageUptodate(page))
 		return 0;
-	if (from <= poff && to >= poff + plen)
-		return 0;
-	return iomap_read_page_sync(inode, block_start, page,
-			poff, plen, from, to, iomap);
+
+	do {
+		iomap_adjust_read_range(inode, iop, &block_start,
+				block_end - block_start, &poff, &plen);
+		if (plen == 0)
+			break;
+
+		if ((from > poff && from < poff + plen) ||
+		    (to > poff && to < poff + plen)) {
+			status = iomap_read_page_sync(inode, block_start, page,
+					poff, plen, from, to, iomap);
+			if (status)
+				break;
+		}
+
+	} while ((block_start += plen) < block_end);
+
+	return status;
 }
 
 static int
@@ -470,7 +690,7 @@ __iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
 	if (unlikely(copied < len && !PageUptodate(page))) {
 		copied = 0;
 	} else {
-		SetPageUptodate(page);
+		iomap_set_range_uptodate(page, pos & (PAGE_SIZE - 1), len);
 		iomap_set_page_dirty(page);
 	}
 	return __generic_write_end(inode, pos, copied, page);
@@ -806,7 +1026,7 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
 		block_commit_write(page, 0, length);
 	} else {
 		WARN_ON_ONCE(!PageUptodate(page));
-		WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE);
+		iomap_page_create(inode, page);
 	}
 
 	return length;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index c2706cfb27c7..60b196c54dd6 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -2,6 +2,9 @@
 #ifndef LINUX_IOMAP_H
 #define LINUX_IOMAP_H 1
 
+#include <linux/atomic.h>
+#include <linux/bitmap.h>
+#include <linux/mm.h>
 #include <linux/types.h>
 
 struct address_space;
@@ -99,12 +102,40 @@ struct iomap_ops {
 			ssize_t written, unsigned flags, struct iomap *iomap);
 };
 
+/*
+ * Structure allocate for each page when block size < PAGE_SIZE to track
+ * sub-page uptodate status and I/O completions.
+ */
+struct iomap_page {
+	atomic_t		read_count;
+	atomic_t		write_count;
+	DECLARE_BITMAP(uptodate, PAGE_SIZE / 512);
+};
+
+static inline struct iomap_page *to_iomap_page(struct page *page)
+{
+	if (page_has_private(page))
+		return (struct iomap_page *)page_private(page);
+	return NULL;
+}
+
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops);
 int iomap_readpage(struct page *page, const struct iomap_ops *ops);
 int iomap_readpages(struct address_space *mapping, struct list_head *pages,
 		unsigned nr_pages, const struct iomap_ops *ops);
 int iomap_set_page_dirty(struct page *page);
+int iomap_is_partially_uptodate(struct page *page, unsigned long from,
+		unsigned long count);
+int iomap_releasepage(struct page *page, gfp_t gfp_mask);
+void iomap_invalidatepage(struct page *page, unsigned int offset,
+		unsigned int len);
+#ifdef CONFIG_MIGRATION
+int iomap_migrate_page(struct address_space *mapping, struct page *newpage,
+		struct page *page, enum migrate_mode mode);
+#else
+#define iomap_migrate_page NULL
+#endif
 int iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 24/24] xfs: add support for sub-pagesize writeback without buffer_heads
  2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
                   ` (22 preceding siblings ...)
  2018-06-15 13:02 ` [PATCH 23/24] iomap: add support for sub-pagesize buffered I/O " Christoph Hellwig
@ 2018-06-15 13:02 ` Christoph Hellwig
  2018-06-19  6:15   ` Darrick J. Wong
  23 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-15 13:02 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Switch to using the iomap_page structure for checking sub-page uptodate
status and track sub-page I/O completion status, and remove large
quantities of boilerplate code working around buffer heads.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c  | 496 ++++++---------------------------------------
 fs/xfs/xfs_buf.h   |   1 -
 fs/xfs/xfs_iomap.c |   3 -
 fs/xfs/xfs_super.c |   2 +-
 fs/xfs/xfs_trace.h |  18 +-
 5 files changed, 62 insertions(+), 458 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 556aad821a17..3fc54c34f1bc 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -32,9 +32,6 @@
 #include "xfs_bmap_util.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_reflink.h"
-#include <linux/gfp.h>
-#include <linux/mpage.h>
-#include <linux/pagevec.h>
 #include <linux/writeback.h>
 
 /*
@@ -46,25 +43,6 @@ struct xfs_writepage_ctx {
 	struct xfs_ioend	*ioend;
 };
 
-void
-xfs_count_page_state(
-	struct page		*page,
-	int			*delalloc,
-	int			*unwritten)
-{
-	struct buffer_head	*bh, *head;
-
-	*delalloc = *unwritten = 0;
-
-	bh = head = page_buffers(page);
-	do {
-		if (buffer_unwritten(bh))
-			(*unwritten) = 1;
-		else if (buffer_delay(bh))
-			(*delalloc) = 1;
-	} while ((bh = bh->b_this_page) != head);
-}
-
 struct block_device *
 xfs_find_bdev_for_inode(
 	struct inode		*inode)
@@ -97,67 +75,17 @@ xfs_finish_page_writeback(
 	struct bio_vec		*bvec,
 	int			error)
 {
+	struct iomap_page	*iop = to_iomap_page(bvec->bv_page);
+
 	if (error) {
 		SetPageError(bvec->bv_page);
 		mapping_set_error(inode->i_mapping, -EIO);
 	}
-	end_page_writeback(bvec->bv_page);
-}
 
-/*
- * We're now finished for good with this page.  Update the page state via the
- * associated buffer_heads, paying attention to the start and end offsets that
- * we need to process on the page.
- *
- * Note that we open code the action in end_buffer_async_write here so that we
- * only have to iterate over the buffers attached to the page once.  This is not
- * only more efficient, but also ensures that we only calls end_page_writeback
- * at the end of the iteration, and thus avoids the pitfall of having the page
- * and buffers potentially freed after every call to end_buffer_async_write.
- */
-static void
-xfs_finish_buffer_writeback(
-	struct inode		*inode,
-	struct bio_vec		*bvec,
-	int			error)
-{
-	struct buffer_head	*head = page_buffers(bvec->bv_page), *bh = head;
-	bool			busy = false;
-	unsigned int		off = 0;
-	unsigned long		flags;
-
-	ASSERT(bvec->bv_offset < PAGE_SIZE);
-	ASSERT((bvec->bv_offset & (i_blocksize(inode) - 1)) == 0);
-	ASSERT(bvec->bv_offset + bvec->bv_len <= PAGE_SIZE);
-	ASSERT((bvec->bv_len & (i_blocksize(inode) - 1)) == 0);
-
-	local_irq_save(flags);
-	bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
-	do {
-		if (off >= bvec->bv_offset &&
-		    off < bvec->bv_offset + bvec->bv_len) {
-			ASSERT(buffer_async_write(bh));
-			ASSERT(bh->b_end_io == NULL);
-
-			if (error) {
-				mark_buffer_write_io_error(bh);
-				clear_buffer_uptodate(bh);
-				SetPageError(bvec->bv_page);
-			} else {
-				set_buffer_uptodate(bh);
-			}
-			clear_buffer_async_write(bh);
-			unlock_buffer(bh);
-		} else if (buffer_async_write(bh)) {
-			ASSERT(buffer_locked(bh));
-			busy = true;
-		}
-		off += bh->b_size;
-	} while ((bh = bh->b_this_page) != head);
-	bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
-	local_irq_restore(flags);
+	ASSERT(iop || i_blocksize(inode) == PAGE_SIZE);
+	ASSERT(!iop || atomic_read(&iop->write_count) > 0);
 
-	if (!busy)
+	if (!iop || atomic_dec_and_test(&iop->write_count))
 		end_page_writeback(bvec->bv_page);
 }
 
@@ -191,12 +119,8 @@ xfs_destroy_ioend(
 			next = bio->bi_private;
 
 		/* walk each page on bio, ending page IO on them */
-		bio_for_each_segment_all(bvec, bio, i) {
-			if (page_has_buffers(bvec->bv_page))
-				xfs_finish_buffer_writeback(inode, bvec, error);
-			else
-				xfs_finish_page_writeback(inode, bvec, error);
-		}
+		bio_for_each_segment_all(bvec, bio, i)
+			xfs_finish_page_writeback(inode, bvec, error);
 		bio_put(bio);
 	}
 
@@ -633,6 +557,7 @@ xfs_add_to_ioend(
 	struct inode		*inode,
 	xfs_off_t		offset,
 	struct page		*page,
+	struct iomap_page	*iop,
 	struct xfs_writepage_ctx *wpc,
 	struct writeback_control *wbc,
 	struct list_head	*iolist)
@@ -656,100 +581,37 @@ xfs_add_to_ioend(
 				bdev, sector);
 	}
 
-	/*
-	 * If the block doesn't fit into the bio we need to allocate a new
-	 * one.  This shouldn't happen more than once for a given block.
-	 */
-	while (bio_add_page(wpc->ioend->io_bio, page, len, poff) != len)
-		xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
+	if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff)) {
+		if (iop)
+			atomic_inc(&iop->write_count);
+		if (bio_full(wpc->ioend->io_bio))
+			xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
+		__bio_add_page(wpc->ioend->io_bio, page, len, poff);
+	}
 
 	wpc->ioend->io_size += len;
 }
 
-STATIC void
-xfs_map_buffer(
-	struct inode		*inode,
-	struct buffer_head	*bh,
-	struct xfs_bmbt_irec	*imap,
-	xfs_off_t		offset)
-{
-	sector_t		bn;
-	struct xfs_mount	*m = XFS_I(inode)->i_mount;
-	xfs_off_t		iomap_offset = XFS_FSB_TO_B(m, imap->br_startoff);
-	xfs_daddr_t		iomap_bn = xfs_fsb_to_db(XFS_I(inode), imap->br_startblock);
-
-	ASSERT(imap->br_startblock != HOLESTARTBLOCK);
-	ASSERT(imap->br_startblock != DELAYSTARTBLOCK);
-
-	bn = (iomap_bn >> (inode->i_blkbits - BBSHIFT)) +
-	      ((offset - iomap_offset) >> inode->i_blkbits);
-
-	ASSERT(bn || XFS_IS_REALTIME_INODE(XFS_I(inode)));
-
-	bh->b_blocknr = bn;
-	set_buffer_mapped(bh);
-}
-
-STATIC void
-xfs_map_at_offset(
-	struct inode		*inode,
-	struct buffer_head	*bh,
-	struct xfs_bmbt_irec	*imap,
-	xfs_off_t		offset)
-{
-	ASSERT(imap->br_startblock != HOLESTARTBLOCK);
-	ASSERT(imap->br_startblock != DELAYSTARTBLOCK);
-
-	lock_buffer(bh);
-	xfs_map_buffer(inode, bh, imap, offset);
-	set_buffer_mapped(bh);
-	clear_buffer_delay(bh);
-	clear_buffer_unwritten(bh);
-
-	/*
-	 * If this is a realtime file, data may be on a different device.
-	 * to that pointed to from the buffer_head b_bdev currently. We can't
-	 * trust that the bufferhead has a already been mapped correctly, so
-	 * set the bdev now.
-	 */
-	bh->b_bdev = xfs_find_bdev_for_inode(inode);
-	bh->b_end_io = NULL;
-	set_buffer_async_write(bh);
-	set_buffer_uptodate(bh);
-	clear_buffer_dirty(bh);
-}
-
 STATIC void
 xfs_vm_invalidatepage(
 	struct page		*page,
 	unsigned int		offset,
 	unsigned int		length)
 {
-	trace_xfs_invalidatepage(page->mapping->host, page, offset,
-				 length);
-
-	/*
-	 * If we are invalidating the entire page, clear the dirty state from it
-	 * so that we can check for attempts to release dirty cached pages in
-	 * xfs_vm_releasepage().
-	 */
-	if (offset == 0 && length >= PAGE_SIZE)
-		cancel_dirty_page(page);
-	block_invalidatepage(page, offset, length);
+	trace_xfs_invalidatepage(page->mapping->host, page, offset, length);
+	iomap_invalidatepage(page, offset, length);
 }
 
 /*
- * If the page has delalloc buffers on it, we need to punch them out before we
- * invalidate the page. If we don't, we leave a stale delalloc mapping on the
- * inode that can trip a BUG() in xfs_get_blocks() later on if a direct IO read
- * is done on that same region - the delalloc extent is returned when none is
- * supposed to be there.
+ * If the page has delalloc blocks on it, we need to punch them out before we
+ * invalidate the page.  If we don't, we leave a stale delalloc mapping on the
+ * inode that can trip up a later direct I/O read operation on the same region.
  *
- * We prevent this by truncating away the delalloc regions on the page before
- * invalidating it. Because they are delalloc, we can do this without needing a
- * transaction. Indeed - if we get ENOSPC errors, we have to be able to do this
- * truncation without a transaction as there is no space left for block
- * reservation (typically why we see a ENOSPC in writeback).
+ * We prevent this by truncating away the delalloc regions on the page.  Because
+ * they are delalloc, we can do this without needing a transaction. Indeed - if
+ * we get ENOSPC errors, we have to be able to do this truncation without a
+ * transaction as there is no space left for block reservation (typically why we
+ * see a ENOSPC in writeback).
  */
 STATIC void
 xfs_aops_discard_page(
@@ -781,7 +643,7 @@ xfs_aops_discard_page(
  * We implement an immediate ioend submission policy here to avoid needing to
  * chain multiple ioends and hence nest mempool allocations which can violate
  * forward progress guarantees we need to provide. The current ioend we are
- * adding buffers to is cached on the writepage context, and if the new buffer
+ * adding blocks to is cached on the writepage context, and if the new block
  * does not append to the cached ioend it will create a new ioend and cache that
  * instead.
  *
@@ -802,79 +664,53 @@ xfs_writepage_map(
 	uint64_t		end_offset)
 {
 	LIST_HEAD(submit_list);
+	struct iomap_page	*iop = to_iomap_page(page);
+	unsigned		len = i_blocksize(inode);
 	struct xfs_ioend	*ioend, *next;
-	struct buffer_head	*bh = NULL;
-	ssize_t			len = i_blocksize(inode);
 	uint64_t		file_offset;	/* file offset of page */
-	unsigned		poffset;	/* offset into page */
-	int			error = 0;
-	int			count = 0;
+	int			error = 0, count = 0, i;
 
-	if (page_has_buffers(page))
-		bh = page_buffers(page);
+	ASSERT(iop || i_blocksize(inode) == PAGE_SIZE);
+	ASSERT(!iop || atomic_read(&iop->write_count) == 0);
 
 	/*
-	 * Walk the blocks on the page, and we we run off then end of the
-	 * current map or find the current map invalid, grab a new one.
-	 * We only use bufferheads here to check per-block state - they no
-	 * longer control the iteration through the page. This allows us to
-	 * replace the bufferhead with some other state tracking mechanism in
-	 * future.
+	 * Walk through the page to find areas to write back. If we run off the
+	 * end of the current map or find the current map invalid, grab a new
+	 * one.
 	 */
-	for (poffset = 0, file_offset = page_offset(page);
-	     poffset < PAGE_SIZE;
-	     poffset += len, file_offset += len) {
-		/* past the range we are writing, so nothing more to write. */
-		if (file_offset >= end_offset)
-			break;
-
-		if (bh && !buffer_uptodate(bh)) {
-			if (PageUptodate(page))
-				ASSERT(buffer_mapped(bh));
-			bh = bh->b_this_page;
+	for (i = 0, file_offset = page_offset(page);
+	     i < (PAGE_SIZE >> inode->i_blkbits) && file_offset < end_offset;
+	     i++, file_offset += len) {
+		if (iop && !test_bit(i, iop->uptodate))
 			continue;
-		}
 
 		error = xfs_map_blocks(wpc, inode, file_offset);
 		if (error)
-			goto out;
-
-		if (wpc->io_type == XFS_IO_HOLE) {
-			if (bh)
-				bh = bh->b_this_page;
+			break;
+		if (wpc->io_type == XFS_IO_HOLE)
 			continue;
-		}
-
-		if (bh) {
-			xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
-			bh = bh->b_this_page;
-		}
-		xfs_add_to_ioend(inode, file_offset, page, wpc, wbc,
-				&submit_list);
+		xfs_add_to_ioend(inode, file_offset, page, iop, wpc, wbc,
+				 &submit_list);
 		count++;
 	}
 
 	ASSERT(wpc->ioend || list_empty(&submit_list));
-out:
 	ASSERT(PageLocked(page));
 	ASSERT(!PageWriteback(page));
 
 	/*
-	 * On error, we have to fail the ioend here because we have locked
-	 * buffers in the ioend. If we don't do this, we'll deadlock
-	 * invalidating the page as that tries to lock the buffers on the page.
-	 * Also, because we may have set pages under writeback, we have to make
-	 * sure we run IO completion to mark the error state of the IO
-	 * appropriately, so we can't cancel the ioend directly here. That means
-	 * we have to mark this page as under writeback if we included any
-	 * buffers from it in the ioend chain so that completion treats it
-	 * correctly.
+	 * On error, we have to fail the ioend here because we may have set
+	 * pages under writeback, we have to make sure we run IO completion to
+	 * mark the error state of the IO appropriately, so we can't cancel the
+	 * ioend directly here.  That means we have to mark this page as under
+	 * writeback if we included any blocks from it in the ioend chain so
+	 * that completion treats it correctly.
 	 *
 	 * If we didn't include the page in the ioend, the on error we can
 	 * simply discard and unlock it as there are no other users of the page
-	 * or it's buffers right now. The caller will still need to trigger
-	 * submission of outstanding ioends on the writepage context so they are
-	 * treated correctly on error.
+	 * now.  The caller will still need to trigger submission of outstanding
+	 * ioends on the writepage context so they are treated correctly on
+	 * error.
 	 */
 	if (unlikely(error)) {
 		if (!count) {
@@ -915,8 +751,8 @@ xfs_writepage_map(
 	}
 
 	/*
-	 * We can end up here with no error and nothing to write if we race with
-	 * a partial page truncate on a sub-page block sized filesystem.
+	 * We can end up here with no error and nothing to write only if we race
+	 * with a partial page truncate on a sub-page block sized filesystem.
 	 */
 	if (!count)
 		end_page_writeback(page);
@@ -931,7 +767,6 @@ xfs_writepage_map(
  * For delalloc space on the page we need to allocate space and flush it.
  * For unwritten space on the page we need to start the conversion to
  * regular allocated space.
- * For any other dirty buffer heads on the page we should flush them.
  */
 STATIC int
 xfs_do_writepage(
@@ -1085,166 +920,13 @@ xfs_dax_writepages(
 			xfs_find_bdev_for_inode(mapping->host), wbc);
 }
 
-/*
- * Called to move a page into cleanable state - and from there
- * to be released. The page should already be clean. We always
- * have buffer heads in this call.
- *
- * Returns 1 if the page is ok to release, 0 otherwise.
- */
 STATIC int
 xfs_vm_releasepage(
 	struct page		*page,
 	gfp_t			gfp_mask)
 {
-	int			delalloc, unwritten;
-
 	trace_xfs_releasepage(page->mapping->host, page, 0, 0);
-
-	/*
-	 * mm accommodates an old ext3 case where clean pages might not have had
-	 * the dirty bit cleared. Thus, it can send actual dirty pages to
-	 * ->releasepage() via shrink_active_list(). Conversely,
-	 * block_invalidatepage() can send pages that are still marked dirty but
-	 * otherwise have invalidated buffers.
-	 *
-	 * We want to release the latter to avoid unnecessary buildup of the
-	 * LRU, so xfs_vm_invalidatepage() clears the page dirty flag on pages
-	 * that are entirely invalidated and need to be released.  Hence the
-	 * only time we should get dirty pages here is through
-	 * shrink_active_list() and so we can simply skip those now.
-	 *
-	 * warn if we've left any lingering delalloc/unwritten buffers on clean
-	 * or invalidated pages we are about to release.
-	 */
-	if (PageDirty(page))
-		return 0;
-
-	xfs_count_page_state(page, &delalloc, &unwritten);
-
-	if (WARN_ON_ONCE(delalloc))
-		return 0;
-	if (WARN_ON_ONCE(unwritten))
-		return 0;
-
-	return try_to_free_buffers(page);
-}
-
-/*
- * If this is O_DIRECT or the mpage code calling tell them how large the mapping
- * is, so that we can avoid repeated get_blocks calls.
- *
- * If the mapping spans EOF, then we have to break the mapping up as the mapping
- * for blocks beyond EOF must be marked new so that sub block regions can be
- * correctly zeroed. We can't do this for mappings within EOF unless the mapping
- * was just allocated or is unwritten, otherwise the callers would overwrite
- * existing data with zeros. Hence we have to split the mapping into a range up
- * to and including EOF, and a second mapping for beyond EOF.
- */
-static void
-xfs_map_trim_size(
-	struct inode		*inode,
-	sector_t		iblock,
-	struct buffer_head	*bh_result,
-	struct xfs_bmbt_irec	*imap,
-	xfs_off_t		offset,
-	ssize_t			size)
-{
-	xfs_off_t		mapping_size;
-
-	mapping_size = imap->br_startoff + imap->br_blockcount - iblock;
-	mapping_size <<= inode->i_blkbits;
-
-	ASSERT(mapping_size > 0);
-	if (mapping_size > size)
-		mapping_size = size;
-	if (offset < i_size_read(inode) &&
-	    (xfs_ufsize_t)offset + mapping_size >= i_size_read(inode)) {
-		/* limit mapping to block that spans EOF */
-		mapping_size = roundup_64(i_size_read(inode) - offset,
-					  i_blocksize(inode));
-	}
-	if (mapping_size > LONG_MAX)
-		mapping_size = LONG_MAX;
-
-	bh_result->b_size = mapping_size;
-}
-
-static int
-xfs_get_blocks(
-	struct inode		*inode,
-	sector_t		iblock,
-	struct buffer_head	*bh_result,
-	int			create)
-{
-	struct xfs_inode	*ip = XFS_I(inode);
-	struct xfs_mount	*mp = ip->i_mount;
-	xfs_fileoff_t		offset_fsb, end_fsb;
-	int			error = 0;
-	int			lockmode = 0;
-	struct xfs_bmbt_irec	imap;
-	int			nimaps = 1;
-	xfs_off_t		offset;
-	ssize_t			size;
-
-	BUG_ON(create);
-
-	if (XFS_FORCED_SHUTDOWN(mp))
-		return -EIO;
-
-	offset = (xfs_off_t)iblock << inode->i_blkbits;
-	ASSERT(bh_result->b_size >= i_blocksize(inode));
-	size = bh_result->b_size;
-
-	if (offset >= i_size_read(inode))
-		return 0;
-
-	/*
-	 * Direct I/O is usually done on preallocated files, so try getting
-	 * a block mapping without an exclusive lock first.
-	 */
-	lockmode = xfs_ilock_data_map_shared(ip);
-
-	ASSERT(offset <= mp->m_super->s_maxbytes);
-	if (offset > mp->m_super->s_maxbytes - size)
-		size = mp->m_super->s_maxbytes - offset;
-	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + size);
-	offset_fsb = XFS_B_TO_FSBT(mp, offset);
-
-	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, &imap,
-			&nimaps, 0);
-	if (error)
-		goto out_unlock;
-	if (!nimaps) {
-		trace_xfs_get_blocks_notfound(ip, offset, size);
-		goto out_unlock;
-	}
-
-	trace_xfs_get_blocks_found(ip, offset, size,
-		imap.br_state == XFS_EXT_UNWRITTEN ?
-			XFS_IO_UNWRITTEN : XFS_IO_OVERWRITE, &imap);
-	xfs_iunlock(ip, lockmode);
-
-	/* trim mapping down to size requested */
-	xfs_map_trim_size(inode, iblock, bh_result, &imap, offset, size);
-
-	/*
-	 * For unwritten extents do not report a disk address in the buffered
-	 * read case (treat as if we're reading into a hole).
-	 */
-	if (xfs_bmap_is_real_extent(&imap))
-		xfs_map_buffer(inode, bh_result, &imap, offset);
-
-	/*
-	 * If this is a realtime file, data may be on a different device.
-	 * to that pointed to from the buffer_head b_bdev currently.
-	 */
-	bh_result->b_bdev = xfs_find_bdev_for_inode(inode);
-	return 0;
-
-out_unlock:
-	xfs_iunlock(ip, lockmode);
-	return error;
+	return iomap_releasepage(page, gfp_mask);
 }
 
 STATIC sector_t
@@ -1276,9 +958,7 @@ xfs_vm_readpage(
 	struct page		*page)
 {
 	trace_xfs_vm_readpage(page->mapping->host, 1);
-	if (i_blocksize(page->mapping->host) == PAGE_SIZE)
-		return iomap_readpage(page, &xfs_iomap_ops);
-	return mpage_readpage(page, xfs_get_blocks);
+	return iomap_readpage(page, &xfs_iomap_ops);
 }
 
 STATIC int
@@ -1289,65 +969,7 @@ xfs_vm_readpages(
 	unsigned		nr_pages)
 {
 	trace_xfs_vm_readpages(mapping->host, nr_pages);
-	if (i_blocksize(mapping->host) == PAGE_SIZE)
-		return iomap_readpages(mapping, pages, nr_pages, &xfs_iomap_ops);
-	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
-}
-
-/*
- * This is basically a copy of __set_page_dirty_buffers() with one
- * small tweak: buffers beyond EOF do not get marked dirty. If we mark them
- * dirty, we'll never be able to clean them because we don't write buffers
- * beyond EOF, and that means we can't invalidate pages that span EOF
- * that have been marked dirty. Further, the dirty state can leak into
- * the file interior if the file is extended, resulting in all sorts of
- * bad things happening as the state does not match the underlying data.
- *
- * XXX: this really indicates that bufferheads in XFS need to die. Warts like
- * this only exist because of bufferheads and how the generic code manages them.
- */
-STATIC int
-xfs_vm_set_page_dirty(
-	struct page		*page)
-{
-	struct address_space	*mapping = page->mapping;
-	struct inode		*inode = mapping->host;
-	loff_t			end_offset;
-	loff_t			offset;
-	int			newly_dirty;
-
-	if (unlikely(!mapping))
-		return !TestSetPageDirty(page);
-
-	end_offset = i_size_read(inode);
-	offset = page_offset(page);
-
-	spin_lock(&mapping->private_lock);
-	if (page_has_buffers(page)) {
-		struct buffer_head *head = page_buffers(page);
-		struct buffer_head *bh = head;
-
-		do {
-			if (offset < end_offset)
-				set_buffer_dirty(bh);
-			bh = bh->b_this_page;
-			offset += i_blocksize(inode);
-		} while (bh != head);
-	}
-	/*
-	 * Lock out page->mem_cgroup migration to keep PageDirty
-	 * synchronized with per-memcg dirty page counters.
-	 */
-	lock_page_memcg(page);
-	newly_dirty = !TestSetPageDirty(page);
-	spin_unlock(&mapping->private_lock);
-
-	if (newly_dirty)
-		__set_page_dirty(page, mapping, 1);
-	unlock_page_memcg(page);
-	if (newly_dirty)
-		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-	return newly_dirty;
+	return iomap_readpages(mapping, pages, nr_pages, &xfs_iomap_ops);
 }
 
 static int
@@ -1365,13 +987,13 @@ const struct address_space_operations xfs_address_space_operations = {
 	.readpages		= xfs_vm_readpages,
 	.writepage		= xfs_vm_writepage,
 	.writepages		= xfs_vm_writepages,
-	.set_page_dirty		= xfs_vm_set_page_dirty,
+	.set_page_dirty		= iomap_set_page_dirty,
 	.releasepage		= xfs_vm_releasepage,
 	.invalidatepage		= xfs_vm_invalidatepage,
 	.bmap			= xfs_vm_bmap,
 	.direct_IO		= noop_direct_IO,
-	.migratepage		= buffer_migrate_page,
-	.is_partially_uptodate  = block_is_partially_uptodate,
+	.migratepage		= iomap_migrate_page,
+	.is_partially_uptodate  = iomap_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 	.swap_activate		= xfs_iomap_swapfile_activate,
 };
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index f5f2b71c2fde..f3fa197bd272 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -24,7 +24,6 @@
 #include <linux/mm.h>
 #include <linux/fs.h>
 #include <linux/dax.h>
-#include <linux/buffer_head.h>
 #include <linux/uio.h>
 #include <linux/list_lru.h>
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index a29c27e86f60..bcd3064bbf9c 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1033,9 +1033,6 @@ xfs_file_iomap_begin(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
-	if (i_blocksize(inode) < PAGE_SIZE)
-		iomap->flags |= IOMAP_F_BUFFER_HEAD;
-
 	if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
 			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
 		/* Reserve delalloc blocks for regular writeback. */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ed67389f4948..cd077c05e09c 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1872,7 +1872,7 @@ MODULE_ALIAS_FS("xfs");
 STATIC int __init
 xfs_init_zones(void)
 {
-	if (bioset_init(&xfs_ioend_bioset, 4 * MAX_BUF_PER_PAGE,
+	if (bioset_init(&xfs_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
 			offsetof(struct xfs_ioend, io_inline_bio),
 			BIOSET_NEED_BVECS))
 		goto out;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 666844532030..41aaaf432ff4 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1165,33 +1165,23 @@ DECLARE_EVENT_CLASS(xfs_page_class,
 		__field(loff_t, size)
 		__field(unsigned long, offset)
 		__field(unsigned int, length)
-		__field(int, delalloc)
-		__field(int, unwritten)
 	),
 	TP_fast_assign(
-		int delalloc = -1, unwritten = -1;
-
-		if (page_has_buffers(page))
-			xfs_count_page_state(page, &delalloc, &unwritten);
 		__entry->dev = inode->i_sb->s_dev;
 		__entry->ino = XFS_I(inode)->i_ino;
 		__entry->pgoff = page_offset(page);
 		__entry->size = i_size_read(inode);
 		__entry->offset = off;
 		__entry->length = len;
-		__entry->delalloc = delalloc;
-		__entry->unwritten = unwritten;
 	),
 	TP_printk("dev %d:%d ino 0x%llx pgoff 0x%lx size 0x%llx offset %lx "
-		  "length %x delalloc %d unwritten %d",
+		  "length %x",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->pgoff,
 		  __entry->size,
 		  __entry->offset,
-		  __entry->length,
-		  __entry->delalloc,
-		  __entry->unwritten)
+		  __entry->length)
 )
 
 #define DEFINE_PAGE_EVENT(name)		\
@@ -1275,9 +1265,6 @@ DEFINE_EVENT(xfs_imap_class, name,	\
 	TP_ARGS(ip, offset, count, type, irec))
 DEFINE_IOMAP_EVENT(xfs_map_blocks_found);
 DEFINE_IOMAP_EVENT(xfs_map_blocks_alloc);
-DEFINE_IOMAP_EVENT(xfs_get_blocks_found);
-DEFINE_IOMAP_EVENT(xfs_get_blocks_alloc);
-DEFINE_IOMAP_EVENT(xfs_get_blocks_map_direct);
 DEFINE_IOMAP_EVENT(xfs_iomap_alloc);
 DEFINE_IOMAP_EVENT(xfs_iomap_found);
 
@@ -1316,7 +1303,6 @@ DEFINE_EVENT(xfs_simple_io_class, name,	\
 	TP_ARGS(ip, offset, count))
 DEFINE_SIMPLE_IO_EVENT(xfs_delalloc_enospc);
 DEFINE_SIMPLE_IO_EVENT(xfs_unwritten_convert);
-DEFINE_SIMPLE_IO_EVENT(xfs_get_blocks_notfound);
 DEFINE_SIMPLE_IO_EVENT(xfs_setfilesize);
 DEFINE_SIMPLE_IO_EVENT(xfs_zero_eof);
 DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/24] xfs: remove xfs_map_cow
  2018-06-15 13:01 ` [PATCH 11/24] xfs: remove xfs_map_cow Christoph Hellwig
@ 2018-06-18 17:38   ` Brian Foster
  2018-06-19  5:35     ` Darrick J. Wong
  0 siblings, 1 reply; 54+ messages in thread
From: Brian Foster @ 2018-06-18 17:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel

On Fri, Jun 15, 2018 at 03:01:56PM +0200, Christoph Hellwig wrote:
> We can handle the existing cow mapping case as a special case directly
> in xfs_writepage_map, and share code for allocating delalloc blocks
> with regular I/O in xfs_map_blocks.  This means we need to always
> call xfs_map_blocks for reflink inodes, but we can still skip most of
> the work if it turns out that there is no COW mapping overlapping the
> current block.
> 
> As a subtle detail we need to start caching holes in the wpc to deal
> with the case of COW reservations between EOF.  But we'll need that
> infrastructure later anyway, so this is no big deal.
> 
> Based on a patch from Dave Chinner.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_aops.c | 186 ++++++++++++++++++++++------------------------
>  fs/xfs/xfs_aops.h |   4 +-
>  2 files changed, 92 insertions(+), 98 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index c8fc6786b06c..ff27bdcc49a1 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
...
> @@ -887,22 +877,24 @@ xfs_writepage_map(
>  		if (wpc->imap_valid)
>  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
>  							 offset);
> -		if (!wpc->imap_valid) {
> -			error = xfs_map_blocks(inode, offset, &wpc->imap,
> -					     wpc->io_type);

A comment would be nice here just so the purpose of the logic is clear
at a glance. E.g:

                /*
                 * COW fork blocks can overlap data fork blocks even if the
                 * blocks aren't shared. COW I/O always takes precedent, so we
                 * must always check for overlap on reflink inodes unless the
                 * mapping is already COW.
                 */

With something like that, this now looks pretty good to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

Note: I see that this code ends up shifted into xfs_map_blocks(). If
it's easier to add the comment in whatever patch touches that last,
that's fine too.

Brian

> +		if (!wpc->imap_valid ||
> +		    (xfs_is_reflink_inode(XFS_I(inode)) &&
> +		     wpc->io_type != XFS_IO_COW)) {
> +			error = xfs_map_blocks(wpc, inode, offset);
>  			if (error)
>  				goto out;
>  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
>  							 offset);
>  		}
> -		if (wpc->imap_valid) {
> -			lock_buffer(bh);
> -			if (wpc->io_type != XFS_IO_OVERWRITE)
> -				xfs_map_at_offset(inode, bh, &wpc->imap, offset);
> -			xfs_add_to_ioend(inode, bh, offset, wpc, wbc, &submit_list);
> -			count++;
> -		}
>  
> +		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE)
> +			continue;
> +
> +		lock_buffer(bh);
> +		if (wpc->io_type != XFS_IO_OVERWRITE)
> +			xfs_map_at_offset(inode, bh, &wpc->imap, offset);
> +		xfs_add_to_ioend(inode, bh, offset, wpc, wbc, &submit_list);
> +		count++;
>  	} while (offset += len, ((bh = bh->b_this_page) != head));
>  
>  	ASSERT(wpc->ioend || list_empty(&submit_list));
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index 694c85b03813..466d22e5bc36 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -29,6 +29,7 @@ enum {
>  	XFS_IO_UNWRITTEN,	/* covers allocated but uninitialized data */
>  	XFS_IO_OVERWRITE,	/* covers already allocated extent */
>  	XFS_IO_COW,		/* covers copy-on-write extent */
> +	XFS_IO_HOLE,		/* covers region without any block allocation */
>  };
>  
>  #define XFS_IO_TYPES \
> @@ -36,7 +37,8 @@ enum {
>  	{ XFS_IO_DELALLOC,		"delalloc" }, \
>  	{ XFS_IO_UNWRITTEN,		"unwritten" }, \
>  	{ XFS_IO_OVERWRITE,		"overwrite" }, \
> -	{ XFS_IO_COW,			"CoW" }
> +	{ XFS_IO_COW,			"CoW" }, \
> +	{ XFS_IO_HOLE,			"hole" }
>  
>  /*
>   * Structure for buffered I/O completions.
> -- 
> 2.17.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 13/24] xfs: make xfs_writepage_map extent map centric
  2018-06-15 13:01 ` [PATCH 13/24] xfs: make xfs_writepage_map extent map centric Christoph Hellwig
@ 2018-06-18 17:38   ` Brian Foster
  2018-06-19  5:43   ` Darrick J. Wong
  1 sibling, 0 replies; 54+ messages in thread
From: Brian Foster @ 2018-06-18 17:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, Dave Chinner

On Fri, Jun 15, 2018 at 03:01:58PM +0200, Christoph Hellwig wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_writepage_map() iterates over the bufferheads on a page to decide
> what sort of IO to do and what actions to take.  However, when it comes
> to reflink and deciding when it needs to execute a COW operation, we no
> longer look at the bufferhead state but instead we ignore than and look
> up internal state held in the COW fork extent list.
> 
> This means xfs_writepage_map() is somewhat confused. It does stuff, then
> ignores it, then tries to handle the impedence mismatch by shovelling the
> results inside the existing mapping code.  It works, but it's a bit of a
> mess and it makes it hard to fix the cached map bug that the writepage
> code currently has.
> 
> To unify the two different mechanisms, we first have to choose a direction.
> That's already been set - we're de-emphasising bufferheads so they are no
> longer a control structure as we need to do taht to allow for eventual
> removal.  Hence we need to move away from looking at bufferhead state to
> determine what operations we need to perform.
> 
> We can't completely get rid of bufferheads yet - they do contain some
> state that is absolutely necessary, such as whether that part of the page
> contains valid data or not (buffer_uptodate()).  Other state in the
> bufferhead is redundant:
> 
> 	BH_dirty - the page is dirty, so we can ignore this and just
> 		write it
> 	BH_delay - we have delalloc extent info in the DATA fork extent
> 		tree
> 	BH_unwritten - same as BH_delay
> 	BH_mapped - indicates we've already used it once for IO and it is
> 		mapped to a disk address. Needs to be ignored for COW
> 		blocks.
> 
> The BH_mapped flag is an interesting case - it's supposed to indicate that
> it's already mapped to disk and so we can just use it "as is".  In theory,
> we don't even have to do an extent lookup to find where to write it too,
> but we have to do that anyway to determine we are actually writing over a
> valid extent.  Hence it's not even serving the purpose of avoiding a an
> extent lookup during writeback, and so we can pretty much ignore it.
> Especially as we have to ignore it for COW operations...
> 
> Therefore, use the extent map as the source of information to tell us
> what actions we need to take and what sort of IO we should perform.  The
> first step is to have xfs_map_blocks() set the io type according to what
> it looks up.  This means it can easily handle both normal overwrite and
> COW cases.  The only thing we also need to add is the ability to return
> hole mappings.
> 
> We need to return and cache hole mappings now for the case of multiple
> blocks per page.  We no longer use the BH_mapped to indicate a block over
> a hole, so we have to get that info from xfs_map_blocks().  We cache it so
> that holes that span two pages don't need separate lookups.  This allows us
> to avoid ever doing write IO over a hole, too.
> 
> Now that we have xfs_map_blocks() returning both a cached map and the type
> of IO we need to perform, we can rewrite xfs_writepage_map() to drop all
> the bufferhead control. It's also much simplified because it doesn't need
> to explicitly handle COW operations.  Instead of iterating bufferheads, it
> iterates blocks within the page and then looks up what per-block state is
> required from the appropriate bufferhead.  It then validates the cached
> map, and if it's not valid, we get a new map.  If we don't get a valid map
> or it's over a hole, we skip the block.
> 
> At this point, we have to remap the bufferhead via xfs_map_at_offset().
> As previously noted, we had to do this even if the buffer was already
> mapped as the mapping would be stale for XFS_IO_DELALLOC, XFS_IO_UNWRITTEN
> and XFS_IO_COW IO types.  With xfs_map_blocks() now controlling the type,
> even XFS_IO_OVERWRITE types need remapping, as converted-but-not-yet-
> written delalloc extents beyond EOF can be reported at XFS_IO_OVERWRITE.
> Bufferheads that span such regions still need their BH_Delay flags cleared
> and their block numbers calculated, so we now unconditionally map each
> bufferhead before submission.
> 
> But wait! There's more - remember the old "treat unwritten extents as
> holes on read" hack?  Yeah, that means we can have a dirty page with
> unmapped, unwritten bufferheads that contain data!  What makes these so
> special is that the unwritten "hole" bufferheads do not have a valid block
> device pointer, so if we attempt to write them xfs_add_to_ioend() blows
> up. So we make xfs_map_at_offset() do the "realtime or data device"
> lookup from the inode and ignore what was or wasn't put into the
> bufferhead when the buffer was instantiated.
> 
> The astute reader will have realised by now that this code treats
> unwritten extents in multiple-blocks-per-page situations differently.
> If we get any combination of unwritten blocks on a dirty page that contain
> valid data in the page, we're going to convert them to real extents.  This
> can actually be a win, because it means that pages with interleaving
> unwritten and written blocks will get converted to a single written extent
> with zeros replacing the interspersed unwritten blocks.  This is actually
> good for reducing extent list and conversion overhead, and it means we
> issue a contiguous IO instead of lots of little ones.  The downside is
> that we use up a little extra IO bandwidth.  Neither of these seem like a
> bad thing given that spinning disks are seek sensitive, and SSDs/pmem have
> bandwidth to burn and the lower Io latency/CPU overhead of fewer, larger
> IOs will result in better performance on them...
> 
> As a result of all this, the only state we actually care about from the
> bufferhead is a single flag - BH_Uptodate. We still use the bufferhead to
> pass some information to the bio via xfs_add_to_ioend(), but that is
> trivial to separate and pass explicitly.  This means we really only need
> 1 bit of state per block per page from the buffered write path in the
> writeback path.  Everything else we do with the bufferhead is purely to
> make the buffered IO front end continue to work correctly. i.e we've
> pretty much marginalised bufferheads in the writeback path completely.
> 
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> [hch: forward port + slight refactoring]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_aops.c | 89 ++++++++++++++++++++---------------------------
>  1 file changed, 37 insertions(+), 52 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 8c1a28f39197..165891ecb4ba 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -454,19 +454,19 @@ xfs_map_blocks(
>  	} else if (imap.br_startblock == HOLESTARTBLOCK) {
>  		/* landed in a hole */
>  		wpc->io_type = XFS_IO_HOLE;
> -	}
> -
> -	if (wpc->io_type == XFS_IO_DELALLOC &&
> -	    (!nimaps || isnullstartblock(imap.br_startblock)))
> -		goto allocate_blocks;
> +	} else {
> +		if (isnullstartblock(imap.br_startblock)) {
> +			/* got a delalloc extent */
> +			wpc->io_type = XFS_IO_DELALLOC;
> +			goto allocate_blocks;
> +		}
>  
> -#ifdef DEBUG
> -	if (wpc->io_type == XFS_IO_UNWRITTEN) {
> -		ASSERT(nimaps);
> -		ASSERT(imap.br_startblock != HOLESTARTBLOCK);
> -		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> +		if (imap.br_state == XFS_EXT_UNWRITTEN)
> +			wpc->io_type = XFS_IO_UNWRITTEN;
> +		else
> +			wpc->io_type = XFS_IO_OVERWRITE;
>  	}
> -#endif
> +
>  	wpc->imap = imap;
>  	trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
>  	return 0;
> @@ -736,6 +736,14 @@ xfs_map_at_offset(
>  	set_buffer_mapped(bh);
>  	clear_buffer_delay(bh);
>  	clear_buffer_unwritten(bh);
> +
> +	/*
> +	 * If this is a realtime file, data may be on a different device.
> +	 * to that pointed to from the buffer_head b_bdev currently. We can't
> +	 * trust that the bufferhead has a already been mapped correctly, so
> +	 * set the bdev now.
> +	 */
> +	bh->b_bdev = xfs_find_bdev_for_inode(inode);
>  }
>  
>  STATIC void
> @@ -822,58 +830,36 @@ xfs_writepage_map(
>  {
>  	LIST_HEAD(submit_list);
>  	struct xfs_ioend	*ioend, *next;
> -	struct buffer_head	*bh, *head;
> +	struct buffer_head	*bh;
>  	ssize_t			len = i_blocksize(inode);
>  	uint64_t		file_offset;	/* file offset of page */
> +	unsigned		poffset;	/* offset into page */
>  	int			error = 0;
>  	int			count = 0;
> -	unsigned int		new_type;
>  
> -	bh = head = page_buffers(page);
> +	/*
> +	 * Walk the blocks on the page, and we we run off then end of the
> +	 * current map or find the current map invalid, grab a new one.
> +	 * We only use bufferheads here to check per-block state - they no
> +	 * longer control the iteration through the page. This allows us to
> +	 * replace the bufferhead with some other state tracking mechanism in
> +	 * future.
> +	 */
>  	file_offset = page_offset(page);
> -	do {
> +	bh = page_buffers(page);
> +	for (poffset = 0;
> +	     poffset < PAGE_SIZE;
> +	     poffset += len, file_offset += len, bh = bh->b_this_page) {
> +		/* past the range we are writing, so nothing more to write. */
>  		if (file_offset >= end_offset)
>  			break;
>  
> -		/*
> -		 * set_page_dirty dirties all buffers in a page, independent
> -		 * of their state.  The dirty state however is entirely
> -		 * meaningless for holes (!mapped && uptodate), so skip
> -		 * buffers covering holes here.
> -		 */
> -		if (!buffer_mapped(bh) && buffer_uptodate(bh))
> -			continue;
> -
> -		if (buffer_unwritten(bh))
> -			new_type = XFS_IO_UNWRITTEN;
> -		else if (buffer_delay(bh))
> -			new_type = XFS_IO_DELALLOC;
> -		else if (buffer_uptodate(bh))
> -			new_type = XFS_IO_OVERWRITE;
> -		else {
> +		if (!buffer_uptodate(bh)) {
>  			if (PageUptodate(page))
>  				ASSERT(buffer_mapped(bh));
> -			/*
> -			 * This buffer is not uptodate and will not be
> -			 * written to disk.
> -			 */
>  			continue;
>  		}
>  
> -		/*
> -		 * If we already have a valid COW mapping keep using it.
> -		 */
> -		if (wpc->io_type == XFS_IO_COW &&
> -		    xfs_imap_valid(inode, &wpc->imap, file_offset)) {
> -			wpc->imap_valid = true;
> -			new_type = XFS_IO_COW;
> -		}
> -
> -		if (wpc->io_type != new_type) {
> -			wpc->io_type = new_type;
> -			wpc->imap_valid = false;
> -		}
> -
>  		if (wpc->imap_valid)
>  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
>  							 file_offset);
> @@ -891,11 +877,10 @@ xfs_writepage_map(
>  			continue;
>  
>  		lock_buffer(bh);
> -		if (wpc->io_type != XFS_IO_OVERWRITE)
> -			xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
> +		xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
>  		xfs_add_to_ioend(inode, bh, file_offset, wpc, wbc, &submit_list);
>  		count++;
> -	} while (file_offset += len, ((bh = bh->b_this_page) != head));
> +	}
>  
>  	ASSERT(wpc->ioend || list_empty(&submit_list));
>  
> -- 
> 2.17.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 06/24] xfs: move locking into xfs_bmap_punch_delalloc_range
  2018-06-15 13:01 ` [PATCH 06/24] xfs: move locking into xfs_bmap_punch_delalloc_range Christoph Hellwig
@ 2018-06-19  5:26   ` Darrick J. Wong
  0 siblings, 0 replies; 54+ messages in thread
From: Darrick J. Wong @ 2018-06-19  5:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel

On Fri, Jun 15, 2018 at 03:01:51PM +0200, Christoph Hellwig wrote:
> Both callers want the same looking, so do it only once.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_aops.c      | 2 --
>  fs/xfs/xfs_bmap_util.c | 9 +++++----
>  fs/xfs/xfs_iomap.c     | 3 ---
>  3 files changed, 5 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d94dd6fe5127..026721c4991e 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -761,10 +761,8 @@ xfs_aops_discard_page(
>  		"page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
>  			page, ip->i_ino, offset);
>  
> -	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
>  			PAGE_SIZE / i_blocksize(inode));
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	if (error && !XFS_FORCED_SHUTDOWN(mp))
>  		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
>  out_invalidate:
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 0070b877ed94..6690a038892d 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -712,16 +712,15 @@ xfs_bmap_punch_delalloc_range(
>  	struct xfs_iext_cursor	icur;
>  	int			error = 0;
>  
> -	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> -
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	if (!(ifp->if_flags & XFS_IFEXTENTS)) {
>  		error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
>  		if (error)
> -			return error;
> +			goto out_unlock;
>  	}
>  
>  	if (!xfs_iext_lookup_extent_before(ip, ifp, &end_fsb, &icur, &got))
> -		return 0;
> +		goto out_unlock;
>  
>  	while (got.br_startoff + got.br_blockcount > start_fsb) {
>  		del = got;
> @@ -745,6 +744,8 @@ xfs_bmap_punch_delalloc_range(
>  			break;
>  	}
>  
> +out_unlock:
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	return error;
>  }
>  
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 85c66e78f9e8..758ad9572d04 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1205,11 +1205,8 @@ xfs_file_iomap_end_delalloc(
>  		truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
>  					 XFS_FSB_TO_B(mp, end_fsb) - 1);
>  
> -		xfs_ilock(ip, XFS_ILOCK_EXCL);
>  		error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
>  					       end_fsb - start_fsb);
> -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -
>  		if (error && !XFS_FORCED_SHUTDOWN(mp)) {
>  			xfs_alert(mp, "%s: unable to clean up ino %lld",
>  				__func__, ip->i_ino);
> -- 
> 2.17.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 09/24] xfs: don't use XFS_BMAPI_IGSTATE in xfs_map_blocks
  2018-06-15 13:01 ` [PATCH 09/24] xfs: don't use XFS_BMAPI_IGSTATE in xfs_map_blocks Christoph Hellwig
@ 2018-06-19  5:27   ` Darrick J. Wong
  0 siblings, 0 replies; 54+ messages in thread
From: Darrick J. Wong @ 2018-06-19  5:27 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel

On Fri, Jun 15, 2018 at 03:01:54PM +0200, Christoph Hellwig wrote:
> We want to be able to use the extent state as a reliably indicator for
> the type of I/O, and stop using the buffer head state.  For this we
> need to stop using the XFS_BMAPI_IGSTATE so that we don't see merged
> extents of different types.
> 
> Based on a patch from Dave Chinner.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_aops.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 307ca66879b3..39b933e17192 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -385,7 +385,6 @@ xfs_map_blocks(
>  	ssize_t			count = i_blocksize(inode);
>  	xfs_fileoff_t		offset_fsb, end_fsb;
>  	int			error = 0;
> -	int			bmapi_flags = XFS_BMAPI_ENTIRE;
>  	int			nimaps = 1;
>  
>  	if (XFS_FORCED_SHUTDOWN(mp))
> @@ -405,8 +404,6 @@ xfs_map_blocks(
>  		return 0;
>  
>  	ASSERT(type != XFS_IO_COW);
> -	if (type == XFS_IO_UNWRITTEN)
> -		bmapi_flags |= XFS_BMAPI_IGSTATE;
>  
>  	xfs_ilock(ip, XFS_ILOCK_SHARED);
>  	ASSERT(ip->i_d.di_format != XFS_DINODE_FMT_BTREE ||
> @@ -418,7 +415,7 @@ xfs_map_blocks(
>  	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
>  	offset_fsb = XFS_B_TO_FSBT(mp, offset);
>  	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
> -				imap, &nimaps, bmapi_flags);
> +				imap, &nimaps, XFS_BMAPI_ENTIRE);
>  	/*
>  	 * Truncate an overwrite extent if there's a pending CoW
>  	 * reservation before the end of this extent.  This forces us
> -- 
> 2.17.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 10/24] xfs: remove xfs_reflink_trim_irec_to_next_cow
  2018-06-15 13:01 ` [PATCH 10/24] xfs: remove xfs_reflink_trim_irec_to_next_cow Christoph Hellwig
@ 2018-06-19  5:30   ` Darrick J. Wong
  0 siblings, 0 replies; 54+ messages in thread
From: Darrick J. Wong @ 2018-06-19  5:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel

On Fri, Jun 15, 2018 at 03:01:55PM +0200, Christoph Hellwig wrote:
> We already have to check for overlapping COW extents everytime we
> come back to a page in xfs_writepage_map / xfs_map_cow, so this
> additional trim is not required.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_aops.c    |  7 -------
>  fs/xfs/xfs_reflink.c | 33 ---------------------------------
>  fs/xfs/xfs_reflink.h |  2 --
>  fs/xfs/xfs_trace.h   |  1 -
>  4 files changed, 43 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 39b933e17192..c8fc6786b06c 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -416,13 +416,6 @@ xfs_map_blocks(
>  	offset_fsb = XFS_B_TO_FSBT(mp, offset);
>  	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
>  				imap, &nimaps, XFS_BMAPI_ENTIRE);
> -	/*
> -	 * Truncate an overwrite extent if there's a pending CoW
> -	 * reservation before the end of this extent.  This forces us
> -	 * to come back to writepage to take care of the CoW.
> -	 */
> -	if (nimaps && type == XFS_IO_OVERWRITE)
> -		xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb, imap);
>  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  
>  	if (error)
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 713e857d9ffa..d0092c864458 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -514,39 +514,6 @@ xfs_reflink_find_cow_mapping(
>  	return true;
>  }
>  
> -/*
> - * Trim an extent to end at the next CoW reservation past offset_fsb.
> - */
> -void
> -xfs_reflink_trim_irec_to_next_cow(
> -	struct xfs_inode		*ip,
> -	xfs_fileoff_t			offset_fsb,
> -	struct xfs_bmbt_irec		*imap)
> -{
> -	struct xfs_ifork		*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> -	struct xfs_bmbt_irec		got;
> -	struct xfs_iext_cursor		icur;
> -
> -	if (!xfs_is_reflink_inode(ip))
> -		return;
> -
> -	/* Find the extent in the CoW fork. */
> -	if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, &icur, &got))
> -		return;
> -
> -	/* This is the extent before; try sliding up one. */
> -	if (got.br_startoff < offset_fsb) {
> -		if (!xfs_iext_next_extent(ifp, &icur, &got))
> -			return;
> -	}
> -
> -	if (got.br_startoff >= imap->br_startoff + imap->br_blockcount)
> -		return;
> -
> -	imap->br_blockcount = got.br_startoff - imap->br_startoff;
> -	trace_xfs_reflink_trim_irec(ip, imap);
> -}
> -
>  /*
>   * Cancel CoW reservations for some block range of an inode.
>   *
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index 701487bab468..c0a25609007b 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -34,8 +34,6 @@ extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
>  		xfs_off_t count);
>  extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
>  		struct xfs_bmbt_irec *imap);
> -extern void xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
> -		xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
>  
>  extern int xfs_reflink_cancel_cow_blocks(struct xfs_inode *ip,
>  		struct xfs_trans **tpp, xfs_fileoff_t offset_fsb,
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 9d4c4ca24fe6..041045a8d65c 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -3228,7 +3228,6 @@ DEFINE_RW_EVENT(xfs_reflink_reserve_cow);
>  
>  DEFINE_SIMPLE_IO_EVENT(xfs_reflink_bounce_dio_write);
>  DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);
> -DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_irec);
>  
>  DEFINE_SIMPLE_IO_EVENT(xfs_reflink_cancel_cow_range);
>  DEFINE_SIMPLE_IO_EVENT(xfs_reflink_end_cow);
> -- 
> 2.17.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/24] xfs: remove xfs_map_cow
  2018-06-18 17:38   ` Brian Foster
@ 2018-06-19  5:35     ` Darrick J. Wong
  2018-06-19 16:53       ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Darrick J. Wong @ 2018-06-19  5:35 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, linux-fsdevel

On Mon, Jun 18, 2018 at 01:38:38PM -0400, Brian Foster wrote:
> On Fri, Jun 15, 2018 at 03:01:56PM +0200, Christoph Hellwig wrote:
> > We can handle the existing cow mapping case as a special case directly
> > in xfs_writepage_map, and share code for allocating delalloc blocks
> > with regular I/O in xfs_map_blocks.  This means we need to always
> > call xfs_map_blocks for reflink inodes, but we can still skip most of
> > the work if it turns out that there is no COW mapping overlapping the
> > current block.
> > 
> > As a subtle detail we need to start caching holes in the wpc to deal
> > with the case of COW reservations between EOF.  But we'll need that
> > infrastructure later anyway, so this is no big deal.
> > 
> > Based on a patch from Dave Chinner.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/xfs/xfs_aops.c | 186 ++++++++++++++++++++++------------------------
> >  fs/xfs/xfs_aops.h |   4 +-
> >  2 files changed, 92 insertions(+), 98 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index c8fc6786b06c..ff27bdcc49a1 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> ...
> > @@ -887,22 +877,24 @@ xfs_writepage_map(
> >  		if (wpc->imap_valid)
> >  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
> >  							 offset);
> > -		if (!wpc->imap_valid) {
> > -			error = xfs_map_blocks(inode, offset, &wpc->imap,
> > -					     wpc->io_type);
> 
> A comment would be nice here just so the purpose of the logic is clear
> at a glance. E.g:
> 
>                 /*
>                  * COW fork blocks can overlap data fork blocks even if the
>                  * blocks aren't shared. COW I/O always takes precedent, so we
>                  * must always check for overlap on reflink inodes unless the
>                  * mapping is already COW.
>                  */
> 
> With something like that, this now looks pretty good to me:
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>

I concur, it would help to have a comment to remind everyone that we
sometimes upgrade to a cow write even when it's not strictly required.
:)

With that comment added in,

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> Note: I see that this code ends up shifted into xfs_map_blocks(). If
> it's easier to add the comment in whatever patch touches that last,
> that's fine too.
> 
> Brian
> 
> > +		if (!wpc->imap_valid ||
> > +		    (xfs_is_reflink_inode(XFS_I(inode)) &&
> > +		     wpc->io_type != XFS_IO_COW)) {
> > +			error = xfs_map_blocks(wpc, inode, offset);
> >  			if (error)
> >  				goto out;
> >  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
> >  							 offset);
> >  		}
> > -		if (wpc->imap_valid) {
> > -			lock_buffer(bh);
> > -			if (wpc->io_type != XFS_IO_OVERWRITE)
> > -				xfs_map_at_offset(inode, bh, &wpc->imap, offset);
> > -			xfs_add_to_ioend(inode, bh, offset, wpc, wbc, &submit_list);
> > -			count++;
> > -		}
> >  
> > +		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE)
> > +			continue;
> > +
> > +		lock_buffer(bh);
> > +		if (wpc->io_type != XFS_IO_OVERWRITE)
> > +			xfs_map_at_offset(inode, bh, &wpc->imap, offset);
> > +		xfs_add_to_ioend(inode, bh, offset, wpc, wbc, &submit_list);
> > +		count++;
> >  	} while (offset += len, ((bh = bh->b_this_page) != head));
> >  
> >  	ASSERT(wpc->ioend || list_empty(&submit_list));
> > diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> > index 694c85b03813..466d22e5bc36 100644
> > --- a/fs/xfs/xfs_aops.h
> > +++ b/fs/xfs/xfs_aops.h
> > @@ -29,6 +29,7 @@ enum {
> >  	XFS_IO_UNWRITTEN,	/* covers allocated but uninitialized data */
> >  	XFS_IO_OVERWRITE,	/* covers already allocated extent */
> >  	XFS_IO_COW,		/* covers copy-on-write extent */
> > +	XFS_IO_HOLE,		/* covers region without any block allocation */
> >  };
> >  
> >  #define XFS_IO_TYPES \
> > @@ -36,7 +37,8 @@ enum {
> >  	{ XFS_IO_DELALLOC,		"delalloc" }, \
> >  	{ XFS_IO_UNWRITTEN,		"unwritten" }, \
> >  	{ XFS_IO_OVERWRITE,		"overwrite" }, \
> > -	{ XFS_IO_COW,			"CoW" }
> > +	{ XFS_IO_COW,			"CoW" }, \
> > +	{ XFS_IO_HOLE,			"hole" }
> >  
> >  /*
> >   * Structure for buffered I/O completions.
> > -- 
> > 2.17.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 12/24] xfs: rename the offset variable in xfs_writepage_map
  2018-06-15 13:01 ` [PATCH 12/24] xfs: rename the offset variable in xfs_writepage_map Christoph Hellwig
@ 2018-06-19  5:37   ` Darrick J. Wong
  0 siblings, 0 replies; 54+ messages in thread
From: Darrick J. Wong @ 2018-06-19  5:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel

On Fri, Jun 15, 2018 at 03:01:57PM +0200, Christoph Hellwig wrote:
> Calling it file_offset makes the usage more clear, especially with
> a new poffset variable that will be added soon for the offset inside
> the page.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_aops.c | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index ff27bdcc49a1..8c1a28f39197 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -824,15 +824,15 @@ xfs_writepage_map(
>  	struct xfs_ioend	*ioend, *next;
>  	struct buffer_head	*bh, *head;
>  	ssize_t			len = i_blocksize(inode);
> -	uint64_t		offset;
> +	uint64_t		file_offset;	/* file offset of page */
>  	int			error = 0;
>  	int			count = 0;
>  	unsigned int		new_type;
>  
>  	bh = head = page_buffers(page);
> -	offset = page_offset(page);
> +	file_offset = page_offset(page);
>  	do {
> -		if (offset >= end_offset)
> +		if (file_offset >= end_offset)
>  			break;
>  
>  		/*
> @@ -864,7 +864,7 @@ xfs_writepage_map(
>  		 * If we already have a valid COW mapping keep using it.
>  		 */
>  		if (wpc->io_type == XFS_IO_COW &&
> -		    xfs_imap_valid(inode, &wpc->imap, offset)) {
> +		    xfs_imap_valid(inode, &wpc->imap, file_offset)) {
>  			wpc->imap_valid = true;
>  			new_type = XFS_IO_COW;
>  		}
> @@ -876,15 +876,15 @@ xfs_writepage_map(
>  
>  		if (wpc->imap_valid)
>  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
> -							 offset);
> +							 file_offset);
>  		if (!wpc->imap_valid ||
>  		    (xfs_is_reflink_inode(XFS_I(inode)) &&
>  		     wpc->io_type != XFS_IO_COW)) {
> -			error = xfs_map_blocks(wpc, inode, offset);
> +			error = xfs_map_blocks(wpc, inode, file_offset);
>  			if (error)
>  				goto out;
>  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
> -							 offset);
> +							 file_offset);
>  		}
>  
>  		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE)
> @@ -892,10 +892,10 @@ xfs_writepage_map(
>  
>  		lock_buffer(bh);
>  		if (wpc->io_type != XFS_IO_OVERWRITE)
> -			xfs_map_at_offset(inode, bh, &wpc->imap, offset);
> -		xfs_add_to_ioend(inode, bh, offset, wpc, wbc, &submit_list);
> +			xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
> +		xfs_add_to_ioend(inode, bh, file_offset, wpc, wbc, &submit_list);
>  		count++;
> -	} while (offset += len, ((bh = bh->b_this_page) != head));
> +	} while (file_offset += len, ((bh = bh->b_this_page) != head));
>  
>  	ASSERT(wpc->ioend || list_empty(&submit_list));
>  
> -- 
> 2.17.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 13/24] xfs: make xfs_writepage_map extent map centric
  2018-06-15 13:01 ` [PATCH 13/24] xfs: make xfs_writepage_map extent map centric Christoph Hellwig
  2018-06-18 17:38   ` Brian Foster
@ 2018-06-19  5:43   ` Darrick J. Wong
  2018-06-19 16:52     ` Christoph Hellwig
  1 sibling, 1 reply; 54+ messages in thread
From: Darrick J. Wong @ 2018-06-19  5:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, Dave Chinner

On Fri, Jun 15, 2018 at 03:01:58PM +0200, Christoph Hellwig wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_writepage_map() iterates over the bufferheads on a page to decide
> what sort of IO to do and what actions to take.  However, when it comes
> to reflink and deciding when it needs to execute a COW operation, we no
> longer look at the bufferhead state but instead we ignore than and look
> up internal state held in the COW fork extent list.
> 
> This means xfs_writepage_map() is somewhat confused. It does stuff, then
> ignores it, then tries to handle the impedence mismatch by shovelling the
> results inside the existing mapping code.  It works, but it's a bit of a
> mess and it makes it hard to fix the cached map bug that the writepage
> code currently has.
> 
> To unify the two different mechanisms, we first have to choose a direction.
> That's already been set - we're de-emphasising bufferheads so they are no
> longer a control structure as we need to do taht to allow for eventual
> removal.  Hence we need to move away from looking at bufferhead state to
> determine what operations we need to perform.
> 
> We can't completely get rid of bufferheads yet - they do contain some
> state that is absolutely necessary, such as whether that part of the page
> contains valid data or not (buffer_uptodate()).  Other state in the
> bufferhead is redundant:
> 
> 	BH_dirty - the page is dirty, so we can ignore this and just
> 		write it
> 	BH_delay - we have delalloc extent info in the DATA fork extent
> 		tree
> 	BH_unwritten - same as BH_delay
> 	BH_mapped - indicates we've already used it once for IO and it is
> 		mapped to a disk address. Needs to be ignored for COW
> 		blocks.
> 
> The BH_mapped flag is an interesting case - it's supposed to indicate that
> it's already mapped to disk and so we can just use it "as is".  In theory,
> we don't even have to do an extent lookup to find where to write it too,
> but we have to do that anyway to determine we are actually writing over a
> valid extent.  Hence it's not even serving the purpose of avoiding a an
> extent lookup during writeback, and so we can pretty much ignore it.
> Especially as we have to ignore it for COW operations...
> 
> Therefore, use the extent map as the source of information to tell us
> what actions we need to take and what sort of IO we should perform.  The
> first step is to have xfs_map_blocks() set the io type according to what
> it looks up.  This means it can easily handle both normal overwrite and
> COW cases.  The only thing we also need to add is the ability to return
> hole mappings.
> 
> We need to return and cache hole mappings now for the case of multiple
> blocks per page.  We no longer use the BH_mapped to indicate a block over
> a hole, so we have to get that info from xfs_map_blocks().  We cache it so
> that holes that span two pages don't need separate lookups.  This allows us
> to avoid ever doing write IO over a hole, too.
> 
> Now that we have xfs_map_blocks() returning both a cached map and the type
> of IO we need to perform, we can rewrite xfs_writepage_map() to drop all
> the bufferhead control. It's also much simplified because it doesn't need
> to explicitly handle COW operations.  Instead of iterating bufferheads, it
> iterates blocks within the page and then looks up what per-block state is
> required from the appropriate bufferhead.  It then validates the cached
> map, and if it's not valid, we get a new map.  If we don't get a valid map
> or it's over a hole, we skip the block.
> 
> At this point, we have to remap the bufferhead via xfs_map_at_offset().
> As previously noted, we had to do this even if the buffer was already
> mapped as the mapping would be stale for XFS_IO_DELALLOC, XFS_IO_UNWRITTEN
> and XFS_IO_COW IO types.  With xfs_map_blocks() now controlling the type,
> even XFS_IO_OVERWRITE types need remapping, as converted-but-not-yet-
> written delalloc extents beyond EOF can be reported at XFS_IO_OVERWRITE.
> Bufferheads that span such regions still need their BH_Delay flags cleared
> and their block numbers calculated, so we now unconditionally map each
> bufferhead before submission.
> 
> But wait! There's more - remember the old "treat unwritten extents as
> holes on read" hack?  Yeah, that means we can have a dirty page with
> unmapped, unwritten bufferheads that contain data!  What makes these so
> special is that the unwritten "hole" bufferheads do not have a valid block
> device pointer, so if we attempt to write them xfs_add_to_ioend() blows
> up. So we make xfs_map_at_offset() do the "realtime or data device"
> lookup from the inode and ignore what was or wasn't put into the
> bufferhead when the buffer was instantiated.
> 
> The astute reader will have realised by now that this code treats
> unwritten extents in multiple-blocks-per-page situations differently.
> If we get any combination of unwritten blocks on a dirty page that contain
> valid data in the page, we're going to convert them to real extents.  This
> can actually be a win, because it means that pages with interleaving
> unwritten and written blocks will get converted to a single written extent
> with zeros replacing the interspersed unwritten blocks.  This is actually
> good for reducing extent list and conversion overhead, and it means we
> issue a contiguous IO instead of lots of little ones.  The downside is
> that we use up a little extra IO bandwidth.  Neither of these seem like a
> bad thing given that spinning disks are seek sensitive, and SSDs/pmem have
> bandwidth to burn and the lower Io latency/CPU overhead of fewer, larger
> IOs will result in better performance on them...
> 
> As a result of all this, the only state we actually care about from the
> bufferhead is a single flag - BH_Uptodate. We still use the bufferhead to
> pass some information to the bio via xfs_add_to_ioend(), but that is
> trivial to separate and pass explicitly.  This means we really only need
> 1 bit of state per block per page from the buffered write path in the
> writeback path.  Everything else we do with the bufferhead is purely to
> make the buffered IO front end continue to work correctly. i.e we've
> pretty much marginalised bufferheads in the writeback path completely.
> 
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> [hch: forward port + slight refactoring]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_aops.c | 89 ++++++++++++++++++++---------------------------
>  1 file changed, 37 insertions(+), 52 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 8c1a28f39197..165891ecb4ba 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -454,19 +454,19 @@ xfs_map_blocks(
>  	} else if (imap.br_startblock == HOLESTARTBLOCK) {
>  		/* landed in a hole */
>  		wpc->io_type = XFS_IO_HOLE;
> -	}
> -
> -	if (wpc->io_type == XFS_IO_DELALLOC &&
> -	    (!nimaps || isnullstartblock(imap.br_startblock)))
> -		goto allocate_blocks;
> +	} else {
> +		if (isnullstartblock(imap.br_startblock)) {
> +			/* got a delalloc extent */
> +			wpc->io_type = XFS_IO_DELALLOC;
> +			goto allocate_blocks;
> +		}
>  
> -#ifdef DEBUG
> -	if (wpc->io_type == XFS_IO_UNWRITTEN) {
> -		ASSERT(nimaps);
> -		ASSERT(imap.br_startblock != HOLESTARTBLOCK);
> -		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> +		if (imap.br_state == XFS_EXT_UNWRITTEN)
> +			wpc->io_type = XFS_IO_UNWRITTEN;
> +		else
> +			wpc->io_type = XFS_IO_OVERWRITE;
>  	}
> -#endif
> +
>  	wpc->imap = imap;
>  	trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
>  	return 0;
> @@ -736,6 +736,14 @@ xfs_map_at_offset(
>  	set_buffer_mapped(bh);
>  	clear_buffer_delay(bh);
>  	clear_buffer_unwritten(bh);
> +
> +	/*
> +	 * If this is a realtime file, data may be on a different device.
> +	 * to that pointed to from the buffer_head b_bdev currently. We can't
> +	 * trust that the bufferhead has a already been mapped correctly, so
> +	 * set the bdev now.
> +	 */
> +	bh->b_bdev = xfs_find_bdev_for_inode(inode);
>  }
>  
>  STATIC void
> @@ -822,58 +830,36 @@ xfs_writepage_map(
>  {
>  	LIST_HEAD(submit_list);
>  	struct xfs_ioend	*ioend, *next;
> -	struct buffer_head	*bh, *head;
> +	struct buffer_head	*bh;
>  	ssize_t			len = i_blocksize(inode);
>  	uint64_t		file_offset;	/* file offset of page */
> +	unsigned		poffset;	/* offset into page */
>  	int			error = 0;
>  	int			count = 0;
> -	unsigned int		new_type;
>  
> -	bh = head = page_buffers(page);
> +	/*
> +	 * Walk the blocks on the page, and we we run off then end of the
> +	 * current map or find the current map invalid, grab a new one.

Will rework this comment to read:

"...and if we run off the end of the current map or..."

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> +	 * We only use bufferheads here to check per-block state - they no
> +	 * longer control the iteration through the page. This allows us to
> +	 * replace the bufferhead with some other state tracking mechanism in
> +	 * future.
> +	 */
>  	file_offset = page_offset(page);
> -	do {
> +	bh = page_buffers(page);
> +	for (poffset = 0;
> +	     poffset < PAGE_SIZE;
> +	     poffset += len, file_offset += len, bh = bh->b_this_page) {
> +		/* past the range we are writing, so nothing more to write. */
>  		if (file_offset >= end_offset)
>  			break;
>  
> -		/*
> -		 * set_page_dirty dirties all buffers in a page, independent
> -		 * of their state.  The dirty state however is entirely
> -		 * meaningless for holes (!mapped && uptodate), so skip
> -		 * buffers covering holes here.
> -		 */
> -		if (!buffer_mapped(bh) && buffer_uptodate(bh))
> -			continue;
> -
> -		if (buffer_unwritten(bh))
> -			new_type = XFS_IO_UNWRITTEN;
> -		else if (buffer_delay(bh))
> -			new_type = XFS_IO_DELALLOC;
> -		else if (buffer_uptodate(bh))
> -			new_type = XFS_IO_OVERWRITE;
> -		else {
> +		if (!buffer_uptodate(bh)) {
>  			if (PageUptodate(page))
>  				ASSERT(buffer_mapped(bh));
> -			/*
> -			 * This buffer is not uptodate and will not be
> -			 * written to disk.
> -			 */
>  			continue;
>  		}
>  
> -		/*
> -		 * If we already have a valid COW mapping keep using it.
> -		 */
> -		if (wpc->io_type == XFS_IO_COW &&
> -		    xfs_imap_valid(inode, &wpc->imap, file_offset)) {
> -			wpc->imap_valid = true;
> -			new_type = XFS_IO_COW;
> -		}
> -
> -		if (wpc->io_type != new_type) {
> -			wpc->io_type = new_type;
> -			wpc->imap_valid = false;
> -		}
> -
>  		if (wpc->imap_valid)
>  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
>  							 file_offset);
> @@ -891,11 +877,10 @@ xfs_writepage_map(
>  			continue;
>  
>  		lock_buffer(bh);
> -		if (wpc->io_type != XFS_IO_OVERWRITE)
> -			xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
> +		xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
>  		xfs_add_to_ioend(inode, bh, file_offset, wpc, wbc, &submit_list);
>  		count++;
> -	} while (file_offset += len, ((bh = bh->b_this_page) != head));
> +	}
>  
>  	ASSERT(wpc->ioend || list_empty(&submit_list));
>  
> -- 
> 2.17.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 24/24] xfs: add support for sub-pagesize writeback without buffer_heads
  2018-06-15 13:02 ` [PATCH 24/24] xfs: add support for sub-pagesize writeback without buffer_heads Christoph Hellwig
@ 2018-06-19  6:15   ` Darrick J. Wong
  0 siblings, 0 replies; 54+ messages in thread
From: Darrick J. Wong @ 2018-06-19  6:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel

On Fri, Jun 15, 2018 at 03:02:09PM +0200, Christoph Hellwig wrote:
> Switch to using the iomap_page structure for checking sub-page uptodate
> status and track sub-page I/O completion status, and remove large
> quantities of boilerplate code working around buffer heads.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Need to go over these last two with a finer toothed comb in the morning,
will run this through testing and see what happens.

--D

> ---
>  fs/xfs/xfs_aops.c  | 496 ++++++---------------------------------------
>  fs/xfs/xfs_buf.h   |   1 -
>  fs/xfs/xfs_iomap.c |   3 -
>  fs/xfs/xfs_super.c |   2 +-
>  fs/xfs/xfs_trace.h |  18 +-
>  5 files changed, 62 insertions(+), 458 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 556aad821a17..3fc54c34f1bc 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -32,9 +32,6 @@
>  #include "xfs_bmap_util.h"
>  #include "xfs_bmap_btree.h"
>  #include "xfs_reflink.h"
> -#include <linux/gfp.h>
> -#include <linux/mpage.h>
> -#include <linux/pagevec.h>
>  #include <linux/writeback.h>
>  
>  /*
> @@ -46,25 +43,6 @@ struct xfs_writepage_ctx {
>  	struct xfs_ioend	*ioend;
>  };
>  
> -void
> -xfs_count_page_state(
> -	struct page		*page,
> -	int			*delalloc,
> -	int			*unwritten)
> -{
> -	struct buffer_head	*bh, *head;
> -
> -	*delalloc = *unwritten = 0;
> -
> -	bh = head = page_buffers(page);
> -	do {
> -		if (buffer_unwritten(bh))
> -			(*unwritten) = 1;
> -		else if (buffer_delay(bh))
> -			(*delalloc) = 1;
> -	} while ((bh = bh->b_this_page) != head);
> -}
> -
>  struct block_device *
>  xfs_find_bdev_for_inode(
>  	struct inode		*inode)
> @@ -97,67 +75,17 @@ xfs_finish_page_writeback(
>  	struct bio_vec		*bvec,
>  	int			error)
>  {
> +	struct iomap_page	*iop = to_iomap_page(bvec->bv_page);
> +
>  	if (error) {
>  		SetPageError(bvec->bv_page);
>  		mapping_set_error(inode->i_mapping, -EIO);
>  	}
> -	end_page_writeback(bvec->bv_page);
> -}
>  
> -/*
> - * We're now finished for good with this page.  Update the page state via the
> - * associated buffer_heads, paying attention to the start and end offsets that
> - * we need to process on the page.
> - *
> - * Note that we open code the action in end_buffer_async_write here so that we
> - * only have to iterate over the buffers attached to the page once.  This is not
> - * only more efficient, but also ensures that we only calls end_page_writeback
> - * at the end of the iteration, and thus avoids the pitfall of having the page
> - * and buffers potentially freed after every call to end_buffer_async_write.
> - */
> -static void
> -xfs_finish_buffer_writeback(
> -	struct inode		*inode,
> -	struct bio_vec		*bvec,
> -	int			error)
> -{
> -	struct buffer_head	*head = page_buffers(bvec->bv_page), *bh = head;
> -	bool			busy = false;
> -	unsigned int		off = 0;
> -	unsigned long		flags;
> -
> -	ASSERT(bvec->bv_offset < PAGE_SIZE);
> -	ASSERT((bvec->bv_offset & (i_blocksize(inode) - 1)) == 0);
> -	ASSERT(bvec->bv_offset + bvec->bv_len <= PAGE_SIZE);
> -	ASSERT((bvec->bv_len & (i_blocksize(inode) - 1)) == 0);
> -
> -	local_irq_save(flags);
> -	bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
> -	do {
> -		if (off >= bvec->bv_offset &&
> -		    off < bvec->bv_offset + bvec->bv_len) {
> -			ASSERT(buffer_async_write(bh));
> -			ASSERT(bh->b_end_io == NULL);
> -
> -			if (error) {
> -				mark_buffer_write_io_error(bh);
> -				clear_buffer_uptodate(bh);
> -				SetPageError(bvec->bv_page);
> -			} else {
> -				set_buffer_uptodate(bh);
> -			}
> -			clear_buffer_async_write(bh);
> -			unlock_buffer(bh);
> -		} else if (buffer_async_write(bh)) {
> -			ASSERT(buffer_locked(bh));
> -			busy = true;
> -		}
> -		off += bh->b_size;
> -	} while ((bh = bh->b_this_page) != head);
> -	bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
> -	local_irq_restore(flags);
> +	ASSERT(iop || i_blocksize(inode) == PAGE_SIZE);
> +	ASSERT(!iop || atomic_read(&iop->write_count) > 0);
>  
> -	if (!busy)
> +	if (!iop || atomic_dec_and_test(&iop->write_count))
>  		end_page_writeback(bvec->bv_page);
>  }
>  
> @@ -191,12 +119,8 @@ xfs_destroy_ioend(
>  			next = bio->bi_private;
>  
>  		/* walk each page on bio, ending page IO on them */
> -		bio_for_each_segment_all(bvec, bio, i) {
> -			if (page_has_buffers(bvec->bv_page))
> -				xfs_finish_buffer_writeback(inode, bvec, error);
> -			else
> -				xfs_finish_page_writeback(inode, bvec, error);
> -		}
> +		bio_for_each_segment_all(bvec, bio, i)
> +			xfs_finish_page_writeback(inode, bvec, error);
>  		bio_put(bio);
>  	}
>  
> @@ -633,6 +557,7 @@ xfs_add_to_ioend(
>  	struct inode		*inode,
>  	xfs_off_t		offset,
>  	struct page		*page,
> +	struct iomap_page	*iop,
>  	struct xfs_writepage_ctx *wpc,
>  	struct writeback_control *wbc,
>  	struct list_head	*iolist)
> @@ -656,100 +581,37 @@ xfs_add_to_ioend(
>  				bdev, sector);
>  	}
>  
> -	/*
> -	 * If the block doesn't fit into the bio we need to allocate a new
> -	 * one.  This shouldn't happen more than once for a given block.
> -	 */
> -	while (bio_add_page(wpc->ioend->io_bio, page, len, poff) != len)
> -		xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
> +	if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff)) {
> +		if (iop)
> +			atomic_inc(&iop->write_count);
> +		if (bio_full(wpc->ioend->io_bio))
> +			xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
> +		__bio_add_page(wpc->ioend->io_bio, page, len, poff);
> +	}
>  
>  	wpc->ioend->io_size += len;
>  }
>  
> -STATIC void
> -xfs_map_buffer(
> -	struct inode		*inode,
> -	struct buffer_head	*bh,
> -	struct xfs_bmbt_irec	*imap,
> -	xfs_off_t		offset)
> -{
> -	sector_t		bn;
> -	struct xfs_mount	*m = XFS_I(inode)->i_mount;
> -	xfs_off_t		iomap_offset = XFS_FSB_TO_B(m, imap->br_startoff);
> -	xfs_daddr_t		iomap_bn = xfs_fsb_to_db(XFS_I(inode), imap->br_startblock);
> -
> -	ASSERT(imap->br_startblock != HOLESTARTBLOCK);
> -	ASSERT(imap->br_startblock != DELAYSTARTBLOCK);
> -
> -	bn = (iomap_bn >> (inode->i_blkbits - BBSHIFT)) +
> -	      ((offset - iomap_offset) >> inode->i_blkbits);
> -
> -	ASSERT(bn || XFS_IS_REALTIME_INODE(XFS_I(inode)));
> -
> -	bh->b_blocknr = bn;
> -	set_buffer_mapped(bh);
> -}
> -
> -STATIC void
> -xfs_map_at_offset(
> -	struct inode		*inode,
> -	struct buffer_head	*bh,
> -	struct xfs_bmbt_irec	*imap,
> -	xfs_off_t		offset)
> -{
> -	ASSERT(imap->br_startblock != HOLESTARTBLOCK);
> -	ASSERT(imap->br_startblock != DELAYSTARTBLOCK);
> -
> -	lock_buffer(bh);
> -	xfs_map_buffer(inode, bh, imap, offset);
> -	set_buffer_mapped(bh);
> -	clear_buffer_delay(bh);
> -	clear_buffer_unwritten(bh);
> -
> -	/*
> -	 * If this is a realtime file, data may be on a different device.
> -	 * to that pointed to from the buffer_head b_bdev currently. We can't
> -	 * trust that the bufferhead has a already been mapped correctly, so
> -	 * set the bdev now.
> -	 */
> -	bh->b_bdev = xfs_find_bdev_for_inode(inode);
> -	bh->b_end_io = NULL;
> -	set_buffer_async_write(bh);
> -	set_buffer_uptodate(bh);
> -	clear_buffer_dirty(bh);
> -}
> -
>  STATIC void
>  xfs_vm_invalidatepage(
>  	struct page		*page,
>  	unsigned int		offset,
>  	unsigned int		length)
>  {
> -	trace_xfs_invalidatepage(page->mapping->host, page, offset,
> -				 length);
> -
> -	/*
> -	 * If we are invalidating the entire page, clear the dirty state from it
> -	 * so that we can check for attempts to release dirty cached pages in
> -	 * xfs_vm_releasepage().
> -	 */
> -	if (offset == 0 && length >= PAGE_SIZE)
> -		cancel_dirty_page(page);
> -	block_invalidatepage(page, offset, length);
> +	trace_xfs_invalidatepage(page->mapping->host, page, offset, length);
> +	iomap_invalidatepage(page, offset, length);
>  }
>  
>  /*
> - * If the page has delalloc buffers on it, we need to punch them out before we
> - * invalidate the page. If we don't, we leave a stale delalloc mapping on the
> - * inode that can trip a BUG() in xfs_get_blocks() later on if a direct IO read
> - * is done on that same region - the delalloc extent is returned when none is
> - * supposed to be there.
> + * If the page has delalloc blocks on it, we need to punch them out before we
> + * invalidate the page.  If we don't, we leave a stale delalloc mapping on the
> + * inode that can trip up a later direct I/O read operation on the same region.
>   *
> - * We prevent this by truncating away the delalloc regions on the page before
> - * invalidating it. Because they are delalloc, we can do this without needing a
> - * transaction. Indeed - if we get ENOSPC errors, we have to be able to do this
> - * truncation without a transaction as there is no space left for block
> - * reservation (typically why we see a ENOSPC in writeback).
> + * We prevent this by truncating away the delalloc regions on the page.  Because
> + * they are delalloc, we can do this without needing a transaction. Indeed - if
> + * we get ENOSPC errors, we have to be able to do this truncation without a
> + * transaction as there is no space left for block reservation (typically why we
> + * see a ENOSPC in writeback).
>   */
>  STATIC void
>  xfs_aops_discard_page(
> @@ -781,7 +643,7 @@ xfs_aops_discard_page(
>   * We implement an immediate ioend submission policy here to avoid needing to
>   * chain multiple ioends and hence nest mempool allocations which can violate
>   * forward progress guarantees we need to provide. The current ioend we are
> - * adding buffers to is cached on the writepage context, and if the new buffer
> + * adding blocks to is cached on the writepage context, and if the new block
>   * does not append to the cached ioend it will create a new ioend and cache that
>   * instead.
>   *
> @@ -802,79 +664,53 @@ xfs_writepage_map(
>  	uint64_t		end_offset)
>  {
>  	LIST_HEAD(submit_list);
> +	struct iomap_page	*iop = to_iomap_page(page);
> +	unsigned		len = i_blocksize(inode);
>  	struct xfs_ioend	*ioend, *next;
> -	struct buffer_head	*bh = NULL;
> -	ssize_t			len = i_blocksize(inode);
>  	uint64_t		file_offset;	/* file offset of page */
> -	unsigned		poffset;	/* offset into page */
> -	int			error = 0;
> -	int			count = 0;
> +	int			error = 0, count = 0, i;
>  
> -	if (page_has_buffers(page))
> -		bh = page_buffers(page);
> +	ASSERT(iop || i_blocksize(inode) == PAGE_SIZE);
> +	ASSERT(!iop || atomic_read(&iop->write_count) == 0);
>  
>  	/*
> -	 * Walk the blocks on the page, and we we run off then end of the
> -	 * current map or find the current map invalid, grab a new one.
> -	 * We only use bufferheads here to check per-block state - they no
> -	 * longer control the iteration through the page. This allows us to
> -	 * replace the bufferhead with some other state tracking mechanism in
> -	 * future.
> +	 * Walk through the page to find areas to write back. If we run off the
> +	 * end of the current map or find the current map invalid, grab a new
> +	 * one.
>  	 */
> -	for (poffset = 0, file_offset = page_offset(page);
> -	     poffset < PAGE_SIZE;
> -	     poffset += len, file_offset += len) {
> -		/* past the range we are writing, so nothing more to write. */
> -		if (file_offset >= end_offset)
> -			break;
> -
> -		if (bh && !buffer_uptodate(bh)) {
> -			if (PageUptodate(page))
> -				ASSERT(buffer_mapped(bh));
> -			bh = bh->b_this_page;
> +	for (i = 0, file_offset = page_offset(page);
> +	     i < (PAGE_SIZE >> inode->i_blkbits) && file_offset < end_offset;
> +	     i++, file_offset += len) {
> +		if (iop && !test_bit(i, iop->uptodate))
>  			continue;
> -		}
>  
>  		error = xfs_map_blocks(wpc, inode, file_offset);
>  		if (error)
> -			goto out;
> -
> -		if (wpc->io_type == XFS_IO_HOLE) {
> -			if (bh)
> -				bh = bh->b_this_page;
> +			break;
> +		if (wpc->io_type == XFS_IO_HOLE)
>  			continue;
> -		}
> -
> -		if (bh) {
> -			xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
> -			bh = bh->b_this_page;
> -		}
> -		xfs_add_to_ioend(inode, file_offset, page, wpc, wbc,
> -				&submit_list);
> +		xfs_add_to_ioend(inode, file_offset, page, iop, wpc, wbc,
> +				 &submit_list);
>  		count++;
>  	}
>  
>  	ASSERT(wpc->ioend || list_empty(&submit_list));
> -out:
>  	ASSERT(PageLocked(page));
>  	ASSERT(!PageWriteback(page));
>  
>  	/*
> -	 * On error, we have to fail the ioend here because we have locked
> -	 * buffers in the ioend. If we don't do this, we'll deadlock
> -	 * invalidating the page as that tries to lock the buffers on the page.
> -	 * Also, because we may have set pages under writeback, we have to make
> -	 * sure we run IO completion to mark the error state of the IO
> -	 * appropriately, so we can't cancel the ioend directly here. That means
> -	 * we have to mark this page as under writeback if we included any
> -	 * buffers from it in the ioend chain so that completion treats it
> -	 * correctly.
> +	 * On error, we have to fail the ioend here because we may have set
> +	 * pages under writeback, we have to make sure we run IO completion to
> +	 * mark the error state of the IO appropriately, so we can't cancel the
> +	 * ioend directly here.  That means we have to mark this page as under
> +	 * writeback if we included any blocks from it in the ioend chain so
> +	 * that completion treats it correctly.
>  	 *
>  	 * If we didn't include the page in the ioend, the on error we can
>  	 * simply discard and unlock it as there are no other users of the page
> -	 * or it's buffers right now. The caller will still need to trigger
> -	 * submission of outstanding ioends on the writepage context so they are
> -	 * treated correctly on error.
> +	 * now.  The caller will still need to trigger submission of outstanding
> +	 * ioends on the writepage context so they are treated correctly on
> +	 * error.
>  	 */
>  	if (unlikely(error)) {
>  		if (!count) {
> @@ -915,8 +751,8 @@ xfs_writepage_map(
>  	}
>  
>  	/*
> -	 * We can end up here with no error and nothing to write if we race with
> -	 * a partial page truncate on a sub-page block sized filesystem.
> +	 * We can end up here with no error and nothing to write only if we race
> +	 * with a partial page truncate on a sub-page block sized filesystem.
>  	 */
>  	if (!count)
>  		end_page_writeback(page);
> @@ -931,7 +767,6 @@ xfs_writepage_map(
>   * For delalloc space on the page we need to allocate space and flush it.
>   * For unwritten space on the page we need to start the conversion to
>   * regular allocated space.
> - * For any other dirty buffer heads on the page we should flush them.
>   */
>  STATIC int
>  xfs_do_writepage(
> @@ -1085,166 +920,13 @@ xfs_dax_writepages(
>  			xfs_find_bdev_for_inode(mapping->host), wbc);
>  }
>  
> -/*
> - * Called to move a page into cleanable state - and from there
> - * to be released. The page should already be clean. We always
> - * have buffer heads in this call.
> - *
> - * Returns 1 if the page is ok to release, 0 otherwise.
> - */
>  STATIC int
>  xfs_vm_releasepage(
>  	struct page		*page,
>  	gfp_t			gfp_mask)
>  {
> -	int			delalloc, unwritten;
> -
>  	trace_xfs_releasepage(page->mapping->host, page, 0, 0);
> -
> -	/*
> -	 * mm accommodates an old ext3 case where clean pages might not have had
> -	 * the dirty bit cleared. Thus, it can send actual dirty pages to
> -	 * ->releasepage() via shrink_active_list(). Conversely,
> -	 * block_invalidatepage() can send pages that are still marked dirty but
> -	 * otherwise have invalidated buffers.
> -	 *
> -	 * We want to release the latter to avoid unnecessary buildup of the
> -	 * LRU, so xfs_vm_invalidatepage() clears the page dirty flag on pages
> -	 * that are entirely invalidated and need to be released.  Hence the
> -	 * only time we should get dirty pages here is through
> -	 * shrink_active_list() and so we can simply skip those now.
> -	 *
> -	 * warn if we've left any lingering delalloc/unwritten buffers on clean
> -	 * or invalidated pages we are about to release.
> -	 */
> -	if (PageDirty(page))
> -		return 0;
> -
> -	xfs_count_page_state(page, &delalloc, &unwritten);
> -
> -	if (WARN_ON_ONCE(delalloc))
> -		return 0;
> -	if (WARN_ON_ONCE(unwritten))
> -		return 0;
> -
> -	return try_to_free_buffers(page);
> -}
> -
> -/*
> - * If this is O_DIRECT or the mpage code calling tell them how large the mapping
> - * is, so that we can avoid repeated get_blocks calls.
> - *
> - * If the mapping spans EOF, then we have to break the mapping up as the mapping
> - * for blocks beyond EOF must be marked new so that sub block regions can be
> - * correctly zeroed. We can't do this for mappings within EOF unless the mapping
> - * was just allocated or is unwritten, otherwise the callers would overwrite
> - * existing data with zeros. Hence we have to split the mapping into a range up
> - * to and including EOF, and a second mapping for beyond EOF.
> - */
> -static void
> -xfs_map_trim_size(
> -	struct inode		*inode,
> -	sector_t		iblock,
> -	struct buffer_head	*bh_result,
> -	struct xfs_bmbt_irec	*imap,
> -	xfs_off_t		offset,
> -	ssize_t			size)
> -{
> -	xfs_off_t		mapping_size;
> -
> -	mapping_size = imap->br_startoff + imap->br_blockcount - iblock;
> -	mapping_size <<= inode->i_blkbits;
> -
> -	ASSERT(mapping_size > 0);
> -	if (mapping_size > size)
> -		mapping_size = size;
> -	if (offset < i_size_read(inode) &&
> -	    (xfs_ufsize_t)offset + mapping_size >= i_size_read(inode)) {
> -		/* limit mapping to block that spans EOF */
> -		mapping_size = roundup_64(i_size_read(inode) - offset,
> -					  i_blocksize(inode));
> -	}
> -	if (mapping_size > LONG_MAX)
> -		mapping_size = LONG_MAX;
> -
> -	bh_result->b_size = mapping_size;
> -}
> -
> -static int
> -xfs_get_blocks(
> -	struct inode		*inode,
> -	sector_t		iblock,
> -	struct buffer_head	*bh_result,
> -	int			create)
> -{
> -	struct xfs_inode	*ip = XFS_I(inode);
> -	struct xfs_mount	*mp = ip->i_mount;
> -	xfs_fileoff_t		offset_fsb, end_fsb;
> -	int			error = 0;
> -	int			lockmode = 0;
> -	struct xfs_bmbt_irec	imap;
> -	int			nimaps = 1;
> -	xfs_off_t		offset;
> -	ssize_t			size;
> -
> -	BUG_ON(create);
> -
> -	if (XFS_FORCED_SHUTDOWN(mp))
> -		return -EIO;
> -
> -	offset = (xfs_off_t)iblock << inode->i_blkbits;
> -	ASSERT(bh_result->b_size >= i_blocksize(inode));
> -	size = bh_result->b_size;
> -
> -	if (offset >= i_size_read(inode))
> -		return 0;
> -
> -	/*
> -	 * Direct I/O is usually done on preallocated files, so try getting
> -	 * a block mapping without an exclusive lock first.
> -	 */
> -	lockmode = xfs_ilock_data_map_shared(ip);
> -
> -	ASSERT(offset <= mp->m_super->s_maxbytes);
> -	if (offset > mp->m_super->s_maxbytes - size)
> -		size = mp->m_super->s_maxbytes - offset;
> -	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + size);
> -	offset_fsb = XFS_B_TO_FSBT(mp, offset);
> -
> -	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, &imap,
> -			&nimaps, 0);
> -	if (error)
> -		goto out_unlock;
> -	if (!nimaps) {
> -		trace_xfs_get_blocks_notfound(ip, offset, size);
> -		goto out_unlock;
> -	}
> -
> -	trace_xfs_get_blocks_found(ip, offset, size,
> -		imap.br_state == XFS_EXT_UNWRITTEN ?
> -			XFS_IO_UNWRITTEN : XFS_IO_OVERWRITE, &imap);
> -	xfs_iunlock(ip, lockmode);
> -
> -	/* trim mapping down to size requested */
> -	xfs_map_trim_size(inode, iblock, bh_result, &imap, offset, size);
> -
> -	/*
> -	 * For unwritten extents do not report a disk address in the buffered
> -	 * read case (treat as if we're reading into a hole).
> -	 */
> -	if (xfs_bmap_is_real_extent(&imap))
> -		xfs_map_buffer(inode, bh_result, &imap, offset);
> -
> -	/*
> -	 * If this is a realtime file, data may be on a different device.
> -	 * to that pointed to from the buffer_head b_bdev currently.
> -	 */
> -	bh_result->b_bdev = xfs_find_bdev_for_inode(inode);
> -	return 0;
> -
> -out_unlock:
> -	xfs_iunlock(ip, lockmode);
> -	return error;
> +	return iomap_releasepage(page, gfp_mask);
>  }
>  
>  STATIC sector_t
> @@ -1276,9 +958,7 @@ xfs_vm_readpage(
>  	struct page		*page)
>  {
>  	trace_xfs_vm_readpage(page->mapping->host, 1);
> -	if (i_blocksize(page->mapping->host) == PAGE_SIZE)
> -		return iomap_readpage(page, &xfs_iomap_ops);
> -	return mpage_readpage(page, xfs_get_blocks);
> +	return iomap_readpage(page, &xfs_iomap_ops);
>  }
>  
>  STATIC int
> @@ -1289,65 +969,7 @@ xfs_vm_readpages(
>  	unsigned		nr_pages)
>  {
>  	trace_xfs_vm_readpages(mapping->host, nr_pages);
> -	if (i_blocksize(mapping->host) == PAGE_SIZE)
> -		return iomap_readpages(mapping, pages, nr_pages, &xfs_iomap_ops);
> -	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
> -}
> -
> -/*
> - * This is basically a copy of __set_page_dirty_buffers() with one
> - * small tweak: buffers beyond EOF do not get marked dirty. If we mark them
> - * dirty, we'll never be able to clean them because we don't write buffers
> - * beyond EOF, and that means we can't invalidate pages that span EOF
> - * that have been marked dirty. Further, the dirty state can leak into
> - * the file interior if the file is extended, resulting in all sorts of
> - * bad things happening as the state does not match the underlying data.
> - *
> - * XXX: this really indicates that bufferheads in XFS need to die. Warts like
> - * this only exist because of bufferheads and how the generic code manages them.
> - */
> -STATIC int
> -xfs_vm_set_page_dirty(
> -	struct page		*page)
> -{
> -	struct address_space	*mapping = page->mapping;
> -	struct inode		*inode = mapping->host;
> -	loff_t			end_offset;
> -	loff_t			offset;
> -	int			newly_dirty;
> -
> -	if (unlikely(!mapping))
> -		return !TestSetPageDirty(page);
> -
> -	end_offset = i_size_read(inode);
> -	offset = page_offset(page);
> -
> -	spin_lock(&mapping->private_lock);
> -	if (page_has_buffers(page)) {
> -		struct buffer_head *head = page_buffers(page);
> -		struct buffer_head *bh = head;
> -
> -		do {
> -			if (offset < end_offset)
> -				set_buffer_dirty(bh);
> -			bh = bh->b_this_page;
> -			offset += i_blocksize(inode);
> -		} while (bh != head);
> -	}
> -	/*
> -	 * Lock out page->mem_cgroup migration to keep PageDirty
> -	 * synchronized with per-memcg dirty page counters.
> -	 */
> -	lock_page_memcg(page);
> -	newly_dirty = !TestSetPageDirty(page);
> -	spin_unlock(&mapping->private_lock);
> -
> -	if (newly_dirty)
> -		__set_page_dirty(page, mapping, 1);
> -	unlock_page_memcg(page);
> -	if (newly_dirty)
> -		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> -	return newly_dirty;
> +	return iomap_readpages(mapping, pages, nr_pages, &xfs_iomap_ops);
>  }
>  
>  static int
> @@ -1365,13 +987,13 @@ const struct address_space_operations xfs_address_space_operations = {
>  	.readpages		= xfs_vm_readpages,
>  	.writepage		= xfs_vm_writepage,
>  	.writepages		= xfs_vm_writepages,
> -	.set_page_dirty		= xfs_vm_set_page_dirty,
> +	.set_page_dirty		= iomap_set_page_dirty,
>  	.releasepage		= xfs_vm_releasepage,
>  	.invalidatepage		= xfs_vm_invalidatepage,
>  	.bmap			= xfs_vm_bmap,
>  	.direct_IO		= noop_direct_IO,
> -	.migratepage		= buffer_migrate_page,
> -	.is_partially_uptodate  = block_is_partially_uptodate,
> +	.migratepage		= iomap_migrate_page,
> +	.is_partially_uptodate  = iomap_is_partially_uptodate,
>  	.error_remove_page	= generic_error_remove_page,
>  	.swap_activate		= xfs_iomap_swapfile_activate,
>  };
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index f5f2b71c2fde..f3fa197bd272 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -24,7 +24,6 @@
>  #include <linux/mm.h>
>  #include <linux/fs.h>
>  #include <linux/dax.h>
> -#include <linux/buffer_head.h>
>  #include <linux/uio.h>
>  #include <linux/list_lru.h>
>  
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index a29c27e86f60..bcd3064bbf9c 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1033,9 +1033,6 @@ xfs_file_iomap_begin(
>  	if (XFS_FORCED_SHUTDOWN(mp))
>  		return -EIO;
>  
> -	if (i_blocksize(inode) < PAGE_SIZE)
> -		iomap->flags |= IOMAP_F_BUFFER_HEAD;
> -
>  	if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
>  			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>  		/* Reserve delalloc blocks for regular writeback. */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index ed67389f4948..cd077c05e09c 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1872,7 +1872,7 @@ MODULE_ALIAS_FS("xfs");
>  STATIC int __init
>  xfs_init_zones(void)
>  {
> -	if (bioset_init(&xfs_ioend_bioset, 4 * MAX_BUF_PER_PAGE,
> +	if (bioset_init(&xfs_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
>  			offsetof(struct xfs_ioend, io_inline_bio),
>  			BIOSET_NEED_BVECS))
>  		goto out;
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 666844532030..41aaaf432ff4 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1165,33 +1165,23 @@ DECLARE_EVENT_CLASS(xfs_page_class,
>  		__field(loff_t, size)
>  		__field(unsigned long, offset)
>  		__field(unsigned int, length)
> -		__field(int, delalloc)
> -		__field(int, unwritten)
>  	),
>  	TP_fast_assign(
> -		int delalloc = -1, unwritten = -1;
> -
> -		if (page_has_buffers(page))
> -			xfs_count_page_state(page, &delalloc, &unwritten);
>  		__entry->dev = inode->i_sb->s_dev;
>  		__entry->ino = XFS_I(inode)->i_ino;
>  		__entry->pgoff = page_offset(page);
>  		__entry->size = i_size_read(inode);
>  		__entry->offset = off;
>  		__entry->length = len;
> -		__entry->delalloc = delalloc;
> -		__entry->unwritten = unwritten;
>  	),
>  	TP_printk("dev %d:%d ino 0x%llx pgoff 0x%lx size 0x%llx offset %lx "
> -		  "length %x delalloc %d unwritten %d",
> +		  "length %x",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->pgoff,
>  		  __entry->size,
>  		  __entry->offset,
> -		  __entry->length,
> -		  __entry->delalloc,
> -		  __entry->unwritten)
> +		  __entry->length)
>  )
>  
>  #define DEFINE_PAGE_EVENT(name)		\
> @@ -1275,9 +1265,6 @@ DEFINE_EVENT(xfs_imap_class, name,	\
>  	TP_ARGS(ip, offset, count, type, irec))
>  DEFINE_IOMAP_EVENT(xfs_map_blocks_found);
>  DEFINE_IOMAP_EVENT(xfs_map_blocks_alloc);
> -DEFINE_IOMAP_EVENT(xfs_get_blocks_found);
> -DEFINE_IOMAP_EVENT(xfs_get_blocks_alloc);
> -DEFINE_IOMAP_EVENT(xfs_get_blocks_map_direct);
>  DEFINE_IOMAP_EVENT(xfs_iomap_alloc);
>  DEFINE_IOMAP_EVENT(xfs_iomap_found);
>  
> @@ -1316,7 +1303,6 @@ DEFINE_EVENT(xfs_simple_io_class, name,	\
>  	TP_ARGS(ip, offset, count))
>  DEFINE_SIMPLE_IO_EVENT(xfs_delalloc_enospc);
>  DEFINE_SIMPLE_IO_EVENT(xfs_unwritten_convert);
> -DEFINE_SIMPLE_IO_EVENT(xfs_get_blocks_notfound);
>  DEFINE_SIMPLE_IO_EVENT(xfs_setfilesize);
>  DEFINE_SIMPLE_IO_EVENT(xfs_zero_eof);
>  DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write);
> -- 
> 2.17.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 13/24] xfs: make xfs_writepage_map extent map centric
  2018-06-19  5:43   ` Darrick J. Wong
@ 2018-06-19 16:52     ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-19 16:52 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Dave Chinner

On Mon, Jun 18, 2018 at 10:43:25PM -0700, Darrick J. Wong wrote:
> > +	/*
> > +	 * Walk the blocks on the page, and we we run off then end of the
> > +	 * current map or find the current map invalid, grab a new one.
> 
> Will rework this comment to read:
> 
> "...and if we run off the end of the current map or..."

Ok.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 23/24] iomap: add support for sub-pagesize buffered I/O without buffer heads
  2018-06-15 13:02 ` [PATCH 23/24] iomap: add support for sub-pagesize buffered I/O " Christoph Hellwig
@ 2018-06-19 16:52   ` Brian Foster
  2018-06-20  7:56     ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Brian Foster @ 2018-06-19 16:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel

On Fri, Jun 15, 2018 at 03:02:08PM +0200, Christoph Hellwig wrote:
> After already supporting a simple implementation of buffered writes for
> the blocksize == PAGE_SIZE case in the last commit this adds full support
> even for smaller block sizes.   There are three bits of per-block
> information in the buffer_head structure that really matter for the iomap
> read and write path:
> 
>  - uptodate status (BH_uptodate)
>  - marked as currently under read I/O (BH_Async_Read)
>  - marked as currently under write I/O (BH_Async_Write)
> 
> Instead of having new per-block structures this now adds a per-page
> structure called struct iomap_page to track this information in a slightly
> different form:
> 
>  - a bitmap for the per-block uptodate status.  For worst case of a 64k
>    page size system this bitmap needs to contain 128 bits.  For the
>    typical 4k page size case it only needs 8 bits, although we still
>    need a full unsigned long due to the way the atomic bitmap API works.
>  - two atomic_t counters are used to track the outstanding read and write
>    counts
> 
> There is quite a bit of boilerplate code as the buffered I/O path uses
> various helper methods, but the actual code is very straight forward.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap.c            | 262 ++++++++++++++++++++++++++++++++++++++----
>  include/linux/iomap.h |  31 +++++
>  2 files changed, 272 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index a504077bb38f..c59d1922991d 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
...
> @@ -197,7 +322,13 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  
>  	__bio_add_page(ctx->bio, page, plen, poff);
>  done:
> -	return plen;
> +	/*
> +	 * Move the caller beyond our range so that it keeps making progress.
> +	 * For that we have to include any leading non-uptodate ranges, but

Do you mean "leading uptodate ranges" here? E.g., pos is pushed forward
past those ranges we don't have to read, so (pos - orig_pos) reflects
the initial uptodate range while plen reflects the length we have to
read..?

> +	 * we can skip trailing ones as they will be handled in the next
> +	 * iteration.
> +	 */
> +	return pos - orig_pos + plen;
>  }
>  
>  int
...
> @@ -373,21 +581,33 @@ static int
>  __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
>  		struct page *page, struct iomap *iomap)
>  {
> +	struct iomap_page *iop = iomap_page_create(inode, page);
>  	loff_t block_size = i_blocksize(inode);
>  	loff_t block_start = pos & ~(block_size - 1);
>  	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
> -	unsigned poff = block_start & (PAGE_SIZE - 1);
> -	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, block_end - block_start);
> -	unsigned from = pos & (PAGE_SIZE - 1), to = from + len;
> -
> -	WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE);
> +	unsigned from = pos & (PAGE_SIZE - 1), to = from + len, poff, plen;
> +	int status = 0;
>  
>  	if (PageUptodate(page))
>  		return 0;
> -	if (from <= poff && to >= poff + plen)
> -		return 0;
> -	return iomap_read_page_sync(inode, block_start, page,
> -			poff, plen, from, to, iomap);
> +
> +	do {

Kind of a nit, but this catches my eye and manages to confuse me every
time I look at it. A comment along the lines of:

                /*
		 * Pass in the block aligned start/end so we get back block
		 * aligned/adjusted poff/plen and can compare with unaligned
		 * from/to below.
                 */

... would be nice here, IMO.

> +		iomap_adjust_read_range(inode, iop, &block_start,
> +				block_end - block_start, &poff, &plen);
> +		if (plen == 0)
> +			break;
> +
> +		if ((from > poff && from < poff + plen) ||
> +		    (to > poff && to < poff + plen)) {
> +			status = iomap_read_page_sync(inode, block_start, page,
> +					poff, plen, from, to, iomap);

After taking another look at the buffer head path, it does look like we
have slightly different behavior here. IIUC, the former reads only the
!uptodate blocks that fall along the from/to boundaries. Here, if say
from = 1, to = PAGE_SIZE and the page is fully !uptodate, it looks like
we'd read the entire page worth of blocks (assuming contiguous 512b
blocks, for example). Intentional? Doesn't seem like a big deal, but
could be worth a followup fix.

Brian

> +			if (status)
> +				break;
> +		}
> +
> +	} while ((block_start += plen) < block_end);
> +
> +	return status;
>  }
>  
>  static int
> @@ -470,7 +690,7 @@ __iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
>  	if (unlikely(copied < len && !PageUptodate(page))) {
>  		copied = 0;
>  	} else {
> -		SetPageUptodate(page);
> +		iomap_set_range_uptodate(page, pos & (PAGE_SIZE - 1), len);
>  		iomap_set_page_dirty(page);
>  	}
>  	return __generic_write_end(inode, pos, copied, page);
> @@ -806,7 +1026,7 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
>  		block_commit_write(page, 0, length);
>  	} else {
>  		WARN_ON_ONCE(!PageUptodate(page));
> -		WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE);
> +		iomap_page_create(inode, page);
>  	}
>  
>  	return length;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index c2706cfb27c7..60b196c54dd6 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -2,6 +2,9 @@
>  #ifndef LINUX_IOMAP_H
>  #define LINUX_IOMAP_H 1
>  
> +#include <linux/atomic.h>
> +#include <linux/bitmap.h>
> +#include <linux/mm.h>
>  #include <linux/types.h>
>  
>  struct address_space;
> @@ -99,12 +102,40 @@ struct iomap_ops {
>  			ssize_t written, unsigned flags, struct iomap *iomap);
>  };
>  
> +/*
> + * Structure allocate for each page when block size < PAGE_SIZE to track
> + * sub-page uptodate status and I/O completions.
> + */
> +struct iomap_page {
> +	atomic_t		read_count;
> +	atomic_t		write_count;
> +	DECLARE_BITMAP(uptodate, PAGE_SIZE / 512);
> +};
> +
> +static inline struct iomap_page *to_iomap_page(struct page *page)
> +{
> +	if (page_has_private(page))
> +		return (struct iomap_page *)page_private(page);
> +	return NULL;
> +}
> +
>  ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
>  		const struct iomap_ops *ops);
>  int iomap_readpage(struct page *page, const struct iomap_ops *ops);
>  int iomap_readpages(struct address_space *mapping, struct list_head *pages,
>  		unsigned nr_pages, const struct iomap_ops *ops);
>  int iomap_set_page_dirty(struct page *page);
> +int iomap_is_partially_uptodate(struct page *page, unsigned long from,
> +		unsigned long count);
> +int iomap_releasepage(struct page *page, gfp_t gfp_mask);
> +void iomap_invalidatepage(struct page *page, unsigned int offset,
> +		unsigned int len);
> +#ifdef CONFIG_MIGRATION
> +int iomap_migrate_page(struct address_space *mapping, struct page *newpage,
> +		struct page *page, enum migrate_mode mode);
> +#else
> +#define iomap_migrate_page NULL
> +#endif
>  int iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
>  		const struct iomap_ops *ops);
>  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
> -- 
> 2.17.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/24] xfs: remove xfs_map_cow
  2018-06-19  5:35     ` Darrick J. Wong
@ 2018-06-19 16:53       ` Christoph Hellwig
  2018-06-20  0:37         ` Darrick J. Wong
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-19 16:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, Christoph Hellwig, linux-xfs, linux-fsdevel

On Mon, Jun 18, 2018 at 10:35:37PM -0700, Darrick J. Wong wrote:
> > A comment would be nice here just so the purpose of the logic is clear
> > at a glance. E.g:
> > 
> >                 /*
> >                  * COW fork blocks can overlap data fork blocks even if the
> >                  * blocks aren't shared. COW I/O always takes precedent, so we
> >                  * must always check for overlap on reflink inodes unless the
> >                  * mapping is already COW.
> >                  */
> > 
> > With something like that, this now looks pretty good to me:
> > 
> > Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> I concur, it would help to have a comment to remind everyone that we
> sometimes upgrade to a cow write even when it's not strictly required.
> :)
> 
> With that comment added in,

Do you just want to throw it after the series?  Or should I or Brian
resend it on top of the series?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/24] xfs: remove xfs_map_cow
  2018-06-19 16:53       ` Christoph Hellwig
@ 2018-06-20  0:37         ` Darrick J. Wong
  0 siblings, 0 replies; 54+ messages in thread
From: Darrick J. Wong @ 2018-06-20  0:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Brian Foster, linux-xfs, linux-fsdevel

On Tue, Jun 19, 2018 at 06:53:03PM +0200, Christoph Hellwig wrote:
> On Mon, Jun 18, 2018 at 10:35:37PM -0700, Darrick J. Wong wrote:
> > > A comment would be nice here just so the purpose of the logic is clear
> > > at a glance. E.g:
> > > 
> > >                 /*
> > >                  * COW fork blocks can overlap data fork blocks even if the
> > >                  * blocks aren't shared. COW I/O always takes precedent, so we
> > >                  * must always check for overlap on reflink inodes unless the
> > >                  * mapping is already COW.
> > >                  */
> > > 
> > > With something like that, this now looks pretty good to me:
> > > 
> > > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > 
> > I concur, it would help to have a comment to remind everyone that we
> > sometimes upgrade to a cow write even when it's not strictly required.
> > :)
> > 
> > With that comment added in,
> 
> Do you just want to throw it after the series?  Or should I or Brian
> resend it on top of the series?

I just stuck it in the patch and had it follow the code in the
subsequent reorganizations.  Testing is running now...

--D

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 23/24] iomap: add support for sub-pagesize buffered I/O without buffer heads
  2018-06-19 16:52   ` Brian Foster
@ 2018-06-20  7:56     ` Christoph Hellwig
  2018-06-20 14:32       ` Brian Foster
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-20  7:56 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, linux-fsdevel

On Tue, Jun 19, 2018 at 12:52:11PM -0400, Brian Foster wrote:
> > +	/*
> > +	 * Move the caller beyond our range so that it keeps making progress.
> > +	 * For that we have to include any leading non-uptodate ranges, but
> 
> Do you mean "leading uptodate ranges" here? E.g., pos is pushed forward
> past those ranges we don't have to read, so (pos - orig_pos) reflects
> the initial uptodate range while plen reflects the length we have to
> read..?

Yes.

> > +
> > +	do {
> 
> Kind of a nit, but this catches my eye and manages to confuse me every
> time I look at it. A comment along the lines of:
> 
>                 /*
> 		 * Pass in the block aligned start/end so we get back block
> 		 * aligned/adjusted poff/plen and can compare with unaligned
> 		 * from/to below.
>                  */
> 
> ... would be nice here, IMO.

Fine with me.

> > +		iomap_adjust_read_range(inode, iop, &block_start,
> > +				block_end - block_start, &poff, &plen);
> > +		if (plen == 0)
> > +			break;
> > +
> > +		if ((from > poff && from < poff + plen) ||
> > +		    (to > poff && to < poff + plen)) {
> > +			status = iomap_read_page_sync(inode, block_start, page,
> > +					poff, plen, from, to, iomap);
> 
> After taking another look at the buffer head path, it does look like we
> have slightly different behavior here. IIUC, the former reads only the
> !uptodate blocks that fall along the from/to boundaries. Here, if say
> from = 1, to = PAGE_SIZE and the page is fully !uptodate, it looks like
> we'd read the entire page worth of blocks (assuming contiguous 512b
> blocks, for example). Intentional? Doesn't seem like a big deal, but
> could be worth a followup fix.

It wasn't actuall intentional, but I actually think it is the right thing
in then end, as it means we'll often do a single read instead of two
separate ones.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 23/24] iomap: add support for sub-pagesize buffered I/O without buffer heads
  2018-06-20  7:56     ` Christoph Hellwig
@ 2018-06-20 14:32       ` Brian Foster
  2018-06-20 16:08         ` Darrick J. Wong
  2018-06-21  7:53         ` Christoph Hellwig
  0 siblings, 2 replies; 54+ messages in thread
From: Brian Foster @ 2018-06-20 14:32 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel

Sending again without the attachment... Christoph, let me know if it
didn't hit your mbox at least.

On Wed, Jun 20, 2018 at 09:56:55AM +0200, Christoph Hellwig wrote:
> On Tue, Jun 19, 2018 at 12:52:11PM -0400, Brian Foster wrote:
> > > +	/*
> > > +	 * Move the caller beyond our range so that it keeps making progress.
> > > +	 * For that we have to include any leading non-uptodate ranges, but
> > 
> > Do you mean "leading uptodate ranges" here? E.g., pos is pushed forward
> > past those ranges we don't have to read, so (pos - orig_pos) reflects
> > the initial uptodate range while plen reflects the length we have to
> > read..?
> 
> Yes.
> 
> > > +
> > > +	do {
> > 
> > Kind of a nit, but this catches my eye and manages to confuse me every
> > time I look at it. A comment along the lines of:
> > 
> >                 /*
> > 		 * Pass in the block aligned start/end so we get back block
> > 		 * aligned/adjusted poff/plen and can compare with unaligned
> > 		 * from/to below.
> >                  */
> > 
> > ... would be nice here, IMO.
> 
> Fine with me.
> 
> > > +		iomap_adjust_read_range(inode, iop, &block_start,
> > > +				block_end - block_start, &poff, &plen);
> > > +		if (plen == 0)
> > > +			break;
> > > +
> > > +		if ((from > poff && from < poff + plen) ||
> > > +		    (to > poff && to < poff + plen)) {
> > > +			status = iomap_read_page_sync(inode, block_start, page,
> > > +					poff, plen, from, to, iomap);
> > 
> > After taking another look at the buffer head path, it does look like we
> > have slightly different behavior here. IIUC, the former reads only the
> > !uptodate blocks that fall along the from/to boundaries. Here, if say
> > from = 1, to = PAGE_SIZE and the page is fully !uptodate, it looks like
> > we'd read the entire page worth of blocks (assuming contiguous 512b
> > blocks, for example). Intentional? Doesn't seem like a big deal, but
> > could be worth a followup fix.
> 
> It wasn't actuall intentional, but I actually think it is the right thing
> in then end, as it means we'll often do a single read instead of two
> separate ones.

Ok, but if that's the argument, then shouldn't we not be doing two
separate I/Os if the middle range of a write happens to be already
uptodate? Or more for that matter, if the page happens to be sparsely
uptodate for whatever reason..?

OTOH, I also do wonder a bit whether that may always be the right thing
if we consider cases like 64k page size arches and whatnot. It seems
like we could end up consuming more bandwidth for reads than we
typically have in the past. That said, unless there's a functional
reason to change this I think it's fine to optimize this path for these
kinds of corner cases in follow on patches.

Finally, this survived xfstests on a sub-page block size fs but I
managed to hit an fsx error:

Mapped Read: non-zero data past EOF (0x21a1f) page offset 0xc00 is
0xc769

It repeats 100% of the time for me using the attached fsxops file (with
--replay-ops) on XFS w/ -bsize=1k. It doesn't occur without the final
patch to enable sub-page block iomap on XFS.

Brian

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 23/24] iomap: add support for sub-pagesize buffered I/O without buffer heads
  2018-06-20 14:32       ` Brian Foster
@ 2018-06-20 16:08         ` Darrick J. Wong
  2018-06-20 18:12           ` Brian Foster
  2018-06-21  7:53         ` Christoph Hellwig
  1 sibling, 1 reply; 54+ messages in thread
From: Darrick J. Wong @ 2018-06-20 16:08 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, linux-fsdevel

On Wed, Jun 20, 2018 at 10:32:53AM -0400, Brian Foster wrote:
> Sending again without the attachment... Christoph, let me know if it
> didn't hit your mbox at least.
> 
> On Wed, Jun 20, 2018 at 09:56:55AM +0200, Christoph Hellwig wrote:
> > On Tue, Jun 19, 2018 at 12:52:11PM -0400, Brian Foster wrote:
> > > > +	/*
> > > > +	 * Move the caller beyond our range so that it keeps making progress.
> > > > +	 * For that we have to include any leading non-uptodate ranges, but
> > > 
> > > Do you mean "leading uptodate ranges" here? E.g., pos is pushed forward
> > > past those ranges we don't have to read, so (pos - orig_pos) reflects
> > > the initial uptodate range while plen reflects the length we have to
> > > read..?
> > 
> > Yes.
> > 
> > > > +
> > > > +	do {
> > > 
> > > Kind of a nit, but this catches my eye and manages to confuse me every
> > > time I look at it. A comment along the lines of:
> > > 
> > >                 /*
> > > 		 * Pass in the block aligned start/end so we get back block
> > > 		 * aligned/adjusted poff/plen and can compare with unaligned
> > > 		 * from/to below.
> > >                  */
> > > 
> > > ... would be nice here, IMO.
> > 
> > Fine with me.
> > 
> > > > +		iomap_adjust_read_range(inode, iop, &block_start,
> > > > +				block_end - block_start, &poff, &plen);
> > > > +		if (plen == 0)
> > > > +			break;
> > > > +
> > > > +		if ((from > poff && from < poff + plen) ||
> > > > +		    (to > poff && to < poff + plen)) {
> > > > +			status = iomap_read_page_sync(inode, block_start, page,
> > > > +					poff, plen, from, to, iomap);
> > > 
> > > After taking another look at the buffer head path, it does look like we
> > > have slightly different behavior here. IIUC, the former reads only the
> > > !uptodate blocks that fall along the from/to boundaries. Here, if say
> > > from = 1, to = PAGE_SIZE and the page is fully !uptodate, it looks like
> > > we'd read the entire page worth of blocks (assuming contiguous 512b
> > > blocks, for example). Intentional? Doesn't seem like a big deal, but
> > > could be worth a followup fix.
> > 
> > It wasn't actuall intentional, but I actually think it is the right thing
> > in then end, as it means we'll often do a single read instead of two
> > separate ones.
> 
> Ok, but if that's the argument, then shouldn't we not be doing two
> separate I/Os if the middle range of a write happens to be already
> uptodate? Or more for that matter, if the page happens to be sparsely
> uptodate for whatever reason..?
> 
> OTOH, I also do wonder a bit whether that may always be the right thing
> if we consider cases like 64k page size arches and whatnot. It seems
> like we could end up consuming more bandwidth for reads than we
> typically have in the past. That said, unless there's a functional
> reason to change this I think it's fine to optimize this path for these
> kinds of corner cases in follow on patches.
> 
> Finally, this survived xfstests on a sub-page block size fs but I
> managed to hit an fsx error:
> 
> Mapped Read: non-zero data past EOF (0x21a1f) page offset 0xc00 is
> 0xc769
> 
> It repeats 100% of the time for me using the attached fsxops file (with
> --replay-ops) on XFS w/ -bsize=1k. It doesn't occur without the final
> patch to enable sub-page block iomap on XFS.

Funny, because I saw the exact same complaint from generic/127 last
night on my development tree that doesn't include hch's patches and was
going to see if I could figure out what's going on.

FWIW it's been happening sporadically for a few weeks now but every time
I've tried to analyze it I (of course) couldn't get it to reproduce. :)

I also ran this series (all of it, including the subpagesize config)
last night and aside from it stumbling over an unrelated locking problem
seemed fine....

--D

> Brian
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 23/24] iomap: add support for sub-pagesize buffered I/O without buffer heads
  2018-06-20 16:08         ` Darrick J. Wong
@ 2018-06-20 18:12           ` Brian Foster
  2018-06-20 19:02             ` Darrick J. Wong
  0 siblings, 1 reply; 54+ messages in thread
From: Brian Foster @ 2018-06-20 18:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs, linux-fsdevel

On Wed, Jun 20, 2018 at 09:08:03AM -0700, Darrick J. Wong wrote:
> On Wed, Jun 20, 2018 at 10:32:53AM -0400, Brian Foster wrote:
> > Sending again without the attachment... Christoph, let me know if it
> > didn't hit your mbox at least.
> > 
> > On Wed, Jun 20, 2018 at 09:56:55AM +0200, Christoph Hellwig wrote:
> > > On Tue, Jun 19, 2018 at 12:52:11PM -0400, Brian Foster wrote:
> > > > > +	/*
> > > > > +	 * Move the caller beyond our range so that it keeps making progress.
> > > > > +	 * For that we have to include any leading non-uptodate ranges, but
> > > > 
> > > > Do you mean "leading uptodate ranges" here? E.g., pos is pushed forward
> > > > past those ranges we don't have to read, so (pos - orig_pos) reflects
> > > > the initial uptodate range while plen reflects the length we have to
> > > > read..?
> > > 
> > > Yes.
> > > 
> > > > > +
> > > > > +	do {
> > > > 
> > > > Kind of a nit, but this catches my eye and manages to confuse me every
> > > > time I look at it. A comment along the lines of:
> > > > 
> > > >                 /*
> > > > 		 * Pass in the block aligned start/end so we get back block
> > > > 		 * aligned/adjusted poff/plen and can compare with unaligned
> > > > 		 * from/to below.
> > > >                  */
> > > > 
> > > > ... would be nice here, IMO.
> > > 
> > > Fine with me.
> > > 
> > > > > +		iomap_adjust_read_range(inode, iop, &block_start,
> > > > > +				block_end - block_start, &poff, &plen);
> > > > > +		if (plen == 0)
> > > > > +			break;
> > > > > +
> > > > > +		if ((from > poff && from < poff + plen) ||
> > > > > +		    (to > poff && to < poff + plen)) {
> > > > > +			status = iomap_read_page_sync(inode, block_start, page,
> > > > > +					poff, plen, from, to, iomap);
> > > > 
> > > > After taking another look at the buffer head path, it does look like we
> > > > have slightly different behavior here. IIUC, the former reads only the
> > > > !uptodate blocks that fall along the from/to boundaries. Here, if say
> > > > from = 1, to = PAGE_SIZE and the page is fully !uptodate, it looks like
> > > > we'd read the entire page worth of blocks (assuming contiguous 512b
> > > > blocks, for example). Intentional? Doesn't seem like a big deal, but
> > > > could be worth a followup fix.
> > > 
> > > It wasn't actuall intentional, but I actually think it is the right thing
> > > in then end, as it means we'll often do a single read instead of two
> > > separate ones.
> > 
> > Ok, but if that's the argument, then shouldn't we not be doing two
> > separate I/Os if the middle range of a write happens to be already
> > uptodate? Or more for that matter, if the page happens to be sparsely
> > uptodate for whatever reason..?
> > 
> > OTOH, I also do wonder a bit whether that may always be the right thing
> > if we consider cases like 64k page size arches and whatnot. It seems
> > like we could end up consuming more bandwidth for reads than we
> > typically have in the past. That said, unless there's a functional
> > reason to change this I think it's fine to optimize this path for these
> > kinds of corner cases in follow on patches.
> > 
> > Finally, this survived xfstests on a sub-page block size fs but I
> > managed to hit an fsx error:
> > 
> > Mapped Read: non-zero data past EOF (0x21a1f) page offset 0xc00 is
> > 0xc769
> > 
> > It repeats 100% of the time for me using the attached fsxops file (with
> > --replay-ops) on XFS w/ -bsize=1k. It doesn't occur without the final
> > patch to enable sub-page block iomap on XFS.
> 
> Funny, because I saw the exact same complaint from generic/127 last
> night on my development tree that doesn't include hch's patches and was
> going to see if I could figure out what's going on.
> 
> FWIW it's been happening sporadically for a few weeks now but every time
> I've tried to analyze it I (of course) couldn't get it to reproduce. :)
> 
> I also ran this series (all of it, including the subpagesize config)
> last night and aside from it stumbling over an unrelated locking problem
> seemed fine....
> 

That's interesting. Perhaps it's a pre-existing issue in that case and
the iomap stuff just changes the timing to make it reliably reproducible
on this particular system.

I only ran it a handful of times in both cases and now have lost access
to the server. Once I regain access, I'll try running for longer on
for-next to see if the same thing eventually triggers.

Brian

> --D
> 
> > Brian
> > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 23/24] iomap: add support for sub-pagesize buffered I/O without buffer heads
  2018-06-20 18:12           ` Brian Foster
@ 2018-06-20 19:02             ` Darrick J. Wong
  2018-06-21  8:46               ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Darrick J. Wong @ 2018-06-20 19:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, linux-fsdevel

On Wed, Jun 20, 2018 at 02:12:59PM -0400, Brian Foster wrote:
> On Wed, Jun 20, 2018 at 09:08:03AM -0700, Darrick J. Wong wrote:
> > On Wed, Jun 20, 2018 at 10:32:53AM -0400, Brian Foster wrote:
> > > Sending again without the attachment... Christoph, let me know if it
> > > didn't hit your mbox at least.
> > > 
> > > On Wed, Jun 20, 2018 at 09:56:55AM +0200, Christoph Hellwig wrote:
> > > > On Tue, Jun 19, 2018 at 12:52:11PM -0400, Brian Foster wrote:
> > > > > > +	/*
> > > > > > +	 * Move the caller beyond our range so that it keeps making progress.
> > > > > > +	 * For that we have to include any leading non-uptodate ranges, but
> > > > > 
> > > > > Do you mean "leading uptodate ranges" here? E.g., pos is pushed forward
> > > > > past those ranges we don't have to read, so (pos - orig_pos) reflects
> > > > > the initial uptodate range while plen reflects the length we have to
> > > > > read..?
> > > > 
> > > > Yes.
> > > > 
> > > > > > +
> > > > > > +	do {
> > > > > 
> > > > > Kind of a nit, but this catches my eye and manages to confuse me every
> > > > > time I look at it. A comment along the lines of:
> > > > > 
> > > > >                 /*
> > > > > 		 * Pass in the block aligned start/end so we get back block
> > > > > 		 * aligned/adjusted poff/plen and can compare with unaligned
> > > > > 		 * from/to below.
> > > > >                  */
> > > > > 
> > > > > ... would be nice here, IMO.
> > > > 
> > > > Fine with me.
> > > > 
> > > > > > +		iomap_adjust_read_range(inode, iop, &block_start,
> > > > > > +				block_end - block_start, &poff, &plen);
> > > > > > +		if (plen == 0)
> > > > > > +			break;
> > > > > > +
> > > > > > +		if ((from > poff && from < poff + plen) ||
> > > > > > +		    (to > poff && to < poff + plen)) {
> > > > > > +			status = iomap_read_page_sync(inode, block_start, page,
> > > > > > +					poff, plen, from, to, iomap);
> > > > > 
> > > > > After taking another look at the buffer head path, it does look like we
> > > > > have slightly different behavior here. IIUC, the former reads only the
> > > > > !uptodate blocks that fall along the from/to boundaries. Here, if say
> > > > > from = 1, to = PAGE_SIZE and the page is fully !uptodate, it looks like
> > > > > we'd read the entire page worth of blocks (assuming contiguous 512b
> > > > > blocks, for example). Intentional? Doesn't seem like a big deal, but
> > > > > could be worth a followup fix.
> > > > 
> > > > It wasn't actuall intentional, but I actually think it is the right thing
> > > > in then end, as it means we'll often do a single read instead of two
> > > > separate ones.
> > > 
> > > Ok, but if that's the argument, then shouldn't we not be doing two
> > > separate I/Os if the middle range of a write happens to be already
> > > uptodate? Or more for that matter, if the page happens to be sparsely
> > > uptodate for whatever reason..?
> > > 
> > > OTOH, I also do wonder a bit whether that may always be the right thing
> > > if we consider cases like 64k page size arches and whatnot. It seems
> > > like we could end up consuming more bandwidth for reads than we
> > > typically have in the past. That said, unless there's a functional
> > > reason to change this I think it's fine to optimize this path for these
> > > kinds of corner cases in follow on patches.
> > > 
> > > Finally, this survived xfstests on a sub-page block size fs but I
> > > managed to hit an fsx error:
> > > 
> > > Mapped Read: non-zero data past EOF (0x21a1f) page offset 0xc00 is
> > > 0xc769
> > > 
> > > It repeats 100% of the time for me using the attached fsxops file (with
> > > --replay-ops) on XFS w/ -bsize=1k. It doesn't occur without the final
> > > patch to enable sub-page block iomap on XFS.
> > 
> > Funny, because I saw the exact same complaint from generic/127 last
> > night on my development tree that doesn't include hch's patches and was
> > going to see if I could figure out what's going on.
> > 
> > FWIW it's been happening sporadically for a few weeks now but every time
> > I've tried to analyze it I (of course) couldn't get it to reproduce. :)
> > 
> > I also ran this series (all of it, including the subpagesize config)
> > last night and aside from it stumbling over an unrelated locking problem
> > seemed fine....
> > 
> 
> That's interesting. Perhaps it's a pre-existing issue in that case and
> the iomap stuff just changes the timing to make it reliably reproducible
> on this particular system.
> 
> I only ran it a handful of times in both cases and now have lost access
> to the server. Once I regain access, I'll try running for longer on
> for-next to see if the same thing eventually triggers.

I managed to cut the testcase down to a nine-line fsx script and so
turned it into a fstests regression case.  It seems to reproduce 100% on
scsi disks and doesn't at all on pmem.

Note that changing the second to last line of the fsxops script to call
punch_hole instead of zero_range triggers it too.

I've also narrowed it down to something going wrong w.r.t. handling the
page cache somewhere under xfs_free_file_space.

(See attached diff...)

--D

generic: mread past eof shows nonzero contents

Certain sequences of generic/127 invocations complain about being able
to mread nonzero contents past eof.  Replicate that here as a regression
test.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 tests/generic/708     |   54 +++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/708.out |    2 ++
 tests/generic/group   |    1 +
 3 files changed, 57 insertions(+)
 create mode 100755 tests/generic/708
 create mode 100644 tests/generic/708.out

diff --git a/tests/generic/708 b/tests/generic/708
new file mode 100755
index 00000000..fa5584f5
--- /dev/null
+++ b/tests/generic/708
@@ -0,0 +1,54 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2018 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 708
+#
+# Test a specific sequence of fsx operations that causes an mmap read past
+# eof to return nonzero contents.
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	cd /
+	rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+
+rm -f $seqres.full
+
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount
+
+cat >> $tmp.fsxops << ENDL
+fallocate 0x77e2 0x5f06 0x269a2 keep_size
+mapwrite 0x2e7fc 0x42ba 0x3f989
+write 0x67a9 0x714e 0x3f989
+write 0x39f96 0x185a 0x3f989
+collapse_range 0x36000 0x8000 0x3f989
+mapread 0x74c0 0x1bb3 0x3e2d0
+truncate 0x0 0x8aa2 0x3e2d0
+zero_range 0x1265 0x783d 0x8aa2
+mapread 0x7bd8 0xeca 0x8aa2
+ENDL
+
+victim=$SCRATCH_MNT/a
+touch $victim
+$here/ltp/fsx --replay-ops $tmp.fsxops $victim > $tmp.output || cat $tmp.output
+
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/generic/708.out b/tests/generic/708.out
new file mode 100644
index 00000000..33c478ad
--- /dev/null
+++ b/tests/generic/708.out
@@ -0,0 +1,2 @@
+QA output created by 708
+Silence is golden
diff --git a/tests/generic/group b/tests/generic/group
index 83a6fdab..1a1a0a6e 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -501,3 +501,4 @@
 496 auto quick swap
 497 auto quick swap collapse
 498 auto quick log
+708 auto quick rw collapse

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 23/24] iomap: add support for sub-pagesize buffered I/O without buffer heads
  2018-06-20 14:32       ` Brian Foster
  2018-06-20 16:08         ` Darrick J. Wong
@ 2018-06-21  7:53         ` Christoph Hellwig
  1 sibling, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-21  7:53 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, linux-fsdevel

On Wed, Jun 20, 2018 at 10:32:53AM -0400, Brian Foster wrote:
> Sending again without the attachment... Christoph, let me know if it
> didn't hit your mbox at least.

I did get the previous one as well as this one.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 23/24] iomap: add support for sub-pagesize buffered I/O without buffer heads
  2018-06-20 19:02             ` Darrick J. Wong
@ 2018-06-21  8:46               ` Christoph Hellwig
  2018-06-23 13:06                 ` Brian Foster
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-21  8:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, Christoph Hellwig, linux-xfs, linux-fsdevel

On Wed, Jun 20, 2018 at 12:02:30PM -0700, Darrick J. Wong wrote:
> I managed to cut the testcase down to a nine-line fsx script and so
> turned it into a fstests regression case.  It seems to reproduce 100% on
> scsi disks and doesn't at all on pmem.
> 
> Note that changing the second to last line of the fsxops script to call
> punch_hole instead of zero_range triggers it too.
> 
> I've also narrowed it down to something going wrong w.r.t. handling the
> page cache somewhere under xfs_free_file_space.

So far this doesn't reproduce for me with or without the iomap code.
But I've only run it a few times in a VM on my laptop.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 23/24] iomap: add support for sub-pagesize buffered I/O without buffer heads
  2018-06-21  8:46               ` Christoph Hellwig
@ 2018-06-23 13:06                 ` Brian Foster
  2018-06-29 15:59                   ` Christoph Hellwig
  2018-07-02 12:50                   ` Christoph Hellwig
  0 siblings, 2 replies; 54+ messages in thread
From: Brian Foster @ 2018-06-23 13:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-xfs, linux-fsdevel

On Thu, Jun 21, 2018 at 10:46:46AM +0200, Christoph Hellwig wrote:
> On Wed, Jun 20, 2018 at 12:02:30PM -0700, Darrick J. Wong wrote:
> > I managed to cut the testcase down to a nine-line fsx script and so
> > turned it into a fstests regression case.  It seems to reproduce 100% on
> > scsi disks and doesn't at all on pmem.
> > 
> > Note that changing the second to last line of the fsxops script to call
> > punch_hole instead of zero_range triggers it too.
> > 
> > I've also narrowed it down to something going wrong w.r.t. handling the
> > page cache somewhere under xfs_free_file_space.
> 
> So far this doesn't reproduce for me with or without the iomap code.
> But I've only run it a few times in a VM on my laptop.

After playing around a bit more, I observed that this became much harder
to reproduce after creating a new fs or using a largish fs. I think it's
hit or miss based on the content of free space in the filesystem. The
appended diff applied on top of Darrick's recent xfstests test[1] makes
this reproduce reliably for me (with -bsize=1k).

The problem, I think, is essentially that readpage doesn't zero out
real/allocated post-eof blocks on the current page. For example, once
this test runs at least once, the following command is all that is
required to observe the difference in behavior between the buffer head
path and the new iomap path:

  xfs_io -c "mmap 0x21000 0x1000" -c "mread -v 0x21c00 0x100" <scratch_mnt>/a

The former returns zeroes for the post-eof (i_size == 0x21a20) portion
of the page due to the (iblock < lblock) check in
block_read_full_page(). If the block is post eof, get_block() isn't
called, the bh is left unmapped and the portion of the page is
explicitly zeroed in the same loop. The iomap code appears to read and
expose whatever is on disk if the block is allocated. I'm guessing we at
least need to consider doing something similar in iomap_readpage_actor()
(or perhaps somewhere in that path that also covers the write path)..?
I'm not sure it's sufficient to rely on writeback to zero the page and
otherwise let it expose garbage until that point.

Brian

[1] https://marc.info/?l=linux-xfs&m=152960597620340&w=2

--- 8< ---

diff --git a/tests/generic/708 b/tests/generic/708
index d380053f..355cc6b1 100755
--- a/tests/generic/708
+++ b/tests/generic/708
@@ -30,19 +30,17 @@ _require_scratch
 
 rm -f $seqres.full
 
-_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mkfs_sized $((1024 * 1024 * 100)) >>$seqres.full 2>&1
 _scratch_mount
 
+$XFS_IO_PROG -fc "pwrite 0 100m" -c fsync $SCRATCH_MNT/file >>$seqres.full 2>&1
+rm -f $SCRATCH_MNT/file
+
 cat >> $tmp.fsxops << ENDL
-fallocate 0x77e2 0x5f06 0x269a2 keep_size
-mapwrite 0x2e7fc 0x42ba 0x3f989
-write 0x67a9 0x714e 0x3f989
-write 0x39f96 0x185a 0x3f989
-collapse_range 0x36000 0x8000 0x3f989
-mapread 0x74c0 0x1bb3 0x3e2d0
-truncate 0x0 0x8aa2 0x3e2d0
-zero_range 0x1265 0x783d 0x8aa2
-mapread 0x7bd8 0xeca 0x8aa2
+truncate 0x0 0x1f0d6 0x380e1
+write 0x1ad87 0x6c99 0x180d6
+zero_range 0x14426 0xd3aa 0x21a20 keep_size
+mapread 0x1f69a 0x2386 0x21a20
 ENDL
 
 victim=$SCRATCH_MNT/a

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH] iomap: Add inline data support to iomap_readpage_actor
  2018-06-15 13:01 ` [PATCH 01/24] iomap: add an iomap-based readpage and readpages implementation Christoph Hellwig
@ 2018-06-29 14:44   ` Andreas Gruenbacher
  2018-07-01  6:21     ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Andreas Gruenbacher @ 2018-06-29 14:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: cluster-devel, linux-ext4, linux-fsdevel, Andreas Gruenbacher

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index 4f10c6b..d393bb0 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -155,6 +155,12 @@ struct iomap_readpage_ctx {
 	bool is_contig = false;
 	sector_t sector;
 
+	if (iomap->type == IOMAP_INLINE) {
+		iomap_read_inline_data(inode, page, iomap);
+		plen = PAGE_SIZE - poff;
+		goto done;
+	}
+
 	/* we don't support blocksize < PAGE_SIZE quite yet. */
 	WARN_ON_ONCE(pos != page_offset(page));
 	WARN_ON_ONCE(plen != PAGE_SIZE);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 23/24] iomap: add support for sub-pagesize buffered I/O without buffer heads
  2018-06-23 13:06                 ` Brian Foster
@ 2018-06-29 15:59                   ` Christoph Hellwig
  2018-07-02 12:50                   ` Christoph Hellwig
  1 sibling, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2018-06-29 15:59 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel

On Sat, Jun 23, 2018 at 09:06:24AM -0400, Brian Foster wrote:
> to reproduce after creating a new fs or using a largish fs. I think it's
> hit or miss based on the content of free space in the filesystem. The
> appended diff applied on top of Darrick's recent xfstests test[1] makes
> this reproduce reliably for me (with -bsize=1k).

Yes, I can reproduce the issue with that patch.

> The former returns zeroes for the post-eof (i_size == 0x21a20) portion
> of the page due to the (iblock < lblock) check in
> block_read_full_page(). If the block is post eof, get_block() isn't
> called, the bh is left unmapped and the portion of the page is
> explicitly zeroed in the same loop. The iomap code appears to read and
> expose whatever is on disk if the block is allocated. I'm guessing we at
> least need to consider doing something similar in iomap_readpage_actor()
> (or perhaps somewhere in that path that also covers the write path)..?
> I'm not sure it's sufficient to rely on writeback to zero the page and
> otherwise let it expose garbage until that point.

This:

	if (iomap->type != IOMAP_MAPPED || pos >= i_size_read(inode)) {
		zero_user(page, poff, plen);
		iomap_set_range_uptodate(page, poff, plen);
		goto done;
	}

is supposed to do the equivalent zeroing code.  I put a little tracing
code in and we get back a mapped extent from 0x21400:0x22000, which
makes sense.  It seems like we have an actual problem with the
zero_page_rage code somehow.  I'll need some more time to investigate
that.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] iomap: Add inline data support to iomap_readpage_actor
  2018-06-29 14:44   ` [PATCH] iomap: Add inline data support to iomap_readpage_actor Andreas Gruenbacher
@ 2018-07-01  6:21     ` Christoph Hellwig
  2018-07-01 21:43       ` Andreas Gruenbacher
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2018-07-01  6:21 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, cluster-devel, linux-ext4, linux-fsdevel

Looks sensible, although I think it could use some more sanity checking
and a changelog.  Does the version below work for you?

---
>From e4603b7acde9b77e7a4aad6038c7ac42ca35772d Mon Sep 17 00:00:00 2001
From: Andreas Gruenbacher <agruenba@redhat.com>
Date: Fri, 29 Jun 2018 16:44:44 +0200
Subject: iomap: Add inline data support to iomap_readpage_actor

Just copy the inline data into the page using the existing helper.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index 4d8ff0f5ecc9..966ffad9706a 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -155,6 +155,13 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	bool is_contig = false;
 	sector_t sector;
 
+	if (iomap->type == IOMAP_INLINE) {
+		WARN_ON_ONCE(poff);
+		WARN_ON_ONCE(plen != PAGE_SIZE);
+		iomap_read_inline_data(inode, page, iomap);
+		return PAGE_SIZE;
+	}
+
 	/* we don't support blocksize < PAGE_SIZE quite yet. */
 	WARN_ON_ONCE(pos != page_offset(page));
 	WARN_ON_ONCE(plen != PAGE_SIZE);
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH] iomap: Add inline data support to iomap_readpage_actor
  2018-07-01  6:21     ` Christoph Hellwig
@ 2018-07-01 21:43       ` Andreas Gruenbacher
  2018-07-02 12:52         ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Andreas Gruenbacher @ 2018-07-01 21:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: cluster-devel, linux-ext4, linux-fsdevel

On 1 July 2018 at 08:21, Christoph Hellwig <hch@lst.de> wrote:
> Looks sensible, although I think it could use some more sanity checking
> and a changelog.  Does the version below work for you?

Almost.

> ---
> From e4603b7acde9b77e7a4aad6038c7ac42ca35772d Mon Sep 17 00:00:00 2001
> From: Andreas Gruenbacher <agruenba@redhat.com>
> Date: Fri, 29 Jun 2018 16:44:44 +0200
> Subject: iomap: Add inline data support to iomap_readpage_actor
>
> Just copy the inline data into the page using the existing helper.
>
> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 4d8ff0f5ecc9..966ffad9706a 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -155,6 +155,13 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>         bool is_contig = false;
>         sector_t sector;
>
> +       if (iomap->type == IOMAP_INLINE) {
> +               WARN_ON_ONCE(poff);

This is ok.

> +               WARN_ON_ONCE(plen != PAGE_SIZE);

Inline mappings only extend to the end of the file (iomap->offset == 0
&& iomap->length == inode->i_size), so length and plen will be
inode->i_size here. This assertion should just be removed.

> +               iomap_read_inline_data(inode, page, iomap);
> +               return PAGE_SIZE;
> +       }
> +
>         /* we don't support blocksize < PAGE_SIZE quite yet. */
>         WARN_ON_ONCE(pos != page_offset(page));
>         WARN_ON_ONCE(plen != PAGE_SIZE);
> --
> 2.18.0
>

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 23/24] iomap: add support for sub-pagesize buffered I/O without buffer heads
  2018-06-23 13:06                 ` Brian Foster
  2018-06-29 15:59                   ` Christoph Hellwig
@ 2018-07-02 12:50                   ` Christoph Hellwig
  2018-07-02 18:16                     ` Brian Foster
  1 sibling, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2018-07-02 12:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel

The problem is that we need to split extents at the eof block
so that the existing zeroing actually takes effect.  The patch below
fixes the test case for me:

diff --git a/fs/iomap.c b/fs/iomap.c
index e8f1bcdc95cf..9c88b8736de0 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -143,13 +143,20 @@ static void
 iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
 		loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
 {
+	unsigned block_bits = inode->i_blkbits;
+	unsigned block_size = (1 << block_bits);
 	unsigned poff = *pos & (PAGE_SIZE - 1);
 	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
+	unsigned first = poff >> block_bits;
+	unsigned last = (poff + plen - 1) >> block_bits;
+	unsigned end = (i_size_read(inode) & (PAGE_SIZE - 1)) >> block_bits;
 
+	/*
+	 * If the block size is smaller than the page size we need to check the
+	 * per-block uptodate status and adjust the offset and length if needed
+	 * to avoid reading in already uptodate ranges.
+	 */
 	if (iop) {
-		unsigned block_size = i_blocksize(inode);
-		unsigned first = poff >> inode->i_blkbits;
-		unsigned last = (poff + plen - 1) >> inode->i_blkbits;
 		unsigned int i;
 
 		/* move forward for each leading block marked uptodate */
@@ -159,17 +166,27 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
 			*pos += block_size;
 			poff += block_size;
 			plen -= block_size;
+			first++;
 		}
 
 		/* truncate len if we find any trailing uptodate block(s) */
 		for ( ; i <= last; i++) {
 			if (test_bit(i, iop->uptodate)) {
 				plen -= (last - i + 1) * block_size;
+				last = i - 1;
 				break;
 			}
 		}
 	}
 
+	/*
+	 * If the extent spans the block that contains the i_size we need to
+	 * handle both halves separately so that we properly zero data in the
+	 * page cache for blocks that are entirely outside of i_size.
+	 */
+	if (first <= end && last > end)
+		plen -= (last - end) * block_size;
+
 	*offp = poff;
 	*lenp = plen;
 }

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH] iomap: Add inline data support to iomap_readpage_actor
  2018-07-01 21:43       ` Andreas Gruenbacher
@ 2018-07-02 12:52         ` Christoph Hellwig
  2018-07-02 15:05           ` Andreas Gruenbacher
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2018-07-02 12:52 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, cluster-devel, linux-ext4, linux-fsdevel

On Sun, Jul 01, 2018 at 11:43:43PM +0200, Andreas Gruenbacher wrote:
> > +               WARN_ON_ONCE(plen != PAGE_SIZE);
> 
> Inline mappings only extend to the end of the file (iomap->offset == 0
> && iomap->length == inode->i_size), so length and plen will be
> inode->i_size here. This assertion should just be removed.

But we should generally still return PAGE_SIZE from iomap_begin
because we cover the whole page and zero the rest, don't we?

Either way, I can remove the assert from now.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] iomap: Add inline data support to iomap_readpage_actor
  2018-07-02 12:52         ` Christoph Hellwig
@ 2018-07-02 15:05           ` Andreas Gruenbacher
  0 siblings, 0 replies; 54+ messages in thread
From: Andreas Gruenbacher @ 2018-07-02 15:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: cluster-devel, linux-ext4, linux-fsdevel

On 2 July 2018 at 14:52, Christoph Hellwig <hch@lst.de> wrote:
> On Sun, Jul 01, 2018 at 11:43:43PM +0200, Andreas Gruenbacher wrote:
>> > +               WARN_ON_ONCE(plen != PAGE_SIZE);
>>
>> Inline mappings only extend to the end of the file (iomap->offset == 0
>> && iomap->length == inode->i_size), so length and plen will be
>> inode->i_size here. This assertion should just be removed.
>
> But we should generally still return PAGE_SIZE from iomap_begin
> because we cover the whole page and zero the rest, don't we?

iomap_begin doesn't know when it's being called for readpage. Direct
I/O reads don't pad to the page size,  for example, so handling the
padding in the caller seems reasonable. If you compare with regular
block-based mappings, those are also block but not necessarily page
aligned.

For stuffed files, iomap_begin will return an IOMAP_HOLE mapping for
offsets beyond the end of the file, so iterating beyond the end of the
file is fine (even if readpage doesn't do that).

> Either way, I can remove the assert from now.

Yes, thanks.

Andreas

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 23/24] iomap: add support for sub-pagesize buffered I/O without buffer heads
  2018-07-02 12:50                   ` Christoph Hellwig
@ 2018-07-02 18:16                     ` Brian Foster
  0 siblings, 0 replies; 54+ messages in thread
From: Brian Foster @ 2018-07-02 18:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-xfs, linux-fsdevel

On Mon, Jul 02, 2018 at 02:50:36PM +0200, Christoph Hellwig wrote:
> The problem is that we need to split extents at the eof block
> so that the existing zeroing actually takes effect.  The patch below
> fixes the test case for me:
> 

Looks sane at a glance. I'll take a closer look at v7 and run some more
testing..

Brian

> diff --git a/fs/iomap.c b/fs/iomap.c
> index e8f1bcdc95cf..9c88b8736de0 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -143,13 +143,20 @@ static void
>  iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
>  		loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
>  {
> +	unsigned block_bits = inode->i_blkbits;
> +	unsigned block_size = (1 << block_bits);
>  	unsigned poff = *pos & (PAGE_SIZE - 1);
>  	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
> +	unsigned first = poff >> block_bits;
> +	unsigned last = (poff + plen - 1) >> block_bits;
> +	unsigned end = (i_size_read(inode) & (PAGE_SIZE - 1)) >> block_bits;
>  
> +	/*
> +	 * If the block size is smaller than the page size we need to check the
> +	 * per-block uptodate status and adjust the offset and length if needed
> +	 * to avoid reading in already uptodate ranges.
> +	 */
>  	if (iop) {
> -		unsigned block_size = i_blocksize(inode);
> -		unsigned first = poff >> inode->i_blkbits;
> -		unsigned last = (poff + plen - 1) >> inode->i_blkbits;
>  		unsigned int i;
>  
>  		/* move forward for each leading block marked uptodate */
> @@ -159,17 +166,27 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
>  			*pos += block_size;
>  			poff += block_size;
>  			plen -= block_size;
> +			first++;
>  		}
>  
>  		/* truncate len if we find any trailing uptodate block(s) */
>  		for ( ; i <= last; i++) {
>  			if (test_bit(i, iop->uptodate)) {
>  				plen -= (last - i + 1) * block_size;
> +				last = i - 1;
>  				break;
>  			}
>  		}
>  	}
>  
> +	/*
> +	 * If the extent spans the block that contains the i_size we need to
> +	 * handle both halves separately so that we properly zero data in the
> +	 * page cache for blocks that are entirely outside of i_size.
> +	 */
> +	if (first <= end && last > end)
> +		plen -= (last - end) * block_size;
> +
>  	*offp = poff;
>  	*lenp = plen;
>  }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2018-07-02 18:16 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-15 13:01 stop using buffer heads in xfs v6 Christoph Hellwig
2018-06-15 13:01 ` [PATCH 01/24] iomap: add an iomap-based readpage and readpages implementation Christoph Hellwig
2018-06-29 14:44   ` [PATCH] iomap: Add inline data support to iomap_readpage_actor Andreas Gruenbacher
2018-07-01  6:21     ` Christoph Hellwig
2018-07-01 21:43       ` Andreas Gruenbacher
2018-07-02 12:52         ` Christoph Hellwig
2018-07-02 15:05           ` Andreas Gruenbacher
2018-06-15 13:01 ` [PATCH 02/24] xfs: use iomap for blocksize == PAGE_SIZE readpage and readpages Christoph Hellwig
2018-06-15 13:01 ` [PATCH 03/24] iomap: add initial support for writes without buffer heads Christoph Hellwig
2018-06-15 13:01 ` [PATCH 04/24] xfs: simplify xfs_bmap_punch_delalloc_range Christoph Hellwig
2018-06-15 13:01 ` [PATCH 05/24] xfs: simplify xfs_aops_discard_page Christoph Hellwig
2018-06-15 13:01 ` [PATCH 06/24] xfs: move locking into xfs_bmap_punch_delalloc_range Christoph Hellwig
2018-06-19  5:26   ` Darrick J. Wong
2018-06-15 13:01 ` [PATCH 07/24] xfs: do not set the page uptodate in xfs_writepage_map Christoph Hellwig
2018-06-15 13:01 ` [PATCH 08/24] xfs: don't clear imap_valid for a non-uptodate buffers Christoph Hellwig
2018-06-15 13:01 ` [PATCH 09/24] xfs: don't use XFS_BMAPI_IGSTATE in xfs_map_blocks Christoph Hellwig
2018-06-19  5:27   ` Darrick J. Wong
2018-06-15 13:01 ` [PATCH 10/24] xfs: remove xfs_reflink_trim_irec_to_next_cow Christoph Hellwig
2018-06-19  5:30   ` Darrick J. Wong
2018-06-15 13:01 ` [PATCH 11/24] xfs: remove xfs_map_cow Christoph Hellwig
2018-06-18 17:38   ` Brian Foster
2018-06-19  5:35     ` Darrick J. Wong
2018-06-19 16:53       ` Christoph Hellwig
2018-06-20  0:37         ` Darrick J. Wong
2018-06-15 13:01 ` [PATCH 12/24] xfs: rename the offset variable in xfs_writepage_map Christoph Hellwig
2018-06-19  5:37   ` Darrick J. Wong
2018-06-15 13:01 ` [PATCH 13/24] xfs: make xfs_writepage_map extent map centric Christoph Hellwig
2018-06-18 17:38   ` Brian Foster
2018-06-19  5:43   ` Darrick J. Wong
2018-06-19 16:52     ` Christoph Hellwig
2018-06-15 13:01 ` [PATCH 14/24] xfs: remove the now unused XFS_BMAPI_IGSTATE flag Christoph Hellwig
2018-06-15 13:02 ` [PATCH 15/24] xfs: remove xfs_reflink_find_cow_mapping Christoph Hellwig
2018-06-15 13:02 ` [PATCH 16/24] xfs: simplify xfs_map_blocks by using xfs_iext_lookup_extent directly Christoph Hellwig
2018-06-15 13:02 ` [PATCH 17/24] xfs: remove the imap_valid flag Christoph Hellwig
2018-06-15 13:02 ` [PATCH 18/24] xfs: don't look at buffer heads in xfs_add_to_ioend Christoph Hellwig
2018-06-15 13:02 ` [PATCH 19/24] xfs: move all writeback buffer_head manipulation into xfs_map_at_offset Christoph Hellwig
2018-06-15 13:02 ` [PATCH 20/24] xfs: remove xfs_start_page_writeback Christoph Hellwig
2018-06-15 13:02 ` [PATCH 21/24] xfs: refactor the tail of xfs_writepage_map Christoph Hellwig
2018-06-15 13:02 ` [PATCH 22/24] xfs: allow writeback on pages without buffer heads Christoph Hellwig
2018-06-15 13:02 ` [PATCH 23/24] iomap: add support for sub-pagesize buffered I/O " Christoph Hellwig
2018-06-19 16:52   ` Brian Foster
2018-06-20  7:56     ` Christoph Hellwig
2018-06-20 14:32       ` Brian Foster
2018-06-20 16:08         ` Darrick J. Wong
2018-06-20 18:12           ` Brian Foster
2018-06-20 19:02             ` Darrick J. Wong
2018-06-21  8:46               ` Christoph Hellwig
2018-06-23 13:06                 ` Brian Foster
2018-06-29 15:59                   ` Christoph Hellwig
2018-07-02 12:50                   ` Christoph Hellwig
2018-07-02 18:16                     ` Brian Foster
2018-06-21  7:53         ` Christoph Hellwig
2018-06-15 13:02 ` [PATCH 24/24] xfs: add support for sub-pagesize writeback without buffer_heads Christoph Hellwig
2018-06-19  6:15   ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).