All of lore.kernel.org
 help / color / mirror / Atom feed
* iomap infrastructure and multipage writes V5
@ 2016-06-01 14:44 ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

This series add a new file system I/O path that uses the iomap structure
introduced for the pNFS support and support multi-page buffered writes.

This was first started by Dave Chinner a long time ago, then I did beat
it into shape for production runs in a very constrained ARM NAS
enviroment for Tuxera almost as long ago, and now half a dozen rewrites
later it's back.

The basic idea is to avoid the per-block get_blocks overhead
and make use of extents in the buffered write path by iterating over
them instead.

Note that patch 1 conflicts with Vishals dax error handling series.
It would be great to have a stable branch with it so that both the
XFS and nvdimm tree could pull it in before the other changes in this
area.


Changes since V4:
 - rebase to Linux 4.7-rc1
 - fixed an incorrect BUG_ON statement

Changes since V3:
 - fix DAX based zeroing
 - Reviews and trivial fixes from Bob

Changes since V2:
 - fix the range for delalloc punches after failed writes
 - updated some changelogs

Chances since V1:
 - add support for fiemap
 - fix a test fail on 1k block sizes
 - prepare for 64-bit length, this will be used in a follow on patchset


^ permalink raw reply	[flat|nested] 67+ messages in thread

* iomap infrastructure and multipage writes V5
@ 2016-06-01 14:44 ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

This series add a new file system I/O path that uses the iomap structure
introduced for the pNFS support and support multi-page buffered writes.

This was first started by Dave Chinner a long time ago, then I did beat
it into shape for production runs in a very constrained ARM NAS
enviroment for Tuxera almost as long ago, and now half a dozen rewrites
later it's back.

The basic idea is to avoid the per-block get_blocks overhead
and make use of extents in the buffered write path by iterating over
them instead.

Note that patch 1 conflicts with Vishals dax error handling series.
It would be great to have a stable branch with it so that both the
XFS and nvdimm tree could pull it in before the other changes in this
area.


Changes since V4:
 - rebase to Linux 4.7-rc1
 - fixed an incorrect BUG_ON statement

Changes since V3:
 - fix DAX based zeroing
 - Reviews and trivial fixes from Bob

Changes since V2:
 - fix the range for delalloc punches after failed writes
 - updated some changelogs

Chances since V1:
 - add support for fiemap
 - fix a test fail on 1k block sizes
 - prepare for 64-bit length, this will be used in a follow on patchset

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 01/14] fs: move struct iomap from exportfs.h to a separate header
  2016-06-01 14:44 ` Christoph Hellwig
@ 2016-06-01 14:44   ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/nfsd/blocklayout.c    |  1 +
 fs/nfsd/blocklayoutxdr.c |  1 +
 fs/xfs/xfs_pnfs.c        |  1 +
 include/linux/exportfs.h | 16 +---------------
 include/linux/iomap.h    | 21 +++++++++++++++++++++
 5 files changed, 25 insertions(+), 15 deletions(-)
 create mode 100644 include/linux/iomap.h

diff --git a/fs/nfsd/blocklayout.c b/fs/nfsd/blocklayout.c
index e55b524..4df16ae 100644
--- a/fs/nfsd/blocklayout.c
+++ b/fs/nfsd/blocklayout.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2014-2016 Christoph Hellwig.
  */
 #include <linux/exportfs.h>
+#include <linux/iomap.h>
 #include <linux/genhd.h>
 #include <linux/slab.h>
 #include <linux/pr.h>
diff --git a/fs/nfsd/blocklayoutxdr.c b/fs/nfsd/blocklayoutxdr.c
index 6c3b316..4ebaaf4 100644
--- a/fs/nfsd/blocklayoutxdr.c
+++ b/fs/nfsd/blocklayoutxdr.c
@@ -3,6 +3,7 @@
  */
 #include <linux/sunrpc/svc.h>
 #include <linux/exportfs.h>
+#include <linux/iomap.h>
 #include <linux/nfs4.h>
 
 #include "nfsd.h"
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index d5b7566..db3c7df 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2014 Christoph Hellwig.
  */
+#include <linux/iomap.h>
 #include "xfs.h"
 #include "xfs_format.h"
 #include "xfs_log_format.h"
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index d841450..b03c062 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -6,6 +6,7 @@
 struct dentry;
 struct iattr;
 struct inode;
+struct iomap;
 struct super_block;
 struct vfsmount;
 
@@ -187,21 +188,6 @@ struct fid {
  *    get_name is not (which is possibly inconsistent)
  */
 
-/* types of block ranges for multipage write mappings. */
-#define IOMAP_HOLE	0x01	/* no blocks allocated, need allocation */
-#define IOMAP_DELALLOC	0x02	/* delayed allocation blocks */
-#define IOMAP_MAPPED	0x03	/* blocks allocated @blkno */
-#define IOMAP_UNWRITTEN	0x04	/* blocks allocated @blkno in unwritten state */
-
-#define IOMAP_NULL_BLOCK -1LL	/* blkno is not valid */
-
-struct iomap {
-	sector_t	blkno;	/* first sector of mapping */
-	loff_t		offset;	/* file offset of mapping, bytes */
-	u64		length;	/* length of mapping, bytes */
-	int		type;	/* type of mapping */
-};
-
 struct export_operations {
 	int (*encode_fh)(struct inode *inode, __u32 *fh, int *max_len,
 			struct inode *parent);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
new file mode 100644
index 0000000..1b22197
--- /dev/null
+++ b/include/linux/iomap.h
@@ -0,0 +1,21 @@
+#ifndef LINUX_IOMAP_H
+#define LINUX_IOMAP_H 1
+
+#include <linux/types.h>
+
+/* types of block ranges for multipage write mappings. */
+#define IOMAP_HOLE	0x01	/* no blocks allocated, need allocation */
+#define IOMAP_DELALLOC	0x02	/* delayed allocation blocks */
+#define IOMAP_MAPPED	0x03	/* blocks allocated @blkno */
+#define IOMAP_UNWRITTEN	0x04	/* blocks allocated @blkno in unwritten state */
+
+#define IOMAP_NULL_BLOCK -1LL	/* blkno is not valid */
+
+struct iomap {
+	sector_t	blkno;	/* first sector of mapping */
+	loff_t		offset;	/* file offset of mapping, bytes */
+	u64		length;	/* length of mapping, bytes */
+	int		type;	/* type of mapping */
+};
+
+#endif /* LINUX_IOMAP_H */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 01/14] fs: move struct iomap from exportfs.h to a separate header
@ 2016-06-01 14:44   ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/nfsd/blocklayout.c    |  1 +
 fs/nfsd/blocklayoutxdr.c |  1 +
 fs/xfs/xfs_pnfs.c        |  1 +
 include/linux/exportfs.h | 16 +---------------
 include/linux/iomap.h    | 21 +++++++++++++++++++++
 5 files changed, 25 insertions(+), 15 deletions(-)
 create mode 100644 include/linux/iomap.h

diff --git a/fs/nfsd/blocklayout.c b/fs/nfsd/blocklayout.c
index e55b524..4df16ae 100644
--- a/fs/nfsd/blocklayout.c
+++ b/fs/nfsd/blocklayout.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2014-2016 Christoph Hellwig.
  */
 #include <linux/exportfs.h>
+#include <linux/iomap.h>
 #include <linux/genhd.h>
 #include <linux/slab.h>
 #include <linux/pr.h>
diff --git a/fs/nfsd/blocklayoutxdr.c b/fs/nfsd/blocklayoutxdr.c
index 6c3b316..4ebaaf4 100644
--- a/fs/nfsd/blocklayoutxdr.c
+++ b/fs/nfsd/blocklayoutxdr.c
@@ -3,6 +3,7 @@
  */
 #include <linux/sunrpc/svc.h>
 #include <linux/exportfs.h>
+#include <linux/iomap.h>
 #include <linux/nfs4.h>
 
 #include "nfsd.h"
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index d5b7566..db3c7df 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2014 Christoph Hellwig.
  */
+#include <linux/iomap.h>
 #include "xfs.h"
 #include "xfs_format.h"
 #include "xfs_log_format.h"
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index d841450..b03c062 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -6,6 +6,7 @@
 struct dentry;
 struct iattr;
 struct inode;
+struct iomap;
 struct super_block;
 struct vfsmount;
 
@@ -187,21 +188,6 @@ struct fid {
  *    get_name is not (which is possibly inconsistent)
  */
 
-/* types of block ranges for multipage write mappings. */
-#define IOMAP_HOLE	0x01	/* no blocks allocated, need allocation */
-#define IOMAP_DELALLOC	0x02	/* delayed allocation blocks */
-#define IOMAP_MAPPED	0x03	/* blocks allocated @blkno */
-#define IOMAP_UNWRITTEN	0x04	/* blocks allocated @blkno in unwritten state */
-
-#define IOMAP_NULL_BLOCK -1LL	/* blkno is not valid */
-
-struct iomap {
-	sector_t	blkno;	/* first sector of mapping */
-	loff_t		offset;	/* file offset of mapping, bytes */
-	u64		length;	/* length of mapping, bytes */
-	int		type;	/* type of mapping */
-};
-
 struct export_operations {
 	int (*encode_fh)(struct inode *inode, __u32 *fh, int *max_len,
 			struct inode *parent);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
new file mode 100644
index 0000000..1b22197
--- /dev/null
+++ b/include/linux/iomap.h
@@ -0,0 +1,21 @@
+#ifndef LINUX_IOMAP_H
+#define LINUX_IOMAP_H 1
+
+#include <linux/types.h>
+
+/* types of block ranges for multipage write mappings. */
+#define IOMAP_HOLE	0x01	/* no blocks allocated, need allocation */
+#define IOMAP_DELALLOC	0x02	/* delayed allocation blocks */
+#define IOMAP_MAPPED	0x03	/* blocks allocated @blkno */
+#define IOMAP_UNWRITTEN	0x04	/* blocks allocated @blkno in unwritten state */
+
+#define IOMAP_NULL_BLOCK -1LL	/* blkno is not valid */
+
+struct iomap {
+	sector_t	blkno;	/* first sector of mapping */
+	loff_t		offset;	/* file offset of mapping, bytes */
+	u64		length;	/* length of mapping, bytes */
+	int		type;	/* type of mapping */
+};
+
+#endif /* LINUX_IOMAP_H */
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 02/14] fs: introduce iomap infrastructure
  2016-06-01 14:44 ` Christoph Hellwig
@ 2016-06-01 14:44   ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Add infrastructure for multipage buffered writes.  This is implemented
using an main iterator that applies an actor function to a range that
can be written.

This infrastucture is used to implement a buffered write helper, one
to zero file ranges and one to implement the ->page_mkwrite VM
operations.  All of them borrow a fair amount of code from fs/buffers.
for now by using an internal version of __block_write_begin that
gets passed an iomap and builds the corresponding buffer head.

The file system is gets a set of paired ->iomap_begin and ->iomap_end
calls which allow it to map/reserve a range and get a notification
once the write code is finished with it.

Based on earlier code from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/Kconfig            |   3 +
 fs/Makefile           |   1 +
 fs/buffer.c           |  76 +++++++++-
 fs/internal.h         |   3 +
 fs/iomap.c            | 394 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/iomap.h |  56 ++++++-
 6 files changed, 523 insertions(+), 10 deletions(-)
 create mode 100644 fs/iomap.c

diff --git a/fs/Kconfig b/fs/Kconfig
index b8fcb41..4524916 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -10,6 +10,9 @@ config DCACHE_WORD_ACCESS
 
 if BLOCK
 
+config FS_IOMAP
+	bool
+
 source "fs/ext2/Kconfig"
 source "fs/ext4/Kconfig"
 source "fs/jbd2/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index 85b6e13..ed2b632 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -49,6 +49,7 @@ obj-$(CONFIG_COREDUMP)		+= coredump.o
 obj-$(CONFIG_SYSCTL)		+= drop_caches.o
 
 obj-$(CONFIG_FHANDLE)		+= fhandle.o
+obj-$(CONFIG_FS_IOMAP)		+= iomap.o
 
 obj-y				+= quota/
 
diff --git a/fs/buffer.c b/fs/buffer.c
index 754813a..228288a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -21,6 +21,7 @@
 #include <linux/kernel.h>
 #include <linux/syscalls.h>
 #include <linux/fs.h>
+#include <linux/iomap.h>
 #include <linux/mm.h>
 #include <linux/percpu.h>
 #include <linux/slab.h>
@@ -1891,8 +1892,62 @@ void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
 }
 EXPORT_SYMBOL(page_zero_new_buffers);
 
-int __block_write_begin(struct page *page, loff_t pos, unsigned len,
-		get_block_t *get_block)
+static void
+iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
+		struct iomap *iomap)
+{
+	loff_t offset = block << inode->i_blkbits;
+
+	bh->b_bdev = iomap->bdev;
+
+	/*
+	 * Block points to offset in file we need to map, iomap contains
+	 * the offset at which the map starts. If the map ends before the
+	 * current block, then do not map the buffer and let the caller
+	 * handle it.
+	 */
+	BUG_ON(offset >= iomap->offset + iomap->length);
+
+	switch (iomap->type) {
+	case IOMAP_HOLE:
+		/*
+		 * If the buffer is not up to date or beyond the current EOF,
+		 * we need to mark it as new to ensure sub-block zeroing is
+		 * executed if necessary.
+		 */
+		if (!buffer_uptodate(bh) ||
+		    (offset >= i_size_read(inode)))
+			set_buffer_new(bh);
+		break;
+	case IOMAP_DELALLOC:
+		if (!buffer_uptodate(bh) ||
+		    (offset >= i_size_read(inode)))
+			set_buffer_new(bh);
+		set_buffer_uptodate(bh);
+		set_buffer_mapped(bh);
+		set_buffer_delay(bh);
+		break;
+	case IOMAP_UNWRITTEN:
+		/*
+		 * For unwritten regions, we always need to ensure that
+		 * sub-block writes cause the regions in the block we are not
+		 * writing to are zeroed. Set the buffer as new to ensure this.
+		 */
+		set_buffer_new(bh);
+		set_buffer_unwritten(bh);
+		/* FALLTHRU */
+	case IOMAP_MAPPED:
+		if (offset >= i_size_read(inode))
+			set_buffer_new(bh);
+		bh->b_blocknr = (iomap->blkno >> (inode->i_blkbits - 9)) +
+				((offset - iomap->offset) >> inode->i_blkbits);
+		set_buffer_mapped(bh);
+		break;
+	}
+}
+
+int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
+		get_block_t *get_block, struct iomap *iomap)
 {
 	unsigned from = pos & (PAGE_SIZE - 1);
 	unsigned to = from + len;
@@ -1928,9 +1983,14 @@ int __block_write_begin(struct page *page, loff_t pos, unsigned len,
 			clear_buffer_new(bh);
 		if (!buffer_mapped(bh)) {
 			WARN_ON(bh->b_size != blocksize);
-			err = get_block(inode, block, bh, 1);
-			if (err)
-				break;
+			if (get_block) {
+				err = get_block(inode, block, bh, 1);
+				if (err)
+					break;
+			} else {
+				iomap_to_bh(inode, block, bh, iomap);
+			}
+
 			if (buffer_new(bh)) {
 				unmap_underlying_metadata(bh->b_bdev,
 							bh->b_blocknr);
@@ -1971,6 +2031,12 @@ int __block_write_begin(struct page *page, loff_t pos, unsigned len,
 		page_zero_new_buffers(page, from, to);
 	return err;
 }
+
+int __block_write_begin(struct page *page, loff_t pos, unsigned len,
+		get_block_t *get_block)
+{
+	return __block_write_begin_int(page, pos, len, get_block, NULL);
+}
 EXPORT_SYMBOL(__block_write_begin);
 
 static int __block_commit_write(struct inode *inode, struct page *page,
diff --git a/fs/internal.h b/fs/internal.h
index b71deee..c0c6f49 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -11,6 +11,7 @@
 
 struct super_block;
 struct file_system_type;
+struct iomap;
 struct linux_binprm;
 struct path;
 struct mount;
@@ -39,6 +40,8 @@ static inline int __sync_blockdev(struct block_device *bdev, int wait)
  * buffer.c
  */
 extern void guard_bio_eod(int rw, struct bio *bio);
+extern int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
+		get_block_t *get_block, struct iomap *iomap);
 
 /*
  * char_dev.c
diff --git a/fs/iomap.c b/fs/iomap.c
new file mode 100644
index 0000000..5cc6d48
--- /dev/null
+++ b/fs/iomap.c
@@ -0,0 +1,394 @@
+/*
+ * Copyright (C) 2010 Red Hat, Inc.
+ * Copyright (c) 2016 Christoph Hellwig.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/module.h>
+#include <linux/compiler.h>
+#include <linux/fs.h>
+#include <linux/iomap.h>
+#include <linux/uaccess.h>
+#include <linux/gfp.h>
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/pagemap.h>
+#include <linux/file.h>
+#include <linux/uio.h>
+#include <linux/backing-dev.h>
+#include <linux/buffer_head.h>
+#include "internal.h"
+
+typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
+		void *data, struct iomap *iomap);
+
+/*
+ * Execute a iomap write on a segment of the mapping that spans a
+ * contiguous range of pages that have identical block mapping state.
+ *
+ * This avoids the need to map pages individually, do individual allocations
+ * for each page and most importantly avoid the need for filesystem specific
+ * locking per page. Instead, all the operations are amortised over the entire
+ * range of pages. It is assumed that the filesystems will lock whatever
+ * resources they require in the iomap_begin call, and release them in the
+ * iomap_end call.
+ */
+static loff_t
+iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
+		struct iomap_ops *ops, void *data, iomap_actor_t actor)
+{
+	struct iomap iomap = { 0 };
+	loff_t written = 0, ret;
+
+	/*
+	 * Need to map a range from start position for count bytes. This can
+	 * span multiple pages - it is only guaranteed to return a range of a
+	 * single type of pages (e.g. all into a hole, all mapped or all
+	 * unwritten). Failure at this point has nothing to undo.
+	 *
+	 * If allocation is required for this range, reserve the space now so
+	 * that the allocation is guaranteed to succeed later on. Once we copy
+	 * the data into the page cache pages, then we cannot fail otherwise we
+	 * expose transient stale data. If the reserve fails, we can safely
+	 * back out at this point as there is nothing to undo.
+	 */
+	ret = ops->iomap_begin(inode, pos, length, flags, &iomap);
+	if (ret)
+		return ret;
+	if (WARN_ON(iomap.offset > pos))
+		return -EIO;
+
+	/*
+	 * Cut down the length to the one actually provided by the filesystem,
+	 * as it might not be able to give us the whole size that we requested.
+	 */
+	if (iomap.offset + iomap.length < pos + length)
+		length = iomap.offset + iomap.length - pos;
+
+	/*
+	 * Now that we have guaranteed that the space allocation will succeed.
+	 * we can do the copy-in page by page without having to worry about
+	 * failures exposing transient data.
+	 */
+	written = actor(inode, pos, length, data, &iomap);
+
+	/*
+	 * Now the data has been copied, commit the range we've copied.  This
+	 * should not fail unless the filesystem has had a fatal error.
+	 */
+	ret = ops->iomap_end(inode, pos, length, written > 0 ? written : 0,
+			flags, &iomap);
+
+	return written ? written : ret;
+}
+
+static void
+iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
+{
+	loff_t i_size = i_size_read(inode);
+
+	/*
+	 * Only truncate newly allocated pages beyoned EOF, even if the
+	 * write started inside the existing inode size.
+	 */
+	if (pos + len > i_size)
+		truncate_pagecache_range(inode, max(pos, i_size), pos + len);
+}
+
+static int
+iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, struct iomap *iomap)
+{
+	pgoff_t index = pos >> PAGE_SHIFT;
+	struct page *page;
+	int status = 0;
+
+	BUG_ON(pos + len > iomap->offset + iomap->length);
+
+	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
+	if (!page)
+		return -ENOMEM;
+
+	status = __block_write_begin_int(page, pos, len, NULL, iomap);
+	if (unlikely(status)) {
+		unlock_page(page);
+		put_page(page);
+		page = NULL;
+
+		iomap_write_failed(inode, pos, len);
+	}
+
+	*pagep = page;
+	return status;
+}
+
+static int
+iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
+		unsigned copied, struct page *page)
+{
+	int ret;
+
+	ret = generic_write_end(NULL, inode->i_mapping, pos, len,
+			copied, page, NULL);
+	if (ret < len)
+		iomap_write_failed(inode, pos, len);
+	return ret;
+}
+
+static loff_t
+iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+		struct iomap *iomap)
+{
+	struct iov_iter *i = data;
+	long status = 0;
+	ssize_t written = 0;
+	unsigned int flags = AOP_FLAG_NOFS;
+
+	/*
+	 * Copies from kernel address space cannot fail (NFSD is a big user).
+	 */
+	if (!iter_is_iovec(i))
+		flags |= AOP_FLAG_UNINTERRUPTIBLE;
+
+	do {
+		struct page *page;
+		unsigned long offset;	/* Offset into pagecache page */
+		unsigned long bytes;	/* Bytes to write to page */
+		size_t copied;		/* Bytes copied from user */
+
+		offset = (pos & (PAGE_SIZE - 1));
+		bytes = min_t(unsigned long, PAGE_SIZE - offset,
+						iov_iter_count(i));
+again:
+		if (bytes > length)
+			bytes = length;
+
+		/*
+		 * Bring in the user page that we will copy from _first_.
+		 * Otherwise there's a nasty deadlock on copying from the
+		 * same page as we're writing to, without it being marked
+		 * up-to-date.
+		 *
+		 * Not only is this an optimisation, but it is also required
+		 * to check that the address is actually valid, when atomic
+		 * usercopies are used, below.
+		 */
+		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+			status = -EFAULT;
+			break;
+		}
+
+		status = iomap_write_begin(inode, pos, bytes, flags, &page,
+				iomap);
+		if (unlikely(status))
+			break;
+
+		if (mapping_writably_mapped(inode->i_mapping))
+			flush_dcache_page(page);
+
+		pagefault_disable();
+		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+		pagefault_enable();
+
+		flush_dcache_page(page);
+		mark_page_accessed(page);
+
+		status = iomap_write_end(inode, pos, bytes, copied, page);
+		if (unlikely(status < 0))
+			break;
+		copied = status;
+
+		cond_resched();
+
+		iov_iter_advance(i, copied);
+		if (unlikely(copied == 0)) {
+			/*
+			 * If we were unable to copy any data at all, we must
+			 * fall back to a single segment length write.
+			 *
+			 * If we didn't fallback here, we could livelock
+			 * because not all segments in the iov can be copied at
+			 * once without a pagefault.
+			 */
+			bytes = min_t(unsigned long, PAGE_SIZE - offset,
+						iov_iter_single_seg_count(i));
+			goto again;
+		}
+		pos += copied;
+		written += copied;
+		length -= copied;
+
+		balance_dirty_pages_ratelimited(inode->i_mapping);
+	} while (iov_iter_count(i) && length);
+
+	return written ? written : status;
+}
+
+ssize_t
+iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
+		struct iomap_ops *ops)
+{
+	struct inode *inode = iocb->ki_filp->f_mapping->host;
+	loff_t pos = iocb->ki_pos, ret = 0, written = 0;
+
+	while (iov_iter_count(iter)) {
+		ret = iomap_apply(inode, pos, iov_iter_count(iter),
+				IOMAP_WRITE, ops, iter, iomap_write_actor);
+		if (ret <= 0)
+			break;
+		pos += ret;
+		written += ret;
+	}
+
+	return written ? written : ret;
+}
+EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
+
+static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
+		unsigned bytes, struct iomap *iomap)
+{
+	struct page *page;
+	int status;
+
+	status = iomap_write_begin(inode, pos, bytes,
+			AOP_FLAG_UNINTERRUPTIBLE | AOP_FLAG_NOFS, &page, iomap);
+	if (status)
+		return status;
+
+	zero_user(page, offset, bytes);
+	mark_page_accessed(page);
+
+	return iomap_write_end(inode, pos, bytes, bytes, page);
+}
+
+static loff_t
+iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
+		void *data, struct iomap *iomap)
+{
+	bool *did_zero = data;
+	loff_t written = 0;
+	int status;
+
+	/* already zeroed?  we're done. */
+	if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_UNWRITTEN)
+	    	return count;
+
+	do {
+		unsigned offset, bytes;
+
+		offset = pos & (PAGE_SIZE - 1); /* Within page */
+		bytes = min_t(unsigned, PAGE_SIZE - offset, count);
+
+		status = iomap_zero(inode, pos, offset, bytes, iomap);
+		if (status < 0)
+			return status;
+
+		pos += bytes;
+		count -= bytes;
+		written += bytes;
+		if (did_zero)
+			*did_zero = true;
+	} while (count > 0);
+
+	return written;
+}
+
+int
+iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
+		struct iomap_ops *ops)
+{
+	loff_t ret;
+
+	while (len > 0) {
+		ret = iomap_apply(inode, pos, len, IOMAP_ZERO,
+				ops, did_zero, iomap_zero_range_actor);
+		if (ret <= 0)
+			return ret;
+
+		pos += ret;
+		len -= ret;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iomap_zero_range);
+
+int
+iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
+		struct iomap_ops *ops)
+{
+	unsigned blocksize = (1 << inode->i_blkbits);
+	unsigned off = pos & (blocksize - 1);
+
+	/* Block boundary? Nothing to do */
+	if (!off)
+		return 0;
+	return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops);
+}
+EXPORT_SYMBOL_GPL(iomap_truncate_page);
+
+static loff_t
+iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
+		void *data, struct iomap *iomap)
+{
+	struct page *page = data;
+	int ret;
+
+	ret = __block_write_begin_int(page, pos & ~PAGE_MASK, length,
+			NULL, iomap);
+	if (ret)
+		return ret;
+
+	block_commit_write(page, 0, length);
+	return length;
+}
+
+int iomap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+		struct iomap_ops *ops)
+{
+	struct page *page = vmf->page;
+	struct inode *inode = file_inode(vma->vm_file);
+	unsigned long length;
+	loff_t offset, size;
+	ssize_t ret;
+
+	lock_page(page);
+	size = i_size_read(inode);
+	if ((page->mapping != inode->i_mapping) ||
+	    (page_offset(page) > size)) {
+		/* We overload EFAULT to mean page got truncated */
+		ret = -EFAULT;
+		goto out_unlock;
+	}
+
+	/* page is wholly or partially inside EOF */
+	if (((page->index + 1) << PAGE_SHIFT) > size)
+		length = size & ~PAGE_MASK;
+	else
+		length = PAGE_SIZE;
+
+	offset = page_offset(page);
+	while (length > 0) {
+		ret = iomap_apply(inode, offset, length, IOMAP_WRITE,
+				ops, page, iomap_page_mkwrite_actor);
+		if (unlikely(ret <= 0))
+			goto out_unlock;
+		offset += ret;
+		length -= ret;
+	}
+
+	set_page_dirty(page);
+	wait_for_stable_page(page);
+	return 0;
+out_unlock:
+	unlock_page(page);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 1b22197..854766f 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -3,19 +3,65 @@
 
 #include <linux/types.h>
 
-/* types of block ranges for multipage write mappings. */
+struct inode;
+struct iov_iter;
+struct kiocb;
+struct vm_area_struct;
+struct vm_fault;
+
+/*
+ * Types of block ranges for iomap mappings:
+ */
 #define IOMAP_HOLE	0x01	/* no blocks allocated, need allocation */
 #define IOMAP_DELALLOC	0x02	/* delayed allocation blocks */
 #define IOMAP_MAPPED	0x03	/* blocks allocated @blkno */
 #define IOMAP_UNWRITTEN	0x04	/* blocks allocated @blkno in unwritten state */
 
+/*
+ * Magic value for blkno:
+ */
 #define IOMAP_NULL_BLOCK -1LL	/* blkno is not valid */
 
 struct iomap {
-	sector_t	blkno;	/* first sector of mapping */
-	loff_t		offset;	/* file offset of mapping, bytes */
-	u64		length;	/* length of mapping, bytes */
-	int		type;	/* type of mapping */
+	sector_t		blkno;	/* first sector of mapping, fs blocks */
+	loff_t			offset;	/* file offset of mapping, bytes */
+	u64			length;	/* length of mapping, bytes */
+	int			type;	/* type of mapping */
+	struct block_device	*bdev; /* block device for I/O */
+};
+
+/*
+ * Flags for iomap_begin / iomap_end.  No flag implies a read.
+ */
+#define IOMAP_WRITE		(1 << 0)
+#define IOMAP_ZERO		(1 << 1)
+
+struct iomap_ops {
+	/*
+	 * Return the existing mapping at pos, or reserve space starting at
+	 * pos for up to length, as long as we can do it as a single mapping.
+	 * The actual length is returned in iomap->length.
+	 */
+	int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
+			unsigned flags, struct iomap *iomap);
+
+	/*
+	 * Commit and/or unreserve space previous allocated using iomap_begin.
+	 * Written indicates the length of the successful write operation which
+	 * needs to be commited, while the rest needs to be unreserved.
+	 * Written might be zero if no data was written.
+	 */
+	int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length,
+			ssize_t written, unsigned flags, struct iomap *iomap);
 };
 
+ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
+		struct iomap_ops *ops);
+int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
+		bool *did_zero, struct iomap_ops *ops);
+int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
+		struct iomap_ops *ops);
+int iomap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+		struct iomap_ops *ops);
+
 #endif /* LINUX_IOMAP_H */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 02/14] fs: introduce iomap infrastructure
@ 2016-06-01 14:44   ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Add infrastructure for multipage buffered writes.  This is implemented
using an main iterator that applies an actor function to a range that
can be written.

This infrastucture is used to implement a buffered write helper, one
to zero file ranges and one to implement the ->page_mkwrite VM
operations.  All of them borrow a fair amount of code from fs/buffers.
for now by using an internal version of __block_write_begin that
gets passed an iomap and builds the corresponding buffer head.

The file system is gets a set of paired ->iomap_begin and ->iomap_end
calls which allow it to map/reserve a range and get a notification
once the write code is finished with it.

Based on earlier code from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/Kconfig            |   3 +
 fs/Makefile           |   1 +
 fs/buffer.c           |  76 +++++++++-
 fs/internal.h         |   3 +
 fs/iomap.c            | 394 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/iomap.h |  56 ++++++-
 6 files changed, 523 insertions(+), 10 deletions(-)
 create mode 100644 fs/iomap.c

diff --git a/fs/Kconfig b/fs/Kconfig
index b8fcb41..4524916 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -10,6 +10,9 @@ config DCACHE_WORD_ACCESS
 
 if BLOCK
 
+config FS_IOMAP
+	bool
+
 source "fs/ext2/Kconfig"
 source "fs/ext4/Kconfig"
 source "fs/jbd2/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index 85b6e13..ed2b632 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -49,6 +49,7 @@ obj-$(CONFIG_COREDUMP)		+= coredump.o
 obj-$(CONFIG_SYSCTL)		+= drop_caches.o
 
 obj-$(CONFIG_FHANDLE)		+= fhandle.o
+obj-$(CONFIG_FS_IOMAP)		+= iomap.o
 
 obj-y				+= quota/
 
diff --git a/fs/buffer.c b/fs/buffer.c
index 754813a..228288a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -21,6 +21,7 @@
 #include <linux/kernel.h>
 #include <linux/syscalls.h>
 #include <linux/fs.h>
+#include <linux/iomap.h>
 #include <linux/mm.h>
 #include <linux/percpu.h>
 #include <linux/slab.h>
@@ -1891,8 +1892,62 @@ void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
 }
 EXPORT_SYMBOL(page_zero_new_buffers);
 
-int __block_write_begin(struct page *page, loff_t pos, unsigned len,
-		get_block_t *get_block)
+static void
+iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
+		struct iomap *iomap)
+{
+	loff_t offset = block << inode->i_blkbits;
+
+	bh->b_bdev = iomap->bdev;
+
+	/*
+	 * Block points to offset in file we need to map, iomap contains
+	 * the offset at which the map starts. If the map ends before the
+	 * current block, then do not map the buffer and let the caller
+	 * handle it.
+	 */
+	BUG_ON(offset >= iomap->offset + iomap->length);
+
+	switch (iomap->type) {
+	case IOMAP_HOLE:
+		/*
+		 * If the buffer is not up to date or beyond the current EOF,
+		 * we need to mark it as new to ensure sub-block zeroing is
+		 * executed if necessary.
+		 */
+		if (!buffer_uptodate(bh) ||
+		    (offset >= i_size_read(inode)))
+			set_buffer_new(bh);
+		break;
+	case IOMAP_DELALLOC:
+		if (!buffer_uptodate(bh) ||
+		    (offset >= i_size_read(inode)))
+			set_buffer_new(bh);
+		set_buffer_uptodate(bh);
+		set_buffer_mapped(bh);
+		set_buffer_delay(bh);
+		break;
+	case IOMAP_UNWRITTEN:
+		/*
+		 * For unwritten regions, we always need to ensure that
+		 * sub-block writes cause the regions in the block we are not
+		 * writing to are zeroed. Set the buffer as new to ensure this.
+		 */
+		set_buffer_new(bh);
+		set_buffer_unwritten(bh);
+		/* FALLTHRU */
+	case IOMAP_MAPPED:
+		if (offset >= i_size_read(inode))
+			set_buffer_new(bh);
+		bh->b_blocknr = (iomap->blkno >> (inode->i_blkbits - 9)) +
+				((offset - iomap->offset) >> inode->i_blkbits);
+		set_buffer_mapped(bh);
+		break;
+	}
+}
+
+int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
+		get_block_t *get_block, struct iomap *iomap)
 {
 	unsigned from = pos & (PAGE_SIZE - 1);
 	unsigned to = from + len;
@@ -1928,9 +1983,14 @@ int __block_write_begin(struct page *page, loff_t pos, unsigned len,
 			clear_buffer_new(bh);
 		if (!buffer_mapped(bh)) {
 			WARN_ON(bh->b_size != blocksize);
-			err = get_block(inode, block, bh, 1);
-			if (err)
-				break;
+			if (get_block) {
+				err = get_block(inode, block, bh, 1);
+				if (err)
+					break;
+			} else {
+				iomap_to_bh(inode, block, bh, iomap);
+			}
+
 			if (buffer_new(bh)) {
 				unmap_underlying_metadata(bh->b_bdev,
 							bh->b_blocknr);
@@ -1971,6 +2031,12 @@ int __block_write_begin(struct page *page, loff_t pos, unsigned len,
 		page_zero_new_buffers(page, from, to);
 	return err;
 }
+
+int __block_write_begin(struct page *page, loff_t pos, unsigned len,
+		get_block_t *get_block)
+{
+	return __block_write_begin_int(page, pos, len, get_block, NULL);
+}
 EXPORT_SYMBOL(__block_write_begin);
 
 static int __block_commit_write(struct inode *inode, struct page *page,
diff --git a/fs/internal.h b/fs/internal.h
index b71deee..c0c6f49 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -11,6 +11,7 @@
 
 struct super_block;
 struct file_system_type;
+struct iomap;
 struct linux_binprm;
 struct path;
 struct mount;
@@ -39,6 +40,8 @@ static inline int __sync_blockdev(struct block_device *bdev, int wait)
  * buffer.c
  */
 extern void guard_bio_eod(int rw, struct bio *bio);
+extern int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
+		get_block_t *get_block, struct iomap *iomap);
 
 /*
  * char_dev.c
diff --git a/fs/iomap.c b/fs/iomap.c
new file mode 100644
index 0000000..5cc6d48
--- /dev/null
+++ b/fs/iomap.c
@@ -0,0 +1,394 @@
+/*
+ * Copyright (C) 2010 Red Hat, Inc.
+ * Copyright (c) 2016 Christoph Hellwig.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/module.h>
+#include <linux/compiler.h>
+#include <linux/fs.h>
+#include <linux/iomap.h>
+#include <linux/uaccess.h>
+#include <linux/gfp.h>
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/pagemap.h>
+#include <linux/file.h>
+#include <linux/uio.h>
+#include <linux/backing-dev.h>
+#include <linux/buffer_head.h>
+#include "internal.h"
+
+typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
+		void *data, struct iomap *iomap);
+
+/*
+ * Execute a iomap write on a segment of the mapping that spans a
+ * contiguous range of pages that have identical block mapping state.
+ *
+ * This avoids the need to map pages individually, do individual allocations
+ * for each page and most importantly avoid the need for filesystem specific
+ * locking per page. Instead, all the operations are amortised over the entire
+ * range of pages. It is assumed that the filesystems will lock whatever
+ * resources they require in the iomap_begin call, and release them in the
+ * iomap_end call.
+ */
+static loff_t
+iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
+		struct iomap_ops *ops, void *data, iomap_actor_t actor)
+{
+	struct iomap iomap = { 0 };
+	loff_t written = 0, ret;
+
+	/*
+	 * Need to map a range from start position for count bytes. This can
+	 * span multiple pages - it is only guaranteed to return a range of a
+	 * single type of pages (e.g. all into a hole, all mapped or all
+	 * unwritten). Failure at this point has nothing to undo.
+	 *
+	 * If allocation is required for this range, reserve the space now so
+	 * that the allocation is guaranteed to succeed later on. Once we copy
+	 * the data into the page cache pages, then we cannot fail otherwise we
+	 * expose transient stale data. If the reserve fails, we can safely
+	 * back out at this point as there is nothing to undo.
+	 */
+	ret = ops->iomap_begin(inode, pos, length, flags, &iomap);
+	if (ret)
+		return ret;
+	if (WARN_ON(iomap.offset > pos))
+		return -EIO;
+
+	/*
+	 * Cut down the length to the one actually provided by the filesystem,
+	 * as it might not be able to give us the whole size that we requested.
+	 */
+	if (iomap.offset + iomap.length < pos + length)
+		length = iomap.offset + iomap.length - pos;
+
+	/*
+	 * Now that we have guaranteed that the space allocation will succeed.
+	 * we can do the copy-in page by page without having to worry about
+	 * failures exposing transient data.
+	 */
+	written = actor(inode, pos, length, data, &iomap);
+
+	/*
+	 * Now the data has been copied, commit the range we've copied.  This
+	 * should not fail unless the filesystem has had a fatal error.
+	 */
+	ret = ops->iomap_end(inode, pos, length, written > 0 ? written : 0,
+			flags, &iomap);
+
+	return written ? written : ret;
+}
+
+static void
+iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
+{
+	loff_t i_size = i_size_read(inode);
+
+	/*
+	 * Only truncate newly allocated pages beyoned EOF, even if the
+	 * write started inside the existing inode size.
+	 */
+	if (pos + len > i_size)
+		truncate_pagecache_range(inode, max(pos, i_size), pos + len);
+}
+
+static int
+iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, struct iomap *iomap)
+{
+	pgoff_t index = pos >> PAGE_SHIFT;
+	struct page *page;
+	int status = 0;
+
+	BUG_ON(pos + len > iomap->offset + iomap->length);
+
+	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
+	if (!page)
+		return -ENOMEM;
+
+	status = __block_write_begin_int(page, pos, len, NULL, iomap);
+	if (unlikely(status)) {
+		unlock_page(page);
+		put_page(page);
+		page = NULL;
+
+		iomap_write_failed(inode, pos, len);
+	}
+
+	*pagep = page;
+	return status;
+}
+
+static int
+iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
+		unsigned copied, struct page *page)
+{
+	int ret;
+
+	ret = generic_write_end(NULL, inode->i_mapping, pos, len,
+			copied, page, NULL);
+	if (ret < len)
+		iomap_write_failed(inode, pos, len);
+	return ret;
+}
+
+static loff_t
+iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+		struct iomap *iomap)
+{
+	struct iov_iter *i = data;
+	long status = 0;
+	ssize_t written = 0;
+	unsigned int flags = AOP_FLAG_NOFS;
+
+	/*
+	 * Copies from kernel address space cannot fail (NFSD is a big user).
+	 */
+	if (!iter_is_iovec(i))
+		flags |= AOP_FLAG_UNINTERRUPTIBLE;
+
+	do {
+		struct page *page;
+		unsigned long offset;	/* Offset into pagecache page */
+		unsigned long bytes;	/* Bytes to write to page */
+		size_t copied;		/* Bytes copied from user */
+
+		offset = (pos & (PAGE_SIZE - 1));
+		bytes = min_t(unsigned long, PAGE_SIZE - offset,
+						iov_iter_count(i));
+again:
+		if (bytes > length)
+			bytes = length;
+
+		/*
+		 * Bring in the user page that we will copy from _first_.
+		 * Otherwise there's a nasty deadlock on copying from the
+		 * same page as we're writing to, without it being marked
+		 * up-to-date.
+		 *
+		 * Not only is this an optimisation, but it is also required
+		 * to check that the address is actually valid, when atomic
+		 * usercopies are used, below.
+		 */
+		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+			status = -EFAULT;
+			break;
+		}
+
+		status = iomap_write_begin(inode, pos, bytes, flags, &page,
+				iomap);
+		if (unlikely(status))
+			break;
+
+		if (mapping_writably_mapped(inode->i_mapping))
+			flush_dcache_page(page);
+
+		pagefault_disable();
+		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+		pagefault_enable();
+
+		flush_dcache_page(page);
+		mark_page_accessed(page);
+
+		status = iomap_write_end(inode, pos, bytes, copied, page);
+		if (unlikely(status < 0))
+			break;
+		copied = status;
+
+		cond_resched();
+
+		iov_iter_advance(i, copied);
+		if (unlikely(copied == 0)) {
+			/*
+			 * If we were unable to copy any data at all, we must
+			 * fall back to a single segment length write.
+			 *
+			 * If we didn't fallback here, we could livelock
+			 * because not all segments in the iov can be copied at
+			 * once without a pagefault.
+			 */
+			bytes = min_t(unsigned long, PAGE_SIZE - offset,
+						iov_iter_single_seg_count(i));
+			goto again;
+		}
+		pos += copied;
+		written += copied;
+		length -= copied;
+
+		balance_dirty_pages_ratelimited(inode->i_mapping);
+	} while (iov_iter_count(i) && length);
+
+	return written ? written : status;
+}
+
+ssize_t
+iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
+		struct iomap_ops *ops)
+{
+	struct inode *inode = iocb->ki_filp->f_mapping->host;
+	loff_t pos = iocb->ki_pos, ret = 0, written = 0;
+
+	while (iov_iter_count(iter)) {
+		ret = iomap_apply(inode, pos, iov_iter_count(iter),
+				IOMAP_WRITE, ops, iter, iomap_write_actor);
+		if (ret <= 0)
+			break;
+		pos += ret;
+		written += ret;
+	}
+
+	return written ? written : ret;
+}
+EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
+
+static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
+		unsigned bytes, struct iomap *iomap)
+{
+	struct page *page;
+	int status;
+
+	status = iomap_write_begin(inode, pos, bytes,
+			AOP_FLAG_UNINTERRUPTIBLE | AOP_FLAG_NOFS, &page, iomap);
+	if (status)
+		return status;
+
+	zero_user(page, offset, bytes);
+	mark_page_accessed(page);
+
+	return iomap_write_end(inode, pos, bytes, bytes, page);
+}
+
+static loff_t
+iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
+		void *data, struct iomap *iomap)
+{
+	bool *did_zero = data;
+	loff_t written = 0;
+	int status;
+
+	/* already zeroed?  we're done. */
+	if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_UNWRITTEN)
+	    	return count;
+
+	do {
+		unsigned offset, bytes;
+
+		offset = pos & (PAGE_SIZE - 1); /* Within page */
+		bytes = min_t(unsigned, PAGE_SIZE - offset, count);
+
+		status = iomap_zero(inode, pos, offset, bytes, iomap);
+		if (status < 0)
+			return status;
+
+		pos += bytes;
+		count -= bytes;
+		written += bytes;
+		if (did_zero)
+			*did_zero = true;
+	} while (count > 0);
+
+	return written;
+}
+
+int
+iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
+		struct iomap_ops *ops)
+{
+	loff_t ret;
+
+	while (len > 0) {
+		ret = iomap_apply(inode, pos, len, IOMAP_ZERO,
+				ops, did_zero, iomap_zero_range_actor);
+		if (ret <= 0)
+			return ret;
+
+		pos += ret;
+		len -= ret;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iomap_zero_range);
+
+int
+iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
+		struct iomap_ops *ops)
+{
+	unsigned blocksize = (1 << inode->i_blkbits);
+	unsigned off = pos & (blocksize - 1);
+
+	/* Block boundary? Nothing to do */
+	if (!off)
+		return 0;
+	return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops);
+}
+EXPORT_SYMBOL_GPL(iomap_truncate_page);
+
+static loff_t
+iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
+		void *data, struct iomap *iomap)
+{
+	struct page *page = data;
+	int ret;
+
+	ret = __block_write_begin_int(page, pos & ~PAGE_MASK, length,
+			NULL, iomap);
+	if (ret)
+		return ret;
+
+	block_commit_write(page, 0, length);
+	return length;
+}
+
+int iomap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+		struct iomap_ops *ops)
+{
+	struct page *page = vmf->page;
+	struct inode *inode = file_inode(vma->vm_file);
+	unsigned long length;
+	loff_t offset, size;
+	ssize_t ret;
+
+	lock_page(page);
+	size = i_size_read(inode);
+	if ((page->mapping != inode->i_mapping) ||
+	    (page_offset(page) > size)) {
+		/* We overload EFAULT to mean page got truncated */
+		ret = -EFAULT;
+		goto out_unlock;
+	}
+
+	/* page is wholly or partially inside EOF */
+	if (((page->index + 1) << PAGE_SHIFT) > size)
+		length = size & ~PAGE_MASK;
+	else
+		length = PAGE_SIZE;
+
+	offset = page_offset(page);
+	while (length > 0) {
+		ret = iomap_apply(inode, offset, length, IOMAP_WRITE,
+				ops, page, iomap_page_mkwrite_actor);
+		if (unlikely(ret <= 0))
+			goto out_unlock;
+		offset += ret;
+		length -= ret;
+	}
+
+	set_page_dirty(page);
+	wait_for_stable_page(page);
+	return 0;
+out_unlock:
+	unlock_page(page);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 1b22197..854766f 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -3,19 +3,65 @@
 
 #include <linux/types.h>
 
-/* types of block ranges for multipage write mappings. */
+struct inode;
+struct iov_iter;
+struct kiocb;
+struct vm_area_struct;
+struct vm_fault;
+
+/*
+ * Types of block ranges for iomap mappings:
+ */
 #define IOMAP_HOLE	0x01	/* no blocks allocated, need allocation */
 #define IOMAP_DELALLOC	0x02	/* delayed allocation blocks */
 #define IOMAP_MAPPED	0x03	/* blocks allocated @blkno */
 #define IOMAP_UNWRITTEN	0x04	/* blocks allocated @blkno in unwritten state */
 
+/*
+ * Magic value for blkno:
+ */
 #define IOMAP_NULL_BLOCK -1LL	/* blkno is not valid */
 
 struct iomap {
-	sector_t	blkno;	/* first sector of mapping */
-	loff_t		offset;	/* file offset of mapping, bytes */
-	u64		length;	/* length of mapping, bytes */
-	int		type;	/* type of mapping */
+	sector_t		blkno;	/* first sector of mapping, fs blocks */
+	loff_t			offset;	/* file offset of mapping, bytes */
+	u64			length;	/* length of mapping, bytes */
+	int			type;	/* type of mapping */
+	struct block_device	*bdev; /* block device for I/O */
+};
+
+/*
+ * Flags for iomap_begin / iomap_end.  No flag implies a read.
+ */
+#define IOMAP_WRITE		(1 << 0)
+#define IOMAP_ZERO		(1 << 1)
+
+struct iomap_ops {
+	/*
+	 * Return the existing mapping at pos, or reserve space starting at
+	 * pos for up to length, as long as we can do it as a single mapping.
+	 * The actual length is returned in iomap->length.
+	 */
+	int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
+			unsigned flags, struct iomap *iomap);
+
+	/*
+	 * Commit and/or unreserve space previous allocated using iomap_begin.
+	 * Written indicates the length of the successful write operation which
+	 * needs to be commited, while the rest needs to be unreserved.
+	 * Written might be zero if no data was written.
+	 */
+	int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length,
+			ssize_t written, unsigned flags, struct iomap *iomap);
 };
 
+ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
+		struct iomap_ops *ops);
+int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
+		bool *did_zero, struct iomap_ops *ops);
+int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
+		struct iomap_ops *ops);
+int iomap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+		struct iomap_ops *ops);
+
 #endif /* LINUX_IOMAP_H */
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 03/14] fs: support DAX based iomap zeroing
  2016-06-01 14:44 ` Christoph Hellwig
@ 2016-06-01 14:44   ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

This avoid needing a separate inefficient get_block based DAX zero_range
implementation in file systems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 5cc6d48..e51adb6 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -24,6 +24,7 @@
 #include <linux/uio.h>
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
+#include <linux/dax.h>
 #include "internal.h"
 
 typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
@@ -268,6 +269,15 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
 	return iomap_write_end(inode, pos, bytes, bytes, page);
 }
 
+static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
+		struct iomap *iomap)
+{
+	sector_t sector = iomap->blkno +
+		(((pos & ~(PAGE_SIZE - 1)) - iomap->offset) >> 9);
+
+	return __dax_zero_page_range(iomap->bdev, sector, offset, bytes);
+}
+
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		void *data, struct iomap *iomap)
@@ -286,7 +296,10 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		offset = pos & (PAGE_SIZE - 1); /* Within page */
 		bytes = min_t(unsigned, PAGE_SIZE - offset, count);
 
-		status = iomap_zero(inode, pos, offset, bytes, iomap);
+		if (IS_DAX(inode))
+			status = iomap_dax_zero(pos, offset, bytes, iomap);
+		else
+			status = iomap_zero(inode, pos, offset, bytes, iomap);
 		if (status < 0)
 			return status;
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 03/14] fs: support DAX based iomap zeroing
@ 2016-06-01 14:44   ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

This avoid needing a separate inefficient get_block based DAX zero_range
implementation in file systems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 5cc6d48..e51adb6 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -24,6 +24,7 @@
 #include <linux/uio.h>
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
+#include <linux/dax.h>
 #include "internal.h"
 
 typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
@@ -268,6 +269,15 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
 	return iomap_write_end(inode, pos, bytes, bytes, page);
 }
 
+static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
+		struct iomap *iomap)
+{
+	sector_t sector = iomap->blkno +
+		(((pos & ~(PAGE_SIZE - 1)) - iomap->offset) >> 9);
+
+	return __dax_zero_page_range(iomap->bdev, sector, offset, bytes);
+}
+
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		void *data, struct iomap *iomap)
@@ -286,7 +296,10 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		offset = pos & (PAGE_SIZE - 1); /* Within page */
 		bytes = min_t(unsigned, PAGE_SIZE - offset, count);
 
-		status = iomap_zero(inode, pos, offset, bytes, iomap);
+		if (IS_DAX(inode))
+			status = iomap_dax_zero(pos, offset, bytes, iomap);
+		else
+			status = iomap_zero(inode, pos, offset, bytes, iomap);
 		if (status < 0)
 			return status;
 
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 04/14] xfs: make xfs_bmbt_to_iomap available outside of xfs_pnfs.c
  2016-06-01 14:44 ` Christoph Hellwig
@ 2016-06-01 14:44   ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

And ensure it works for RT subvolume files an set the block device,
both of which will be needed to be able to use the function in the
buffered write path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/xfs/xfs_iomap.c | 27 +++++++++++++++++++++++++++
 fs/xfs/xfs_iomap.h |  4 ++++
 fs/xfs/xfs_pnfs.c  | 26 --------------------------
 3 files changed, 31 insertions(+), 26 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 5839135..2f37194 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -15,6 +15,7 @@
  * along with this program; if not, write the Free Software Foundation,
  * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
  */
+#include <linux/iomap.h>
 #include "xfs.h"
 #include "xfs_fs.h"
 #include "xfs_shared.h"
@@ -940,3 +941,29 @@ error_on_bmapi_transaction:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
+
+void
+xfs_bmbt_to_iomap(
+	struct xfs_inode	*ip,
+	struct iomap		*iomap,
+	struct xfs_bmbt_irec	*imap)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (imap->br_startblock == HOLESTARTBLOCK) {
+		iomap->blkno = IOMAP_NULL_BLOCK;
+		iomap->type = IOMAP_HOLE;
+	} else if (imap->br_startblock == DELAYSTARTBLOCK) {
+		iomap->blkno = IOMAP_NULL_BLOCK;
+		iomap->type = IOMAP_DELALLOC;
+	} else {
+		iomap->blkno = xfs_fsb_to_db(ip, imap->br_startblock);
+		if (imap->br_state == XFS_EXT_UNWRITTEN)
+			iomap->type = IOMAP_UNWRITTEN;
+		else
+			iomap->type = IOMAP_MAPPED;
+	}
+	iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
+	iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
+	iomap->bdev = xfs_find_bdev_for_inode(VFS_I(ip));
+}
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 8688e66..718f07c 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -18,6 +18,7 @@
 #ifndef __XFS_IOMAP_H__
 #define __XFS_IOMAP_H__
 
+struct iomap;
 struct xfs_inode;
 struct xfs_bmbt_irec;
 
@@ -29,4 +30,7 @@ int xfs_iomap_write_allocate(struct xfs_inode *, xfs_off_t,
 			struct xfs_bmbt_irec *);
 int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t);
 
+void xfs_bmbt_to_iomap(struct xfs_inode *, struct iomap *,
+		struct xfs_bmbt_irec *);
+
 #endif /* __XFS_IOMAP_H__*/
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index db3c7df..0f14b2e 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -80,32 +80,6 @@ xfs_fs_get_uuid(
 	return 0;
 }
 
-static void
-xfs_bmbt_to_iomap(
-	struct xfs_inode	*ip,
-	struct iomap		*iomap,
-	struct xfs_bmbt_irec	*imap)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-
-	if (imap->br_startblock == HOLESTARTBLOCK) {
-		iomap->blkno = IOMAP_NULL_BLOCK;
-		iomap->type = IOMAP_HOLE;
-	} else if (imap->br_startblock == DELAYSTARTBLOCK) {
-		iomap->blkno = IOMAP_NULL_BLOCK;
-		iomap->type = IOMAP_DELALLOC;
-	} else {
-		iomap->blkno =
-			XFS_FSB_TO_DADDR(ip->i_mount, imap->br_startblock);
-		if (imap->br_state == XFS_EXT_UNWRITTEN)
-			iomap->type = IOMAP_UNWRITTEN;
-		else
-			iomap->type = IOMAP_MAPPED;
-	}
-	iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
-	iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
-}
-
 /*
  * Get a layout for the pNFS client.
  */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 04/14] xfs: make xfs_bmbt_to_iomap available outside of xfs_pnfs.c
@ 2016-06-01 14:44   ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

And ensure it works for RT subvolume files an set the block device,
both of which will be needed to be able to use the function in the
buffered write path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/xfs/xfs_iomap.c | 27 +++++++++++++++++++++++++++
 fs/xfs/xfs_iomap.h |  4 ++++
 fs/xfs/xfs_pnfs.c  | 26 --------------------------
 3 files changed, 31 insertions(+), 26 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 5839135..2f37194 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -15,6 +15,7 @@
  * along with this program; if not, write the Free Software Foundation,
  * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
  */
+#include <linux/iomap.h>
 #include "xfs.h"
 #include "xfs_fs.h"
 #include "xfs_shared.h"
@@ -940,3 +941,29 @@ error_on_bmapi_transaction:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
+
+void
+xfs_bmbt_to_iomap(
+	struct xfs_inode	*ip,
+	struct iomap		*iomap,
+	struct xfs_bmbt_irec	*imap)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (imap->br_startblock == HOLESTARTBLOCK) {
+		iomap->blkno = IOMAP_NULL_BLOCK;
+		iomap->type = IOMAP_HOLE;
+	} else if (imap->br_startblock == DELAYSTARTBLOCK) {
+		iomap->blkno = IOMAP_NULL_BLOCK;
+		iomap->type = IOMAP_DELALLOC;
+	} else {
+		iomap->blkno = xfs_fsb_to_db(ip, imap->br_startblock);
+		if (imap->br_state == XFS_EXT_UNWRITTEN)
+			iomap->type = IOMAP_UNWRITTEN;
+		else
+			iomap->type = IOMAP_MAPPED;
+	}
+	iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
+	iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
+	iomap->bdev = xfs_find_bdev_for_inode(VFS_I(ip));
+}
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 8688e66..718f07c 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -18,6 +18,7 @@
 #ifndef __XFS_IOMAP_H__
 #define __XFS_IOMAP_H__
 
+struct iomap;
 struct xfs_inode;
 struct xfs_bmbt_irec;
 
@@ -29,4 +30,7 @@ int xfs_iomap_write_allocate(struct xfs_inode *, xfs_off_t,
 			struct xfs_bmbt_irec *);
 int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t);
 
+void xfs_bmbt_to_iomap(struct xfs_inode *, struct iomap *,
+		struct xfs_bmbt_irec *);
+
 #endif /* __XFS_IOMAP_H__*/
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index db3c7df..0f14b2e 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -80,32 +80,6 @@ xfs_fs_get_uuid(
 	return 0;
 }
 
-static void
-xfs_bmbt_to_iomap(
-	struct xfs_inode	*ip,
-	struct iomap		*iomap,
-	struct xfs_bmbt_irec	*imap)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-
-	if (imap->br_startblock == HOLESTARTBLOCK) {
-		iomap->blkno = IOMAP_NULL_BLOCK;
-		iomap->type = IOMAP_HOLE;
-	} else if (imap->br_startblock == DELAYSTARTBLOCK) {
-		iomap->blkno = IOMAP_NULL_BLOCK;
-		iomap->type = IOMAP_DELALLOC;
-	} else {
-		iomap->blkno =
-			XFS_FSB_TO_DADDR(ip->i_mount, imap->br_startblock);
-		if (imap->br_state == XFS_EXT_UNWRITTEN)
-			iomap->type = IOMAP_UNWRITTEN;
-		else
-			iomap->type = IOMAP_MAPPED;
-	}
-	iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
-	iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
-}
-
 /*
  * Get a layout for the pNFS client.
  */
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 05/14] xfs: reorder zeroing and flushing sequence in truncate
  2016-06-01 14:44 ` Christoph Hellwig
@ 2016-06-01 14:44   ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Currently zeroing out blocks and waiting for writeout is a bit of a mess in
truncate.  This patch gives it a clear order in preparation for the iomap
path:

 (1) we first wait for any direct I/O to complete to prevent any races
     for it
 (2) we then perform the actual zeroing, and only use the truncate_page
     helpers for truncating down.  The truncate up case already is
     handled by the separate call to xfs_zero_eof.
 (3) only then we write back dirty data, as zeroing block may cause
     dirty pages when using either xfs_zero_eof or the new iomap
     infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/xfs/xfs_iops.c | 33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index c5d4eba..1a5ca4b 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -801,20 +801,35 @@ xfs_setattr_size(
 		return error;
 
 	/*
+	 * Wait for all direct I/O to complete.
+	 */
+	inode_dio_wait(inode);
+
+	/*
 	 * File data changes must be complete before we start the transaction to
 	 * modify the inode.  This needs to be done before joining the inode to
 	 * the transaction because the inode cannot be unlocked once it is a
 	 * part of the transaction.
 	 *
-	 * Start with zeroing any data block beyond EOF that we may expose on
-	 * file extension.
+	 * Start with zeroing any data beyond EOF that we may expose on file
+	 * extension, or zeroing out the rest of the block on a downward
+	 * truncate.
 	 */
 	if (newsize > oldsize) {
 		error = xfs_zero_eof(ip, newsize, oldsize, &did_zeroing);
-		if (error)
-			return error;
+	} else {
+		if (IS_DAX(inode)) {
+			error = dax_truncate_page(inode, newsize,
+					xfs_get_blocks_direct);
+		} else {
+			error = block_truncate_page(inode->i_mapping, newsize,
+					xfs_get_blocks);
+		}
 	}
 
+	if (error)
+		return error;
+
 	/*
 	 * We are going to log the inode size change in this transaction so
 	 * any previous writes that are beyond the on disk EOF and the new
@@ -831,9 +846,6 @@ xfs_setattr_size(
 			return error;
 	}
 
-	/* Now wait for all direct I/O to complete. */
-	inode_dio_wait(inode);
-
 	/*
 	 * We've already locked out new page faults, so now we can safely remove
 	 * pages from the page cache knowing they won't get refaulted until we
@@ -851,13 +863,6 @@ xfs_setattr_size(
 	 * to hope that the caller sees ENOMEM and retries the truncate
 	 * operation.
 	 */
-	if (IS_DAX(inode))
-		error = dax_truncate_page(inode, newsize, xfs_get_blocks_direct);
-	else
-		error = block_truncate_page(inode->i_mapping, newsize,
-					    xfs_get_blocks);
-	if (error)
-		return error;
 	truncate_setsize(inode, newsize);
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 05/14] xfs: reorder zeroing and flushing sequence in truncate
@ 2016-06-01 14:44   ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Currently zeroing out blocks and waiting for writeout is a bit of a mess in
truncate.  This patch gives it a clear order in preparation for the iomap
path:

 (1) we first wait for any direct I/O to complete to prevent any races
     for it
 (2) we then perform the actual zeroing, and only use the truncate_page
     helpers for truncating down.  The truncate up case already is
     handled by the separate call to xfs_zero_eof.
 (3) only then we write back dirty data, as zeroing block may cause
     dirty pages when using either xfs_zero_eof or the new iomap
     infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/xfs/xfs_iops.c | 33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index c5d4eba..1a5ca4b 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -801,20 +801,35 @@ xfs_setattr_size(
 		return error;
 
 	/*
+	 * Wait for all direct I/O to complete.
+	 */
+	inode_dio_wait(inode);
+
+	/*
 	 * File data changes must be complete before we start the transaction to
 	 * modify the inode.  This needs to be done before joining the inode to
 	 * the transaction because the inode cannot be unlocked once it is a
 	 * part of the transaction.
 	 *
-	 * Start with zeroing any data block beyond EOF that we may expose on
-	 * file extension.
+	 * Start with zeroing any data beyond EOF that we may expose on file
+	 * extension, or zeroing out the rest of the block on a downward
+	 * truncate.
 	 */
 	if (newsize > oldsize) {
 		error = xfs_zero_eof(ip, newsize, oldsize, &did_zeroing);
-		if (error)
-			return error;
+	} else {
+		if (IS_DAX(inode)) {
+			error = dax_truncate_page(inode, newsize,
+					xfs_get_blocks_direct);
+		} else {
+			error = block_truncate_page(inode->i_mapping, newsize,
+					xfs_get_blocks);
+		}
 	}
 
+	if (error)
+		return error;
+
 	/*
 	 * We are going to log the inode size change in this transaction so
 	 * any previous writes that are beyond the on disk EOF and the new
@@ -831,9 +846,6 @@ xfs_setattr_size(
 			return error;
 	}
 
-	/* Now wait for all direct I/O to complete. */
-	inode_dio_wait(inode);
-
 	/*
 	 * We've already locked out new page faults, so now we can safely remove
 	 * pages from the page cache knowing they won't get refaulted until we
@@ -851,13 +863,6 @@ xfs_setattr_size(
 	 * to hope that the caller sees ENOMEM and retries the truncate
 	 * operation.
 	 */
-	if (IS_DAX(inode))
-		error = dax_truncate_page(inode, newsize, xfs_get_blocks_direct);
-	else
-		error = block_truncate_page(inode->i_mapping, newsize,
-					    xfs_get_blocks);
-	if (error)
-		return error;
 	truncate_setsize(inode, newsize);
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 06/14] xfs: implement iomap based buffered write path
  2016-06-01 14:44 ` Christoph Hellwig
@ 2016-06-01 14:44   ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Convert XFS to use the new iomap based multipage write path. This involves
implementing the ->iomap_begin and ->iomap_end methods, and switching the
buffered file write, page_mkwrite and xfs_iozero paths to the new iomap
helpers.

With this change __xfs_get_blocks will never be used for buffered writes,
and the code handling them can be removed.

Based on earlier code from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/xfs/Kconfig     |   1 +
 fs/xfs/xfs_aops.c  | 212 -----------------------------------------------------
 fs/xfs/xfs_file.c  |  71 ++++++++----------
 fs/xfs/xfs_iomap.c | 144 ++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_iomap.h |   5 +-
 fs/xfs/xfs_iops.c  |   9 ++-
 fs/xfs/xfs_trace.h |   3 +
 7 files changed, 187 insertions(+), 258 deletions(-)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 5d47b4d..35faf12 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -4,6 +4,7 @@ config XFS_FS
 	depends on (64BIT || LBDAF)
 	select EXPORTFS
 	select LIBCRC32C
+	select FS_IOMAP
 	help
 	  XFS is a high performance journaling filesystem which originated
 	  on the SGI IRIX platform.  It is completely multi-threaded, can
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 4c463b9..2ac9f7e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1427,216 +1427,6 @@ xfs_vm_direct_IO(
 			xfs_get_blocks_direct, endio, NULL, flags);
 }
 
-/*
- * Punch out the delalloc blocks we have already allocated.
- *
- * Don't bother with xfs_setattr given that nothing can have made it to disk yet
- * as the page is still locked at this point.
- */
-STATIC void
-xfs_vm_kill_delalloc_range(
-	struct inode		*inode,
-	loff_t			start,
-	loff_t			end)
-{
-	struct xfs_inode	*ip = XFS_I(inode);
-	xfs_fileoff_t		start_fsb;
-	xfs_fileoff_t		end_fsb;
-	int			error;
-
-	start_fsb = XFS_B_TO_FSB(ip->i_mount, start);
-	end_fsb = XFS_B_TO_FSB(ip->i_mount, end);
-	if (end_fsb <= start_fsb)
-		return;
-
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-						end_fsb - start_fsb);
-	if (error) {
-		/* something screwed, just bail */
-		if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) {
-			xfs_alert(ip->i_mount,
-		"xfs_vm_write_failed: unable to clean up ino %lld",
-					ip->i_ino);
-		}
-	}
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-}
-
-STATIC void
-xfs_vm_write_failed(
-	struct inode		*inode,
-	struct page		*page,
-	loff_t			pos,
-	unsigned		len)
-{
-	loff_t			block_offset;
-	loff_t			block_start;
-	loff_t			block_end;
-	loff_t			from = pos & (PAGE_SIZE - 1);
-	loff_t			to = from + len;
-	struct buffer_head	*bh, *head;
-	struct xfs_mount	*mp = XFS_I(inode)->i_mount;
-
-	/*
-	 * The request pos offset might be 32 or 64 bit, this is all fine
-	 * on 64-bit platform.  However, for 64-bit pos request on 32-bit
-	 * platform, the high 32-bit will be masked off if we evaluate the
-	 * block_offset via (pos & PAGE_MASK) because the PAGE_MASK is
-	 * 0xfffff000 as an unsigned long, hence the result is incorrect
-	 * which could cause the following ASSERT failed in most cases.
-	 * In order to avoid this, we can evaluate the block_offset of the
-	 * start of the page by using shifts rather than masks the mismatch
-	 * problem.
-	 */
-	block_offset = (pos >> PAGE_SHIFT) << PAGE_SHIFT;
-
-	ASSERT(block_offset + from == pos);
-
-	head = page_buffers(page);
-	block_start = 0;
-	for (bh = head; bh != head || !block_start;
-	     bh = bh->b_this_page, block_start = block_end,
-				   block_offset += bh->b_size) {
-		block_end = block_start + bh->b_size;
-
-		/* skip buffers before the write */
-		if (block_end <= from)
-			continue;
-
-		/* if the buffer is after the write, we're done */
-		if (block_start >= to)
-			break;
-
-		/*
-		 * Process delalloc and unwritten buffers beyond EOF. We can
-		 * encounter unwritten buffers in the event that a file has
-		 * post-EOF unwritten extents and an extending write happens to
-		 * fail (e.g., an unaligned write that also involves a delalloc
-		 * to the same page).
-		 */
-		if (!buffer_delay(bh) && !buffer_unwritten(bh))
-			continue;
-
-		if (!xfs_mp_fail_writes(mp) && !buffer_new(bh) &&
-		    block_offset < i_size_read(inode))
-			continue;
-
-		if (buffer_delay(bh))
-			xfs_vm_kill_delalloc_range(inode, block_offset,
-						   block_offset + bh->b_size);
-
-		/*
-		 * This buffer does not contain data anymore. make sure anyone
-		 * who finds it knows that for certain.
-		 */
-		clear_buffer_delay(bh);
-		clear_buffer_uptodate(bh);
-		clear_buffer_mapped(bh);
-		clear_buffer_new(bh);
-		clear_buffer_dirty(bh);
-		clear_buffer_unwritten(bh);
-	}
-
-}
-
-/*
- * This used to call block_write_begin(), but it unlocks and releases the page
- * on error, and we need that page to be able to punch stale delalloc blocks out
- * on failure. hence we copy-n-waste it here and call xfs_vm_write_failed() at
- * the appropriate point.
- */
-STATIC int
-xfs_vm_write_begin(
-	struct file		*file,
-	struct address_space	*mapping,
-	loff_t			pos,
-	unsigned		len,
-	unsigned		flags,
-	struct page		**pagep,
-	void			**fsdata)
-{
-	pgoff_t			index = pos >> PAGE_SHIFT;
-	struct page		*page;
-	int			status;
-	struct xfs_mount	*mp = XFS_I(mapping->host)->i_mount;
-
-	ASSERT(len <= PAGE_SIZE);
-
-	page = grab_cache_page_write_begin(mapping, index, flags);
-	if (!page)
-		return -ENOMEM;
-
-	status = __block_write_begin(page, pos, len, xfs_get_blocks);
-	if (xfs_mp_fail_writes(mp))
-		status = -EIO;
-	if (unlikely(status)) {
-		struct inode	*inode = mapping->host;
-		size_t		isize = i_size_read(inode);
-
-		xfs_vm_write_failed(inode, page, pos, len);
-		unlock_page(page);
-
-		/*
-		 * If the write is beyond EOF, we only want to kill blocks
-		 * allocated in this write, not blocks that were previously
-		 * written successfully.
-		 */
-		if (xfs_mp_fail_writes(mp))
-			isize = 0;
-		if (pos + len > isize) {
-			ssize_t start = max_t(ssize_t, pos, isize);
-
-			truncate_pagecache_range(inode, start, pos + len);
-		}
-
-		put_page(page);
-		page = NULL;
-	}
-
-	*pagep = page;
-	return status;
-}
-
-/*
- * On failure, we only need to kill delalloc blocks beyond EOF in the range of
- * this specific write because they will never be written. Previous writes
- * beyond EOF where block allocation succeeded do not need to be trashed, so
- * only new blocks from this write should be trashed. For blocks within
- * EOF, generic_write_end() zeros them so they are safe to leave alone and be
- * written with all the other valid data.
- */
-STATIC int
-xfs_vm_write_end(
-	struct file		*file,
-	struct address_space	*mapping,
-	loff_t			pos,
-	unsigned		len,
-	unsigned		copied,
-	struct page		*page,
-	void			*fsdata)
-{
-	int			ret;
-
-	ASSERT(len <= PAGE_SIZE);
-
-	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
-	if (unlikely(ret < len)) {
-		struct inode	*inode = mapping->host;
-		size_t		isize = i_size_read(inode);
-		loff_t		to = pos + len;
-
-		if (to > isize) {
-			/* only kill blocks in this write beyond EOF */
-			if (pos > isize)
-				isize = pos;
-			xfs_vm_kill_delalloc_range(inode, isize, to);
-			truncate_pagecache_range(inode, isize, to);
-		}
-	}
-	return ret;
-}
-
 STATIC sector_t
 xfs_vm_bmap(
 	struct address_space	*mapping,
@@ -1747,8 +1537,6 @@ const struct address_space_operations xfs_address_space_operations = {
 	.set_page_dirty		= xfs_vm_set_page_dirty,
 	.releasepage		= xfs_vm_releasepage,
 	.invalidatepage		= xfs_vm_invalidatepage,
-	.write_begin		= xfs_vm_write_begin,
-	.write_end		= xfs_vm_write_end,
 	.bmap			= xfs_vm_bmap,
 	.direct_IO		= xfs_vm_direct_IO,
 	.migratepage		= buffer_migrate_page,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 47fc632..7316d38 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -37,6 +37,7 @@
 #include "xfs_log.h"
 #include "xfs_icache.h"
 #include "xfs_pnfs.h"
+#include "xfs_iomap.h"
 
 #include <linux/dcache.h>
 #include <linux/falloc.h>
@@ -79,57 +80,27 @@ xfs_rw_ilock_demote(
 		inode_unlock(VFS_I(ip));
 }
 
-/*
- * xfs_iozero clears the specified range supplied via the page cache (except in
- * the DAX case). Writes through the page cache will allocate blocks over holes,
- * though the callers usually map the holes first and avoid them. If a block is
- * not completely zeroed, then it will be read from disk before being partially
- * zeroed.
- *
- * In the DAX case, we can just directly write to the underlying pages. This
- * will not allocate blocks, but will avoid holes and unwritten extents and so
- * not do unnecessary work.
- */
-int
-xfs_iozero(
-	struct xfs_inode	*ip,	/* inode			*/
-	loff_t			pos,	/* offset in file		*/
-	size_t			count)	/* size of data to zero		*/
+static int
+xfs_dax_zero_range(
+	struct inode		*inode,
+	loff_t			pos,
+	size_t			count)
 {
-	struct page		*page;
-	struct address_space	*mapping;
 	int			status = 0;
 
-
-	mapping = VFS_I(ip)->i_mapping;
 	do {
 		unsigned offset, bytes;
-		void *fsdata;
 
 		offset = (pos & (PAGE_SIZE -1)); /* Within page */
 		bytes = PAGE_SIZE - offset;
 		if (bytes > count)
 			bytes = count;
 
-		if (IS_DAX(VFS_I(ip))) {
-			status = dax_zero_page_range(VFS_I(ip), pos, bytes,
-						     xfs_get_blocks_direct);
-			if (status)
-				break;
-		} else {
-			status = pagecache_write_begin(NULL, mapping, pos, bytes,
-						AOP_FLAG_UNINTERRUPTIBLE,
-						&page, &fsdata);
-			if (status)
-				break;
-
-			zero_user(page, offset, bytes);
+		status = dax_zero_page_range(inode, pos, bytes,
+					     xfs_get_blocks_direct);
+		if (status)
+			break;
 
-			status = pagecache_write_end(NULL, mapping, pos, bytes,
-						bytes, page, fsdata);
-			WARN_ON(status <= 0); /* can't return less than zero! */
-			status = 0;
-		}
 		pos += bytes;
 		count -= bytes;
 	} while (count);
@@ -137,6 +108,24 @@ xfs_iozero(
 	return status;
 }
 
+/*
+ * Clear the specified ranges to zero through either the pagecache or DAX.
+ * Holes and unwritten extents will be left as-is as they already are zeroed.
+ */
+int
+xfs_iozero(
+	struct xfs_inode	*ip,
+	loff_t			pos,
+	size_t			count)
+{
+	struct inode		*inode = VFS_I(ip);
+
+	if (IS_DAX(VFS_I(ip)))
+		return xfs_dax_zero_range(inode, pos, count);
+	else
+		return iomap_zero_range(inode, pos, count, NULL, &xfs_iomap_ops);
+}
+
 int
 xfs_update_prealloc_flags(
 	struct xfs_inode	*ip,
@@ -841,7 +830,7 @@ xfs_file_buffered_aio_write(
 write_retry:
 	trace_xfs_file_buffered_write(ip, iov_iter_count(from),
 				      iocb->ki_pos, 0);
-	ret = generic_perform_write(file, from, iocb->ki_pos);
+	ret = iomap_file_buffered_write(iocb, from, &xfs_iomap_ops);
 	if (likely(ret >= 0))
 		iocb->ki_pos += ret;
 
@@ -1553,7 +1542,7 @@ xfs_filemap_page_mkwrite(
 	if (IS_DAX(inode)) {
 		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault);
 	} else {
-		ret = block_page_mkwrite(vma, vmf, xfs_get_blocks);
+		ret = iomap_page_mkwrite(vma, vmf, &xfs_iomap_ops);
 		ret = block_page_mkwrite_return(ret);
 	}
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 2f37194..620fc91 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -967,3 +967,147 @@ xfs_bmbt_to_iomap(
 	iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
 	iomap->bdev = xfs_find_bdev_for_inode(VFS_I(ip));
 }
+
+static inline bool imap_needs_alloc(struct xfs_bmbt_irec *imap, int nimaps)
+{
+	return !nimaps ||
+		imap->br_startblock == HOLESTARTBLOCK ||
+		imap->br_startblock == DELAYSTARTBLOCK;
+}
+
+static int
+xfs_file_iomap_begin(
+	struct inode		*inode,
+	loff_t			offset,
+	loff_t			length,
+	unsigned		flags,
+	struct iomap		*iomap)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_bmbt_irec	imap;
+	xfs_fileoff_t		offset_fsb, end_fsb;
+	int			nimaps = 1, error = 0;
+
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return -EIO;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+
+	ASSERT(offset <= mp->m_super->s_maxbytes);
+	if ((xfs_fsize_t)offset + length > mp->m_super->s_maxbytes)
+		length = mp->m_super->s_maxbytes - offset;
+	offset_fsb = XFS_B_TO_FSBT(mp, offset);
+	end_fsb = XFS_B_TO_FSB(mp, offset + length);
+
+	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, &imap,
+			       &nimaps, XFS_BMAPI_ENTIRE);
+	if (error) {
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		return error;
+	}
+
+	if ((flags & IOMAP_WRITE) && imap_needs_alloc(&imap, nimaps)) {
+		/*
+		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
+		 * pages to keep the chunks of work done where somewhat symmetric
+		 * with the work writeback does. This is a completely arbitrary
+		 * number pulled out of thin air as a best guess for initial
+		 * testing.
+		 *
+		 * Note that the values needs to be less than 32-bits wide until
+		 * the lower level functions are updated.
+		 */
+		length = min_t(loff_t, length, 1024 * PAGE_SIZE);
+		if (xfs_get_extsz_hint(ip)) {
+			/*
+			 * xfs_iomap_write_direct() expects the shared lock. It
+			 * is unlocked on return.
+			 */
+			xfs_ilock_demote(ip, XFS_ILOCK_EXCL);
+			error = xfs_iomap_write_direct(ip, offset, length, &imap,
+					nimaps);
+		} else {
+			error = xfs_iomap_write_delay(ip, offset, length, &imap);
+			xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		}
+
+		if (error)
+			return error;
+
+		trace_xfs_iomap_alloc(ip, offset, length, 0, &imap);
+		xfs_bmbt_to_iomap(ip, iomap, &imap);
+	} else if (nimaps) {
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		trace_xfs_iomap_found(ip, offset, length, 0, &imap);
+		xfs_bmbt_to_iomap(ip, iomap, &imap);
+	} else {
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		trace_xfs_iomap_not_found(ip, offset, length, 0, &imap);
+		iomap->blkno = IOMAP_NULL_BLOCK;
+		iomap->type = IOMAP_HOLE;
+		iomap->offset = offset;
+		iomap->length = length;
+	}
+
+	return 0;
+}
+
+static int
+xfs_file_iomap_end_delalloc(
+	struct xfs_inode	*ip,
+	loff_t			offset,
+	loff_t			length,
+	ssize_t			written)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_fileoff_t		start_fsb;
+	xfs_fileoff_t		end_fsb;
+	int			error = 0;
+
+	start_fsb = XFS_B_TO_FSB(mp, offset + written);
+	end_fsb = XFS_B_TO_FSB(mp, offset + length);
+
+	/*
+	 * Trim back delalloc blocks if we didn't manage to write the whole
+	 * range reserved.
+	 *
+	 * We don't need to care about racing delalloc as we hold i_mutex
+	 * across the reserve/allocate/unreserve calls. If there are delalloc
+	 * blocks in the range, they are ours.
+	 */
+	if (start_fsb < end_fsb) {
+		xfs_ilock(ip, XFS_ILOCK_EXCL);
+		error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
+					       end_fsb - start_fsb);
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+		if (error && !XFS_FORCED_SHUTDOWN(mp)) {
+			xfs_alert(mp, "%s: unable to clean up ino %lld",
+				__func__, ip->i_ino);
+			return error;
+		}
+	}
+
+	return 0;
+}
+
+static int
+xfs_file_iomap_end(
+	struct inode		*inode,
+	loff_t			offset,
+	loff_t			length,
+	ssize_t			written,
+	unsigned		flags,
+	struct iomap		*iomap)
+{
+	if ((flags & IOMAP_WRITE) && iomap->type == IOMAP_DELALLOC)
+		return xfs_file_iomap_end_delalloc(XFS_I(inode), offset,
+				length, written);
+	return 0;
+}
+
+struct iomap_ops xfs_iomap_ops = {
+	.iomap_begin		= xfs_file_iomap_begin,
+	.iomap_end		= xfs_file_iomap_end,
+};
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 718f07c..e066d04 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -18,7 +18,8 @@
 #ifndef __XFS_IOMAP_H__
 #define __XFS_IOMAP_H__
 
-struct iomap;
+#include <linux/iomap.h>
+
 struct xfs_inode;
 struct xfs_bmbt_irec;
 
@@ -33,4 +34,6 @@ int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t);
 void xfs_bmbt_to_iomap(struct xfs_inode *, struct iomap *,
 		struct xfs_bmbt_irec *);
 
+extern struct iomap_ops xfs_iomap_ops;
+
 #endif /* __XFS_IOMAP_H__*/
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 1a5ca4b..5d1fdae 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -38,6 +38,7 @@
 #include "xfs_dir2.h"
 #include "xfs_trans_space.h"
 #include "xfs_pnfs.h"
+#include "xfs_iomap.h"
 
 #include <linux/capability.h>
 #include <linux/xattr.h>
@@ -822,8 +823,8 @@ xfs_setattr_size(
 			error = dax_truncate_page(inode, newsize,
 					xfs_get_blocks_direct);
 		} else {
-			error = block_truncate_page(inode->i_mapping, newsize,
-					xfs_get_blocks);
+			error = iomap_truncate_page(inode, newsize,
+					&did_zeroing, &xfs_iomap_ops);
 		}
 	}
 
@@ -838,8 +839,8 @@ xfs_setattr_size(
 	 * problem. Note that this includes any block zeroing we did above;
 	 * otherwise those blocks may not be zeroed after a crash.
 	 */
-	if (newsize > ip->i_d.di_size &&
-	    (oldsize != ip->i_d.di_size || did_zeroing)) {
+	if (did_zeroing ||
+	    (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) {
 		error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
 						      ip->i_d.di_size, newsize);
 		if (error)
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index ea94ee0..bb24ce7 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1295,6 +1295,9 @@ DEFINE_IOMAP_EVENT(xfs_map_blocks_alloc);
 DEFINE_IOMAP_EVENT(xfs_get_blocks_found);
 DEFINE_IOMAP_EVENT(xfs_get_blocks_alloc);
 DEFINE_IOMAP_EVENT(xfs_get_blocks_map_direct);
+DEFINE_IOMAP_EVENT(xfs_iomap_alloc);
+DEFINE_IOMAP_EVENT(xfs_iomap_found);
+DEFINE_IOMAP_EVENT(xfs_iomap_not_found);
 
 DECLARE_EVENT_CLASS(xfs_simple_io_class,
 	TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count),
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 06/14] xfs: implement iomap based buffered write path
@ 2016-06-01 14:44   ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Convert XFS to use the new iomap based multipage write path. This involves
implementing the ->iomap_begin and ->iomap_end methods, and switching the
buffered file write, page_mkwrite and xfs_iozero paths to the new iomap
helpers.

With this change __xfs_get_blocks will never be used for buffered writes,
and the code handling them can be removed.

Based on earlier code from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/xfs/Kconfig     |   1 +
 fs/xfs/xfs_aops.c  | 212 -----------------------------------------------------
 fs/xfs/xfs_file.c  |  71 ++++++++----------
 fs/xfs/xfs_iomap.c | 144 ++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_iomap.h |   5 +-
 fs/xfs/xfs_iops.c  |   9 ++-
 fs/xfs/xfs_trace.h |   3 +
 7 files changed, 187 insertions(+), 258 deletions(-)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 5d47b4d..35faf12 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -4,6 +4,7 @@ config XFS_FS
 	depends on (64BIT || LBDAF)
 	select EXPORTFS
 	select LIBCRC32C
+	select FS_IOMAP
 	help
 	  XFS is a high performance journaling filesystem which originated
 	  on the SGI IRIX platform.  It is completely multi-threaded, can
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 4c463b9..2ac9f7e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1427,216 +1427,6 @@ xfs_vm_direct_IO(
 			xfs_get_blocks_direct, endio, NULL, flags);
 }
 
-/*
- * Punch out the delalloc blocks we have already allocated.
- *
- * Don't bother with xfs_setattr given that nothing can have made it to disk yet
- * as the page is still locked at this point.
- */
-STATIC void
-xfs_vm_kill_delalloc_range(
-	struct inode		*inode,
-	loff_t			start,
-	loff_t			end)
-{
-	struct xfs_inode	*ip = XFS_I(inode);
-	xfs_fileoff_t		start_fsb;
-	xfs_fileoff_t		end_fsb;
-	int			error;
-
-	start_fsb = XFS_B_TO_FSB(ip->i_mount, start);
-	end_fsb = XFS_B_TO_FSB(ip->i_mount, end);
-	if (end_fsb <= start_fsb)
-		return;
-
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-						end_fsb - start_fsb);
-	if (error) {
-		/* something screwed, just bail */
-		if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) {
-			xfs_alert(ip->i_mount,
-		"xfs_vm_write_failed: unable to clean up ino %lld",
-					ip->i_ino);
-		}
-	}
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-}
-
-STATIC void
-xfs_vm_write_failed(
-	struct inode		*inode,
-	struct page		*page,
-	loff_t			pos,
-	unsigned		len)
-{
-	loff_t			block_offset;
-	loff_t			block_start;
-	loff_t			block_end;
-	loff_t			from = pos & (PAGE_SIZE - 1);
-	loff_t			to = from + len;
-	struct buffer_head	*bh, *head;
-	struct xfs_mount	*mp = XFS_I(inode)->i_mount;
-
-	/*
-	 * The request pos offset might be 32 or 64 bit, this is all fine
-	 * on 64-bit platform.  However, for 64-bit pos request on 32-bit
-	 * platform, the high 32-bit will be masked off if we evaluate the
-	 * block_offset via (pos & PAGE_MASK) because the PAGE_MASK is
-	 * 0xfffff000 as an unsigned long, hence the result is incorrect
-	 * which could cause the following ASSERT failed in most cases.
-	 * In order to avoid this, we can evaluate the block_offset of the
-	 * start of the page by using shifts rather than masks the mismatch
-	 * problem.
-	 */
-	block_offset = (pos >> PAGE_SHIFT) << PAGE_SHIFT;
-
-	ASSERT(block_offset + from == pos);
-
-	head = page_buffers(page);
-	block_start = 0;
-	for (bh = head; bh != head || !block_start;
-	     bh = bh->b_this_page, block_start = block_end,
-				   block_offset += bh->b_size) {
-		block_end = block_start + bh->b_size;
-
-		/* skip buffers before the write */
-		if (block_end <= from)
-			continue;
-
-		/* if the buffer is after the write, we're done */
-		if (block_start >= to)
-			break;
-
-		/*
-		 * Process delalloc and unwritten buffers beyond EOF. We can
-		 * encounter unwritten buffers in the event that a file has
-		 * post-EOF unwritten extents and an extending write happens to
-		 * fail (e.g., an unaligned write that also involves a delalloc
-		 * to the same page).
-		 */
-		if (!buffer_delay(bh) && !buffer_unwritten(bh))
-			continue;
-
-		if (!xfs_mp_fail_writes(mp) && !buffer_new(bh) &&
-		    block_offset < i_size_read(inode))
-			continue;
-
-		if (buffer_delay(bh))
-			xfs_vm_kill_delalloc_range(inode, block_offset,
-						   block_offset + bh->b_size);
-
-		/*
-		 * This buffer does not contain data anymore. make sure anyone
-		 * who finds it knows that for certain.
-		 */
-		clear_buffer_delay(bh);
-		clear_buffer_uptodate(bh);
-		clear_buffer_mapped(bh);
-		clear_buffer_new(bh);
-		clear_buffer_dirty(bh);
-		clear_buffer_unwritten(bh);
-	}
-
-}
-
-/*
- * This used to call block_write_begin(), but it unlocks and releases the page
- * on error, and we need that page to be able to punch stale delalloc blocks out
- * on failure. hence we copy-n-waste it here and call xfs_vm_write_failed() at
- * the appropriate point.
- */
-STATIC int
-xfs_vm_write_begin(
-	struct file		*file,
-	struct address_space	*mapping,
-	loff_t			pos,
-	unsigned		len,
-	unsigned		flags,
-	struct page		**pagep,
-	void			**fsdata)
-{
-	pgoff_t			index = pos >> PAGE_SHIFT;
-	struct page		*page;
-	int			status;
-	struct xfs_mount	*mp = XFS_I(mapping->host)->i_mount;
-
-	ASSERT(len <= PAGE_SIZE);
-
-	page = grab_cache_page_write_begin(mapping, index, flags);
-	if (!page)
-		return -ENOMEM;
-
-	status = __block_write_begin(page, pos, len, xfs_get_blocks);
-	if (xfs_mp_fail_writes(mp))
-		status = -EIO;
-	if (unlikely(status)) {
-		struct inode	*inode = mapping->host;
-		size_t		isize = i_size_read(inode);
-
-		xfs_vm_write_failed(inode, page, pos, len);
-		unlock_page(page);
-
-		/*
-		 * If the write is beyond EOF, we only want to kill blocks
-		 * allocated in this write, not blocks that were previously
-		 * written successfully.
-		 */
-		if (xfs_mp_fail_writes(mp))
-			isize = 0;
-		if (pos + len > isize) {
-			ssize_t start = max_t(ssize_t, pos, isize);
-
-			truncate_pagecache_range(inode, start, pos + len);
-		}
-
-		put_page(page);
-		page = NULL;
-	}
-
-	*pagep = page;
-	return status;
-}
-
-/*
- * On failure, we only need to kill delalloc blocks beyond EOF in the range of
- * this specific write because they will never be written. Previous writes
- * beyond EOF where block allocation succeeded do not need to be trashed, so
- * only new blocks from this write should be trashed. For blocks within
- * EOF, generic_write_end() zeros them so they are safe to leave alone and be
- * written with all the other valid data.
- */
-STATIC int
-xfs_vm_write_end(
-	struct file		*file,
-	struct address_space	*mapping,
-	loff_t			pos,
-	unsigned		len,
-	unsigned		copied,
-	struct page		*page,
-	void			*fsdata)
-{
-	int			ret;
-
-	ASSERT(len <= PAGE_SIZE);
-
-	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
-	if (unlikely(ret < len)) {
-		struct inode	*inode = mapping->host;
-		size_t		isize = i_size_read(inode);
-		loff_t		to = pos + len;
-
-		if (to > isize) {
-			/* only kill blocks in this write beyond EOF */
-			if (pos > isize)
-				isize = pos;
-			xfs_vm_kill_delalloc_range(inode, isize, to);
-			truncate_pagecache_range(inode, isize, to);
-		}
-	}
-	return ret;
-}
-
 STATIC sector_t
 xfs_vm_bmap(
 	struct address_space	*mapping,
@@ -1747,8 +1537,6 @@ const struct address_space_operations xfs_address_space_operations = {
 	.set_page_dirty		= xfs_vm_set_page_dirty,
 	.releasepage		= xfs_vm_releasepage,
 	.invalidatepage		= xfs_vm_invalidatepage,
-	.write_begin		= xfs_vm_write_begin,
-	.write_end		= xfs_vm_write_end,
 	.bmap			= xfs_vm_bmap,
 	.direct_IO		= xfs_vm_direct_IO,
 	.migratepage		= buffer_migrate_page,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 47fc632..7316d38 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -37,6 +37,7 @@
 #include "xfs_log.h"
 #include "xfs_icache.h"
 #include "xfs_pnfs.h"
+#include "xfs_iomap.h"
 
 #include <linux/dcache.h>
 #include <linux/falloc.h>
@@ -79,57 +80,27 @@ xfs_rw_ilock_demote(
 		inode_unlock(VFS_I(ip));
 }
 
-/*
- * xfs_iozero clears the specified range supplied via the page cache (except in
- * the DAX case). Writes through the page cache will allocate blocks over holes,
- * though the callers usually map the holes first and avoid them. If a block is
- * not completely zeroed, then it will be read from disk before being partially
- * zeroed.
- *
- * In the DAX case, we can just directly write to the underlying pages. This
- * will not allocate blocks, but will avoid holes and unwritten extents and so
- * not do unnecessary work.
- */
-int
-xfs_iozero(
-	struct xfs_inode	*ip,	/* inode			*/
-	loff_t			pos,	/* offset in file		*/
-	size_t			count)	/* size of data to zero		*/
+static int
+xfs_dax_zero_range(
+	struct inode		*inode,
+	loff_t			pos,
+	size_t			count)
 {
-	struct page		*page;
-	struct address_space	*mapping;
 	int			status = 0;
 
-
-	mapping = VFS_I(ip)->i_mapping;
 	do {
 		unsigned offset, bytes;
-		void *fsdata;
 
 		offset = (pos & (PAGE_SIZE -1)); /* Within page */
 		bytes = PAGE_SIZE - offset;
 		if (bytes > count)
 			bytes = count;
 
-		if (IS_DAX(VFS_I(ip))) {
-			status = dax_zero_page_range(VFS_I(ip), pos, bytes,
-						     xfs_get_blocks_direct);
-			if (status)
-				break;
-		} else {
-			status = pagecache_write_begin(NULL, mapping, pos, bytes,
-						AOP_FLAG_UNINTERRUPTIBLE,
-						&page, &fsdata);
-			if (status)
-				break;
-
-			zero_user(page, offset, bytes);
+		status = dax_zero_page_range(inode, pos, bytes,
+					     xfs_get_blocks_direct);
+		if (status)
+			break;
 
-			status = pagecache_write_end(NULL, mapping, pos, bytes,
-						bytes, page, fsdata);
-			WARN_ON(status <= 0); /* can't return less than zero! */
-			status = 0;
-		}
 		pos += bytes;
 		count -= bytes;
 	} while (count);
@@ -137,6 +108,24 @@ xfs_iozero(
 	return status;
 }
 
+/*
+ * Clear the specified ranges to zero through either the pagecache or DAX.
+ * Holes and unwritten extents will be left as-is as they already are zeroed.
+ */
+int
+xfs_iozero(
+	struct xfs_inode	*ip,
+	loff_t			pos,
+	size_t			count)
+{
+	struct inode		*inode = VFS_I(ip);
+
+	if (IS_DAX(VFS_I(ip)))
+		return xfs_dax_zero_range(inode, pos, count);
+	else
+		return iomap_zero_range(inode, pos, count, NULL, &xfs_iomap_ops);
+}
+
 int
 xfs_update_prealloc_flags(
 	struct xfs_inode	*ip,
@@ -841,7 +830,7 @@ xfs_file_buffered_aio_write(
 write_retry:
 	trace_xfs_file_buffered_write(ip, iov_iter_count(from),
 				      iocb->ki_pos, 0);
-	ret = generic_perform_write(file, from, iocb->ki_pos);
+	ret = iomap_file_buffered_write(iocb, from, &xfs_iomap_ops);
 	if (likely(ret >= 0))
 		iocb->ki_pos += ret;
 
@@ -1553,7 +1542,7 @@ xfs_filemap_page_mkwrite(
 	if (IS_DAX(inode)) {
 		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault);
 	} else {
-		ret = block_page_mkwrite(vma, vmf, xfs_get_blocks);
+		ret = iomap_page_mkwrite(vma, vmf, &xfs_iomap_ops);
 		ret = block_page_mkwrite_return(ret);
 	}
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 2f37194..620fc91 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -967,3 +967,147 @@ xfs_bmbt_to_iomap(
 	iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
 	iomap->bdev = xfs_find_bdev_for_inode(VFS_I(ip));
 }
+
+static inline bool imap_needs_alloc(struct xfs_bmbt_irec *imap, int nimaps)
+{
+	return !nimaps ||
+		imap->br_startblock == HOLESTARTBLOCK ||
+		imap->br_startblock == DELAYSTARTBLOCK;
+}
+
+static int
+xfs_file_iomap_begin(
+	struct inode		*inode,
+	loff_t			offset,
+	loff_t			length,
+	unsigned		flags,
+	struct iomap		*iomap)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_bmbt_irec	imap;
+	xfs_fileoff_t		offset_fsb, end_fsb;
+	int			nimaps = 1, error = 0;
+
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return -EIO;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+
+	ASSERT(offset <= mp->m_super->s_maxbytes);
+	if ((xfs_fsize_t)offset + length > mp->m_super->s_maxbytes)
+		length = mp->m_super->s_maxbytes - offset;
+	offset_fsb = XFS_B_TO_FSBT(mp, offset);
+	end_fsb = XFS_B_TO_FSB(mp, offset + length);
+
+	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, &imap,
+			       &nimaps, XFS_BMAPI_ENTIRE);
+	if (error) {
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		return error;
+	}
+
+	if ((flags & IOMAP_WRITE) && imap_needs_alloc(&imap, nimaps)) {
+		/*
+		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
+		 * pages to keep the chunks of work done where somewhat symmetric
+		 * with the work writeback does. This is a completely arbitrary
+		 * number pulled out of thin air as a best guess for initial
+		 * testing.
+		 *
+		 * Note that the values needs to be less than 32-bits wide until
+		 * the lower level functions are updated.
+		 */
+		length = min_t(loff_t, length, 1024 * PAGE_SIZE);
+		if (xfs_get_extsz_hint(ip)) {
+			/*
+			 * xfs_iomap_write_direct() expects the shared lock. It
+			 * is unlocked on return.
+			 */
+			xfs_ilock_demote(ip, XFS_ILOCK_EXCL);
+			error = xfs_iomap_write_direct(ip, offset, length, &imap,
+					nimaps);
+		} else {
+			error = xfs_iomap_write_delay(ip, offset, length, &imap);
+			xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		}
+
+		if (error)
+			return error;
+
+		trace_xfs_iomap_alloc(ip, offset, length, 0, &imap);
+		xfs_bmbt_to_iomap(ip, iomap, &imap);
+	} else if (nimaps) {
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		trace_xfs_iomap_found(ip, offset, length, 0, &imap);
+		xfs_bmbt_to_iomap(ip, iomap, &imap);
+	} else {
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		trace_xfs_iomap_not_found(ip, offset, length, 0, &imap);
+		iomap->blkno = IOMAP_NULL_BLOCK;
+		iomap->type = IOMAP_HOLE;
+		iomap->offset = offset;
+		iomap->length = length;
+	}
+
+	return 0;
+}
+
+static int
+xfs_file_iomap_end_delalloc(
+	struct xfs_inode	*ip,
+	loff_t			offset,
+	loff_t			length,
+	ssize_t			written)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_fileoff_t		start_fsb;
+	xfs_fileoff_t		end_fsb;
+	int			error = 0;
+
+	start_fsb = XFS_B_TO_FSB(mp, offset + written);
+	end_fsb = XFS_B_TO_FSB(mp, offset + length);
+
+	/*
+	 * Trim back delalloc blocks if we didn't manage to write the whole
+	 * range reserved.
+	 *
+	 * We don't need to care about racing delalloc as we hold i_mutex
+	 * across the reserve/allocate/unreserve calls. If there are delalloc
+	 * blocks in the range, they are ours.
+	 */
+	if (start_fsb < end_fsb) {
+		xfs_ilock(ip, XFS_ILOCK_EXCL);
+		error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
+					       end_fsb - start_fsb);
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+		if (error && !XFS_FORCED_SHUTDOWN(mp)) {
+			xfs_alert(mp, "%s: unable to clean up ino %lld",
+				__func__, ip->i_ino);
+			return error;
+		}
+	}
+
+	return 0;
+}
+
+static int
+xfs_file_iomap_end(
+	struct inode		*inode,
+	loff_t			offset,
+	loff_t			length,
+	ssize_t			written,
+	unsigned		flags,
+	struct iomap		*iomap)
+{
+	if ((flags & IOMAP_WRITE) && iomap->type == IOMAP_DELALLOC)
+		return xfs_file_iomap_end_delalloc(XFS_I(inode), offset,
+				length, written);
+	return 0;
+}
+
+struct iomap_ops xfs_iomap_ops = {
+	.iomap_begin		= xfs_file_iomap_begin,
+	.iomap_end		= xfs_file_iomap_end,
+};
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 718f07c..e066d04 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -18,7 +18,8 @@
 #ifndef __XFS_IOMAP_H__
 #define __XFS_IOMAP_H__
 
-struct iomap;
+#include <linux/iomap.h>
+
 struct xfs_inode;
 struct xfs_bmbt_irec;
 
@@ -33,4 +34,6 @@ int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t);
 void xfs_bmbt_to_iomap(struct xfs_inode *, struct iomap *,
 		struct xfs_bmbt_irec *);
 
+extern struct iomap_ops xfs_iomap_ops;
+
 #endif /* __XFS_IOMAP_H__*/
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 1a5ca4b..5d1fdae 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -38,6 +38,7 @@
 #include "xfs_dir2.h"
 #include "xfs_trans_space.h"
 #include "xfs_pnfs.h"
+#include "xfs_iomap.h"
 
 #include <linux/capability.h>
 #include <linux/xattr.h>
@@ -822,8 +823,8 @@ xfs_setattr_size(
 			error = dax_truncate_page(inode, newsize,
 					xfs_get_blocks_direct);
 		} else {
-			error = block_truncate_page(inode->i_mapping, newsize,
-					xfs_get_blocks);
+			error = iomap_truncate_page(inode, newsize,
+					&did_zeroing, &xfs_iomap_ops);
 		}
 	}
 
@@ -838,8 +839,8 @@ xfs_setattr_size(
 	 * problem. Note that this includes any block zeroing we did above;
 	 * otherwise those blocks may not be zeroed after a crash.
 	 */
-	if (newsize > ip->i_d.di_size &&
-	    (oldsize != ip->i_d.di_size || did_zeroing)) {
+	if (did_zeroing ||
+	    (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) {
 		error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
 						      ip->i_d.di_size, newsize);
 		if (error)
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index ea94ee0..bb24ce7 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1295,6 +1295,9 @@ DEFINE_IOMAP_EVENT(xfs_map_blocks_alloc);
 DEFINE_IOMAP_EVENT(xfs_get_blocks_found);
 DEFINE_IOMAP_EVENT(xfs_get_blocks_alloc);
 DEFINE_IOMAP_EVENT(xfs_get_blocks_map_direct);
+DEFINE_IOMAP_EVENT(xfs_iomap_alloc);
+DEFINE_IOMAP_EVENT(xfs_iomap_found);
+DEFINE_IOMAP_EVENT(xfs_iomap_not_found);
 
 DECLARE_EVENT_CLASS(xfs_simple_io_class,
 	TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count),
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 07/14] xfs: remove buffered write support from __xfs_get_blocks
  2016-06-01 14:44 ` Christoph Hellwig
@ 2016-06-01 14:44   ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/xfs/xfs_aops.c | 71 +++++++++++++++----------------------------------------
 1 file changed, 19 insertions(+), 52 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 2ac9f7e..80714eb 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1143,6 +1143,8 @@ __xfs_get_blocks(
 	ssize_t			size;
 	int			new = 0;
 
+	BUG_ON(create && !direct);
+
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
@@ -1150,22 +1152,14 @@ __xfs_get_blocks(
 	ASSERT(bh_result->b_size >= (1 << inode->i_blkbits));
 	size = bh_result->b_size;
 
-	if (!create && direct && offset >= i_size_read(inode))
+	if (!create && offset >= i_size_read(inode))
 		return 0;
 
 	/*
 	 * Direct I/O is usually done on preallocated files, so try getting
-	 * a block mapping without an exclusive lock first.  For buffered
-	 * writes we already have the exclusive iolock anyway, so avoiding
-	 * a lock roundtrip here by taking the ilock exclusive from the
-	 * beginning is a useful micro optimization.
+	 * a block mapping without an exclusive lock first.
 	 */
-	if (create && !direct) {
-		lockmode = XFS_ILOCK_EXCL;
-		xfs_ilock(ip, lockmode);
-	} else {
-		lockmode = xfs_ilock_data_map_shared(ip);
-	}
+	lockmode = xfs_ilock_data_map_shared(ip);
 
 	ASSERT(offset <= mp->m_super->s_maxbytes);
 	if (offset + size > mp->m_super->s_maxbytes)
@@ -1184,37 +1178,19 @@ __xfs_get_blocks(
 	     (imap.br_startblock == HOLESTARTBLOCK ||
 	      imap.br_startblock == DELAYSTARTBLOCK) ||
 	     (IS_DAX(inode) && ISUNWRITTEN(&imap)))) {
-		if (direct || xfs_get_extsz_hint(ip)) {
-			/*
-			 * xfs_iomap_write_direct() expects the shared lock. It
-			 * is unlocked on return.
-			 */
-			if (lockmode == XFS_ILOCK_EXCL)
-				xfs_ilock_demote(ip, lockmode);
-
-			error = xfs_iomap_write_direct(ip, offset, size,
-						       &imap, nimaps);
-			if (error)
-				return error;
-			new = 1;
+		/*
+		 * xfs_iomap_write_direct() expects the shared lock. It
+		 * is unlocked on return.
+		 */
+		if (lockmode == XFS_ILOCK_EXCL)
+			xfs_ilock_demote(ip, lockmode);
 
-		} else {
-			/*
-			 * Delalloc reservations do not require a transaction,
-			 * we can go on without dropping the lock here. If we
-			 * are allocating a new delalloc block, make sure that
-			 * we set the new flag so that we mark the buffer new so
-			 * that we know that it is newly allocated if the write
-			 * fails.
-			 */
-			if (nimaps && imap.br_startblock == HOLESTARTBLOCK)
-				new = 1;
-			error = xfs_iomap_write_delay(ip, offset, size, &imap);
-			if (error)
-				goto out_unlock;
+		error = xfs_iomap_write_direct(ip, offset, size,
+					       &imap, nimaps);
+		if (error)
+			return error;
+		new = 1;
 
-			xfs_iunlock(ip, lockmode);
-		}
 		trace_xfs_get_blocks_alloc(ip, offset, size,
 				ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN
 						   : XFS_IO_DELALLOC, &imap);
@@ -1235,9 +1211,7 @@ __xfs_get_blocks(
 	}
 
 	/* trim mapping down to size requested */
-	if (direct || size > (1 << inode->i_blkbits))
-		xfs_map_trim_size(inode, iblock, bh_result,
-				  &imap, offset, size);
+	xfs_map_trim_size(inode, iblock, bh_result, &imap, offset, size);
 
 	/*
 	 * For unwritten extents do not report a disk address in the buffered
@@ -1250,7 +1224,7 @@ __xfs_get_blocks(
 		if (ISUNWRITTEN(&imap))
 			set_buffer_unwritten(bh_result);
 		/* direct IO needs special help */
-		if (create && direct) {
+		if (create) {
 			if (dax_fault)
 				ASSERT(!ISUNWRITTEN(&imap));
 			else
@@ -1279,14 +1253,7 @@ __xfs_get_blocks(
 	     (new || ISUNWRITTEN(&imap))))
 		set_buffer_new(bh_result);
 
-	if (imap.br_startblock == DELAYSTARTBLOCK) {
-		BUG_ON(direct);
-		if (create) {
-			set_buffer_uptodate(bh_result);
-			set_buffer_mapped(bh_result);
-			set_buffer_delay(bh_result);
-		}
-	}
+	BUG_ON(direct && imap.br_startblock == DELAYSTARTBLOCK);
 
 	return 0;
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 07/14] xfs: remove buffered write support from __xfs_get_blocks
@ 2016-06-01 14:44   ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/xfs/xfs_aops.c | 71 +++++++++++++++----------------------------------------
 1 file changed, 19 insertions(+), 52 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 2ac9f7e..80714eb 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1143,6 +1143,8 @@ __xfs_get_blocks(
 	ssize_t			size;
 	int			new = 0;
 
+	BUG_ON(create && !direct);
+
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
@@ -1150,22 +1152,14 @@ __xfs_get_blocks(
 	ASSERT(bh_result->b_size >= (1 << inode->i_blkbits));
 	size = bh_result->b_size;
 
-	if (!create && direct && offset >= i_size_read(inode))
+	if (!create && offset >= i_size_read(inode))
 		return 0;
 
 	/*
 	 * Direct I/O is usually done on preallocated files, so try getting
-	 * a block mapping without an exclusive lock first.  For buffered
-	 * writes we already have the exclusive iolock anyway, so avoiding
-	 * a lock roundtrip here by taking the ilock exclusive from the
-	 * beginning is a useful micro optimization.
+	 * a block mapping without an exclusive lock first.
 	 */
-	if (create && !direct) {
-		lockmode = XFS_ILOCK_EXCL;
-		xfs_ilock(ip, lockmode);
-	} else {
-		lockmode = xfs_ilock_data_map_shared(ip);
-	}
+	lockmode = xfs_ilock_data_map_shared(ip);
 
 	ASSERT(offset <= mp->m_super->s_maxbytes);
 	if (offset + size > mp->m_super->s_maxbytes)
@@ -1184,37 +1178,19 @@ __xfs_get_blocks(
 	     (imap.br_startblock == HOLESTARTBLOCK ||
 	      imap.br_startblock == DELAYSTARTBLOCK) ||
 	     (IS_DAX(inode) && ISUNWRITTEN(&imap)))) {
-		if (direct || xfs_get_extsz_hint(ip)) {
-			/*
-			 * xfs_iomap_write_direct() expects the shared lock. It
-			 * is unlocked on return.
-			 */
-			if (lockmode == XFS_ILOCK_EXCL)
-				xfs_ilock_demote(ip, lockmode);
-
-			error = xfs_iomap_write_direct(ip, offset, size,
-						       &imap, nimaps);
-			if (error)
-				return error;
-			new = 1;
+		/*
+		 * xfs_iomap_write_direct() expects the shared lock. It
+		 * is unlocked on return.
+		 */
+		if (lockmode == XFS_ILOCK_EXCL)
+			xfs_ilock_demote(ip, lockmode);
 
-		} else {
-			/*
-			 * Delalloc reservations do not require a transaction,
-			 * we can go on without dropping the lock here. If we
-			 * are allocating a new delalloc block, make sure that
-			 * we set the new flag so that we mark the buffer new so
-			 * that we know that it is newly allocated if the write
-			 * fails.
-			 */
-			if (nimaps && imap.br_startblock == HOLESTARTBLOCK)
-				new = 1;
-			error = xfs_iomap_write_delay(ip, offset, size, &imap);
-			if (error)
-				goto out_unlock;
+		error = xfs_iomap_write_direct(ip, offset, size,
+					       &imap, nimaps);
+		if (error)
+			return error;
+		new = 1;
 
-			xfs_iunlock(ip, lockmode);
-		}
 		trace_xfs_get_blocks_alloc(ip, offset, size,
 				ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN
 						   : XFS_IO_DELALLOC, &imap);
@@ -1235,9 +1211,7 @@ __xfs_get_blocks(
 	}
 
 	/* trim mapping down to size requested */
-	if (direct || size > (1 << inode->i_blkbits))
-		xfs_map_trim_size(inode, iblock, bh_result,
-				  &imap, offset, size);
+	xfs_map_trim_size(inode, iblock, bh_result, &imap, offset, size);
 
 	/*
 	 * For unwritten extents do not report a disk address in the buffered
@@ -1250,7 +1224,7 @@ __xfs_get_blocks(
 		if (ISUNWRITTEN(&imap))
 			set_buffer_unwritten(bh_result);
 		/* direct IO needs special help */
-		if (create && direct) {
+		if (create) {
 			if (dax_fault)
 				ASSERT(!ISUNWRITTEN(&imap));
 			else
@@ -1279,14 +1253,7 @@ __xfs_get_blocks(
 	     (new || ISUNWRITTEN(&imap))))
 		set_buffer_new(bh_result);
 
-	if (imap.br_startblock == DELAYSTARTBLOCK) {
-		BUG_ON(direct);
-		if (create) {
-			set_buffer_uptodate(bh_result);
-			set_buffer_mapped(bh_result);
-			set_buffer_delay(bh_result);
-		}
-	}
+	BUG_ON(direct && imap.br_startblock == DELAYSTARTBLOCK);
 
 	return 0;
 
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 08/14] fs: iomap based fiemap implementation
  2016-06-01 14:44 ` Christoph Hellwig
@ 2016-06-01 14:44   ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Add a simple fiemap implementation based on iomap_ops, partially based
on a previous implementation from Bob Peterson <rpeterso@redhat.com>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c            | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/iomap.h |  3 ++
 2 files changed, 93 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index e51adb6..e874c78 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -405,3 +405,93 @@ out_unlock:
 	return ret;
 }
 EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
+
+struct fiemap_ctx {
+	struct fiemap_extent_info *fi;
+	struct iomap prev;
+};
+
+static int iomap_to_fiemap(struct fiemap_extent_info *fi,
+		struct iomap *iomap, u32 flags)
+{
+	switch (iomap->type) {
+	case IOMAP_HOLE:
+		/* skip holes */
+		return 0;
+	case IOMAP_DELALLOC:
+		flags |= FIEMAP_EXTENT_DELALLOC | FIEMAP_EXTENT_UNKNOWN;
+		break;
+	case IOMAP_UNWRITTEN:
+		flags |= FIEMAP_EXTENT_UNWRITTEN;
+		break;
+	case IOMAP_MAPPED:
+		break;
+	}
+
+	return fiemap_fill_next_extent(fi, iomap->offset,
+			iomap->blkno != IOMAP_NULL_BLOCK ? iomap->blkno << 9: 0,
+			iomap->length, flags | FIEMAP_EXTENT_MERGED);
+
+}
+
+static loff_t
+iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+		struct iomap *iomap)
+{
+	struct fiemap_ctx *ctx = data;
+	loff_t ret = length;
+
+	if (iomap->type == IOMAP_HOLE)
+		return length;
+
+	ret = iomap_to_fiemap(ctx->fi, &ctx->prev, 0);
+	ctx->prev = *iomap;
+	switch (ret) {
+	case 0:		/* success */
+		return length;
+	case 1:		/* extent array full */
+		return 0;
+	default:
+		return ret;
+	}
+}
+
+int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
+		loff_t start, loff_t len, struct iomap_ops *ops)
+{
+	struct fiemap_ctx ctx;
+	loff_t ret;
+
+	memset(&ctx, 0, sizeof(ctx));
+	ctx.fi = fi;
+	ctx.prev.type = IOMAP_HOLE;
+
+	ret = fiemap_check_flags(fi, FIEMAP_FLAG_SYNC);
+	if (ret)
+		return ret;
+
+	ret = filemap_write_and_wait(inode->i_mapping);
+	if (ret)
+		return ret;
+
+	while (len > 0) {
+		ret = iomap_apply(inode, start, len, 0, ops, &ctx,
+				iomap_fiemap_actor);
+		if (ret < 0)
+			return ret;
+		if (ret == 0)
+			break;
+
+		start += ret;
+		len -= ret;
+	}
+
+	if (ctx.prev.type != IOMAP_HOLE) {
+		ret = iomap_to_fiemap(fi, &ctx.prev, FIEMAP_EXTENT_LAST);
+		if (ret < 0)
+			return ret;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iomap_fiemap);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 854766f..b3deee1 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -3,6 +3,7 @@
 
 #include <linux/types.h>
 
+struct fiemap_extent_info;
 struct inode;
 struct iov_iter;
 struct kiocb;
@@ -63,5 +64,7 @@ int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 		struct iomap_ops *ops);
 int iomap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 		struct iomap_ops *ops);
+int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+		loff_t start, loff_t len, struct iomap_ops *ops);
 
 #endif /* LINUX_IOMAP_H */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 08/14] fs: iomap based fiemap implementation
@ 2016-06-01 14:44   ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Add a simple fiemap implementation based on iomap_ops, partially based
on a previous implementation from Bob Peterson <rpeterso@redhat.com>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c            | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/iomap.h |  3 ++
 2 files changed, 93 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index e51adb6..e874c78 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -405,3 +405,93 @@ out_unlock:
 	return ret;
 }
 EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
+
+struct fiemap_ctx {
+	struct fiemap_extent_info *fi;
+	struct iomap prev;
+};
+
+static int iomap_to_fiemap(struct fiemap_extent_info *fi,
+		struct iomap *iomap, u32 flags)
+{
+	switch (iomap->type) {
+	case IOMAP_HOLE:
+		/* skip holes */
+		return 0;
+	case IOMAP_DELALLOC:
+		flags |= FIEMAP_EXTENT_DELALLOC | FIEMAP_EXTENT_UNKNOWN;
+		break;
+	case IOMAP_UNWRITTEN:
+		flags |= FIEMAP_EXTENT_UNWRITTEN;
+		break;
+	case IOMAP_MAPPED:
+		break;
+	}
+
+	return fiemap_fill_next_extent(fi, iomap->offset,
+			iomap->blkno != IOMAP_NULL_BLOCK ? iomap->blkno << 9: 0,
+			iomap->length, flags | FIEMAP_EXTENT_MERGED);
+
+}
+
+static loff_t
+iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+		struct iomap *iomap)
+{
+	struct fiemap_ctx *ctx = data;
+	loff_t ret = length;
+
+	if (iomap->type == IOMAP_HOLE)
+		return length;
+
+	ret = iomap_to_fiemap(ctx->fi, &ctx->prev, 0);
+	ctx->prev = *iomap;
+	switch (ret) {
+	case 0:		/* success */
+		return length;
+	case 1:		/* extent array full */
+		return 0;
+	default:
+		return ret;
+	}
+}
+
+int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
+		loff_t start, loff_t len, struct iomap_ops *ops)
+{
+	struct fiemap_ctx ctx;
+	loff_t ret;
+
+	memset(&ctx, 0, sizeof(ctx));
+	ctx.fi = fi;
+	ctx.prev.type = IOMAP_HOLE;
+
+	ret = fiemap_check_flags(fi, FIEMAP_FLAG_SYNC);
+	if (ret)
+		return ret;
+
+	ret = filemap_write_and_wait(inode->i_mapping);
+	if (ret)
+		return ret;
+
+	while (len > 0) {
+		ret = iomap_apply(inode, start, len, 0, ops, &ctx,
+				iomap_fiemap_actor);
+		if (ret < 0)
+			return ret;
+		if (ret == 0)
+			break;
+
+		start += ret;
+		len -= ret;
+	}
+
+	if (ctx.prev.type != IOMAP_HOLE) {
+		ret = iomap_to_fiemap(fi, &ctx.prev, FIEMAP_EXTENT_LAST);
+		if (ret < 0)
+			return ret;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iomap_fiemap);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 854766f..b3deee1 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -3,6 +3,7 @@
 
 #include <linux/types.h>
 
+struct fiemap_extent_info;
 struct inode;
 struct iov_iter;
 struct kiocb;
@@ -63,5 +64,7 @@ int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 		struct iomap_ops *ops);
 int iomap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 		struct iomap_ops *ops);
+int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+		loff_t start, loff_t len, struct iomap_ops *ops);
 
 #endif /* LINUX_IOMAP_H */
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 09/14] xfs: use iomap fiemap implementation
  2016-06-01 14:44 ` Christoph Hellwig
@ 2016-06-01 14:44   ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Note that this removes support for the untested FIEMAP_FLAG_XATTR.  It
could be added relatively easily with iomap ops for the attr fork, but
without test coverage I don't feel safe doing this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iops.c | 80 ++++---------------------------------------------------
 1 file changed, 5 insertions(+), 75 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 5d1fdae..985a263 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -44,7 +44,7 @@
 #include <linux/xattr.h>
 #include <linux/posix_acl.h>
 #include <linux/security.h>
-#include <linux/fiemap.h>
+#include <linux/iomap.h>
 #include <linux/slab.h>
 
 /*
@@ -1004,51 +1004,6 @@ xfs_vn_update_time(
 	return xfs_trans_commit(tp);
 }
 
-#define XFS_FIEMAP_FLAGS	(FIEMAP_FLAG_SYNC|FIEMAP_FLAG_XATTR)
-
-/*
- * Call fiemap helper to fill in user data.
- * Returns positive errors to xfs_getbmap.
- */
-STATIC int
-xfs_fiemap_format(
-	void			**arg,
-	struct getbmapx		*bmv,
-	int			*full)
-{
-	int			error;
-	struct fiemap_extent_info *fieinfo = *arg;
-	u32			fiemap_flags = 0;
-	u64			logical, physical, length;
-
-	/* Do nothing for a hole */
-	if (bmv->bmv_block == -1LL)
-		return 0;
-
-	logical = BBTOB(bmv->bmv_offset);
-	physical = BBTOB(bmv->bmv_block);
-	length = BBTOB(bmv->bmv_length);
-
-	if (bmv->bmv_oflags & BMV_OF_PREALLOC)
-		fiemap_flags |= FIEMAP_EXTENT_UNWRITTEN;
-	else if (bmv->bmv_oflags & BMV_OF_DELALLOC) {
-		fiemap_flags |= (FIEMAP_EXTENT_DELALLOC |
-				 FIEMAP_EXTENT_UNKNOWN);
-		physical = 0;   /* no block yet */
-	}
-	if (bmv->bmv_oflags & BMV_OF_LAST)
-		fiemap_flags |= FIEMAP_EXTENT_LAST;
-
-	error = fiemap_fill_next_extent(fieinfo, logical, physical,
-					length, fiemap_flags);
-	if (error > 0) {
-		error = 0;
-		*full = 1;	/* user array now full */
-	}
-
-	return error;
-}
-
 STATIC int
 xfs_vn_fiemap(
 	struct inode		*inode,
@@ -1056,38 +1011,13 @@ xfs_vn_fiemap(
 	u64			start,
 	u64			length)
 {
-	xfs_inode_t		*ip = XFS_I(inode);
-	struct getbmapx		bm;
 	int			error;
 
-	error = fiemap_check_flags(fieinfo, XFS_FIEMAP_FLAGS);
-	if (error)
-		return error;
-
-	/* Set up bmap header for xfs internal routine */
-	bm.bmv_offset = BTOBBT(start);
-	/* Special case for whole file */
-	if (length == FIEMAP_MAX_OFFSET)
-		bm.bmv_length = -1LL;
-	else
-		bm.bmv_length = BTOBB(start + length) - bm.bmv_offset;
-
-	/* We add one because in getbmap world count includes the header */
-	bm.bmv_count = !fieinfo->fi_extents_max ? MAXEXTNUM :
-					fieinfo->fi_extents_max + 1;
-	bm.bmv_count = min_t(__s32, bm.bmv_count,
-			     (PAGE_SIZE * 16 / sizeof(struct getbmapx)));
-	bm.bmv_iflags = BMV_IF_PREALLOC | BMV_IF_NO_HOLES;
-	if (fieinfo->fi_flags & FIEMAP_FLAG_XATTR)
-		bm.bmv_iflags |= BMV_IF_ATTRFORK;
-	if (!(fieinfo->fi_flags & FIEMAP_FLAG_SYNC))
-		bm.bmv_iflags |= BMV_IF_DELALLOC;
-
-	error = xfs_getbmap(ip, &bm, xfs_fiemap_format, fieinfo);
-	if (error)
-		return error;
+	xfs_ilock(XFS_I(inode), XFS_IOLOCK_SHARED);
+	error = iomap_fiemap(inode, fieinfo, start, length, &xfs_iomap_ops);
+	xfs_iunlock(XFS_I(inode), XFS_IOLOCK_SHARED);
 
-	return 0;
+	return error;
 }
 
 STATIC int
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 09/14] xfs: use iomap fiemap implementation
@ 2016-06-01 14:44   ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Note that this removes support for the untested FIEMAP_FLAG_XATTR.  It
could be added relatively easily with iomap ops for the attr fork, but
without test coverage I don't feel safe doing this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iops.c | 80 ++++---------------------------------------------------
 1 file changed, 5 insertions(+), 75 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 5d1fdae..985a263 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -44,7 +44,7 @@
 #include <linux/xattr.h>
 #include <linux/posix_acl.h>
 #include <linux/security.h>
-#include <linux/fiemap.h>
+#include <linux/iomap.h>
 #include <linux/slab.h>
 
 /*
@@ -1004,51 +1004,6 @@ xfs_vn_update_time(
 	return xfs_trans_commit(tp);
 }
 
-#define XFS_FIEMAP_FLAGS	(FIEMAP_FLAG_SYNC|FIEMAP_FLAG_XATTR)
-
-/*
- * Call fiemap helper to fill in user data.
- * Returns positive errors to xfs_getbmap.
- */
-STATIC int
-xfs_fiemap_format(
-	void			**arg,
-	struct getbmapx		*bmv,
-	int			*full)
-{
-	int			error;
-	struct fiemap_extent_info *fieinfo = *arg;
-	u32			fiemap_flags = 0;
-	u64			logical, physical, length;
-
-	/* Do nothing for a hole */
-	if (bmv->bmv_block == -1LL)
-		return 0;
-
-	logical = BBTOB(bmv->bmv_offset);
-	physical = BBTOB(bmv->bmv_block);
-	length = BBTOB(bmv->bmv_length);
-
-	if (bmv->bmv_oflags & BMV_OF_PREALLOC)
-		fiemap_flags |= FIEMAP_EXTENT_UNWRITTEN;
-	else if (bmv->bmv_oflags & BMV_OF_DELALLOC) {
-		fiemap_flags |= (FIEMAP_EXTENT_DELALLOC |
-				 FIEMAP_EXTENT_UNKNOWN);
-		physical = 0;   /* no block yet */
-	}
-	if (bmv->bmv_oflags & BMV_OF_LAST)
-		fiemap_flags |= FIEMAP_EXTENT_LAST;
-
-	error = fiemap_fill_next_extent(fieinfo, logical, physical,
-					length, fiemap_flags);
-	if (error > 0) {
-		error = 0;
-		*full = 1;	/* user array now full */
-	}
-
-	return error;
-}
-
 STATIC int
 xfs_vn_fiemap(
 	struct inode		*inode,
@@ -1056,38 +1011,13 @@ xfs_vn_fiemap(
 	u64			start,
 	u64			length)
 {
-	xfs_inode_t		*ip = XFS_I(inode);
-	struct getbmapx		bm;
 	int			error;
 
-	error = fiemap_check_flags(fieinfo, XFS_FIEMAP_FLAGS);
-	if (error)
-		return error;
-
-	/* Set up bmap header for xfs internal routine */
-	bm.bmv_offset = BTOBBT(start);
-	/* Special case for whole file */
-	if (length == FIEMAP_MAX_OFFSET)
-		bm.bmv_length = -1LL;
-	else
-		bm.bmv_length = BTOBB(start + length) - bm.bmv_offset;
-
-	/* We add one because in getbmap world count includes the header */
-	bm.bmv_count = !fieinfo->fi_extents_max ? MAXEXTNUM :
-					fieinfo->fi_extents_max + 1;
-	bm.bmv_count = min_t(__s32, bm.bmv_count,
-			     (PAGE_SIZE * 16 / sizeof(struct getbmapx)));
-	bm.bmv_iflags = BMV_IF_PREALLOC | BMV_IF_NO_HOLES;
-	if (fieinfo->fi_flags & FIEMAP_FLAG_XATTR)
-		bm.bmv_iflags |= BMV_IF_ATTRFORK;
-	if (!(fieinfo->fi_flags & FIEMAP_FLAG_SYNC))
-		bm.bmv_iflags |= BMV_IF_DELALLOC;
-
-	error = xfs_getbmap(ip, &bm, xfs_fiemap_format, fieinfo);
-	if (error)
-		return error;
+	xfs_ilock(XFS_I(inode), XFS_IOLOCK_SHARED);
+	error = iomap_fiemap(inode, fieinfo, start, length, &xfs_iomap_ops);
+	xfs_iunlock(XFS_I(inode), XFS_IOLOCK_SHARED);
 
-	return 0;
+	return error;
 }
 
 STATIC int
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 10/14] xfs: use iomap infrastructure for DAX zeroing
  2016-06-01 14:44 ` Christoph Hellwig
@ 2016-06-01 14:44   ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c | 35 +----------------------------------
 fs/xfs/xfs_iops.c |  9 ++-------
 2 files changed, 3 insertions(+), 41 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7316d38..090a90f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -80,34 +80,6 @@ xfs_rw_ilock_demote(
 		inode_unlock(VFS_I(ip));
 }
 
-static int
-xfs_dax_zero_range(
-	struct inode		*inode,
-	loff_t			pos,
-	size_t			count)
-{
-	int			status = 0;
-
-	do {
-		unsigned offset, bytes;
-
-		offset = (pos & (PAGE_SIZE -1)); /* Within page */
-		bytes = PAGE_SIZE - offset;
-		if (bytes > count)
-			bytes = count;
-
-		status = dax_zero_page_range(inode, pos, bytes,
-					     xfs_get_blocks_direct);
-		if (status)
-			break;
-
-		pos += bytes;
-		count -= bytes;
-	} while (count);
-
-	return status;
-}
-
 /*
  * Clear the specified ranges to zero through either the pagecache or DAX.
  * Holes and unwritten extents will be left as-is as they already are zeroed.
@@ -118,12 +90,7 @@ xfs_iozero(
 	loff_t			pos,
 	size_t			count)
 {
-	struct inode		*inode = VFS_I(ip);
-
-	if (IS_DAX(VFS_I(ip)))
-		return xfs_dax_zero_range(inode, pos, count);
-	else
-		return iomap_zero_range(inode, pos, count, NULL, &xfs_iomap_ops);
+	return iomap_zero_range(VFS_I(ip), pos, count, NULL, &xfs_iomap_ops);
 }
 
 int
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 985a263..ab820f8 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -819,13 +819,8 @@ xfs_setattr_size(
 	if (newsize > oldsize) {
 		error = xfs_zero_eof(ip, newsize, oldsize, &did_zeroing);
 	} else {
-		if (IS_DAX(inode)) {
-			error = dax_truncate_page(inode, newsize,
-					xfs_get_blocks_direct);
-		} else {
-			error = iomap_truncate_page(inode, newsize,
-					&did_zeroing, &xfs_iomap_ops);
-		}
+		error = iomap_truncate_page(inode, newsize, &did_zeroing,
+				&xfs_iomap_ops);
 	}
 
 	if (error)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 10/14] xfs: use iomap infrastructure for DAX zeroing
@ 2016-06-01 14:44   ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c | 35 +----------------------------------
 fs/xfs/xfs_iops.c |  9 ++-------
 2 files changed, 3 insertions(+), 41 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7316d38..090a90f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -80,34 +80,6 @@ xfs_rw_ilock_demote(
 		inode_unlock(VFS_I(ip));
 }
 
-static int
-xfs_dax_zero_range(
-	struct inode		*inode,
-	loff_t			pos,
-	size_t			count)
-{
-	int			status = 0;
-
-	do {
-		unsigned offset, bytes;
-
-		offset = (pos & (PAGE_SIZE -1)); /* Within page */
-		bytes = PAGE_SIZE - offset;
-		if (bytes > count)
-			bytes = count;
-
-		status = dax_zero_page_range(inode, pos, bytes,
-					     xfs_get_blocks_direct);
-		if (status)
-			break;
-
-		pos += bytes;
-		count -= bytes;
-	} while (count);
-
-	return status;
-}
-
 /*
  * Clear the specified ranges to zero through either the pagecache or DAX.
  * Holes and unwritten extents will be left as-is as they already are zeroed.
@@ -118,12 +90,7 @@ xfs_iozero(
 	loff_t			pos,
 	size_t			count)
 {
-	struct inode		*inode = VFS_I(ip);
-
-	if (IS_DAX(VFS_I(ip)))
-		return xfs_dax_zero_range(inode, pos, count);
-	else
-		return iomap_zero_range(inode, pos, count, NULL, &xfs_iomap_ops);
+	return iomap_zero_range(VFS_I(ip), pos, count, NULL, &xfs_iomap_ops);
 }
 
 int
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 985a263..ab820f8 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -819,13 +819,8 @@ xfs_setattr_size(
 	if (newsize > oldsize) {
 		error = xfs_zero_eof(ip, newsize, oldsize, &did_zeroing);
 	} else {
-		if (IS_DAX(inode)) {
-			error = dax_truncate_page(inode, newsize,
-					xfs_get_blocks_direct);
-		} else {
-			error = iomap_truncate_page(inode, newsize,
-					&did_zeroing, &xfs_iomap_ops);
-		}
+		error = iomap_truncate_page(inode, newsize, &did_zeroing,
+				&xfs_iomap_ops);
 	}
 
 	if (error)
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 11/14] xfs: handle 64-bit length in xfs_iozero
  2016-06-01 14:44 ` Christoph Hellwig
@ 2016-06-01 14:44   ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

We'll want to use this code for large offsets now that we're skipping holes
and unwritten extents efficiently.  Also rename it to xfs_zero_range to be
a bit more descriptive, and tell the caller if we actually did any zeroing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c  | 11 ++++++-----
 fs/xfs/xfs_inode.h |  3 ++-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 090a90f..294e5f4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -85,10 +85,11 @@ xfs_rw_ilock_demote(
  * Holes and unwritten extents will be left as-is as they already are zeroed.
  */
 int
-xfs_iozero(
+xfs_zero_range(
 	struct xfs_inode	*ip,
-	loff_t			pos,
-	size_t			count)
+	xfs_off_t		pos,
+	xfs_off_t		count,
+	bool			*did_zero)
 {
 	return iomap_zero_range(VFS_I(ip), pos, count, NULL, &xfs_iomap_ops);
 }
@@ -419,7 +420,7 @@ xfs_zero_last_block(
 	if (isize + zero_len > offset)
 		zero_len = offset - isize;
 	*did_zeroing = true;
-	return xfs_iozero(ip, isize, zero_len);
+	return xfs_zero_range(ip, isize, zero_len, NULL);
 }
 
 /*
@@ -518,7 +519,7 @@ xfs_zero_eof(
 		if ((zero_off + zero_len) > offset)
 			zero_len = offset - zero_off;
 
-		error = xfs_iozero(ip, zero_off, zero_len);
+		error = xfs_zero_range(ip, zero_off, zero_len, NULL);
 		if (error)
 			return error;
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index e52d7c7..dbb0bcf 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -434,7 +434,8 @@ int	xfs_update_prealloc_flags(struct xfs_inode *ip,
 				  enum xfs_prealloc_flags flags);
 int	xfs_zero_eof(struct xfs_inode *ip, xfs_off_t offset,
 		     xfs_fsize_t isize, bool *did_zeroing);
-int	xfs_iozero(struct xfs_inode *ip, loff_t pos, size_t count);
+int	xfs_zero_range(struct xfs_inode *ip, xfs_off_t pos, xfs_off_t count,
+		bool *did_zero);
 loff_t	__xfs_seek_hole_data(struct inode *inode, loff_t start,
 			     loff_t eof, int whence);
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 11/14] xfs: handle 64-bit length in xfs_iozero
@ 2016-06-01 14:44   ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

We'll want to use this code for large offsets now that we're skipping holes
and unwritten extents efficiently.  Also rename it to xfs_zero_range to be
a bit more descriptive, and tell the caller if we actually did any zeroing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c  | 11 ++++++-----
 fs/xfs/xfs_inode.h |  3 ++-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 090a90f..294e5f4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -85,10 +85,11 @@ xfs_rw_ilock_demote(
  * Holes and unwritten extents will be left as-is as they already are zeroed.
  */
 int
-xfs_iozero(
+xfs_zero_range(
 	struct xfs_inode	*ip,
-	loff_t			pos,
-	size_t			count)
+	xfs_off_t		pos,
+	xfs_off_t		count,
+	bool			*did_zero)
 {
 	return iomap_zero_range(VFS_I(ip), pos, count, NULL, &xfs_iomap_ops);
 }
@@ -419,7 +420,7 @@ xfs_zero_last_block(
 	if (isize + zero_len > offset)
 		zero_len = offset - isize;
 	*did_zeroing = true;
-	return xfs_iozero(ip, isize, zero_len);
+	return xfs_zero_range(ip, isize, zero_len, NULL);
 }
 
 /*
@@ -518,7 +519,7 @@ xfs_zero_eof(
 		if ((zero_off + zero_len) > offset)
 			zero_len = offset - zero_off;
 
-		error = xfs_iozero(ip, zero_off, zero_len);
+		error = xfs_zero_range(ip, zero_off, zero_len, NULL);
 		if (error)
 			return error;
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index e52d7c7..dbb0bcf 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -434,7 +434,8 @@ int	xfs_update_prealloc_flags(struct xfs_inode *ip,
 				  enum xfs_prealloc_flags flags);
 int	xfs_zero_eof(struct xfs_inode *ip, xfs_off_t offset,
 		     xfs_fsize_t isize, bool *did_zeroing);
-int	xfs_iozero(struct xfs_inode *ip, loff_t pos, size_t count);
+int	xfs_zero_range(struct xfs_inode *ip, xfs_off_t pos, xfs_off_t count,
+		bool *did_zero);
 loff_t	__xfs_seek_hole_data(struct inode *inode, loff_t start,
 			     loff_t eof, int whence);
 
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 12/14] xfs: use xfs_zero_range in xfs_zero_eof
  2016-06-01 14:44 ` Christoph Hellwig
@ 2016-06-01 14:44   ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

We now skip holes in it, so no need to have the caller do it as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c | 128 +-----------------------------------------------------
 1 file changed, 1 insertion(+), 127 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 294e5f4..713991c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -381,49 +381,6 @@ out:
 }
 
 /*
- * This routine is called to handle zeroing any space in the last block of the
- * file that is beyond the EOF.  We do this since the size is being increased
- * without writing anything to that block and we don't want to read the
- * garbage on the disk.
- */
-STATIC int				/* error (positive) */
-xfs_zero_last_block(
-	struct xfs_inode	*ip,
-	xfs_fsize_t		offset,
-	xfs_fsize_t		isize,
-	bool			*did_zeroing)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	xfs_fileoff_t		last_fsb = XFS_B_TO_FSBT(mp, isize);
-	int			zero_offset = XFS_B_FSB_OFFSET(mp, isize);
-	int			zero_len;
-	int			nimaps = 1;
-	int			error = 0;
-	struct xfs_bmbt_irec	imap;
-
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	error = xfs_bmapi_read(ip, last_fsb, 1, &imap, &nimaps, 0);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	if (error)
-		return error;
-
-	ASSERT(nimaps > 0);
-
-	/*
-	 * If the block underlying isize is just a hole, then there
-	 * is nothing to zero.
-	 */
-	if (imap.br_startblock == HOLESTARTBLOCK)
-		return 0;
-
-	zero_len = mp->m_sb.sb_blocksize - zero_offset;
-	if (isize + zero_len > offset)
-		zero_len = offset - isize;
-	*did_zeroing = true;
-	return xfs_zero_range(ip, isize, zero_len, NULL);
-}
-
-/*
  * Zero any on disk space between the current EOF and the new, larger EOF.
  *
  * This handles the normal case of zeroing the remainder of the last block in
@@ -441,94 +398,11 @@ xfs_zero_eof(
 	xfs_fsize_t		isize,		/* current inode size */
 	bool			*did_zeroing)
 {
-	struct xfs_mount	*mp = ip->i_mount;
-	xfs_fileoff_t		start_zero_fsb;
-	xfs_fileoff_t		end_zero_fsb;
-	xfs_fileoff_t		zero_count_fsb;
-	xfs_fileoff_t		last_fsb;
-	xfs_fileoff_t		zero_off;
-	xfs_fsize_t		zero_len;
-	int			nimaps;
-	int			error = 0;
-	struct xfs_bmbt_irec	imap;
-
 	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
 	ASSERT(offset > isize);
 
 	trace_xfs_zero_eof(ip, isize, offset - isize);
-
-	/*
-	 * First handle zeroing the block on which isize resides.
-	 *
-	 * We only zero a part of that block so it is handled specially.
-	 */
-	if (XFS_B_FSB_OFFSET(mp, isize) != 0) {
-		error = xfs_zero_last_block(ip, offset, isize, did_zeroing);
-		if (error)
-			return error;
-	}
-
-	/*
-	 * Calculate the range between the new size and the old where blocks
-	 * needing to be zeroed may exist.
-	 *
-	 * To get the block where the last byte in the file currently resides,
-	 * we need to subtract one from the size and truncate back to a block
-	 * boundary.  We subtract 1 in case the size is exactly on a block
-	 * boundary.
-	 */
-	last_fsb = isize ? XFS_B_TO_FSBT(mp, isize - 1) : (xfs_fileoff_t)-1;
-	start_zero_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)isize);
-	end_zero_fsb = XFS_B_TO_FSBT(mp, offset - 1);
-	ASSERT((xfs_sfiloff_t)last_fsb < (xfs_sfiloff_t)start_zero_fsb);
-	if (last_fsb == end_zero_fsb) {
-		/*
-		 * The size was only incremented on its last block.
-		 * We took care of that above, so just return.
-		 */
-		return 0;
-	}
-
-	ASSERT(start_zero_fsb <= end_zero_fsb);
-	while (start_zero_fsb <= end_zero_fsb) {
-		nimaps = 1;
-		zero_count_fsb = end_zero_fsb - start_zero_fsb + 1;
-
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
-		error = xfs_bmapi_read(ip, start_zero_fsb, zero_count_fsb,
-					  &imap, &nimaps, 0);
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		if (error)
-			return error;
-
-		ASSERT(nimaps > 0);
-
-		if (imap.br_state == XFS_EXT_UNWRITTEN ||
-		    imap.br_startblock == HOLESTARTBLOCK) {
-			start_zero_fsb = imap.br_startoff + imap.br_blockcount;
-			ASSERT(start_zero_fsb <= (end_zero_fsb + 1));
-			continue;
-		}
-
-		/*
-		 * There are blocks we need to zero.
-		 */
-		zero_off = XFS_FSB_TO_B(mp, start_zero_fsb);
-		zero_len = XFS_FSB_TO_B(mp, imap.br_blockcount);
-
-		if ((zero_off + zero_len) > offset)
-			zero_len = offset - zero_off;
-
-		error = xfs_zero_range(ip, zero_off, zero_len, NULL);
-		if (error)
-			return error;
-
-		*did_zeroing = true;
-		start_zero_fsb = imap.br_startoff + imap.br_blockcount;
-		ASSERT(start_zero_fsb <= (end_zero_fsb + 1));
-	}
-
-	return 0;
+	return xfs_zero_range(ip, isize, offset - isize, did_zeroing);
 }
 
 /*
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 12/14] xfs: use xfs_zero_range in xfs_zero_eof
@ 2016-06-01 14:44   ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

We now skip holes in it, so no need to have the caller do it as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c | 128 +-----------------------------------------------------
 1 file changed, 1 insertion(+), 127 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 294e5f4..713991c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -381,49 +381,6 @@ out:
 }
 
 /*
- * This routine is called to handle zeroing any space in the last block of the
- * file that is beyond the EOF.  We do this since the size is being increased
- * without writing anything to that block and we don't want to read the
- * garbage on the disk.
- */
-STATIC int				/* error (positive) */
-xfs_zero_last_block(
-	struct xfs_inode	*ip,
-	xfs_fsize_t		offset,
-	xfs_fsize_t		isize,
-	bool			*did_zeroing)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	xfs_fileoff_t		last_fsb = XFS_B_TO_FSBT(mp, isize);
-	int			zero_offset = XFS_B_FSB_OFFSET(mp, isize);
-	int			zero_len;
-	int			nimaps = 1;
-	int			error = 0;
-	struct xfs_bmbt_irec	imap;
-
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	error = xfs_bmapi_read(ip, last_fsb, 1, &imap, &nimaps, 0);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	if (error)
-		return error;
-
-	ASSERT(nimaps > 0);
-
-	/*
-	 * If the block underlying isize is just a hole, then there
-	 * is nothing to zero.
-	 */
-	if (imap.br_startblock == HOLESTARTBLOCK)
-		return 0;
-
-	zero_len = mp->m_sb.sb_blocksize - zero_offset;
-	if (isize + zero_len > offset)
-		zero_len = offset - isize;
-	*did_zeroing = true;
-	return xfs_zero_range(ip, isize, zero_len, NULL);
-}
-
-/*
  * Zero any on disk space between the current EOF and the new, larger EOF.
  *
  * This handles the normal case of zeroing the remainder of the last block in
@@ -441,94 +398,11 @@ xfs_zero_eof(
 	xfs_fsize_t		isize,		/* current inode size */
 	bool			*did_zeroing)
 {
-	struct xfs_mount	*mp = ip->i_mount;
-	xfs_fileoff_t		start_zero_fsb;
-	xfs_fileoff_t		end_zero_fsb;
-	xfs_fileoff_t		zero_count_fsb;
-	xfs_fileoff_t		last_fsb;
-	xfs_fileoff_t		zero_off;
-	xfs_fsize_t		zero_len;
-	int			nimaps;
-	int			error = 0;
-	struct xfs_bmbt_irec	imap;
-
 	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
 	ASSERT(offset > isize);
 
 	trace_xfs_zero_eof(ip, isize, offset - isize);
-
-	/*
-	 * First handle zeroing the block on which isize resides.
-	 *
-	 * We only zero a part of that block so it is handled specially.
-	 */
-	if (XFS_B_FSB_OFFSET(mp, isize) != 0) {
-		error = xfs_zero_last_block(ip, offset, isize, did_zeroing);
-		if (error)
-			return error;
-	}
-
-	/*
-	 * Calculate the range between the new size and the old where blocks
-	 * needing to be zeroed may exist.
-	 *
-	 * To get the block where the last byte in the file currently resides,
-	 * we need to subtract one from the size and truncate back to a block
-	 * boundary.  We subtract 1 in case the size is exactly on a block
-	 * boundary.
-	 */
-	last_fsb = isize ? XFS_B_TO_FSBT(mp, isize - 1) : (xfs_fileoff_t)-1;
-	start_zero_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)isize);
-	end_zero_fsb = XFS_B_TO_FSBT(mp, offset - 1);
-	ASSERT((xfs_sfiloff_t)last_fsb < (xfs_sfiloff_t)start_zero_fsb);
-	if (last_fsb == end_zero_fsb) {
-		/*
-		 * The size was only incremented on its last block.
-		 * We took care of that above, so just return.
-		 */
-		return 0;
-	}
-
-	ASSERT(start_zero_fsb <= end_zero_fsb);
-	while (start_zero_fsb <= end_zero_fsb) {
-		nimaps = 1;
-		zero_count_fsb = end_zero_fsb - start_zero_fsb + 1;
-
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
-		error = xfs_bmapi_read(ip, start_zero_fsb, zero_count_fsb,
-					  &imap, &nimaps, 0);
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		if (error)
-			return error;
-
-		ASSERT(nimaps > 0);
-
-		if (imap.br_state == XFS_EXT_UNWRITTEN ||
-		    imap.br_startblock == HOLESTARTBLOCK) {
-			start_zero_fsb = imap.br_startoff + imap.br_blockcount;
-			ASSERT(start_zero_fsb <= (end_zero_fsb + 1));
-			continue;
-		}
-
-		/*
-		 * There are blocks we need to zero.
-		 */
-		zero_off = XFS_FSB_TO_B(mp, start_zero_fsb);
-		zero_len = XFS_FSB_TO_B(mp, imap.br_blockcount);
-
-		if ((zero_off + zero_len) > offset)
-			zero_len = offset - zero_off;
-
-		error = xfs_zero_range(ip, zero_off, zero_len, NULL);
-		if (error)
-			return error;
-
-		*did_zeroing = true;
-		start_zero_fsb = imap.br_startoff + imap.br_blockcount;
-		ASSERT(start_zero_fsb <= (end_zero_fsb + 1));
-	}
-
-	return 0;
+	return xfs_zero_range(ip, isize, offset - isize, did_zeroing);
 }
 
 /*
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 13/14] xfs: split xfs_free_file_space in manageable pieces
  2016-06-01 14:44 ` Christoph Hellwig
@ 2016-06-01 14:44   ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_bmap_util.c | 252 +++++++++++++++++++++++++++----------------------
 1 file changed, 137 insertions(+), 115 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 586bb64..e36664f 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1184,30 +1184,132 @@ xfs_zero_remaining_bytes(
 	return error;
 }
 
+static int
+xfs_unmap_extent(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		startoffset_fsb,
+	xfs_filblks_t		len_fsb,
+	int			*done)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	struct xfs_bmap_free	free_list;
+	xfs_fsblock_t		firstfsb;
+	uint			resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
+	int			error;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
+	if (error) {
+		ASSERT(error == -ENOSPC || XFS_FORCED_SHUTDOWN(mp));
+		return error;
+	}
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	error = xfs_trans_reserve_quota(tp, mp, ip->i_udquot, ip->i_gdquot,
+			ip->i_pdquot, resblks, 0, XFS_QMOPT_RES_REGBLKS);
+	if (error)
+		goto out_trans_cancel;
+
+	xfs_trans_ijoin(tp, ip, 0);
+
+	xfs_bmap_init(&free_list, &firstfsb);
+	error = xfs_bunmapi(tp, ip, startoffset_fsb, len_fsb, 0, 2, &firstfsb,
+			&free_list, done);
+	if (error)
+		goto out_bmap_cancel;
+
+	error = xfs_bmap_finish(&tp, &free_list, NULL);
+	if (error)
+		goto out_bmap_cancel;
+
+	error = xfs_trans_commit(tp);
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+
+out_bmap_cancel:
+	xfs_bmap_cancel(&free_list);
+out_trans_cancel:
+	xfs_trans_cancel(tp);
+	goto out_unlock;
+}
+
+static int
+xfs_adjust_extent_unmap_boundaries(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		*startoffset_fsb,
+	xfs_fileoff_t		*endoffset_fsb)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_bmbt_irec	imap;
+	int			nimap, error;
+	xfs_extlen_t		mod = 0;
+
+	nimap = 1;
+	error = xfs_bmapi_read(ip, *startoffset_fsb, 1, &imap, &nimap, 0);
+	if (error)
+		return error;
+
+	if (nimap && imap.br_startblock != HOLESTARTBLOCK) {
+		xfs_daddr_t	block;
+
+		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
+		block = imap.br_startblock;
+		mod = do_div(block, mp->m_sb.sb_rextsize);
+		if (mod)
+			*startoffset_fsb += mp->m_sb.sb_rextsize - mod;
+	}
+
+	nimap = 1;
+	error = xfs_bmapi_read(ip, *endoffset_fsb - 1, 1, &imap, &nimap, 0);
+	if (error)
+		return error;
+
+	if (nimap && imap.br_startblock != HOLESTARTBLOCK) {
+		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
+		mod++;
+		if (mod && mod != mp->m_sb.sb_rextsize)
+			*endoffset_fsb -= mod;
+	}
+
+	return 0;
+}
+
+static int
+xfs_flush_unmap_range(
+	struct xfs_inode	*ip,
+	xfs_off_t		offset,
+	xfs_off_t		len)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct inode		*inode = VFS_I(ip);
+	xfs_off_t		rounding, start, end;
+	int			error;
+
+	/* wait for the completion of any pending DIOs */
+	inode_dio_wait(inode);
+
+	rounding = max_t(xfs_off_t, 1 << mp->m_sb.sb_blocklog, PAGE_SIZE);
+	start = round_down(offset, rounding);
+	end = round_up(offset + len, rounding) - 1;
+
+	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
+	if (error)
+		return error;
+	truncate_pagecache_range(inode, start, end);
+	return 0;
+}
+
 int
 xfs_free_file_space(
 	struct xfs_inode	*ip,
 	xfs_off_t		offset,
 	xfs_off_t		len)
 {
-	int			done;
-	xfs_fileoff_t		endoffset_fsb;
-	int			error;
-	xfs_fsblock_t		firstfsb;
-	xfs_bmap_free_t		free_list;
-	xfs_bmbt_irec_t		imap;
-	xfs_off_t		ioffset;
-	xfs_off_t		iendoffset;
-	xfs_extlen_t		mod=0;
-	xfs_mount_t		*mp;
-	int			nimap;
-	uint			resblks;
-	xfs_off_t		rounding;
-	int			rt;
+	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fileoff_t		startoffset_fsb;
-	xfs_trans_t		*tp;
-
-	mp = ip->i_mount;
+	xfs_fileoff_t		endoffset_fsb;
+	int			done, error;
 
 	trace_xfs_free_file_space(ip);
 
@@ -1215,60 +1317,30 @@ xfs_free_file_space(
 	if (error)
 		return error;
 
-	error = 0;
 	if (len <= 0)	/* if nothing being freed */
-		return error;
-	rt = XFS_IS_REALTIME_INODE(ip);
-	startoffset_fsb	= XFS_B_TO_FSB(mp, offset);
-	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
-
-	/* wait for the completion of any pending DIOs */
-	inode_dio_wait(VFS_I(ip));
+		return 0;
 
-	rounding = max_t(xfs_off_t, 1 << mp->m_sb.sb_blocklog, PAGE_SIZE);
-	ioffset = round_down(offset, rounding);
-	iendoffset = round_up(offset + len, rounding) - 1;
-	error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, ioffset,
-					     iendoffset);
+	error = xfs_flush_unmap_range(ip, offset, len);
 	if (error)
-		goto out;
-	truncate_pagecache_range(VFS_I(ip), ioffset, iendoffset);
+		return error;
+
+	startoffset_fsb = XFS_B_TO_FSB(mp, offset);
+	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
 
 	/*
-	 * Need to zero the stuff we're not freeing, on disk.
-	 * If it's a realtime file & can't use unwritten extents then we
-	 * actually need to zero the extent edges.  Otherwise xfs_bunmapi
-	 * will take care of it for us.
+	 * Need to zero the stuff we're not freeing, on disk.  If it's a RT file
+	 * and we can't use unwritten extents then we actually need to ensure
+	 * to zero the whole extent, otherwise we just need to take of block
+	 * boundaries, and xfs_bunmapi will handle the rest.
 	 */
-	if (rt && !xfs_sb_version_hasextflgbit(&mp->m_sb)) {
-		nimap = 1;
-		error = xfs_bmapi_read(ip, startoffset_fsb, 1,
-					&imap, &nimap, 0);
+	if (XFS_IS_REALTIME_INODE(ip) &&
+	    !xfs_sb_version_hasextflgbit(&mp->m_sb)) {
+		error = xfs_adjust_extent_unmap_boundaries(ip, &startoffset_fsb,
+				&endoffset_fsb);
 		if (error)
-			goto out;
-		ASSERT(nimap == 0 || nimap == 1);
-		if (nimap && imap.br_startblock != HOLESTARTBLOCK) {
-			xfs_daddr_t	block;
-
-			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
-			block = imap.br_startblock;
-			mod = do_div(block, mp->m_sb.sb_rextsize);
-			if (mod)
-				startoffset_fsb += mp->m_sb.sb_rextsize - mod;
-		}
-		nimap = 1;
-		error = xfs_bmapi_read(ip, endoffset_fsb - 1, 1,
-					&imap, &nimap, 0);
-		if (error)
-			goto out;
-		ASSERT(nimap == 0 || nimap == 1);
-		if (nimap && imap.br_startblock != HOLESTARTBLOCK) {
-			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
-			mod++;
-			if (mod && (mod != mp->m_sb.sb_rextsize))
-				endoffset_fsb -= mod;
-		}
+			return error;
 	}
+
 	if ((done = (endoffset_fsb <= startoffset_fsb)))
 		/*
 		 * One contiguous piece to clear
@@ -1288,62 +1360,12 @@ xfs_free_file_space(
 				offset + len - 1);
 	}
 
-	/*
-	 * free file space until done or until there is an error
-	 */
-	resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
 	while (!error && !done) {
-
-		/*
-		 * allocate and setup the transaction. Allow this
-		 * transaction to dip into the reserve blocks to ensure
-		 * the freeing of the space succeeds at ENOSPC.
-		 */
-		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0,
-				&tp);
-		if (error) {
-			ASSERT(error == -ENOSPC || XFS_FORCED_SHUTDOWN(mp));
-			break;
-		}
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
-		error = xfs_trans_reserve_quota(tp, mp,
-				ip->i_udquot, ip->i_gdquot, ip->i_pdquot,
-				resblks, 0, XFS_QMOPT_RES_REGBLKS);
-		if (error)
-			goto error1;
-
-		xfs_trans_ijoin(tp, ip, 0);
-
-		/*
-		 * issue the bunmapi() call to free the blocks
-		 */
-		xfs_bmap_init(&free_list, &firstfsb);
-		error = xfs_bunmapi(tp, ip, startoffset_fsb,
-				  endoffset_fsb - startoffset_fsb,
-				  0, 2, &firstfsb, &free_list, &done);
-		if (error)
-			goto error0;
-
-		/*
-		 * complete the transaction
-		 */
-		error = xfs_bmap_finish(&tp, &free_list, NULL);
-		if (error)
-			goto error0;
-
-		error = xfs_trans_commit(tp);
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		error = xfs_unmap_extent(ip, startoffset_fsb,
+				endoffset_fsb - startoffset_fsb, &done);
 	}
 
- out:
 	return error;
-
- error0:
-	xfs_bmap_cancel(&free_list);
- error1:
-	xfs_trans_cancel(tp);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	goto out;
 }
 
 /*
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 13/14] xfs: split xfs_free_file_space in manageable pieces
@ 2016-06-01 14:44   ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_bmap_util.c | 252 +++++++++++++++++++++++++++----------------------
 1 file changed, 137 insertions(+), 115 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 586bb64..e36664f 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1184,30 +1184,132 @@ xfs_zero_remaining_bytes(
 	return error;
 }
 
+static int
+xfs_unmap_extent(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		startoffset_fsb,
+	xfs_filblks_t		len_fsb,
+	int			*done)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	struct xfs_bmap_free	free_list;
+	xfs_fsblock_t		firstfsb;
+	uint			resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
+	int			error;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
+	if (error) {
+		ASSERT(error == -ENOSPC || XFS_FORCED_SHUTDOWN(mp));
+		return error;
+	}
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	error = xfs_trans_reserve_quota(tp, mp, ip->i_udquot, ip->i_gdquot,
+			ip->i_pdquot, resblks, 0, XFS_QMOPT_RES_REGBLKS);
+	if (error)
+		goto out_trans_cancel;
+
+	xfs_trans_ijoin(tp, ip, 0);
+
+	xfs_bmap_init(&free_list, &firstfsb);
+	error = xfs_bunmapi(tp, ip, startoffset_fsb, len_fsb, 0, 2, &firstfsb,
+			&free_list, done);
+	if (error)
+		goto out_bmap_cancel;
+
+	error = xfs_bmap_finish(&tp, &free_list, NULL);
+	if (error)
+		goto out_bmap_cancel;
+
+	error = xfs_trans_commit(tp);
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+
+out_bmap_cancel:
+	xfs_bmap_cancel(&free_list);
+out_trans_cancel:
+	xfs_trans_cancel(tp);
+	goto out_unlock;
+}
+
+static int
+xfs_adjust_extent_unmap_boundaries(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		*startoffset_fsb,
+	xfs_fileoff_t		*endoffset_fsb)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_bmbt_irec	imap;
+	int			nimap, error;
+	xfs_extlen_t		mod = 0;
+
+	nimap = 1;
+	error = xfs_bmapi_read(ip, *startoffset_fsb, 1, &imap, &nimap, 0);
+	if (error)
+		return error;
+
+	if (nimap && imap.br_startblock != HOLESTARTBLOCK) {
+		xfs_daddr_t	block;
+
+		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
+		block = imap.br_startblock;
+		mod = do_div(block, mp->m_sb.sb_rextsize);
+		if (mod)
+			*startoffset_fsb += mp->m_sb.sb_rextsize - mod;
+	}
+
+	nimap = 1;
+	error = xfs_bmapi_read(ip, *endoffset_fsb - 1, 1, &imap, &nimap, 0);
+	if (error)
+		return error;
+
+	if (nimap && imap.br_startblock != HOLESTARTBLOCK) {
+		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
+		mod++;
+		if (mod && mod != mp->m_sb.sb_rextsize)
+			*endoffset_fsb -= mod;
+	}
+
+	return 0;
+}
+
+static int
+xfs_flush_unmap_range(
+	struct xfs_inode	*ip,
+	xfs_off_t		offset,
+	xfs_off_t		len)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct inode		*inode = VFS_I(ip);
+	xfs_off_t		rounding, start, end;
+	int			error;
+
+	/* wait for the completion of any pending DIOs */
+	inode_dio_wait(inode);
+
+	rounding = max_t(xfs_off_t, 1 << mp->m_sb.sb_blocklog, PAGE_SIZE);
+	start = round_down(offset, rounding);
+	end = round_up(offset + len, rounding) - 1;
+
+	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
+	if (error)
+		return error;
+	truncate_pagecache_range(inode, start, end);
+	return 0;
+}
+
 int
 xfs_free_file_space(
 	struct xfs_inode	*ip,
 	xfs_off_t		offset,
 	xfs_off_t		len)
 {
-	int			done;
-	xfs_fileoff_t		endoffset_fsb;
-	int			error;
-	xfs_fsblock_t		firstfsb;
-	xfs_bmap_free_t		free_list;
-	xfs_bmbt_irec_t		imap;
-	xfs_off_t		ioffset;
-	xfs_off_t		iendoffset;
-	xfs_extlen_t		mod=0;
-	xfs_mount_t		*mp;
-	int			nimap;
-	uint			resblks;
-	xfs_off_t		rounding;
-	int			rt;
+	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fileoff_t		startoffset_fsb;
-	xfs_trans_t		*tp;
-
-	mp = ip->i_mount;
+	xfs_fileoff_t		endoffset_fsb;
+	int			done, error;
 
 	trace_xfs_free_file_space(ip);
 
@@ -1215,60 +1317,30 @@ xfs_free_file_space(
 	if (error)
 		return error;
 
-	error = 0;
 	if (len <= 0)	/* if nothing being freed */
-		return error;
-	rt = XFS_IS_REALTIME_INODE(ip);
-	startoffset_fsb	= XFS_B_TO_FSB(mp, offset);
-	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
-
-	/* wait for the completion of any pending DIOs */
-	inode_dio_wait(VFS_I(ip));
+		return 0;
 
-	rounding = max_t(xfs_off_t, 1 << mp->m_sb.sb_blocklog, PAGE_SIZE);
-	ioffset = round_down(offset, rounding);
-	iendoffset = round_up(offset + len, rounding) - 1;
-	error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, ioffset,
-					     iendoffset);
+	error = xfs_flush_unmap_range(ip, offset, len);
 	if (error)
-		goto out;
-	truncate_pagecache_range(VFS_I(ip), ioffset, iendoffset);
+		return error;
+
+	startoffset_fsb = XFS_B_TO_FSB(mp, offset);
+	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
 
 	/*
-	 * Need to zero the stuff we're not freeing, on disk.
-	 * If it's a realtime file & can't use unwritten extents then we
-	 * actually need to zero the extent edges.  Otherwise xfs_bunmapi
-	 * will take care of it for us.
+	 * Need to zero the stuff we're not freeing, on disk.  If it's a RT file
+	 * and we can't use unwritten extents then we actually need to ensure
+	 * to zero the whole extent, otherwise we just need to take of block
+	 * boundaries, and xfs_bunmapi will handle the rest.
 	 */
-	if (rt && !xfs_sb_version_hasextflgbit(&mp->m_sb)) {
-		nimap = 1;
-		error = xfs_bmapi_read(ip, startoffset_fsb, 1,
-					&imap, &nimap, 0);
+	if (XFS_IS_REALTIME_INODE(ip) &&
+	    !xfs_sb_version_hasextflgbit(&mp->m_sb)) {
+		error = xfs_adjust_extent_unmap_boundaries(ip, &startoffset_fsb,
+				&endoffset_fsb);
 		if (error)
-			goto out;
-		ASSERT(nimap == 0 || nimap == 1);
-		if (nimap && imap.br_startblock != HOLESTARTBLOCK) {
-			xfs_daddr_t	block;
-
-			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
-			block = imap.br_startblock;
-			mod = do_div(block, mp->m_sb.sb_rextsize);
-			if (mod)
-				startoffset_fsb += mp->m_sb.sb_rextsize - mod;
-		}
-		nimap = 1;
-		error = xfs_bmapi_read(ip, endoffset_fsb - 1, 1,
-					&imap, &nimap, 0);
-		if (error)
-			goto out;
-		ASSERT(nimap == 0 || nimap == 1);
-		if (nimap && imap.br_startblock != HOLESTARTBLOCK) {
-			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
-			mod++;
-			if (mod && (mod != mp->m_sb.sb_rextsize))
-				endoffset_fsb -= mod;
-		}
+			return error;
 	}
+
 	if ((done = (endoffset_fsb <= startoffset_fsb)))
 		/*
 		 * One contiguous piece to clear
@@ -1288,62 +1360,12 @@ xfs_free_file_space(
 				offset + len - 1);
 	}
 
-	/*
-	 * free file space until done or until there is an error
-	 */
-	resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
 	while (!error && !done) {
-
-		/*
-		 * allocate and setup the transaction. Allow this
-		 * transaction to dip into the reserve blocks to ensure
-		 * the freeing of the space succeeds at ENOSPC.
-		 */
-		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0,
-				&tp);
-		if (error) {
-			ASSERT(error == -ENOSPC || XFS_FORCED_SHUTDOWN(mp));
-			break;
-		}
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
-		error = xfs_trans_reserve_quota(tp, mp,
-				ip->i_udquot, ip->i_gdquot, ip->i_pdquot,
-				resblks, 0, XFS_QMOPT_RES_REGBLKS);
-		if (error)
-			goto error1;
-
-		xfs_trans_ijoin(tp, ip, 0);
-
-		/*
-		 * issue the bunmapi() call to free the blocks
-		 */
-		xfs_bmap_init(&free_list, &firstfsb);
-		error = xfs_bunmapi(tp, ip, startoffset_fsb,
-				  endoffset_fsb - startoffset_fsb,
-				  0, 2, &firstfsb, &free_list, &done);
-		if (error)
-			goto error0;
-
-		/*
-		 * complete the transaction
-		 */
-		error = xfs_bmap_finish(&tp, &free_list, NULL);
-		if (error)
-			goto error0;
-
-		error = xfs_trans_commit(tp);
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		error = xfs_unmap_extent(ip, startoffset_fsb,
+				endoffset_fsb - startoffset_fsb, &done);
 	}
 
- out:
 	return error;
-
- error0:
-	xfs_bmap_cancel(&free_list);
- error1:
-	xfs_trans_cancel(tp);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	goto out;
 }
 
 /*
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 14/14] xfs: kill xfs_zero_remaining_bytes
  2016-06-01 14:44 ` Christoph Hellwig
@ 2016-06-01 14:44   ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Instead punch the whole first, and the use the our zeroing helper
to punch out the edge blocks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_bmap_util.c | 133 ++++++-------------------------------------------
 1 file changed, 14 insertions(+), 119 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index e36664f..569f03d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1089,101 +1089,6 @@ error1:	/* Just cancel transaction */
 	return error;
 }
 
-/*
- * Zero file bytes between startoff and endoff inclusive.
- * The iolock is held exclusive and no blocks are buffered.
- *
- * This function is used by xfs_free_file_space() to zero
- * partial blocks when the range to free is not block aligned.
- * When unreserving space with boundaries that are not block
- * aligned we round up the start and round down the end
- * boundaries and then use this function to zero the parts of
- * the blocks that got dropped during the rounding.
- */
-STATIC int
-xfs_zero_remaining_bytes(
-	xfs_inode_t		*ip,
-	xfs_off_t		startoff,
-	xfs_off_t		endoff)
-{
-	xfs_bmbt_irec_t		imap;
-	xfs_fileoff_t		offset_fsb;
-	xfs_off_t		lastoffset;
-	xfs_off_t		offset;
-	xfs_buf_t		*bp;
-	xfs_mount_t		*mp = ip->i_mount;
-	int			nimap;
-	int			error = 0;
-
-	/*
-	 * Avoid doing I/O beyond eof - it's not necessary
-	 * since nothing can read beyond eof.  The space will
-	 * be zeroed when the file is extended anyway.
-	 */
-	if (startoff >= XFS_ISIZE(ip))
-		return 0;
-
-	if (endoff > XFS_ISIZE(ip))
-		endoff = XFS_ISIZE(ip);
-
-	for (offset = startoff; offset <= endoff; offset = lastoffset + 1) {
-		uint lock_mode;
-
-		offset_fsb = XFS_B_TO_FSBT(mp, offset);
-		nimap = 1;
-
-		lock_mode = xfs_ilock_data_map_shared(ip);
-		error = xfs_bmapi_read(ip, offset_fsb, 1, &imap, &nimap, 0);
-		xfs_iunlock(ip, lock_mode);
-
-		if (error || nimap < 1)
-			break;
-		ASSERT(imap.br_blockcount >= 1);
-		ASSERT(imap.br_startoff == offset_fsb);
-		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
-
-		if (imap.br_startblock == HOLESTARTBLOCK ||
-		    imap.br_state == XFS_EXT_UNWRITTEN) {
-			/* skip the entire extent */
-			lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff +
-						      imap.br_blockcount) - 1;
-			continue;
-		}
-
-		lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff + 1) - 1;
-		if (lastoffset > endoff)
-			lastoffset = endoff;
-
-		/* DAX can just zero the backing device directly */
-		if (IS_DAX(VFS_I(ip))) {
-			error = dax_zero_page_range(VFS_I(ip), offset,
-						    lastoffset - offset + 1,
-						    xfs_get_blocks_direct);
-			if (error)
-				return error;
-			continue;
-		}
-
-		error = xfs_buf_read_uncached(XFS_IS_REALTIME_INODE(ip) ?
-				mp->m_rtdev_targp : mp->m_ddev_targp,
-				xfs_fsb_to_db(ip, imap.br_startblock),
-				BTOBB(mp->m_sb.sb_blocksize),
-				0, &bp, NULL);
-		if (error)
-			return error;
-
-		memset(bp->b_addr +
-				(offset - XFS_FSB_TO_B(mp, imap.br_startoff)),
-		       0, lastoffset - offset + 1);
-
-		error = xfs_bwrite(bp);
-		xfs_buf_relse(bp);
-		if (error)
-			return error;
-	}
-	return error;
-}
-
 static int
 xfs_unmap_extent(
 	struct xfs_inode	*ip,
@@ -1309,7 +1214,7 @@ xfs_free_file_space(
 	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fileoff_t		startoffset_fsb;
 	xfs_fileoff_t		endoffset_fsb;
-	int			done, error;
+	int			done = 0, error;
 
 	trace_xfs_free_file_space(ip);
 
@@ -1341,31 +1246,21 @@ xfs_free_file_space(
 			return error;
 	}
 
-	if ((done = (endoffset_fsb <= startoffset_fsb)))
-		/*
-		 * One contiguous piece to clear
-		 */
-		error = xfs_zero_remaining_bytes(ip, offset, offset + len - 1);
-	else {
-		/*
-		 * Some full blocks, possibly two pieces to clear
-		 */
-		if (offset < XFS_FSB_TO_B(mp, startoffset_fsb))
-			error = xfs_zero_remaining_bytes(ip, offset,
-				XFS_FSB_TO_B(mp, startoffset_fsb) - 1);
-		if (!error &&
-		    XFS_FSB_TO_B(mp, endoffset_fsb) < offset + len)
-			error = xfs_zero_remaining_bytes(ip,
-				XFS_FSB_TO_B(mp, endoffset_fsb),
-				offset + len - 1);
-	}
-
-	while (!error && !done) {
-		error = xfs_unmap_extent(ip, startoffset_fsb,
-				endoffset_fsb - startoffset_fsb, &done);
+	if (endoffset_fsb > startoffset_fsb) {
+		while (!done) {
+			error = xfs_unmap_extent(ip, startoffset_fsb,
+					endoffset_fsb - startoffset_fsb, &done);
+			if (error)
+				return error;
+		}
 	}
 
-	return error;
+	/*
+	 * Now that we've unmap all full blocks we'll have to zero out any
+	 * partial block at the beginning and/or end.  xfs_zero_range is
+	 * smart enough to skip any holes, including those we just created.
+	 */
+	return xfs_zero_range(ip, offset, len, NULL);
 }
 
 /*
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 14/14] xfs: kill xfs_zero_remaining_bytes
@ 2016-06-01 14:44   ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

Instead punch the whole first, and the use the our zeroing helper
to punch out the edge blocks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_bmap_util.c | 133 ++++++-------------------------------------------
 1 file changed, 14 insertions(+), 119 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index e36664f..569f03d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1089,101 +1089,6 @@ error1:	/* Just cancel transaction */
 	return error;
 }
 
-/*
- * Zero file bytes between startoff and endoff inclusive.
- * The iolock is held exclusive and no blocks are buffered.
- *
- * This function is used by xfs_free_file_space() to zero
- * partial blocks when the range to free is not block aligned.
- * When unreserving space with boundaries that are not block
- * aligned we round up the start and round down the end
- * boundaries and then use this function to zero the parts of
- * the blocks that got dropped during the rounding.
- */
-STATIC int
-xfs_zero_remaining_bytes(
-	xfs_inode_t		*ip,
-	xfs_off_t		startoff,
-	xfs_off_t		endoff)
-{
-	xfs_bmbt_irec_t		imap;
-	xfs_fileoff_t		offset_fsb;
-	xfs_off_t		lastoffset;
-	xfs_off_t		offset;
-	xfs_buf_t		*bp;
-	xfs_mount_t		*mp = ip->i_mount;
-	int			nimap;
-	int			error = 0;
-
-	/*
-	 * Avoid doing I/O beyond eof - it's not necessary
-	 * since nothing can read beyond eof.  The space will
-	 * be zeroed when the file is extended anyway.
-	 */
-	if (startoff >= XFS_ISIZE(ip))
-		return 0;
-
-	if (endoff > XFS_ISIZE(ip))
-		endoff = XFS_ISIZE(ip);
-
-	for (offset = startoff; offset <= endoff; offset = lastoffset + 1) {
-		uint lock_mode;
-
-		offset_fsb = XFS_B_TO_FSBT(mp, offset);
-		nimap = 1;
-
-		lock_mode = xfs_ilock_data_map_shared(ip);
-		error = xfs_bmapi_read(ip, offset_fsb, 1, &imap, &nimap, 0);
-		xfs_iunlock(ip, lock_mode);
-
-		if (error || nimap < 1)
-			break;
-		ASSERT(imap.br_blockcount >= 1);
-		ASSERT(imap.br_startoff == offset_fsb);
-		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
-
-		if (imap.br_startblock == HOLESTARTBLOCK ||
-		    imap.br_state == XFS_EXT_UNWRITTEN) {
-			/* skip the entire extent */
-			lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff +
-						      imap.br_blockcount) - 1;
-			continue;
-		}
-
-		lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff + 1) - 1;
-		if (lastoffset > endoff)
-			lastoffset = endoff;
-
-		/* DAX can just zero the backing device directly */
-		if (IS_DAX(VFS_I(ip))) {
-			error = dax_zero_page_range(VFS_I(ip), offset,
-						    lastoffset - offset + 1,
-						    xfs_get_blocks_direct);
-			if (error)
-				return error;
-			continue;
-		}
-
-		error = xfs_buf_read_uncached(XFS_IS_REALTIME_INODE(ip) ?
-				mp->m_rtdev_targp : mp->m_ddev_targp,
-				xfs_fsb_to_db(ip, imap.br_startblock),
-				BTOBB(mp->m_sb.sb_blocksize),
-				0, &bp, NULL);
-		if (error)
-			return error;
-
-		memset(bp->b_addr +
-				(offset - XFS_FSB_TO_B(mp, imap.br_startoff)),
-		       0, lastoffset - offset + 1);
-
-		error = xfs_bwrite(bp);
-		xfs_buf_relse(bp);
-		if (error)
-			return error;
-	}
-	return error;
-}
-
 static int
 xfs_unmap_extent(
 	struct xfs_inode	*ip,
@@ -1309,7 +1214,7 @@ xfs_free_file_space(
 	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fileoff_t		startoffset_fsb;
 	xfs_fileoff_t		endoffset_fsb;
-	int			done, error;
+	int			done = 0, error;
 
 	trace_xfs_free_file_space(ip);
 
@@ -1341,31 +1246,21 @@ xfs_free_file_space(
 			return error;
 	}
 
-	if ((done = (endoffset_fsb <= startoffset_fsb)))
-		/*
-		 * One contiguous piece to clear
-		 */
-		error = xfs_zero_remaining_bytes(ip, offset, offset + len - 1);
-	else {
-		/*
-		 * Some full blocks, possibly two pieces to clear
-		 */
-		if (offset < XFS_FSB_TO_B(mp, startoffset_fsb))
-			error = xfs_zero_remaining_bytes(ip, offset,
-				XFS_FSB_TO_B(mp, startoffset_fsb) - 1);
-		if (!error &&
-		    XFS_FSB_TO_B(mp, endoffset_fsb) < offset + len)
-			error = xfs_zero_remaining_bytes(ip,
-				XFS_FSB_TO_B(mp, endoffset_fsb),
-				offset + len - 1);
-	}
-
-	while (!error && !done) {
-		error = xfs_unmap_extent(ip, startoffset_fsb,
-				endoffset_fsb - startoffset_fsb, &done);
+	if (endoffset_fsb > startoffset_fsb) {
+		while (!done) {
+			error = xfs_unmap_extent(ip, startoffset_fsb,
+					endoffset_fsb - startoffset_fsb, &done);
+			if (error)
+				return error;
+		}
 	}
 
-	return error;
+	/*
+	 * Now that we've unmap all full blocks we'll have to zero out any
+	 * partial block at the beginning and/or end.  xfs_zero_range is
+	 * smart enough to skip any holes, including those we just created.
+	 */
+	return xfs_zero_range(ip, offset, len, NULL);
 }
 
 /*
-- 
2.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
  2016-06-01 14:44 ` Christoph Hellwig
@ 2016-06-01 14:46   ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:46 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

On Wed, Jun 01, 2016 at 04:44:43PM +0200, Christoph Hellwig wrote:
> Note that patch 1 conflicts with Vishals dax error handling series.
> It would be great to have a stable branch with it so that both the
> XFS and nvdimm tree could pull it in before the other changes in this
> area.

Please ignore this note - that former patch 1 has been merged in
4.7-rc1.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
@ 2016-06-01 14:46   ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-01 14:46 UTC (permalink / raw)
  To: xfs; +Cc: rpeterso, linux-fsdevel

On Wed, Jun 01, 2016 at 04:44:43PM +0200, Christoph Hellwig wrote:
> Note that patch 1 conflicts with Vishals dax error handling series.
> It would be great to have a stable branch with it so that both the
> XFS and nvdimm tree could pull it in before the other changes in this
> area.

Please ignore this note - that former patch 1 has been merged in
4.7-rc1.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 02/14] fs: introduce iomap infrastructure
  2016-06-01 14:44   ` Christoph Hellwig
@ 2016-06-16 16:12     ` Jan Kara
  -1 siblings, 0 replies; 67+ messages in thread
From: Jan Kara @ 2016-06-16 16:12 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs, rpeterso, linux-fsdevel

On Wed 01-06-16 16:44:45, Christoph Hellwig wrote:
> Add infrastructure for multipage buffered writes.  This is implemented
> using an main iterator that applies an actor function to a range that
> can be written.
> 
> This infrastucture is used to implement a buffered write helper, one
> to zero file ranges and one to implement the ->page_mkwrite VM
> operations.  All of them borrow a fair amount of code from fs/buffers.
> for now by using an internal version of __block_write_begin that
> gets passed an iomap and builds the corresponding buffer head.
> 
> The file system is gets a set of paired ->iomap_begin and ->iomap_end
> calls which allow it to map/reserve a range and get a notification
> once the write code is finished with it.
> 
> Based on earlier code from Dave Chinner.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Bob Peterson <rpeterso@redhat.com>

When reading through the patch I've noticed two minor errors in the
comments:

...
> +static loff_t
> +iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
> +		struct iomap_ops *ops, void *data, iomap_actor_t actor)
> +{
> +	struct iomap iomap = { 0 };
> +	loff_t written = 0, ret;
> +
> +	/*
> +	 * Need to map a range from start position for count bytes. This can
						       ^^^^ length
...

>  struct iomap {
> -	sector_t	blkno;	/* first sector of mapping */
> -	loff_t		offset;	/* file offset of mapping, bytes */
> -	u64		length;	/* length of mapping, bytes */
> -	int		type;	/* type of mapping */
> +	sector_t		blkno;	/* first sector of mapping, fs blocks */

This is actually in 512-byte units, not in fs blocks as the comment
suggests, right?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 02/14] fs: introduce iomap infrastructure
@ 2016-06-16 16:12     ` Jan Kara
  0 siblings, 0 replies; 67+ messages in thread
From: Jan Kara @ 2016-06-16 16:12 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, xfs

On Wed 01-06-16 16:44:45, Christoph Hellwig wrote:
> Add infrastructure for multipage buffered writes.  This is implemented
> using an main iterator that applies an actor function to a range that
> can be written.
> 
> This infrastucture is used to implement a buffered write helper, one
> to zero file ranges and one to implement the ->page_mkwrite VM
> operations.  All of them borrow a fair amount of code from fs/buffers.
> for now by using an internal version of __block_write_begin that
> gets passed an iomap and builds the corresponding buffer head.
> 
> The file system is gets a set of paired ->iomap_begin and ->iomap_end
> calls which allow it to map/reserve a range and get a notification
> once the write code is finished with it.
> 
> Based on earlier code from Dave Chinner.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Bob Peterson <rpeterso@redhat.com>

When reading through the patch I've noticed two minor errors in the
comments:

...
> +static loff_t
> +iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
> +		struct iomap_ops *ops, void *data, iomap_actor_t actor)
> +{
> +	struct iomap iomap = { 0 };
> +	loff_t written = 0, ret;
> +
> +	/*
> +	 * Need to map a range from start position for count bytes. This can
						       ^^^^ length
...

>  struct iomap {
> -	sector_t	blkno;	/* first sector of mapping */
> -	loff_t		offset;	/* file offset of mapping, bytes */
> -	u64		length;	/* length of mapping, bytes */
> -	int		type;	/* type of mapping */
> +	sector_t		blkno;	/* first sector of mapping, fs blocks */

This is actually in 512-byte units, not in fs blocks as the comment
suggests, right?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 02/14] fs: introduce iomap infrastructure
  2016-06-16 16:12     ` Jan Kara
@ 2016-06-17 12:01       ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-17 12:01 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, xfs, rpeterso, linux-fsdevel

On Thu, Jun 16, 2016 at 06:12:42PM +0200, Jan Kara wrote:
> This is actually in 512-byte units, not in fs blocks as the comment
> suggests, right?

Indeed, thanks for spotting this!

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 02/14] fs: introduce iomap infrastructure
@ 2016-06-17 12:01       ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-17 12:01 UTC (permalink / raw)
  To: Jan Kara; +Cc: rpeterso, linux-fsdevel, Christoph Hellwig, xfs

On Thu, Jun 16, 2016 at 06:12:42PM +0200, Jan Kara wrote:
> This is actually in 512-byte units, not in fs blocks as the comment
> suggests, right?

Indeed, thanks for spotting this!

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 02/14] fs: introduce iomap infrastructure
  2016-06-17 12:01       ` Christoph Hellwig
@ 2016-06-20  2:29         ` Dave Chinner
  -1 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2016-06-20  2:29 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, rpeterso, linux-fsdevel, xfs

On Fri, Jun 17, 2016 at 02:01:44PM +0200, Christoph Hellwig wrote:
> On Thu, Jun 16, 2016 at 06:12:42PM +0200, Jan Kara wrote:
> > This is actually in 512-byte units, not in fs blocks as the comment
> > suggests, right?
> 
> Indeed, thanks for spotting this!

I'll update the patches I've already tested, Christoph, so you don't
have to post an update just for this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 02/14] fs: introduce iomap infrastructure
@ 2016-06-20  2:29         ` Dave Chinner
  0 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2016-06-20  2:29 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, Jan Kara, xfs

On Fri, Jun 17, 2016 at 02:01:44PM +0200, Christoph Hellwig wrote:
> On Thu, Jun 16, 2016 at 06:12:42PM +0200, Jan Kara wrote:
> > This is actually in 512-byte units, not in fs blocks as the comment
> > suggests, right?
> 
> Indeed, thanks for spotting this!

I'll update the patches I've already tested, Christoph, so you don't
have to post an update just for this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 02/14] fs: introduce iomap infrastructure
  2016-06-20  2:29         ` Dave Chinner
@ 2016-06-20 12:22           ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-20 12:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, Jan Kara, rpeterso, linux-fsdevel, xfs

On Mon, Jun 20, 2016 at 12:29:52PM +1000, Dave Chinner wrote:
> On Fri, Jun 17, 2016 at 02:01:44PM +0200, Christoph Hellwig wrote:
> > On Thu, Jun 16, 2016 at 06:12:42PM +0200, Jan Kara wrote:
> > > This is actually in 512-byte units, not in fs blocks as the comment
> > > suggests, right?
> > 
> > Indeed, thanks for spotting this!
> 
> I'll update the patches I've already tested, Christoph, so you don't
> have to post an update just for this.

Ok, thanks.  Can't wait for the code to finally go in..

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 02/14] fs: introduce iomap infrastructure
@ 2016-06-20 12:22           ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-20 12:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: rpeterso, linux-fsdevel, Jan Kara, Christoph Hellwig, xfs

On Mon, Jun 20, 2016 at 12:29:52PM +1000, Dave Chinner wrote:
> On Fri, Jun 17, 2016 at 02:01:44PM +0200, Christoph Hellwig wrote:
> > On Thu, Jun 16, 2016 at 06:12:42PM +0200, Jan Kara wrote:
> > > This is actually in 512-byte units, not in fs blocks as the comment
> > > suggests, right?
> > 
> > Indeed, thanks for spotting this!
> 
> I'll update the patches I've already tested, Christoph, so you don't
> have to post an update just for this.

Ok, thanks.  Can't wait for the code to finally go in..

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
  2016-06-01 14:44 ` Christoph Hellwig
@ 2016-06-28  0:26   ` Dave Chinner
  -1 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2016-06-28  0:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs, rpeterso, linux-fsdevel

On Wed, Jun 01, 2016 at 04:44:43PM +0200, Christoph Hellwig wrote:
> This series add a new file system I/O path that uses the iomap structure
> introduced for the pNFS support and support multi-page buffered writes.
> 
> This was first started by Dave Chinner a long time ago, then I did beat
> it into shape for production runs in a very constrained ARM NAS
> enviroment for Tuxera almost as long ago, and now half a dozen rewrites
> later it's back.
> 
> The basic idea is to avoid the per-block get_blocks overhead
> and make use of extents in the buffered write path by iterating over
> them instead.

Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here.
generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
significantly different behaviour once ENOSPC is hit withi this patchset.

It ends up with an endless stream of errors like this:

[  687.530641] XFS (sdc): page discard on page ffffea0000197700, inode 0xb52c, offset 400338944.
[  687.539828] XFS (sdc): page discard on page ffffea0000197740, inode 0xb52c, offset 400343040.
[  687.549035] XFS (sdc): page discard on page ffffea0000197780, inode 0xb52c, offset 400347136.
[  687.558222] XFS (sdc): page discard on page ffffea00001977c0, inode 0xb52c, offset 400351232.
[  687.567391] XFS (sdc): page discard on page ffffea0000846000, inode 0xb52c, offset 400355328.
[  687.576602] XFS (sdc): page discard on page ffffea0000846040, inode 0xb52c, offset 400359424.
[  687.585794] XFS (sdc): page discard on page ffffea0000846080, inode 0xb52c, offset 400363520.
[  687.595005] XFS (sdc): page discard on page ffffea00008460c0, inode 0xb52c, offset 400367616.

Yeah, it's been going for ten minutes already, and it's reporting
this for every page on inode 0xb52c. df reports:

Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/sdc         1038336 1038334         2 100% /mnt/scratch


So it's looking very much like the iomap write is allowing pages
through the write side rather than giving ENOSPC. i.e. the initial
alloc fails, it triggers a flush, which triggers a page discard,
which then allows the next write to proceed because it's free up
blocks. The trace looks like this

              dd-12284 [000]   983.936583: xfs_file_buffered_write: dev 8:32 ino 0xb52c size 0xa00000 offset 0x1f37a000 count 0x1000 ioflags 
              dd-12284 [000]   983.936584: xfs_ilock:            dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff8150bb47s
              dd-12284 [000]   983.936585: xfs_iomap_prealloc_size: dev 8:32 ino 0xb52c prealloc blocks 64 shift 6 m_writeio_blocks 64
              dd-12284 [000]   983.936585: xfs_delalloc_enospc:  dev 8:32 ino 0xb52c isize 0x1f37a000 disize 0xa00000 offset 0x1f37a000 count 4096
              dd-12284 [000]   983.936586: xfs_delalloc_enospc:  dev 8:32 ino 0xb52c isize 0x1f37a000 disize 0xa00000 offset 0x1f37a000 count 4096
              dd-12284 [000]   983.936586: xfs_iunlock:          dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff8150bc61s
    kworker/u2:0-6     [000]   983.946649: xfs_writepage:        dev 8:32 ino 0xb52c pgoff 0x1f379000 size 0x1f37a000 offset 0 length 0 delalloc 1 unwritten 0
    kworker/u2:0-6     [000]   983.946650: xfs_ilock:            dev 8:32 ino 0xb52c flags ILOCK_SHARED caller 0xffffffff814f4b49s
    kworker/u2:0-6     [000]   983.946651: xfs_iunlock:          dev 8:32 ino 0xb52c flags ILOCK_SHARED caller 0xffffffff814f4bf5s
    kworker/u2:0-6     [000]   983.948093: xfs_ilock:            dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff814f555fs
    kworker/u2:0-6     [000]   983.948095: xfs_bunmap:           dev 8:32 ino 0xb52c size 0xa00000 bno 0x7cde4 len 0x1flags  caller 0xffffffff814f9123s
    kworker/u2:0-6     [000]   983.948096: xfs_bmap_pre_update:  dev 8:32 ino 0xb52c state  idx 0 offset 511460 block 4503599627239431 count 4 flag 0 caller 0xffffffff814c6ea3s
    kworker/u2:0-6     [000]   983.948097: xfs_bmap_post_update: dev 8:32 ino 0xb52c state  idx 0 offset 511461 block 4503599627239431 count 3 flag 0 caller 0xffffffff814c6f1es
    kworker/u2:0-6     [000]   983.948097: xfs_bunmap:           dev 8:32 ino 0xb52c size 0xa00000 bno 0x7cde5 len 0x1flags  caller 0xffffffff814f9123s
    kworker/u2:0-6     [000]   983.948098: xfs_bmap_pre_update:  dev 8:32 ino 0xb52c state  idx 0 offset 511461 block 4503599627239431 count 3 flag 0 caller 0xffffffff814c6ea3s
    kworker/u2:0-6     [000]   983.948098: xfs_bmap_post_update: dev 8:32 ino 0xb52c state  idx 0 offset 511462 block 4503599627239431 count 2 flag 0 caller 0xffffffff814c6f1es
    kworker/u2:0-6     [000]   983.948099: xfs_bunmap:           dev 8:32 ino 0xb52c size 0xa00000 bno 0x7cde6 len 0x1flags  caller 0xffffffff814f9123s
    kworker/u2:0-6     [000]   983.948099: xfs_bmap_pre_update:  dev 8:32 ino 0xb52c state  idx 0 offset 511462 block 4503599627239431 count 2 flag 0 caller 0xffffffff814c6ea3s
    kworker/u2:0-6     [000]   983.948100: xfs_bmap_post_update: dev 8:32 ino 0xb52c state  idx 0 offset 511463 block 4503599627239431 count 1 flag 0 caller 0xffffffff814c6f1es
    kworker/u2:0-6     [000]   983.948100: xfs_bunmap:           dev 8:32 ino 0xb52c size 0xa00000 bno 0x7cde7 len 0x1flags  caller 0xffffffff814f9123s
    kworker/u2:0-6     [000]   983.948101: xfs_iext_remove:      dev 8:32 ino 0xb52c state  idx 0 offset 511463 block 4503599627239431 count 1 flag 0 caller 0xffffffff814c6d2as
    kworker/u2:0-6     [000]   983.948101: xfs_iunlock:          dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff814f55e8s
    kworker/u2:0-6     [000]   983.948102: xfs_invalidatepage:   dev 8:32 ino 0xb52c pgoff 0x1f379000 size 0x1f37a000 offset 0 length 1000 delalloc 1 unwritten 0
    kworker/u2:0-6     [000]   983.948102: xfs_releasepage:      dev 8:32 ino 0xb52c pgoff 0x1f379000 size 0x1f37a000 offset 0 length 0 delalloc 0 unwritten 0

[snip eof block scan locking]

              dd-12284 [000]   983.948239: xfs_file_buffered_write: dev 8:32 ino 0xb52c size 0xa00000 offset 0x1f37a000 count 0x1000 ioflags 
              dd-12284 [000]   983.948239: xfs_ilock:            dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff8150bb47s
              dd-12284 [000]   983.948240: xfs_iomap_prealloc_size: dev 8:32 ino 0xb52c prealloc blocks 64 shift 0 m_writeio_blocks 64
              dd-12284 [000]   983.948242: xfs_delalloc_enospc:  dev 8:32 ino 0xb52c isize 0x1f37a000 disize 0xa00000 offset 0x1f37a000 count 4096
              dd-12284 [000]   983.948243: xfs_iext_insert:      dev 8:32 ino 0xb52c state  idx 0 offset 511464 block 4503599627239431 count 4 flag 0 caller 0xffffffff814c265as
              dd-12284 [000]   983.948243: xfs_iunlock:          dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff8150bc61s
              dd-12284 [000]   983.948244: xfs_iomap_alloc:      dev 8:32 ino 0xb52c size 0xa00000 offset 0x1f37a000 count 4096 type invalid startoff 0x7cde8 startblock -1 blockcount 0x4
              dd-12284 [000]   983.948250: xfs_iunlock:          dev 8:32 ino 0xb52c flags IOLOCK_EXCL caller 0xffffffff81502378s
              dd-12284 [000]   983.948254: xfs_ilock:            dev 8:32 ino 0xb52c flags IOLOCK_EXCL caller 0xffffffff81502352s
              dd-12284 [000]   983.948256: xfs_update_time:      dev 8:32 ino 0xb52c
              dd-12284 [000]   983.948257: xfs_log_reserve:      dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 956212 grant_write_cycle 1 grant_write_bytes 956212 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861
              dd-12284 [000]   983.948257: xfs_log_reserve_exit: dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 959072 grant_write_cycle 1 grant_write_bytes 959072 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861
              dd-12284 [000]   983.948258: xfs_ilock:            dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff8150c4dfs
              dd-12284 [000]   983.948260: xfs_log_done_nonperm: dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 959072 grant_write_cycle 1 grant_write_bytes 959072 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861
              dd-12284 [000]   983.948260: xfs_log_ungrant_enter: dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 959072 grant_write_cycle 1 grant_write_bytes 959072 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861
              dd-12284 [000]   983.948260: xfs_log_ungrant_sub:  dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 959072 grant_write_cycle 1 grant_write_bytes 959072 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861
              dd-12284 [000]   983.948261: xfs_log_ungrant_exit: dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 956212 grant_write_cycle 1 grant_write_bytes 956212 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861
              dd-12284 [000]   983.948261: xfs_iunlock:          dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff81526d6cs

Ad so the cycle goes. The next page at offset 0x1f37b000 fails
allocation, triggers a flush, which results in the write at
0x1f37a000 failing and freeing it's delalloc blocks, which then
allows the retry of the write at 0x1f37b000 to succeed....

I can see that mapping errors (i.e. the AS_ENOSPC error) ar enot
propagated into the write() path, but that's the same as the old
code. What I don't quite understand is how delalloc has blown
through the XFS_ALLOC_SET_ASIDE() limits which are supposed to
trigger ENOSPC on the write() side much earlier than just one or two
blocks free....

Something doesn't quite add up here, and I haven't been able to put
my finger on it yet.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
@ 2016-06-28  0:26   ` Dave Chinner
  0 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2016-06-28  0:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, xfs

On Wed, Jun 01, 2016 at 04:44:43PM +0200, Christoph Hellwig wrote:
> This series add a new file system I/O path that uses the iomap structure
> introduced for the pNFS support and support multi-page buffered writes.
> 
> This was first started by Dave Chinner a long time ago, then I did beat
> it into shape for production runs in a very constrained ARM NAS
> enviroment for Tuxera almost as long ago, and now half a dozen rewrites
> later it's back.
> 
> The basic idea is to avoid the per-block get_blocks overhead
> and make use of extents in the buffered write path by iterating over
> them instead.

Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here.
generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
significantly different behaviour once ENOSPC is hit withi this patchset.

It ends up with an endless stream of errors like this:

[  687.530641] XFS (sdc): page discard on page ffffea0000197700, inode 0xb52c, offset 400338944.
[  687.539828] XFS (sdc): page discard on page ffffea0000197740, inode 0xb52c, offset 400343040.
[  687.549035] XFS (sdc): page discard on page ffffea0000197780, inode 0xb52c, offset 400347136.
[  687.558222] XFS (sdc): page discard on page ffffea00001977c0, inode 0xb52c, offset 400351232.
[  687.567391] XFS (sdc): page discard on page ffffea0000846000, inode 0xb52c, offset 400355328.
[  687.576602] XFS (sdc): page discard on page ffffea0000846040, inode 0xb52c, offset 400359424.
[  687.585794] XFS (sdc): page discard on page ffffea0000846080, inode 0xb52c, offset 400363520.
[  687.595005] XFS (sdc): page discard on page ffffea00008460c0, inode 0xb52c, offset 400367616.

Yeah, it's been going for ten minutes already, and it's reporting
this for every page on inode 0xb52c. df reports:

Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/sdc         1038336 1038334         2 100% /mnt/scratch


So it's looking very much like the iomap write is allowing pages
through the write side rather than giving ENOSPC. i.e. the initial
alloc fails, it triggers a flush, which triggers a page discard,
which then allows the next write to proceed because it's free up
blocks. The trace looks like this

              dd-12284 [000]   983.936583: xfs_file_buffered_write: dev 8:32 ino 0xb52c size 0xa00000 offset 0x1f37a000 count 0x1000 ioflags 
              dd-12284 [000]   983.936584: xfs_ilock:            dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff8150bb47s
              dd-12284 [000]   983.936585: xfs_iomap_prealloc_size: dev 8:32 ino 0xb52c prealloc blocks 64 shift 6 m_writeio_blocks 64
              dd-12284 [000]   983.936585: xfs_delalloc_enospc:  dev 8:32 ino 0xb52c isize 0x1f37a000 disize 0xa00000 offset 0x1f37a000 count 4096
              dd-12284 [000]   983.936586: xfs_delalloc_enospc:  dev 8:32 ino 0xb52c isize 0x1f37a000 disize 0xa00000 offset 0x1f37a000 count 4096
              dd-12284 [000]   983.936586: xfs_iunlock:          dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff8150bc61s
    kworker/u2:0-6     [000]   983.946649: xfs_writepage:        dev 8:32 ino 0xb52c pgoff 0x1f379000 size 0x1f37a000 offset 0 length 0 delalloc 1 unwritten 0
    kworker/u2:0-6     [000]   983.946650: xfs_ilock:            dev 8:32 ino 0xb52c flags ILOCK_SHARED caller 0xffffffff814f4b49s
    kworker/u2:0-6     [000]   983.946651: xfs_iunlock:          dev 8:32 ino 0xb52c flags ILOCK_SHARED caller 0xffffffff814f4bf5s
    kworker/u2:0-6     [000]   983.948093: xfs_ilock:            dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff814f555fs
    kworker/u2:0-6     [000]   983.948095: xfs_bunmap:           dev 8:32 ino 0xb52c size 0xa00000 bno 0x7cde4 len 0x1flags  caller 0xffffffff814f9123s
    kworker/u2:0-6     [000]   983.948096: xfs_bmap_pre_update:  dev 8:32 ino 0xb52c state  idx 0 offset 511460 block 4503599627239431 count 4 flag 0 caller 0xffffffff814c6ea3s
    kworker/u2:0-6     [000]   983.948097: xfs_bmap_post_update: dev 8:32 ino 0xb52c state  idx 0 offset 511461 block 4503599627239431 count 3 flag 0 caller 0xffffffff814c6f1es
    kworker/u2:0-6     [000]   983.948097: xfs_bunmap:           dev 8:32 ino 0xb52c size 0xa00000 bno 0x7cde5 len 0x1flags  caller 0xffffffff814f9123s
    kworker/u2:0-6     [000]   983.948098: xfs_bmap_pre_update:  dev 8:32 ino 0xb52c state  idx 0 offset 511461 block 4503599627239431 count 3 flag 0 caller 0xffffffff814c6ea3s
    kworker/u2:0-6     [000]   983.948098: xfs_bmap_post_update: dev 8:32 ino 0xb52c state  idx 0 offset 511462 block 4503599627239431 count 2 flag 0 caller 0xffffffff814c6f1es
    kworker/u2:0-6     [000]   983.948099: xfs_bunmap:           dev 8:32 ino 0xb52c size 0xa00000 bno 0x7cde6 len 0x1flags  caller 0xffffffff814f9123s
    kworker/u2:0-6     [000]   983.948099: xfs_bmap_pre_update:  dev 8:32 ino 0xb52c state  idx 0 offset 511462 block 4503599627239431 count 2 flag 0 caller 0xffffffff814c6ea3s
    kworker/u2:0-6     [000]   983.948100: xfs_bmap_post_update: dev 8:32 ino 0xb52c state  idx 0 offset 511463 block 4503599627239431 count 1 flag 0 caller 0xffffffff814c6f1es
    kworker/u2:0-6     [000]   983.948100: xfs_bunmap:           dev 8:32 ino 0xb52c size 0xa00000 bno 0x7cde7 len 0x1flags  caller 0xffffffff814f9123s
    kworker/u2:0-6     [000]   983.948101: xfs_iext_remove:      dev 8:32 ino 0xb52c state  idx 0 offset 511463 block 4503599627239431 count 1 flag 0 caller 0xffffffff814c6d2as
    kworker/u2:0-6     [000]   983.948101: xfs_iunlock:          dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff814f55e8s
    kworker/u2:0-6     [000]   983.948102: xfs_invalidatepage:   dev 8:32 ino 0xb52c pgoff 0x1f379000 size 0x1f37a000 offset 0 length 1000 delalloc 1 unwritten 0
    kworker/u2:0-6     [000]   983.948102: xfs_releasepage:      dev 8:32 ino 0xb52c pgoff 0x1f379000 size 0x1f37a000 offset 0 length 0 delalloc 0 unwritten 0

[snip eof block scan locking]

              dd-12284 [000]   983.948239: xfs_file_buffered_write: dev 8:32 ino 0xb52c size 0xa00000 offset 0x1f37a000 count 0x1000 ioflags 
              dd-12284 [000]   983.948239: xfs_ilock:            dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff8150bb47s
              dd-12284 [000]   983.948240: xfs_iomap_prealloc_size: dev 8:32 ino 0xb52c prealloc blocks 64 shift 0 m_writeio_blocks 64
              dd-12284 [000]   983.948242: xfs_delalloc_enospc:  dev 8:32 ino 0xb52c isize 0x1f37a000 disize 0xa00000 offset 0x1f37a000 count 4096
              dd-12284 [000]   983.948243: xfs_iext_insert:      dev 8:32 ino 0xb52c state  idx 0 offset 511464 block 4503599627239431 count 4 flag 0 caller 0xffffffff814c265as
              dd-12284 [000]   983.948243: xfs_iunlock:          dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff8150bc61s
              dd-12284 [000]   983.948244: xfs_iomap_alloc:      dev 8:32 ino 0xb52c size 0xa00000 offset 0x1f37a000 count 4096 type invalid startoff 0x7cde8 startblock -1 blockcount 0x4
              dd-12284 [000]   983.948250: xfs_iunlock:          dev 8:32 ino 0xb52c flags IOLOCK_EXCL caller 0xffffffff81502378s
              dd-12284 [000]   983.948254: xfs_ilock:            dev 8:32 ino 0xb52c flags IOLOCK_EXCL caller 0xffffffff81502352s
              dd-12284 [000]   983.948256: xfs_update_time:      dev 8:32 ino 0xb52c
              dd-12284 [000]   983.948257: xfs_log_reserve:      dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 956212 grant_write_cycle 1 grant_write_bytes 956212 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861
              dd-12284 [000]   983.948257: xfs_log_reserve_exit: dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 959072 grant_write_cycle 1 grant_write_bytes 959072 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861
              dd-12284 [000]   983.948258: xfs_ilock:            dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff8150c4dfs
              dd-12284 [000]   983.948260: xfs_log_done_nonperm: dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 959072 grant_write_cycle 1 grant_write_bytes 959072 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861
              dd-12284 [000]   983.948260: xfs_log_ungrant_enter: dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 959072 grant_write_cycle 1 grant_write_bytes 959072 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861
              dd-12284 [000]   983.948260: xfs_log_ungrant_sub:  dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 959072 grant_write_cycle 1 grant_write_bytes 959072 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861
              dd-12284 [000]   983.948261: xfs_log_ungrant_exit: dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 956212 grant_write_cycle 1 grant_write_bytes 956212 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861
              dd-12284 [000]   983.948261: xfs_iunlock:          dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff81526d6cs

Ad so the cycle goes. The next page at offset 0x1f37b000 fails
allocation, triggers a flush, which results in the write at
0x1f37a000 failing and freeing it's delalloc blocks, which then
allows the retry of the write at 0x1f37b000 to succeed....

I can see that mapping errors (i.e. the AS_ENOSPC error) ar enot
propagated into the write() path, but that's the same as the old
code. What I don't quite understand is how delalloc has blown
through the XFS_ALLOC_SET_ASIDE() limits which are supposed to
trigger ENOSPC on the write() side much earlier than just one or two
blocks free....

Something doesn't quite add up here, and I haven't been able to put
my finger on it yet.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
  2016-06-28  0:26   ` Dave Chinner
@ 2016-06-28 13:28     ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-28 13:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs, rpeterso, linux-fsdevel

On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote:
> Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here.
> generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
> significantly different behaviour once ENOSPC is hit withi this patchset.

Works fine on my 1k test setup with 4 CPUs and 2GB RAM.  1 CPU and 1GB
RAM runs into the OOM killer, although I haven't checked if that was
the case with the old code as well.  I'll look into this more later
today or tomorrow.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
@ 2016-06-28 13:28     ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-28 13:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: rpeterso, linux-fsdevel, Christoph Hellwig, xfs

On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote:
> Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here.
> generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
> significantly different behaviour once ENOSPC is hit withi this patchset.

Works fine on my 1k test setup with 4 CPUs and 2GB RAM.  1 CPU and 1GB
RAM runs into the OOM killer, although I haven't checked if that was
the case with the old code as well.  I'll look into this more later
today or tomorrow.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
  2016-06-28 13:28     ` Christoph Hellwig
@ 2016-06-28 13:38       ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-28 13:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs, rpeterso, linux-fsdevel

On Tue, Jun 28, 2016 at 03:28:39PM +0200, Christoph Hellwig wrote:
> > generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
> > significantly different behaviour once ENOSPC is hit withi this patchset.
> 
> Works fine on my 1k test setup with 4 CPUs and 2GB RAM.  1 CPU and 1GB
> RAM runs into the OOM killer, although I haven't checked if that was
> the case with the old code as well.  I'll look into this more later
> today or tomorrow.

It was doing fine before my changes - I'm going to investigate what
caused the change in behavior.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
@ 2016-06-28 13:38       ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-28 13:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: rpeterso, linux-fsdevel, Christoph Hellwig, xfs

On Tue, Jun 28, 2016 at 03:28:39PM +0200, Christoph Hellwig wrote:
> > generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
> > significantly different behaviour once ENOSPC is hit withi this patchset.
> 
> Works fine on my 1k test setup with 4 CPUs and 2GB RAM.  1 CPU and 1GB
> RAM runs into the OOM killer, although I haven't checked if that was
> the case with the old code as well.  I'll look into this more later
> today or tomorrow.

It was doing fine before my changes - I'm going to investigate what
caused the change in behavior.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
  2016-06-28  0:26   ` Dave Chinner
@ 2016-06-30 17:22     ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-30 17:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, rpeterso, linux-fsdevel

On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote:
> Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here.
> generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
> significantly different behaviour once ENOSPC is hit withi this patchset.
> 
> It ends up with an endless stream of errors like this:

I've spent some time trying to reproduce this.  I'm actually getting
the OOM killer almost reproducible for for-next without the iomap
patches as well when just using 1GB of mem.  1400 MB is the minimum
I can reproducibly finish the test with either code base.

But with the 1400 MB setup I see a few interesting things.  Even
with the baseline, no-iomap case I see a few errors in the log:

[   70.407465] Filesystem "vdc": reserve blocks depleted! Consider increasing
reserve pool
size.
[   70.195645] XFS (vdc): page discard on page ffff88005682a988, inode 0xd3, offset 761856.
[   70.408079] Buffer I/O error on dev vdc, logical block 1048513, lost async
page write
[   70.408598] Buffer I/O error on dev vdc, logical block 1048514, lost async
page write
 27s

With iomap I also see the spew of page discard errors your see, but while
I see a lot of them, the rest still finishes after a reasonable time,
just a few seconds more than the pre-iomap baseline.  I also see the
reserve block depleted message in this case.

Digging into the reserve block depleted message - it seems we have
too many parallel iomap_allocate transactions going on.  I suspect
this might be because the writeback code will not finish a writeback
context if we have multiple blocks inside a page, which can
happen easily for this 1k ENOSPC setup.  I've not had time to fully
check if this is what really happens, but I did a quick hack (see below)
to only allocate 1k at a time in iomap_begin, and with that generic/224
finishes without the warning spew.  Of course this isn't a real fix,
and I need to fully understand what's going on in writeback due to
different allocation / dirtying patterns from the iomap change.


diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 620fc91..d9afba2 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1018,7 +1018,7 @@ xfs_file_iomap_begin(
 		 * Note that the values needs to be less than 32-bits wide until
 		 * the lower level functions are updated.
 		 */
-		length = min_t(loff_t, length, 1024 * PAGE_SIZE);
+		length = min_t(loff_t, length, 1024);
 		if (xfs_get_extsz_hint(ip)) {
 			/*
 			 * xfs_iomap_write_direct() expects the shared lock. It

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
@ 2016-06-30 17:22     ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-06-30 17:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: rpeterso, linux-fsdevel, xfs

On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote:
> Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here.
> generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
> significantly different behaviour once ENOSPC is hit withi this patchset.
> 
> It ends up with an endless stream of errors like this:

I've spent some time trying to reproduce this.  I'm actually getting
the OOM killer almost reproducible for for-next without the iomap
patches as well when just using 1GB of mem.  1400 MB is the minimum
I can reproducibly finish the test with either code base.

But with the 1400 MB setup I see a few interesting things.  Even
with the baseline, no-iomap case I see a few errors in the log:

[   70.407465] Filesystem "vdc": reserve blocks depleted! Consider increasing
reserve pool
size.
[   70.195645] XFS (vdc): page discard on page ffff88005682a988, inode 0xd3, offset 761856.
[   70.408079] Buffer I/O error on dev vdc, logical block 1048513, lost async
page write
[   70.408598] Buffer I/O error on dev vdc, logical block 1048514, lost async
page write
 27s

With iomap I also see the spew of page discard errors your see, but while
I see a lot of them, the rest still finishes after a reasonable time,
just a few seconds more than the pre-iomap baseline.  I also see the
reserve block depleted message in this case.

Digging into the reserve block depleted message - it seems we have
too many parallel iomap_allocate transactions going on.  I suspect
this might be because the writeback code will not finish a writeback
context if we have multiple blocks inside a page, which can
happen easily for this 1k ENOSPC setup.  I've not had time to fully
check if this is what really happens, but I did a quick hack (see below)
to only allocate 1k at a time in iomap_begin, and with that generic/224
finishes without the warning spew.  Of course this isn't a real fix,
and I need to fully understand what's going on in writeback due to
different allocation / dirtying patterns from the iomap change.


diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 620fc91..d9afba2 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1018,7 +1018,7 @@ xfs_file_iomap_begin(
 		 * Note that the values needs to be less than 32-bits wide until
 		 * the lower level functions are updated.
 		 */
-		length = min_t(loff_t, length, 1024 * PAGE_SIZE);
+		length = min_t(loff_t, length, 1024);
 		if (xfs_get_extsz_hint(ip)) {
 			/*
 			 * xfs_iomap_write_direct() expects the shared lock. It

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
  2016-06-30 17:22     ` Christoph Hellwig
@ 2016-06-30 23:16       ` Dave Chinner
  -1 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2016-06-30 23:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs, rpeterso, linux-fsdevel

On Thu, Jun 30, 2016 at 07:22:39PM +0200, Christoph Hellwig wrote:
> On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote:
> > Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here.
> > generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
> > significantly different behaviour once ENOSPC is hit withi this patchset.
> > 
> > It ends up with an endless stream of errors like this:
> 
> I've spent some time trying to reproduce this.  I'm actually getting
> the OOM killer almost reproducible for for-next without the iomap
> patches as well when just using 1GB of mem.  1400 MB is the minimum
> I can reproducibly finish the test with either code base.
> 
> But with the 1400 MB setup I see a few interesting things.  Even
> with the baseline, no-iomap case I see a few errors in the log:
> 
> [   70.407465] Filesystem "vdc": reserve blocks depleted! Consider increasing
> reserve pool
> size.
> [   70.195645] XFS (vdc): page discard on page ffff88005682a988, inode 0xd3, offset 761856.
> [   70.408079] Buffer I/O error on dev vdc, logical block 1048513, lost async
> page write
> [   70.408598] Buffer I/O error on dev vdc, logical block 1048514, lost async
> page write
>  27s
> 
> With iomap I also see the spew of page discard errors your see, but while
> I see a lot of them, the rest still finishes after a reasonable time,
> just a few seconds more than the pre-iomap baseline.  I also see the
> reserve block depleted message in this case.

The reserve block pool depleted message is normal for me in this
test.  We're throwing a thousand concurrent processes at the
filesystem at ENOSPC, and so the metadata reservation for the
delayed allocation totals quite a lot. We only reserve 8192 blocks
for the reserve pool, so a delalloc reservation for one page on each
file (4 blocks per page, which means a couple of blocks for the
metadata reservation via the indlen calculation) is going to consume
the reserve pool quite quickly if the up front reservation
overshoots the XFS_ALLOC_SET_ASIDE() ENOSPC threshold.

> Digging into the reserve block depleted message - it seems we have
> too many parallel iomap_allocate transactions going on.  I suspect
> this might be because the writeback code will not finish a writeback
> context if we have multiple blocks inside a page, which can
> happen easily for this 1k ENOSPC setup.

Right - this test has regularly triggered that warning on this
particular test setup for me - it's not something new to the iomap
patchset.

> I've not had time to fully
> check if this is what really happens, but I did a quick hack (see below)
> to only allocate 1k at a time in iomap_begin, and with that generic/224
> finishes without the warning spew.  Of course this isn't a real fix,
> and I need to fully understand what's going on in writeback due to
> different allocation / dirtying patterns from the iomap change.

Which tends to indicate that the multi-block allocation has a larger
indlen reservation, and that's what is causing the code to hit
whatever edge-case that is leading to it not recovering. However,
I'm still wondering how we are not throwing ENOSPC back to userspace
at XFS_ALLOC_SET_ASIDE limits.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
@ 2016-06-30 23:16       ` Dave Chinner
  0 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2016-06-30 23:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, xfs

On Thu, Jun 30, 2016 at 07:22:39PM +0200, Christoph Hellwig wrote:
> On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote:
> > Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here.
> > generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
> > significantly different behaviour once ENOSPC is hit withi this patchset.
> > 
> > It ends up with an endless stream of errors like this:
> 
> I've spent some time trying to reproduce this.  I'm actually getting
> the OOM killer almost reproducible for for-next without the iomap
> patches as well when just using 1GB of mem.  1400 MB is the minimum
> I can reproducibly finish the test with either code base.
> 
> But with the 1400 MB setup I see a few interesting things.  Even
> with the baseline, no-iomap case I see a few errors in the log:
> 
> [   70.407465] Filesystem "vdc": reserve blocks depleted! Consider increasing
> reserve pool
> size.
> [   70.195645] XFS (vdc): page discard on page ffff88005682a988, inode 0xd3, offset 761856.
> [   70.408079] Buffer I/O error on dev vdc, logical block 1048513, lost async
> page write
> [   70.408598] Buffer I/O error on dev vdc, logical block 1048514, lost async
> page write
>  27s
> 
> With iomap I also see the spew of page discard errors your see, but while
> I see a lot of them, the rest still finishes after a reasonable time,
> just a few seconds more than the pre-iomap baseline.  I also see the
> reserve block depleted message in this case.

The reserve block pool depleted message is normal for me in this
test.  We're throwing a thousand concurrent processes at the
filesystem at ENOSPC, and so the metadata reservation for the
delayed allocation totals quite a lot. We only reserve 8192 blocks
for the reserve pool, so a delalloc reservation for one page on each
file (4 blocks per page, which means a couple of blocks for the
metadata reservation via the indlen calculation) is going to consume
the reserve pool quite quickly if the up front reservation
overshoots the XFS_ALLOC_SET_ASIDE() ENOSPC threshold.

> Digging into the reserve block depleted message - it seems we have
> too many parallel iomap_allocate transactions going on.  I suspect
> this might be because the writeback code will not finish a writeback
> context if we have multiple blocks inside a page, which can
> happen easily for this 1k ENOSPC setup.

Right - this test has regularly triggered that warning on this
particular test setup for me - it's not something new to the iomap
patchset.

> I've not had time to fully
> check if this is what really happens, but I did a quick hack (see below)
> to only allocate 1k at a time in iomap_begin, and with that generic/224
> finishes without the warning spew.  Of course this isn't a real fix,
> and I need to fully understand what's going on in writeback due to
> different allocation / dirtying patterns from the iomap change.

Which tends to indicate that the multi-block allocation has a larger
indlen reservation, and that's what is causing the code to hit
whatever edge-case that is leading to it not recovering. However,
I'm still wondering how we are not throwing ENOSPC back to userspace
at XFS_ALLOC_SET_ASIDE limits.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
  2016-06-30 17:22     ` Christoph Hellwig
@ 2016-07-18 11:14       ` Dave Chinner
  -1 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2016-07-18 11:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs, rpeterso, linux-fsdevel

On Thu, Jun 30, 2016 at 07:22:39PM +0200, Christoph Hellwig wrote:
> On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote:
> > Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here.
> > generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
> > significantly different behaviour once ENOSPC is hit withi this patchset.
> > 
> > It ends up with an endless stream of errors like this:
> 
> I've spent some time trying to reproduce this.  I'm actually getting
> the OOM killer almost reproducible for for-next without the iomap
> patches as well when just using 1GB of mem.  1400 MB is the minimum
> I can reproducibly finish the test with either code base.
> 
> But with the 1400 MB setup I see a few interesting things.  Even
> with the baseline, no-iomap case I see a few errors in the log:
> 
> [   70.407465] Filesystem "vdc": reserve blocks depleted! Consider increasing
> reserve pool
> size.
> [   70.195645] XFS (vdc): page discard on page ffff88005682a988, inode 0xd3, offset 761856.
> [   70.408079] Buffer I/O error on dev vdc, logical block 1048513, lost async
> page write
> [   70.408598] Buffer I/O error on dev vdc, logical block 1048514, lost async
> page write
>  27s
> 
> With iomap I also see the spew of page discard errors your see, but while
> I see a lot of them, the rest still finishes after a reasonable time,
> just a few seconds more than the pre-iomap baseline.  I also see the
> reserve block depleted message in this case.
> 
> Digging into the reserve block depleted message - it seems we have
> too many parallel iomap_allocate transactions going on.  I suspect
> this might be because the writeback code will not finish a writeback
> context if we have multiple blocks inside a page, which can
> happen easily for this 1k ENOSPC setup.  I've not had time to fully
> check if this is what really happens, but I did a quick hack (see below)
> to only allocate 1k at a time in iomap_begin, and with that generic/224
> finishes without the warning spew.  Of course this isn't a real fix,
> and I need to fully understand what's going on in writeback due to
> different allocation / dirtying patterns from the iomap change.

Any progress here, Christoph? The current test run has been running
generic/224 on the 1GB mem test Vm for almost 6 hours now, and it's
still discarding pages. This doesn't always happen - sometimes it
takes the normal amount of time to run, but every so often it falls
into this "discard every page" loop and it takes hours to
complete...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
@ 2016-07-18 11:14       ` Dave Chinner
  0 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2016-07-18 11:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, xfs

On Thu, Jun 30, 2016 at 07:22:39PM +0200, Christoph Hellwig wrote:
> On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote:
> > Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here.
> > generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
> > significantly different behaviour once ENOSPC is hit withi this patchset.
> > 
> > It ends up with an endless stream of errors like this:
> 
> I've spent some time trying to reproduce this.  I'm actually getting
> the OOM killer almost reproducible for for-next without the iomap
> patches as well when just using 1GB of mem.  1400 MB is the minimum
> I can reproducibly finish the test with either code base.
> 
> But with the 1400 MB setup I see a few interesting things.  Even
> with the baseline, no-iomap case I see a few errors in the log:
> 
> [   70.407465] Filesystem "vdc": reserve blocks depleted! Consider increasing
> reserve pool
> size.
> [   70.195645] XFS (vdc): page discard on page ffff88005682a988, inode 0xd3, offset 761856.
> [   70.408079] Buffer I/O error on dev vdc, logical block 1048513, lost async
> page write
> [   70.408598] Buffer I/O error on dev vdc, logical block 1048514, lost async
> page write
>  27s
> 
> With iomap I also see the spew of page discard errors your see, but while
> I see a lot of them, the rest still finishes after a reasonable time,
> just a few seconds more than the pre-iomap baseline.  I also see the
> reserve block depleted message in this case.
> 
> Digging into the reserve block depleted message - it seems we have
> too many parallel iomap_allocate transactions going on.  I suspect
> this might be because the writeback code will not finish a writeback
> context if we have multiple blocks inside a page, which can
> happen easily for this 1k ENOSPC setup.  I've not had time to fully
> check if this is what really happens, but I did a quick hack (see below)
> to only allocate 1k at a time in iomap_begin, and with that generic/224
> finishes without the warning spew.  Of course this isn't a real fix,
> and I need to fully understand what's going on in writeback due to
> different allocation / dirtying patterns from the iomap change.

Any progress here, Christoph? The current test run has been running
generic/224 on the 1GB mem test Vm for almost 6 hours now, and it's
still discarding pages. This doesn't always happen - sometimes it
takes the normal amount of time to run, but every so often it falls
into this "discard every page" loop and it takes hours to
complete...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
  2016-07-18 11:14       ` Dave Chinner
@ 2016-07-18 11:18         ` Dave Chinner
  -1 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2016-07-18 11:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, xfs

On Mon, Jul 18, 2016 at 09:14:00PM +1000, Dave Chinner wrote:
> On Thu, Jun 30, 2016 at 07:22:39PM +0200, Christoph Hellwig wrote:
> > On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote:
> > > Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here.
> > > generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
> > > significantly different behaviour once ENOSPC is hit withi this patchset.
> > > 
> > > It ends up with an endless stream of errors like this:
> > 
> > I've spent some time trying to reproduce this.  I'm actually getting
> > the OOM killer almost reproducible for for-next without the iomap
> > patches as well when just using 1GB of mem.  1400 MB is the minimum
> > I can reproducibly finish the test with either code base.
> > 
> > But with the 1400 MB setup I see a few interesting things.  Even
> > with the baseline, no-iomap case I see a few errors in the log:
> > 
> > [   70.407465] Filesystem "vdc": reserve blocks depleted! Consider increasing
> > reserve pool
> > size.
> > [   70.195645] XFS (vdc): page discard on page ffff88005682a988, inode 0xd3, offset 761856.
> > [   70.408079] Buffer I/O error on dev vdc, logical block 1048513, lost async
> > page write
> > [   70.408598] Buffer I/O error on dev vdc, logical block 1048514, lost async
> > page write
> >  27s
> > 
> > With iomap I also see the spew of page discard errors your see, but while
> > I see a lot of them, the rest still finishes after a reasonable time,
> > just a few seconds more than the pre-iomap baseline.  I also see the
> > reserve block depleted message in this case.
> > 
> > Digging into the reserve block depleted message - it seems we have
> > too many parallel iomap_allocate transactions going on.  I suspect
> > this might be because the writeback code will not finish a writeback
> > context if we have multiple blocks inside a page, which can
> > happen easily for this 1k ENOSPC setup.  I've not had time to fully
> > check if this is what really happens, but I did a quick hack (see below)
> > to only allocate 1k at a time in iomap_begin, and with that generic/224
> > finishes without the warning spew.  Of course this isn't a real fix,
> > and I need to fully understand what's going on in writeback due to
> > different allocation / dirtying patterns from the iomap change.
> 
> Any progress here, Christoph? The current test run has been running
> generic/224 on the 1GB mem test Vm for almost 6 hours now, and it's
> still discarding pages. This doesn't always happen - sometimes it
> takes the normal amount of time to run, but every so often it falls
> into this "discard every page" loop and it takes hours to
> complete...

.... and I've now got a 16p/16GB RAM VM stuck in this loop in
generic/224, so it's not limited to low memory machines....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
@ 2016-07-18 11:18         ` Dave Chinner
  0 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2016-07-18 11:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, xfs

On Mon, Jul 18, 2016 at 09:14:00PM +1000, Dave Chinner wrote:
> On Thu, Jun 30, 2016 at 07:22:39PM +0200, Christoph Hellwig wrote:
> > On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote:
> > > Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here.
> > > generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has
> > > significantly different behaviour once ENOSPC is hit withi this patchset.
> > > 
> > > It ends up with an endless stream of errors like this:
> > 
> > I've spent some time trying to reproduce this.  I'm actually getting
> > the OOM killer almost reproducible for for-next without the iomap
> > patches as well when just using 1GB of mem.  1400 MB is the minimum
> > I can reproducibly finish the test with either code base.
> > 
> > But with the 1400 MB setup I see a few interesting things.  Even
> > with the baseline, no-iomap case I see a few errors in the log:
> > 
> > [   70.407465] Filesystem "vdc": reserve blocks depleted! Consider increasing
> > reserve pool
> > size.
> > [   70.195645] XFS (vdc): page discard on page ffff88005682a988, inode 0xd3, offset 761856.
> > [   70.408079] Buffer I/O error on dev vdc, logical block 1048513, lost async
> > page write
> > [   70.408598] Buffer I/O error on dev vdc, logical block 1048514, lost async
> > page write
> >  27s
> > 
> > With iomap I also see the spew of page discard errors your see, but while
> > I see a lot of them, the rest still finishes after a reasonable time,
> > just a few seconds more than the pre-iomap baseline.  I also see the
> > reserve block depleted message in this case.
> > 
> > Digging into the reserve block depleted message - it seems we have
> > too many parallel iomap_allocate transactions going on.  I suspect
> > this might be because the writeback code will not finish a writeback
> > context if we have multiple blocks inside a page, which can
> > happen easily for this 1k ENOSPC setup.  I've not had time to fully
> > check if this is what really happens, but I did a quick hack (see below)
> > to only allocate 1k at a time in iomap_begin, and with that generic/224
> > finishes without the warning spew.  Of course this isn't a real fix,
> > and I need to fully understand what's going on in writeback due to
> > different allocation / dirtying patterns from the iomap change.
> 
> Any progress here, Christoph? The current test run has been running
> generic/224 on the 1GB mem test Vm for almost 6 hours now, and it's
> still discarding pages. This doesn't always happen - sometimes it
> takes the normal amount of time to run, but every so often it falls
> into this "discard every page" loop and it takes hours to
> complete...

.... and I've now got a 16p/16GB RAM VM stuck in this loop in
generic/224, so it's not limited to low memory machines....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
  2016-07-18 11:14       ` Dave Chinner
@ 2016-07-19  3:50         ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-07-19  3:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs, rpeterso, linux-fsdevel

On Mon, Jul 18, 2016 at 09:14:00PM +1000, Dave Chinner wrote:
> Any progress here, Christoph?

Nothing after the last posting yet.  Shortly after that I left to Japan
for about two weeks of hiking and a conference, which doesn't help my
ability to spend some quiet hours with the code.   It's on top of
my priority list for non-trivial things and I should be able to get
back it tomorrow after finishing the next conference today..

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
@ 2016-07-19  3:50         ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-07-19  3:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: rpeterso, linux-fsdevel, Christoph Hellwig, xfs

On Mon, Jul 18, 2016 at 09:14:00PM +1000, Dave Chinner wrote:
> Any progress here, Christoph?

Nothing after the last posting yet.  Shortly after that I left to Japan
for about two weeks of hiking and a conference, which doesn't help my
ability to spend some quiet hours with the code.   It's on top of
my priority list for non-trivial things and I should be able to get
back it tomorrow after finishing the next conference today..

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
  2016-07-18 11:18         ` Dave Chinner
@ 2016-07-31 19:19           ` Christoph Hellwig
  -1 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-07-31 19:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, rpeterso, linux-fsdevel, xfs

Another quiet weekend trying to debug this, and only minor progress.

The biggest different in traces of the old vs new code is that we manage
to allocate much bigger delalloc reservations at a time in xfs_bmapi_delay
-> xfs_bmapi_reserve_delalloc.  The old code always went for a single FSB,
which also meant allocating an indlen of 7 FSBs.  With the iomap code
we always allocate at least 4FSB (aka a page), and sometimes 8 or 12.
All of these still need 7 FSBs for the worst case indirect blocks.  So
what happens here is that in an ENOSPC case we manage to allocate more
actual delalloc blocks before hitting ENOSPC - notwithstanding that the
old case would immediately release them a little later in
xfs_bmap_add_extent_hole_delay after merging the delalloc extents.

On the writeback side I don't see to many changes either.  We'll
eventually run out of blocks when allocating the transaction in
xfs_iomap_write_allocate because the reserved pool is too small.  The
only real difference to before is that under the ENOSPC / out of memory
case we have allocated between 4 to 12 times more blocks, so we have
to clean up 4 to 12 times as much while write_cache_pages continues
iterating over these dirty delalloc blocks.   For me this happens
~6 times as much as before, but I still don't manage to hit an
endless loop.

Now after spending this much time I've started wondering why we even
reserve blocks in xfs_iomap_write_allocate - after all we've reserved
space for the actual data blocks and the indlen worst case in
xfs_bmapi_reserve_delalloc.  And in fact a little hack to drop that
reservation seems to solve both the root cause (depleted reserved pool)
and the cleanup mess.  I just haven't spend enought time to convince
myself that it's actually safe, and in fact looking at the allocator
makes me thing it only works by accident currently despite generally
postive test results.

Here is the quick patch if anyone wants to chime in:

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 620fc91..67c317f 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -717,7 +717,7 @@ xfs_iomap_write_allocate(
 
 		nimaps = 0;
 		while (nimaps == 0) {
-			nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
+			nres = 0; // XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
 
 			error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, nres,
 					0, XFS_TRANS_RESERVE, &tp);

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
@ 2016-07-31 19:19           ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2016-07-31 19:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: rpeterso, linux-fsdevel, Christoph Hellwig, xfs

Another quiet weekend trying to debug this, and only minor progress.

The biggest different in traces of the old vs new code is that we manage
to allocate much bigger delalloc reservations at a time in xfs_bmapi_delay
-> xfs_bmapi_reserve_delalloc.  The old code always went for a single FSB,
which also meant allocating an indlen of 7 FSBs.  With the iomap code
we always allocate at least 4FSB (aka a page), and sometimes 8 or 12.
All of these still need 7 FSBs for the worst case indirect blocks.  So
what happens here is that in an ENOSPC case we manage to allocate more
actual delalloc blocks before hitting ENOSPC - notwithstanding that the
old case would immediately release them a little later in
xfs_bmap_add_extent_hole_delay after merging the delalloc extents.

On the writeback side I don't see to many changes either.  We'll
eventually run out of blocks when allocating the transaction in
xfs_iomap_write_allocate because the reserved pool is too small.  The
only real difference to before is that under the ENOSPC / out of memory
case we have allocated between 4 to 12 times more blocks, so we have
to clean up 4 to 12 times as much while write_cache_pages continues
iterating over these dirty delalloc blocks.   For me this happens
~6 times as much as before, but I still don't manage to hit an
endless loop.

Now after spending this much time I've started wondering why we even
reserve blocks in xfs_iomap_write_allocate - after all we've reserved
space for the actual data blocks and the indlen worst case in
xfs_bmapi_reserve_delalloc.  And in fact a little hack to drop that
reservation seems to solve both the root cause (depleted reserved pool)
and the cleanup mess.  I just haven't spend enought time to convince
myself that it's actually safe, and in fact looking at the allocator
makes me thing it only works by accident currently despite generally
postive test results.

Here is the quick patch if anyone wants to chime in:

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 620fc91..67c317f 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -717,7 +717,7 @@ xfs_iomap_write_allocate(
 
 		nimaps = 0;
 		while (nimaps == 0) {
-			nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
+			nres = 0; // XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
 
 			error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, nres,
 					0, XFS_TRANS_RESERVE, &tp);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
  2016-07-31 19:19           ` Christoph Hellwig
@ 2016-08-01  0:16             ` Dave Chinner
  -1 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2016-08-01  0:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, xfs

On Sun, Jul 31, 2016 at 09:19:00PM +0200, Christoph Hellwig wrote:
> Another quiet weekend trying to debug this, and only minor progress.
> 
> The biggest different in traces of the old vs new code is that we manage
> to allocate much bigger delalloc reservations at a time in xfs_bmapi_delay
> -> xfs_bmapi_reserve_delalloc.  The old code always went for a single FSB,
> which also meant allocating an indlen of 7 FSBs.  With the iomap code
> we always allocate at least 4FSB (aka a page), and sometimes 8 or 12.
> All of these still need 7 FSBs for the worst case indirect blocks.  So
> what happens here is that in an ENOSPC case we manage to allocate more
> actual delalloc blocks before hitting ENOSPC - notwithstanding that the
> old case would immediately release them a little later in
> xfs_bmap_add_extent_hole_delay after merging the delalloc extents.
> 
> On the writeback side I don't see to many changes either.  We'll
> eventually run out of blocks when allocating the transaction in
> xfs_iomap_write_allocate because the reserved pool is too small.

Yup, that's exactly what generic/224 is testing - it sets the
reserve pool to 4 blocks so it does get exhausted very quickly
and then it exposes the underlying ENOSPC issue. Most users won't
ever see reserve pool exhaustion, which is why I didn't worry too
much about solving this before merging.

> The
> only real difference to before is that under the ENOSPC / out of memory
> case we have allocated between 4 to 12 times more blocks, so we have
> to clean up 4 to 12 times as much while write_cache_pages continues
> iterating over these dirty delalloc blocks.   For me this happens
> ~6 times as much as before, but I still don't manage to hit an
> endless loop.

Ok, I'd kind of got that far myself, but then never really got much
further than that - I suspected some kind of "split a delalloc
extent too many times, run out of reservation" type of issue, but
couldn't isolate such a problem in any of the traces.

> Now after spending this much time I've started wondering why we even
> reserve blocks in xfs_iomap_write_allocate - after all we've reserved
> space for the actual data blocks and the indlen worst case in
> xfs_bmapi_reserve_delalloc.  And in fact a little hack to drop that
> reservation seems to solve both the root cause (depleted reserved pool)
> and the cleanup mess.  I just haven't spend enought time to convince
> myself that it's actually safe, and in fact looking at the allocator
> makes me thing it only works by accident currently despite generally
> postive test results.

Hmmm, interesting. I didn't think about that. I have been looking at
this exact code as a result of rmap ENOSPC problems, and now that
you mention this, I can't see why we'd need a block reservation here
for delalloc conversion, either. Time for more <reverb on>
Adventures in Code Archeology! <reverb off>

/me digs

First stop - just after we removed the behaviour layer. Only caller
of xfs_iomap_write_allocate was:

writepage
  xfs_map_block(BMAPI_ALLOCATE) // only for delalloc
    xfs_iomap
      xfs_iomap_write_allocate

Which is essentially the same single caller we have now, just with
much less indirection.

Looking at the code before the behaviour layer removal, there was
also an "xfs_iocore" abstraction, which abstracted inode locking,
block mapping and allocation and a few other miscellaneous IO
functions. This was so CXFS server could plug into the XFS IO path
and intercept allocation requests on the CXFS client side.  This
leads me to think that the CXFS server could call
xfs_iomap_write_allocate() directly. Whether or not the server kept
the delalloc reservation or not, I'm not sure.

So, let's go back to before this abstraction was in place. Takes us
back to before the linux port was started, back to pure Irix
code....

.... and there's no block reservation done for delalloc conversion.

Ok, so here's the commit that introduced the block reservation for
delalloc conversion:

commit 6706e96e324a2fa9636e93facfd4b7fbbf5b85f8
Author: Glen Overby <overby@sgi.com>
Date:   Tue Mar 4 20:15:43 2003 +0000

    Add error reporting calls in error paths that return EFSCORRUPTED


Yup, hidden deep inside the commit that added the
XFS_CORRUPTION_ERROR and XFS_ERROR_REPORT macros for better error
reporting was this unexplained, uncommented hunk:

@@ -562,9 +563,19 @@ xfs_iomap_write_allocate(
                nimaps = 0;
                while (nimaps == 0) {
                        tp = xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE);
-                       error = xfs_trans_reserve(tp, 0, XFS_WRITE_LOG_RES(mp),
+                       nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
+                       error = xfs_trans_reserve(tp, nres,
+                                       XFS_WRITE_LOG_RES(mp),
                                        0, XFS_TRANS_PERM_LOG_RES,
                                        XFS_WRITE_LOG_COUNT);
+
+                       if (error == ENOSPC) {
+                               error = xfs_trans_reserve(tp, 0,
+                                               XFS_WRITE_LOG_RES(mp),
+                                               0,
+                                               XFS_TRANS_PERM_LOG_RES,
+                                               XFS_WRITE_LOG_COUNT);
+                       }
                        if (error) {
                                xfs_trans_cancel(tp, 0);
                                return XFS_ERROR(error);


It's completely out of place compared to the rest of the patch which
didn't change any code logic or algorithms - it only added error
reporting macros. Hence THIS looks like it may have been an
accidental/unintended change in the commit.

The ENOSPC check here went away in 2007 when I expanded the reserve
block pool and added XFS_TRANS_RESERVE to this function to allow it
dip into the reserve pool (commit bdebc6a4 "Prevent ENOSPC from
aborting transactions that need to succeed"). I didn't pick up on
the history back then, I bet I was just focussed on fixing the
ENOSPC issue....

So, essentially, by emptying the reserve block pool, we've opened
this code back up to whatever underlying ENOSPC issue it had prior
to bdebc6a4. And looking back at 6706e96e, I can only guess that the
block reservation was added for a CXFS use case, because XFS still
only called this from a single place - delalloc conversion.

Christoph - it does look like you've found the problem - I agree
with your analysis that the delalloc already reserves space for the
bmbt blocks in the indlen reservation, and that adding another
reservation for bmbt blocks at transaction allocation makes no
obvious sense. The code history doesn't explain it - it only raises
more questions as to why this was done - it may even have been an
accidental change in the first place...

> Here is the quick patch if anyone wants to chime in:
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 620fc91..67c317f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -717,7 +717,7 @@ xfs_iomap_write_allocate(
>  
>  		nimaps = 0;
>  		while (nimaps == 0) {
> -			nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> +			nres = 0; // XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
>  
>  			error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, nres,
>  					0, XFS_TRANS_RESERVE, &tp);

Let me go test it, see what comes up.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
@ 2016-08-01  0:16             ` Dave Chinner
  0 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2016-08-01  0:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, xfs

On Sun, Jul 31, 2016 at 09:19:00PM +0200, Christoph Hellwig wrote:
> Another quiet weekend trying to debug this, and only minor progress.
> 
> The biggest different in traces of the old vs new code is that we manage
> to allocate much bigger delalloc reservations at a time in xfs_bmapi_delay
> -> xfs_bmapi_reserve_delalloc.  The old code always went for a single FSB,
> which also meant allocating an indlen of 7 FSBs.  With the iomap code
> we always allocate at least 4FSB (aka a page), and sometimes 8 or 12.
> All of these still need 7 FSBs for the worst case indirect blocks.  So
> what happens here is that in an ENOSPC case we manage to allocate more
> actual delalloc blocks before hitting ENOSPC - notwithstanding that the
> old case would immediately release them a little later in
> xfs_bmap_add_extent_hole_delay after merging the delalloc extents.
> 
> On the writeback side I don't see to many changes either.  We'll
> eventually run out of blocks when allocating the transaction in
> xfs_iomap_write_allocate because the reserved pool is too small.

Yup, that's exactly what generic/224 is testing - it sets the
reserve pool to 4 blocks so it does get exhausted very quickly
and then it exposes the underlying ENOSPC issue. Most users won't
ever see reserve pool exhaustion, which is why I didn't worry too
much about solving this before merging.

> The
> only real difference to before is that under the ENOSPC / out of memory
> case we have allocated between 4 to 12 times more blocks, so we have
> to clean up 4 to 12 times as much while write_cache_pages continues
> iterating over these dirty delalloc blocks.   For me this happens
> ~6 times as much as before, but I still don't manage to hit an
> endless loop.

Ok, I'd kind of got that far myself, but then never really got much
further than that - I suspected some kind of "split a delalloc
extent too many times, run out of reservation" type of issue, but
couldn't isolate such a problem in any of the traces.

> Now after spending this much time I've started wondering why we even
> reserve blocks in xfs_iomap_write_allocate - after all we've reserved
> space for the actual data blocks and the indlen worst case in
> xfs_bmapi_reserve_delalloc.  And in fact a little hack to drop that
> reservation seems to solve both the root cause (depleted reserved pool)
> and the cleanup mess.  I just haven't spend enought time to convince
> myself that it's actually safe, and in fact looking at the allocator
> makes me thing it only works by accident currently despite generally
> postive test results.

Hmmm, interesting. I didn't think about that. I have been looking at
this exact code as a result of rmap ENOSPC problems, and now that
you mention this, I can't see why we'd need a block reservation here
for delalloc conversion, either. Time for more <reverb on>
Adventures in Code Archeology! <reverb off>

/me digs

First stop - just after we removed the behaviour layer. Only caller
of xfs_iomap_write_allocate was:

writepage
  xfs_map_block(BMAPI_ALLOCATE) // only for delalloc
    xfs_iomap
      xfs_iomap_write_allocate

Which is essentially the same single caller we have now, just with
much less indirection.

Looking at the code before the behaviour layer removal, there was
also an "xfs_iocore" abstraction, which abstracted inode locking,
block mapping and allocation and a few other miscellaneous IO
functions. This was so CXFS server could plug into the XFS IO path
and intercept allocation requests on the CXFS client side.  This
leads me to think that the CXFS server could call
xfs_iomap_write_allocate() directly. Whether or not the server kept
the delalloc reservation or not, I'm not sure.

So, let's go back to before this abstraction was in place. Takes us
back to before the linux port was started, back to pure Irix
code....

.... and there's no block reservation done for delalloc conversion.

Ok, so here's the commit that introduced the block reservation for
delalloc conversion:

commit 6706e96e324a2fa9636e93facfd4b7fbbf5b85f8
Author: Glen Overby <overby@sgi.com>
Date:   Tue Mar 4 20:15:43 2003 +0000

    Add error reporting calls in error paths that return EFSCORRUPTED


Yup, hidden deep inside the commit that added the
XFS_CORRUPTION_ERROR and XFS_ERROR_REPORT macros for better error
reporting was this unexplained, uncommented hunk:

@@ -562,9 +563,19 @@ xfs_iomap_write_allocate(
                nimaps = 0;
                while (nimaps == 0) {
                        tp = xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE);
-                       error = xfs_trans_reserve(tp, 0, XFS_WRITE_LOG_RES(mp),
+                       nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
+                       error = xfs_trans_reserve(tp, nres,
+                                       XFS_WRITE_LOG_RES(mp),
                                        0, XFS_TRANS_PERM_LOG_RES,
                                        XFS_WRITE_LOG_COUNT);
+
+                       if (error == ENOSPC) {
+                               error = xfs_trans_reserve(tp, 0,
+                                               XFS_WRITE_LOG_RES(mp),
+                                               0,
+                                               XFS_TRANS_PERM_LOG_RES,
+                                               XFS_WRITE_LOG_COUNT);
+                       }
                        if (error) {
                                xfs_trans_cancel(tp, 0);
                                return XFS_ERROR(error);


It's completely out of place compared to the rest of the patch which
didn't change any code logic or algorithms - it only added error
reporting macros. Hence THIS looks like it may have been an
accidental/unintended change in the commit.

The ENOSPC check here went away in 2007 when I expanded the reserve
block pool and added XFS_TRANS_RESERVE to this function to allow it
dip into the reserve pool (commit bdebc6a4 "Prevent ENOSPC from
aborting transactions that need to succeed"). I didn't pick up on
the history back then, I bet I was just focussed on fixing the
ENOSPC issue....

So, essentially, by emptying the reserve block pool, we've opened
this code back up to whatever underlying ENOSPC issue it had prior
to bdebc6a4. And looking back at 6706e96e, I can only guess that the
block reservation was added for a CXFS use case, because XFS still
only called this from a single place - delalloc conversion.

Christoph - it does look like you've found the problem - I agree
with your analysis that the delalloc already reserves space for the
bmbt blocks in the indlen reservation, and that adding another
reservation for bmbt blocks at transaction allocation makes no
obvious sense. The code history doesn't explain it - it only raises
more questions as to why this was done - it may even have been an
accidental change in the first place...

> Here is the quick patch if anyone wants to chime in:
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 620fc91..67c317f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -717,7 +717,7 @@ xfs_iomap_write_allocate(
>  
>  		nimaps = 0;
>  		while (nimaps == 0) {
> -			nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> +			nres = 0; // XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
>  
>  			error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, nres,
>  					0, XFS_TRANS_RESERVE, &tp);

Let me go test it, see what comes up.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
  2016-07-31 19:19           ` Christoph Hellwig
@ 2016-08-02 23:42             ` Dave Chinner
  -1 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2016-08-02 23:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, xfs

On Sun, Jul 31, 2016 at 09:19:00PM +0200, Christoph Hellwig wrote:
> Now after spending this much time I've started wondering why we even
> reserve blocks in xfs_iomap_write_allocate - after all we've reserved
> space for the actual data blocks and the indlen worst case in
> xfs_bmapi_reserve_delalloc.  And in fact a little hack to drop that
> reservation seems to solve both the root cause (depleted reserved pool)
> and the cleanup mess.  I just haven't spend enought time to convince
> myself that it's actually safe, and in fact looking at the allocator
> makes me thing it only works by accident currently despite generally
> postive test results.
> 
> Here is the quick patch if anyone wants to chime in:
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 620fc91..67c317f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -717,7 +717,7 @@ xfs_iomap_write_allocate(
>  
>  		nimaps = 0;
>  		while (nimaps == 0) {
> -			nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> +			nres = 0; // XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
>  
>  			error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, nres,
>  					0, XFS_TRANS_RESERVE, &tp);
> 

This solves the problem for me, and from history appears to be the
right thing to do. Christoph, can you send a proper patch for this?

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
@ 2016-08-02 23:42             ` Dave Chinner
  0 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2016-08-02 23:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, xfs

On Sun, Jul 31, 2016 at 09:19:00PM +0200, Christoph Hellwig wrote:
> Now after spending this much time I've started wondering why we even
> reserve blocks in xfs_iomap_write_allocate - after all we've reserved
> space for the actual data blocks and the indlen worst case in
> xfs_bmapi_reserve_delalloc.  And in fact a little hack to drop that
> reservation seems to solve both the root cause (depleted reserved pool)
> and the cleanup mess.  I just haven't spend enought time to convince
> myself that it's actually safe, and in fact looking at the allocator
> makes me thing it only works by accident currently despite generally
> postive test results.
> 
> Here is the quick patch if anyone wants to chime in:
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 620fc91..67c317f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -717,7 +717,7 @@ xfs_iomap_write_allocate(
>  
>  		nimaps = 0;
>  		while (nimaps == 0) {
> -			nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> +			nres = 0; // XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
>  
>  			error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, nres,
>  					0, XFS_TRANS_RESERVE, &tp);
> 

This solves the problem for me, and from history appears to be the
right thing to do. Christoph, can you send a proper patch for this?

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 08/14] fs: iomap based fiemap implementation
  2016-06-01 14:44   ` Christoph Hellwig
@ 2016-08-18 13:18     ` Andreas Grünbacher
  -1 siblings, 0 replies; 67+ messages in thread
From: Andreas Grünbacher @ 2016-08-18 13:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs, Bob Peterson, Linux FS-devel Mailing List

Christoph,

2016-06-01 16:44 GMT+02:00 Christoph Hellwig <hch@lst.de>:
> Add a simple fiemap implementation based on iomap_ops, partially based
> on a previous implementation from Bob Peterson <rpeterso@redhat.com>.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap.c            | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/iomap.h |  3 ++
>  2 files changed, 93 insertions(+)
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index e51adb6..e874c78 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -405,3 +405,93 @@ out_unlock:
>         return ret;
>  }
>  EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
> +
> +struct fiemap_ctx {
> +       struct fiemap_extent_info *fi;
> +       struct iomap prev;
> +};
> +
> +static int iomap_to_fiemap(struct fiemap_extent_info *fi,
> +               struct iomap *iomap, u32 flags)
> +{
> +       switch (iomap->type) {
> +       case IOMAP_HOLE:
> +               /* skip holes */
> +               return 0;
> +       case IOMAP_DELALLOC:
> +               flags |= FIEMAP_EXTENT_DELALLOC | FIEMAP_EXTENT_UNKNOWN;
> +               break;
> +       case IOMAP_UNWRITTEN:
> +               flags |= FIEMAP_EXTENT_UNWRITTEN;
> +               break;
> +       case IOMAP_MAPPED:
> +               break;
> +       }
> +
> +       return fiemap_fill_next_extent(fi, iomap->offset,
> +                       iomap->blkno != IOMAP_NULL_BLOCK ? iomap->blkno << 9: 0,
> +                       iomap->length, flags | FIEMAP_EXTENT_MERGED);

According to Documentation/filesystems/fiemap.txt, it seems that
FIEMAP_EXTENT_MERGED flag should only be set by filesystems with
block-based rather than extent-based allocation. Am I overlooking
something?

> +
> +}
> +
> +static loff_t
> +iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> +               struct iomap *iomap)
> +{
> +       struct fiemap_ctx *ctx = data;
> +       loff_t ret = length;
> +
> +       if (iomap->type == IOMAP_HOLE)
> +               return length;
> +
> +       ret = iomap_to_fiemap(ctx->fi, &ctx->prev, 0);
> +       ctx->prev = *iomap;
> +       switch (ret) {
> +       case 0:         /* success */
> +               return length;
> +       case 1:         /* extent array full */
> +               return 0;
> +       default:
> +               return ret;
> +       }
> +}
> +
> +int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
> +               loff_t start, loff_t len, struct iomap_ops *ops)
> +{
> +       struct fiemap_ctx ctx;
> +       loff_t ret;
> +
> +       memset(&ctx, 0, sizeof(ctx));
> +       ctx.fi = fi;
> +       ctx.prev.type = IOMAP_HOLE;
> +
> +       ret = fiemap_check_flags(fi, FIEMAP_FLAG_SYNC);
> +       if (ret)
> +               return ret;
> +
> +       ret = filemap_write_and_wait(inode->i_mapping);
> +       if (ret)
> +               return ret;
> +
> +       while (len > 0) {
> +               ret = iomap_apply(inode, start, len, 0, ops, &ctx,
> +                               iomap_fiemap_actor);
> +               if (ret < 0)
> +                       return ret;
> +               if (ret == 0)
> +                       break;
> +
> +               start += ret;
> +               len -= ret;
> +       }
> +
> +       if (ctx.prev.type != IOMAP_HOLE) {
> +               ret = iomap_to_fiemap(fi, &ctx.prev, FIEMAP_EXTENT_LAST);
> +               if (ret < 0)
> +                       return ret;
> +       }
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(iomap_fiemap);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 854766f..b3deee1 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -3,6 +3,7 @@
>
>  #include <linux/types.h>
>
> +struct fiemap_extent_info;
>  struct inode;
>  struct iov_iter;
>  struct kiocb;
> @@ -63,5 +64,7 @@ int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
>                 struct iomap_ops *ops);
>  int iomap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
>                 struct iomap_ops *ops);
> +int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> +               loff_t start, loff_t len, struct iomap_ops *ops);
>
>  #endif /* LINUX_IOMAP_H */
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 08/14] fs: iomap based fiemap implementation
@ 2016-08-18 13:18     ` Andreas Grünbacher
  0 siblings, 0 replies; 67+ messages in thread
From: Andreas Grünbacher @ 2016-08-18 13:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Bob Peterson, Linux FS-devel Mailing List, xfs

Christoph,

2016-06-01 16:44 GMT+02:00 Christoph Hellwig <hch@lst.de>:
> Add a simple fiemap implementation based on iomap_ops, partially based
> on a previous implementation from Bob Peterson <rpeterso@redhat.com>.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap.c            | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/iomap.h |  3 ++
>  2 files changed, 93 insertions(+)
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index e51adb6..e874c78 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -405,3 +405,93 @@ out_unlock:
>         return ret;
>  }
>  EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
> +
> +struct fiemap_ctx {
> +       struct fiemap_extent_info *fi;
> +       struct iomap prev;
> +};
> +
> +static int iomap_to_fiemap(struct fiemap_extent_info *fi,
> +               struct iomap *iomap, u32 flags)
> +{
> +       switch (iomap->type) {
> +       case IOMAP_HOLE:
> +               /* skip holes */
> +               return 0;
> +       case IOMAP_DELALLOC:
> +               flags |= FIEMAP_EXTENT_DELALLOC | FIEMAP_EXTENT_UNKNOWN;
> +               break;
> +       case IOMAP_UNWRITTEN:
> +               flags |= FIEMAP_EXTENT_UNWRITTEN;
> +               break;
> +       case IOMAP_MAPPED:
> +               break;
> +       }
> +
> +       return fiemap_fill_next_extent(fi, iomap->offset,
> +                       iomap->blkno != IOMAP_NULL_BLOCK ? iomap->blkno << 9: 0,
> +                       iomap->length, flags | FIEMAP_EXTENT_MERGED);

According to Documentation/filesystems/fiemap.txt, it seems that
FIEMAP_EXTENT_MERGED flag should only be set by filesystems with
block-based rather than extent-based allocation. Am I overlooking
something?

> +
> +}
> +
> +static loff_t
> +iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> +               struct iomap *iomap)
> +{
> +       struct fiemap_ctx *ctx = data;
> +       loff_t ret = length;
> +
> +       if (iomap->type == IOMAP_HOLE)
> +               return length;
> +
> +       ret = iomap_to_fiemap(ctx->fi, &ctx->prev, 0);
> +       ctx->prev = *iomap;
> +       switch (ret) {
> +       case 0:         /* success */
> +               return length;
> +       case 1:         /* extent array full */
> +               return 0;
> +       default:
> +               return ret;
> +       }
> +}
> +
> +int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
> +               loff_t start, loff_t len, struct iomap_ops *ops)
> +{
> +       struct fiemap_ctx ctx;
> +       loff_t ret;
> +
> +       memset(&ctx, 0, sizeof(ctx));
> +       ctx.fi = fi;
> +       ctx.prev.type = IOMAP_HOLE;
> +
> +       ret = fiemap_check_flags(fi, FIEMAP_FLAG_SYNC);
> +       if (ret)
> +               return ret;
> +
> +       ret = filemap_write_and_wait(inode->i_mapping);
> +       if (ret)
> +               return ret;
> +
> +       while (len > 0) {
> +               ret = iomap_apply(inode, start, len, 0, ops, &ctx,
> +                               iomap_fiemap_actor);
> +               if (ret < 0)
> +                       return ret;
> +               if (ret == 0)
> +                       break;
> +
> +               start += ret;
> +               len -= ret;
> +       }
> +
> +       if (ctx.prev.type != IOMAP_HOLE) {
> +               ret = iomap_to_fiemap(fi, &ctx.prev, FIEMAP_EXTENT_LAST);
> +               if (ret < 0)
> +                       return ret;
> +       }
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(iomap_fiemap);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 854766f..b3deee1 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -3,6 +3,7 @@
>
>  #include <linux/types.h>
>
> +struct fiemap_extent_info;
>  struct inode;
>  struct iov_iter;
>  struct kiocb;
> @@ -63,5 +64,7 @@ int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
>                 struct iomap_ops *ops);
>  int iomap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
>                 struct iomap_ops *ops);
> +int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> +               loff_t start, loff_t len, struct iomap_ops *ops);
>
>  #endif /* LINUX_IOMAP_H */
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thanks,
Andreas

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
  2016-08-02 23:42             ` Dave Chinner
  (?)
@ 2017-02-13 22:31             ` Eric Sandeen
  2017-02-14  6:10               ` Christoph Hellwig
  -1 siblings, 1 reply; 67+ messages in thread
From: Eric Sandeen @ 2017-02-13 22:31 UTC (permalink / raw)
  To: Dave Chinner, Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, xfs

On 8/2/16 6:42 PM, Dave Chinner wrote:
> On Sun, Jul 31, 2016 at 09:19:00PM +0200, Christoph Hellwig wrote:
>> Now after spending this much time I've started wondering why we even
>> reserve blocks in xfs_iomap_write_allocate - after all we've reserved
>> space for the actual data blocks and the indlen worst case in
>> xfs_bmapi_reserve_delalloc.  And in fact a little hack to drop that
>> reservation seems to solve both the root cause (depleted reserved pool)
>> and the cleanup mess.  I just haven't spend enought time to convince
>> myself that it's actually safe, and in fact looking at the allocator
>> makes me thing it only works by accident currently despite generally
>> postive test results.
>>
>> Here is the quick patch if anyone wants to chime in:
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 620fc91..67c317f 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -717,7 +717,7 @@ xfs_iomap_write_allocate(
>>  
>>  		nimaps = 0;
>>  		while (nimaps == 0) {
>> -			nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
>> +			nres = 0; // XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
>>  
>>  			error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, nres,
>>  					0, XFS_TRANS_RESERVE, &tp);
>>
> 
> This solves the problem for me, and from history appears to be the
> right thing to do. Christoph, can you send a proper patch for this?

Did anything ever come of this?  I don't think I saw a patch, and it looks
like it is not upstream.

Thanks,
-Eric

> Cheers,
> 
> Dave.
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
  2017-02-13 22:31             ` Eric Sandeen
@ 2017-02-14  6:10               ` Christoph Hellwig
  0 siblings, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2017-02-14  6:10 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Dave Chinner, Christoph Hellwig, rpeterso, linux-fsdevel, xfs

On Mon, Feb 13, 2017 at 04:31:55PM -0600, Eric Sandeen wrote:
> > This solves the problem for me, and from history appears to be the
> > right thing to do. Christoph, can you send a proper patch for this?
> 
> Did anything ever come of this?  I don't think I saw a patch, and it looks
> like it is not upstream.

"xfs: fix bogus space reservation in xfs_iomap_write_allocate" got edits
from Dave to fix this issue up in-line.  It's actually still kinda
bogus as I found out while spending more time with the allocator,
though.  Fixing the whole total vs minlen thing is rather invasive,
but I've started working on it and hope to have something for the
next merge window.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: iomap infrastructure and multipage writes V5
@ 2017-02-13 22:32 xfs-owner
  0 siblings, 0 replies; 67+ messages in thread
From: xfs-owner @ 2017-02-13 22:32 UTC (permalink / raw)
  To: sandeen

[-- Attachment #1: Type: text/plain, Size: 268 bytes --]

This list has been closed.  Please subscribe to the
linux-xfs@vger.kernel.org mailing list and send any future messages
there. You can subscribe to the linux-xfs list at
http://vger.kernel.org/vger-lists.html#linux-xfs

For any questions please post to the new list.


[-- Attachment #2: Type: message/rfc822, Size: 4391 bytes --]

From: Eric Sandeen <sandeen@sandeen.net>
To: Dave Chinner <david@fromorbit.com>, Christoph Hellwig <hch@lst.de>
Cc: rpeterso@redhat.com, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com
Subject: Re: iomap infrastructure and multipage writes V5
Date: Mon, 13 Feb 2017 16:31:55 -0600
Message-ID: <75a139c8-5e49-3a89-10d1-20caeef3f69a@sandeen.net>

On 8/2/16 6:42 PM, Dave Chinner wrote:
> On Sun, Jul 31, 2016 at 09:19:00PM +0200, Christoph Hellwig wrote:
>> Now after spending this much time I've started wondering why we even
>> reserve blocks in xfs_iomap_write_allocate - after all we've reserved
>> space for the actual data blocks and the indlen worst case in
>> xfs_bmapi_reserve_delalloc.  And in fact a little hack to drop that
>> reservation seems to solve both the root cause (depleted reserved pool)
>> and the cleanup mess.  I just haven't spend enought time to convince
>> myself that it's actually safe, and in fact looking at the allocator
>> makes me thing it only works by accident currently despite generally
>> postive test results.
>>
>> Here is the quick patch if anyone wants to chime in:
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 620fc91..67c317f 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -717,7 +717,7 @@ xfs_iomap_write_allocate(
>>  
>>  		nimaps = 0;
>>  		while (nimaps == 0) {
>> -			nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
>> +			nres = 0; // XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
>>  
>>  			error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, nres,
>>  					0, XFS_TRANS_RESERVE, &tp);
>>
> 
> This solves the problem for me, and from history appears to be the
> right thing to do. Christoph, can you send a proper patch for this?

Did anything ever come of this?  I don't think I saw a patch, and it looks
like it is not upstream.

Thanks,
-Eric

> Cheers,
> 
> Dave.
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2017-02-14  6:10 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-01 14:44 iomap infrastructure and multipage writes V5 Christoph Hellwig
2016-06-01 14:44 ` Christoph Hellwig
2016-06-01 14:44 ` [PATCH 01/14] fs: move struct iomap from exportfs.h to a separate header Christoph Hellwig
2016-06-01 14:44   ` Christoph Hellwig
2016-06-01 14:44 ` [PATCH 02/14] fs: introduce iomap infrastructure Christoph Hellwig
2016-06-01 14:44   ` Christoph Hellwig
2016-06-16 16:12   ` Jan Kara
2016-06-16 16:12     ` Jan Kara
2016-06-17 12:01     ` Christoph Hellwig
2016-06-17 12:01       ` Christoph Hellwig
2016-06-20  2:29       ` Dave Chinner
2016-06-20  2:29         ` Dave Chinner
2016-06-20 12:22         ` Christoph Hellwig
2016-06-20 12:22           ` Christoph Hellwig
2016-06-01 14:44 ` [PATCH 03/14] fs: support DAX based iomap zeroing Christoph Hellwig
2016-06-01 14:44   ` Christoph Hellwig
2016-06-01 14:44 ` [PATCH 04/14] xfs: make xfs_bmbt_to_iomap available outside of xfs_pnfs.c Christoph Hellwig
2016-06-01 14:44   ` Christoph Hellwig
2016-06-01 14:44 ` [PATCH 05/14] xfs: reorder zeroing and flushing sequence in truncate Christoph Hellwig
2016-06-01 14:44   ` Christoph Hellwig
2016-06-01 14:44 ` [PATCH 06/14] xfs: implement iomap based buffered write path Christoph Hellwig
2016-06-01 14:44   ` Christoph Hellwig
2016-06-01 14:44 ` [PATCH 07/14] xfs: remove buffered write support from __xfs_get_blocks Christoph Hellwig
2016-06-01 14:44   ` Christoph Hellwig
2016-06-01 14:44 ` [PATCH 08/14] fs: iomap based fiemap implementation Christoph Hellwig
2016-06-01 14:44   ` Christoph Hellwig
2016-08-18 13:18   ` Andreas Grünbacher
2016-08-18 13:18     ` Andreas Grünbacher
2016-06-01 14:44 ` [PATCH 09/14] xfs: use iomap " Christoph Hellwig
2016-06-01 14:44   ` Christoph Hellwig
2016-06-01 14:44 ` [PATCH 10/14] xfs: use iomap infrastructure for DAX zeroing Christoph Hellwig
2016-06-01 14:44   ` Christoph Hellwig
2016-06-01 14:44 ` [PATCH 11/14] xfs: handle 64-bit length in xfs_iozero Christoph Hellwig
2016-06-01 14:44   ` Christoph Hellwig
2016-06-01 14:44 ` [PATCH 12/14] xfs: use xfs_zero_range in xfs_zero_eof Christoph Hellwig
2016-06-01 14:44   ` Christoph Hellwig
2016-06-01 14:44 ` [PATCH 13/14] xfs: split xfs_free_file_space in manageable pieces Christoph Hellwig
2016-06-01 14:44   ` Christoph Hellwig
2016-06-01 14:44 ` [PATCH 14/14] xfs: kill xfs_zero_remaining_bytes Christoph Hellwig
2016-06-01 14:44   ` Christoph Hellwig
2016-06-01 14:46 ` iomap infrastructure and multipage writes V5 Christoph Hellwig
2016-06-01 14:46   ` Christoph Hellwig
2016-06-28  0:26 ` Dave Chinner
2016-06-28  0:26   ` Dave Chinner
2016-06-28 13:28   ` Christoph Hellwig
2016-06-28 13:28     ` Christoph Hellwig
2016-06-28 13:38     ` Christoph Hellwig
2016-06-28 13:38       ` Christoph Hellwig
2016-06-30 17:22   ` Christoph Hellwig
2016-06-30 17:22     ` Christoph Hellwig
2016-06-30 23:16     ` Dave Chinner
2016-06-30 23:16       ` Dave Chinner
2016-07-18 11:14     ` Dave Chinner
2016-07-18 11:14       ` Dave Chinner
2016-07-18 11:18       ` Dave Chinner
2016-07-18 11:18         ` Dave Chinner
2016-07-31 19:19         ` Christoph Hellwig
2016-07-31 19:19           ` Christoph Hellwig
2016-08-01  0:16           ` Dave Chinner
2016-08-01  0:16             ` Dave Chinner
2016-08-02 23:42           ` Dave Chinner
2016-08-02 23:42             ` Dave Chinner
2017-02-13 22:31             ` Eric Sandeen
2017-02-14  6:10               ` Christoph Hellwig
2016-07-19  3:50       ` Christoph Hellwig
2016-07-19  3:50         ` Christoph Hellwig
2017-02-13 22:32 xfs-owner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.