[RFC] failure atomic writes for file systems and block devices

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] failure atomic writes for file systems and block devices
@ 2017-02-28 14:57 Christoph Hellwig
  2017-02-28 14:57 ` [PATCH 01/12] uapi/fs: add O_ATOMIC to the open flags Christoph Hellwig
                   ` (14 more replies)
  0 siblings, 15 replies; 25+ messages in thread
From: Christoph Hellwig @ 2017-02-28 14:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs, linux-block

Hi all,

this series implements a new O_ATOMIC flag for failure atomic writes
to files.   It is based on and tries to unify to earlier proposals,
the first one for block devices by Chris Mason:

	https://lwn.net/Articles/573092/

and the second one for regular files, published by HP Research at
Usenix FAST 2015:

	https://www.usenix.org/conference/fast15/technical-sessions/presentation/verma

It adds a new O_ATOMIC flag for open, which requests writes to be
failure-atomic, that is either the whole write makes it to persistent
storage, or none of it, even in case of power of other failures.

There are two implementation various of this:  on block devices O_ATOMIC
must be combined with O_(D)SYNC so that storage devices that can handle
large writes atomically can simply do that without any additional work.
This case is supported by NVMe.

The second case is for file systems, where we simply write new blocks
out of places and then remap them into the file atomically on either
completion of an O_(D)SYNC write or when fsync is called explicitly.

The semantics of the latter case are explained in detail at the Usenix
paper above.

Last but not least a new fcntl is implemented to provide information
about I/O restrictions such as alignment requirements and the maximum
atomic write size.

The implementation is simple and clean, but I'm rather unhappy about
the interface as it has too many failure modes to bullet proof.  For
one old kernels ignore unknown open flags silently, so applications
have to check the F_IOINFO fcntl before, which is a bit of a killer.
Because of that I've also not implemented any other validity checks
yet, as they might make thing even worse when an open on a not supported
file system or device fails, but not on an old kernel.  Maybe we need
a new open version that checks arguments properly first?

Also I'm really worried about the NVMe failure modes - devices simply
advertise an atomic write size, with no way for the device to know
that the host requested a given write to be atomic, and thus no
error reporting.  This is made worse by NVMe 1.2 adding per-namespace
atomic I/O parameters that devices can use to introduce additional
odd alignment quirks - while there is some language in the spec
requiring them not to weaken the per-controller guarantees it all
looks rather weak and I'm not too confident in all implementations
getting everything right.

Last but not least this depends on a few XFS patches, so to actually
apply / run the patches please use this git tree:

    git://git.infradead.org/users/hch/vfs.git O_ATOMIC

Gitweb:

    http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/O_ATOMIC

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 01/12] uapi/fs: add O_ATOMIC to the open flags
  2017-02-28 14:57 [RFC] failure atomic writes for file systems and block devices Christoph Hellwig
@ 2017-02-28 14:57 ` Christoph Hellwig
  2017-02-28 14:57 ` [PATCH 02/12] iomap: pass IOMAP_* flags to actors Christoph Hellwig
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2017-02-28 14:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs, linux-block

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/fcntl.c                       | 3 ++-
 include/uapi/asm-generic/fcntl.h | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index e1c54f20325c..ca5d228be7ea 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -741,7 +741,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
+	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
 		O_RDONLY	| O_WRONLY	| O_RDWR	|
 		O_CREAT		| O_EXCL	| O_NOCTTY	|
 		O_TRUNC		| O_APPEND	| /* O_NONBLOCK	| */
@@ -749,6 +749,7 @@ static int __init fcntl_init(void)
 		O_DIRECT	| O_LARGEFILE	| O_DIRECTORY	|
 		O_NOFOLLOW	| O_NOATIME	| O_CLOEXEC	|
 		__FMODE_EXEC	| O_PATH	| __O_TMPFILE	|
+		O_ATOMIC	|
 		__FMODE_NONOTIFY
 		));
 
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index e063effe0cc1..26ab7622238a 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -92,6 +92,8 @@
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)      
 
+#define O_ATOMIC	040000000
+
 #ifndef O_NDELAY
 #define O_NDELAY	O_NONBLOCK
 #endif
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 02/12] iomap: pass IOMAP_* flags to actors
  2017-02-28 14:57 [RFC] failure atomic writes for file systems and block devices Christoph Hellwig
  2017-02-28 14:57 ` [PATCH 01/12] uapi/fs: add O_ATOMIC to the open flags Christoph Hellwig
@ 2017-02-28 14:57 ` Christoph Hellwig
  2017-02-28 14:57 ` [PATCH 03/12] iomap: add a IOMAP_ATOMIC flag Christoph Hellwig
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2017-02-28 14:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs, linux-block

This will be needed to implement O_ATOMIC.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/dax.c      |  2 +-
 fs/internal.h |  2 +-
 fs/iomap.c    | 39 +++++++++++++++++++++------------------
 3 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 78b9651576c6..5d71fc5f0a08 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -997,7 +997,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
 
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap)
+		struct iomap *iomap, unsigned flags)
 {
 	struct iov_iter *iter = data;
 	loff_t end = pos + length, done = 0;
diff --git a/fs/internal.h b/fs/internal.h
index 11c6d89dce9c..1934fdb2bb27 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -179,7 +179,7 @@ extern long vfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
  * iomap support:
  */
 typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
-		void *data, struct iomap *iomap);
+		void *data, struct iomap *iomap, unsigned flags);
 
 loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
 		unsigned flags, const struct iomap_ops *ops, void *data,
diff --git a/fs/iomap.c b/fs/iomap.c
index 7f08ca03d95d..16a9d2b89cb6 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -76,7 +76,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	 * we can do the copy-in page by page without having to worry about
 	 * failures exposing transient data.
 	 */
-	written = actor(inode, pos, length, data, &iomap);
+	written = actor(inode, pos, length, data, &iomap, flags);
 
 	/*
 	 * Now the data has been copied, commit the range we've copied.  This
@@ -105,8 +105,9 @@ iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 }
 
 static int
-iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
-		struct page **pagep, struct iomap *iomap)
+iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
+		unsigned aop_flags, struct page **pagep, struct iomap *iomap,
+		unsigned flags)
 {
 	pgoff_t index = pos >> PAGE_SHIFT;
 	struct page *page;
@@ -114,7 +115,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 
 	BUG_ON(pos + len > iomap->offset + iomap->length);
 
-	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
+	page = grab_cache_page_write_begin(inode->i_mapping, index, aop_flags);
 	if (!page)
 		return -ENOMEM;
 
@@ -146,18 +147,18 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
 
 static loff_t
 iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap)
+		struct iomap *iomap, unsigned flags)
 {
 	struct iov_iter *i = data;
 	long status = 0;
 	ssize_t written = 0;
-	unsigned int flags = AOP_FLAG_NOFS;
+	unsigned int aop_flags = AOP_FLAG_NOFS;
 
 	/*
 	 * Copies from kernel address space cannot fail (NFSD is a big user).
 	 */
 	if (!iter_is_iovec(i))
-		flags |= AOP_FLAG_UNINTERRUPTIBLE;
+		aop_flags |= AOP_FLAG_UNINTERRUPTIBLE;
 
 	do {
 		struct page *page;
@@ -187,8 +188,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			break;
 		}
 
-		status = iomap_write_begin(inode, pos, bytes, flags, &page,
-				iomap);
+		status = iomap_write_begin(inode, pos, bytes, aop_flags, &page,
+				iomap, flags);
 		if (unlikely(status))
 			break;
 
@@ -268,7 +269,7 @@ __iomap_read_page(struct inode *inode, loff_t offset)
 
 static loff_t
 iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap)
+		struct iomap *iomap, unsigned flags)
 {
 	long status = 0;
 	ssize_t written = 0;
@@ -287,7 +288,7 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 
 		status = iomap_write_begin(inode, pos, bytes,
 				AOP_FLAG_NOFS | AOP_FLAG_UNINTERRUPTIBLE,
-				&page, iomap);
+				&page, iomap, flags);
 		put_page(rpage);
 		if (unlikely(status))
 			return status;
@@ -333,13 +334,14 @@ iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
 EXPORT_SYMBOL_GPL(iomap_file_dirty);
 
 static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
-		unsigned bytes, struct iomap *iomap)
+		unsigned bytes, struct iomap *iomap, unsigned flags)
 {
 	struct page *page;
 	int status;
 
 	status = iomap_write_begin(inode, pos, bytes,
-			AOP_FLAG_UNINTERRUPTIBLE | AOP_FLAG_NOFS, &page, iomap);
+			AOP_FLAG_UNINTERRUPTIBLE | AOP_FLAG_NOFS, &page, iomap,
+			flags);
 	if (status)
 		return status;
 
@@ -360,7 +362,7 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
 
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
-		void *data, struct iomap *iomap)
+		void *data, struct iomap *iomap, unsigned flags)
 {
 	bool *did_zero = data;
 	loff_t written = 0;
@@ -379,7 +381,8 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		if (IS_DAX(inode))
 			status = iomap_dax_zero(pos, offset, bytes, iomap);
 		else
-			status = iomap_zero(inode, pos, offset, bytes, iomap);
+			status = iomap_zero(inode, pos, offset, bytes, iomap,
+					flags);
 		if (status < 0)
 			return status;
 
@@ -429,7 +432,7 @@ EXPORT_SYMBOL_GPL(iomap_truncate_page);
 
 static loff_t
 iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap)
+		void *data, struct iomap *iomap, unsigned flags)
 {
 	struct page *page = data;
 	int ret;
@@ -521,7 +524,7 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
 
 static loff_t
 iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap)
+		struct iomap *iomap, unsigned flags)
 {
 	struct fiemap_ctx *ctx = data;
 	loff_t ret = length;
@@ -730,7 +733,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 
 static loff_t
 iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap)
+		void *data, struct iomap *iomap, unsigned flags)
 {
 	struct iomap_dio *dio = data;
 	unsigned blkbits = blksize_bits(bdev_logical_block_size(iomap->bdev));
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 03/12] iomap: add a IOMAP_ATOMIC flag
  2017-02-28 14:57 [RFC] failure atomic writes for file systems and block devices Christoph Hellwig
  2017-02-28 14:57 ` [PATCH 01/12] uapi/fs: add O_ATOMIC to the open flags Christoph Hellwig
  2017-02-28 14:57 ` [PATCH 02/12] iomap: pass IOMAP_* flags to actors Christoph Hellwig
@ 2017-02-28 14:57 ` Christoph Hellwig
  2017-02-28 14:57 ` [PATCH 04/12] fs: add a BH_Atomic flag Christoph Hellwig
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2017-02-28 14:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs, linux-block

To pass through O_ATOMIC to the iomap_begin methods.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c            | 13 +++++++++++--
 include/linux/iomap.h |  1 +
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 16a9d2b89cb6..096cbf573932 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -237,6 +237,10 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
 {
 	struct inode *inode = iocb->ki_filp->f_mapping->host;
 	loff_t pos = iocb->ki_pos, ret = 0, written = 0;
+	unsigned flags = IOMAP_WRITE;
+
+	if (iocb->ki_filp->f_flags & O_ATOMIC)
+		flags |= IOMAP_ATOMIC;
 
 	while (iov_iter_count(iter)) {
 		ret = iomap_apply(inode, pos, iov_iter_count(iter),
@@ -452,8 +456,12 @@ int iomap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 	struct inode *inode = file_inode(vma->vm_file);
 	unsigned long length;
 	loff_t offset, size;
+	unsigned flags = IOMAP_WRITE | IOMAP_FAULT;
 	ssize_t ret;
 
+	if (vma->vm_file->f_flags & O_ATOMIC)
+		flags |= IOMAP_ATOMIC;
+
 	lock_page(page);
 	size = i_size_read(inode);
 	if ((page->mapping != inode->i_mapping) ||
@@ -471,8 +479,7 @@ int iomap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 
 	offset = page_offset(page);
 	while (length > 0) {
-		ret = iomap_apply(inode, offset, length,
-				IOMAP_WRITE | IOMAP_FAULT, ops, page,
+		ret = iomap_apply(inode, offset, length, flags, ops, page,
 				iomap_page_mkwrite_actor);
 		if (unlikely(ret <= 0))
 			goto out_unlock;
@@ -883,6 +890,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	} else {
 		dio->flags |= IOMAP_DIO_WRITE;
 		flags |= IOMAP_WRITE;
+		if (iocb->ki_filp->f_flags & O_ATOMIC)
+			flags |= IOMAP_ATOMIC;
 	}
 
 	if (mapping->nrpages) {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 891459caa278..a670ff18ccd6 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -51,6 +51,7 @@ struct iomap {
 #define IOMAP_REPORT		(1 << 2) /* report extent status, e.g. FIEMAP */
 #define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT		(1 << 4) /* direct I/O */
+#define IOMAP_ATOMIC		(1 << 5) /* atomic write vs power fail */
 
 struct iomap_ops {
 	/*
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 04/12] fs: add a BH_Atomic flag
  2017-02-28 14:57 [RFC] failure atomic writes for file systems and block devices Christoph Hellwig
                   ` (2 preceding siblings ...)
  2017-02-28 14:57 ` [PATCH 03/12] iomap: add a IOMAP_ATOMIC flag Christoph Hellwig
@ 2017-02-28 14:57 ` Christoph Hellwig
  2017-02-28 14:57 ` [PATCH 05/12] fs: add a F_IOINFO fcntl Christoph Hellwig
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2017-02-28 14:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs, linux-block

This allows us propagate the O_ATOMIC flag through the writeback code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/buffer.c                 | 13 +++++++++----
 fs/internal.h               |  2 +-
 fs/iomap.c                  |  4 ++--
 include/linux/buffer_head.h |  2 ++
 4 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 0e87401cf335..85b0dce31b34 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1937,7 +1937,7 @@ EXPORT_SYMBOL(page_zero_new_buffers);
 
 static void
 iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
-		struct iomap *iomap)
+		struct iomap *iomap, unsigned flags)
 {
 	loff_t offset = block << inode->i_blkbits;
 
@@ -1987,10 +1987,15 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
 		set_buffer_mapped(bh);
 		break;
 	}
+
+	if (flags & IOMAP_ATOMIC)
+		set_buffer_atomic(bh);
+	else
+		clear_buffer_atomic(bh);
 }
 
 int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
-		get_block_t *get_block, struct iomap *iomap)
+		get_block_t *get_block, struct iomap *iomap, unsigned flags)
 {
 	unsigned from = pos & (PAGE_SIZE - 1);
 	unsigned to = from + len;
@@ -2031,7 +2036,7 @@ int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
 				if (err)
 					break;
 			} else {
-				iomap_to_bh(inode, block, bh, iomap);
+				iomap_to_bh(inode, block, bh, iomap, flags);
 			}
 
 			if (buffer_new(bh)) {
@@ -2077,7 +2082,7 @@ int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
 int __block_write_begin(struct page *page, loff_t pos, unsigned len,
 		get_block_t *get_block)
 {
-	return __block_write_begin_int(page, pos, len, get_block, NULL);
+	return __block_write_begin_int(page, pos, len, get_block, NULL, 0);
 }
 EXPORT_SYMBOL(__block_write_begin);
 
diff --git a/fs/internal.h b/fs/internal.h
index 1934fdb2bb27..5f4cbdedafdb 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -42,7 +42,7 @@ static inline int __sync_blockdev(struct block_device *bdev, int wait)
  */
 extern void guard_bio_eod(int rw, struct bio *bio);
 extern int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
-		get_block_t *get_block, struct iomap *iomap);
+		get_block_t *get_block, struct iomap *iomap, unsigned flags);
 
 /*
  * char_dev.c
diff --git a/fs/iomap.c b/fs/iomap.c
index 096cbf573932..3c3c09104dcd 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -119,7 +119,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
 	if (!page)
 		return -ENOMEM;
 
-	status = __block_write_begin_int(page, pos, len, NULL, iomap);
+	status = __block_write_begin_int(page, pos, len, NULL, iomap, flags);
 	if (unlikely(status)) {
 		unlock_page(page);
 		put_page(page);
@@ -441,7 +441,7 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
 	struct page *page = data;
 	int ret;
 
-	ret = __block_write_begin_int(page, pos, length, NULL, iomap);
+	ret = __block_write_begin_int(page, pos, length, NULL, iomap, flags);
 	if (ret)
 		return ret;
 
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index d67ab83823ad..baff49fdfbe8 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -37,6 +37,7 @@ enum bh_state_bits {
 	BH_Meta,	/* Buffer contains metadata */
 	BH_Prio,	/* Buffer should be submitted with REQ_PRIO */
 	BH_Defer_Completion, /* Defer AIO completion to workqueue */
+	BH_Atomic,	/* part of an O_ATOMIC write */
 
 	BH_PrivateStart,/* not a state bit, but the first bit available
 			 * for private allocation by other entities
@@ -130,6 +131,7 @@ BUFFER_FNS(Unwritten, unwritten)
 BUFFER_FNS(Meta, meta)
 BUFFER_FNS(Prio, prio)
 BUFFER_FNS(Defer_Completion, defer_completion)
+BUFFER_FNS(Atomic, atomic);
 
 #define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 05/12] fs: add a F_IOINFO fcntl
  2017-02-28 14:57 [RFC] failure atomic writes for file systems and block devices Christoph Hellwig
                   ` (3 preceding siblings ...)
  2017-02-28 14:57 ` [PATCH 04/12] fs: add a BH_Atomic flag Christoph Hellwig
@ 2017-02-28 14:57 ` Christoph Hellwig
  2017-02-28 16:51   ` Darrick J. Wong
  2017-02-28 14:57 ` [PATCH 06/12] xfs: cleanup is_reflink checks Christoph Hellwig
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2017-02-28 14:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs, linux-block

This fcntl can be used to query I/O parameters for the given file
descriptor.  Initially it is used for the I/O alignment and atomic
write parameters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/fcntl.c                 | 18 ++++++++++++++++++
 include/linux/fs.h         |  1 +
 include/uapi/linux/fcntl.h | 16 ++++++++++++++++
 3 files changed, 35 insertions(+)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ca5d228be7ea..248fb4cc66a6 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -241,6 +241,21 @@ static int f_getowner_uids(struct file *filp, unsigned long arg)
 }
 #endif
 
+static int fcntl_ioinfo(struct file *file, void __user *argp)
+{
+	struct fcntl_ioinfo fio = { 0, };
+
+	if (file->f_op->ioinfo) {
+		int ret = file->f_op->ioinfo(file, &fio);
+		if (ret)
+			return ret;
+	}
+
+	if (copy_to_user(argp, &fio, sizeof(fio)))
+		return -EFAULT;
+	return 0;
+}
+
 static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		struct file *filp)
 {
@@ -335,6 +350,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_GET_SEALS:
 		err = shmem_fcntl(filp, cmd, arg);
 		break;
+	case F_IOINFO:
+		err = fcntl_ioinfo(filp, (void __user *)arg);
+		break;
 	default:
 		break;
 	}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2ba074328894..33b08a8c2bc3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1680,6 +1680,7 @@ struct file_operations {
 			u64);
 	ssize_t (*dedupe_file_range)(struct file *, u64, u64, struct file *,
 			u64);
+	int (*ioinfo)(struct file *, struct fcntl_ioinfo *);
 };
 
 struct inode_operations {
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index beed138bd359..6b0aaba7c623 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -42,6 +42,22 @@
 #define F_SEAL_WRITE	0x0008	/* prevent writes */
 /* (1U << 31) is reserved for signed error codes */
 
+
+#define F_IOINFO	(F_LINUX_SPECIFIC_BASE +  11)
+
+struct fcntl_ioinfo {
+	__u16		fio_flags;	/* FIO_FL_* */
+	__u16		fio_alignment;	/* required I/O alignment on disk */
+	__u32		__reserved1;	/* must be zero */
+	__u64		fio_max_atomic;	/* max size for atomic writes */
+	__u64		__reserved2[14];/* must be zero */
+};
+
+/* supports atomic writes using O_(D)SYNC */
+#define FIO_FL_ATOMIC_OSYNC	(1 << 0)
+/* supports atomic writes committed using fsync/fdatasync/msync */
+#define FIO_FL_ATOMIC_FSYNC	(1 << 1)
+
 /*
  * Types of directory notifications that may be requested.
  */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 06/12] xfs: cleanup is_reflink checks
  2017-02-28 14:57 [RFC] failure atomic writes for file systems and block devices Christoph Hellwig
                   ` (4 preceding siblings ...)
  2017-02-28 14:57 ` [PATCH 05/12] fs: add a F_IOINFO fcntl Christoph Hellwig
@ 2017-02-28 14:57 ` Christoph Hellwig
  2017-02-28 14:57 ` [PATCH 07/12] xfs: implement failure-atomic writes Christoph Hellwig
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2017-02-28 14:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs, linux-block

We'll soon need to distinguish between inodes that actually are reflinked,
and those that just use the COW fork for atomic write operations.  Switch
a few places to check for the existance of a COW for instead of the
reflink to prepare for that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c    | 2 +-
 fs/xfs/xfs_icache.c  | 2 +-
 fs/xfs/xfs_reflink.c | 7 +++----
 fs/xfs/xfs_super.c   | 2 +-
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index fe244648fff0..c78b585b3d84 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -885,7 +885,7 @@ xfs_writepage_map(
 			continue;
 		}
 
-		if (xfs_is_reflink_inode(XFS_I(inode))) {
+		if (XFS_I(inode)->i_cowfp) {
 			error = xfs_map_cow(wpc, inode, offset, &new_type);
 			if (error)
 				goto out;
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 1f7d158266c1..1673a41db731 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1574,7 +1574,7 @@ xfs_inode_free_cowblocks(
 	 * Just clear the tag if we have an empty cow fork or none at all. It's
 	 * possible the inode was fully unshared since it was originally tagged.
 	 */
-	if (!xfs_is_reflink_inode(ip) || !ifp->if_bytes) {
+	if (!ifp || !ifp->if_bytes) {
 		trace_xfs_inode_free_cowblocks_invalid(ip);
 		xfs_inode_clear_cowblocks_tag(ip);
 		return 0;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 83605af3b135..4225b5e67b17 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -496,7 +496,6 @@ xfs_reflink_find_cow_mapping(
 	xfs_extnum_t			idx;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL | XFS_ILOCK_SHARED));
-	ASSERT(xfs_is_reflink_inode(ip));
 
 	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
 	if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, &idx, &got))
@@ -523,7 +522,7 @@ xfs_reflink_trim_irec_to_next_cow(
 	struct xfs_bmbt_irec		got;
 	xfs_extnum_t			idx;
 
-	if (!xfs_is_reflink_inode(ip))
+	if (!ifp)
 		return;
 
 	/* Find the extent in the CoW fork. */
@@ -561,7 +560,7 @@ xfs_reflink_cancel_cow_blocks(
 	struct xfs_defer_ops		dfops;
 	int				error = 0;
 
-	if (!xfs_is_reflink_inode(ip))
+	if (!ifp)
 		return 0;
 	if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, &idx, &got))
 		return 0;
@@ -634,7 +633,7 @@ xfs_reflink_cancel_cow_range(
 	int			error;
 
 	trace_xfs_reflink_cancel_cow_range(ip, offset, count);
-	ASSERT(xfs_is_reflink_inode(ip));
+	ASSERT(ip->i_cowfp);
 
 	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
 	if (count == NULLFILEOFF)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 9136854030d5..868860354f09 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -952,7 +952,7 @@ xfs_fs_destroy_inode(
 	XFS_STATS_INC(ip->i_mount, vn_rele);
 	XFS_STATS_INC(ip->i_mount, vn_remove);
 
-	if (xfs_is_reflink_inode(ip)) {
+	if (ip->i_cowfp) {
 		error = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, 0);
 		if (error && !XFS_FORCED_SHUTDOWN(ip->i_mount))
 			xfs_warn(ip->i_mount,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 07/12] xfs: implement failure-atomic writes
  2017-02-28 14:57 [RFC] failure atomic writes for file systems and block devices Christoph Hellwig
                   ` (5 preceding siblings ...)
  2017-02-28 14:57 ` [PATCH 06/12] xfs: cleanup is_reflink checks Christoph Hellwig
@ 2017-02-28 14:57 ` Christoph Hellwig
  2017-02-28 23:09   ` Darrick J. Wong
  2017-02-28 14:57 ` [PATCH 08/12] xfs: implement the F_IOINFO fcntl Christoph Hellwig
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2017-02-28 14:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs, linux-block

If O_ATOMIC is specified in the open flags this will cause XFS to
allocate new extents in the COW for even if overwriting existing data,
and not remap them into the data fork until ->fsync is called,
at which point the whole range will be atomically remapped into the
data fork.  This allows applications to ѕafely overwrite data instead
of having to do double writes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c    | 18 +++++++++-----
 fs/xfs/xfs_aops.h    |  4 ++-
 fs/xfs/xfs_file.c    | 17 +++++++++++++
 fs/xfs/xfs_iomap.c   | 18 ++++++++------
 fs/xfs/xfs_reflink.c | 69 ++++++++++++++++++++++++++++++++++------------------
 fs/xfs/xfs_reflink.h |  5 ++--
 6 files changed, 91 insertions(+), 40 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index c78b585b3d84..1c5efbb05b47 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -292,6 +292,7 @@ xfs_end_io(
 	if (unlikely(error)) {
 		switch (ioend->io_type) {
 		case XFS_IO_COW:
+		case XFS_IO_ATOMIC:
 			xfs_reflink_cancel_cow_range(ip, offset, size, 0);
 			break;
 		}
@@ -327,7 +328,9 @@ xfs_end_bio(
 	struct xfs_ioend	*ioend = bio->bi_private;
 	struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
 
-	if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_type == XFS_IO_COW)
+	if (ioend->io_type == XFS_IO_UNWRITTEN ||
+	    ioend->io_type == XFS_IO_COW ||
+	    ioend->io_type == XFS_IO_ATOMIC)
 		queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
 	else if (ioend->io_append_trans)
 		queue_work(mp->m_data_workqueue, &ioend->io_work);
@@ -354,6 +357,7 @@ xfs_map_blocks(
 		return -EIO;
 
 	ASSERT(type != XFS_IO_COW);
+	ASSERT(type != XFS_IO_ATOMIC);
 	if (type == XFS_IO_UNWRITTEN)
 		bmapi_flags |= XFS_BMAPI_IGSTATE;
 
@@ -768,7 +772,8 @@ xfs_map_cow(
 	struct xfs_writepage_ctx *wpc,
 	struct inode		*inode,
 	loff_t			offset,
-	unsigned int		*new_type)
+	unsigned int		*new_type,
+	bool			atomic)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_bmbt_irec	imap;
@@ -778,10 +783,10 @@ xfs_map_cow(
 	/*
 	 * If we already have a valid COW mapping keep using it.
 	 */
-	if (wpc->io_type == XFS_IO_COW) {
+	if (wpc->io_type == XFS_IO_COW || wpc->io_type == XFS_IO_ATOMIC) {
 		wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap, offset);
 		if (wpc->imap_valid) {
-			*new_type = XFS_IO_COW;
+			*new_type = wpc->io_type;
 			return 0;
 		}
 	}
@@ -807,7 +812,7 @@ xfs_map_cow(
 			return error;
 	}
 
-	wpc->io_type = *new_type = XFS_IO_COW;
+	wpc->io_type = *new_type = atomic ? XFS_IO_ATOMIC : XFS_IO_COW;
 	wpc->imap_valid = true;
 	wpc->imap = imap;
 	return 0;
@@ -886,7 +891,8 @@ xfs_writepage_map(
 		}
 
 		if (XFS_I(inode)->i_cowfp) {
-			error = xfs_map_cow(wpc, inode, offset, &new_type);
+			error = xfs_map_cow(wpc, inode, offset, &new_type,
+					buffer_atomic(bh));
 			if (error)
 				goto out;
 		}
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index cc174ec6c2fd..798e653e68b6 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -29,6 +29,7 @@ enum {
 	XFS_IO_UNWRITTEN,	/* covers allocated but uninitialized data */
 	XFS_IO_OVERWRITE,	/* covers already allocated extent */
 	XFS_IO_COW,		/* covers copy-on-write extent */
+	XFS_IO_ATOMIC,		/* atomic write */
 };
 
 #define XFS_IO_TYPES \
@@ -36,7 +37,8 @@ enum {
 	{ XFS_IO_DELALLOC,		"delalloc" }, \
 	{ XFS_IO_UNWRITTEN,		"unwritten" }, \
 	{ XFS_IO_OVERWRITE,		"overwrite" }, \
-	{ XFS_IO_COW,			"CoW" }
+	{ XFS_IO_COW,			"CoW" }, \
+	{ XFS_IO_ATOMIC,		"atomic" }
 
 /*
  * Structure for buffered I/O completions.
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 086440e79b86..a7d8324b59c5 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -160,6 +160,12 @@ xfs_file_fsync(
 	else if (mp->m_logdev_targp != mp->m_ddev_targp)
 		xfs_blkdev_issue_flush(mp->m_ddev_targp);
 
+	if (file->f_flags & O_ATOMIC) {
+		error = xfs_reflink_end_cow(ip, start, end - start + 1);
+		if (error)
+			return error;
+	}
+
 	/*
 	 * All metadata updates are logged, which means that we just have to
 	 * flush the log up to the latest LSN that touched the inode. If we have
@@ -457,6 +463,9 @@ xfs_dio_write_end_io(
 	}
 	spin_unlock(&ip->i_flags_lock);
 
+	if (iocb->ki_filp->f_flags & O_ATOMIC)
+		return 0;
+
 	if (flags & IOMAP_DIO_COW) {
 		error = xfs_reflink_end_cow(ip, offset, size);
 		if (error)
@@ -529,6 +538,12 @@ xfs_file_dio_aio_write(
 		unaligned_io = 1;
 
 		/*
+		 * We need filesystem block alignment to provide atomic commits.
+		 */
+		if (file->f_flags & O_ATOMIC)
+			return -EINVAL;
+
+		/*
 		 * We can't properly handle unaligned direct I/O to reflink
 		 * files yet, as we can't unshare a partial block.
 		 */
@@ -892,6 +907,8 @@ xfs_file_open(
 		return -EFBIG;
 	if (XFS_FORCED_SHUTDOWN(XFS_M(inode->i_sb)))
 		return -EIO;
+	if (file->f_flags & O_ATOMIC)
+		printk_ratelimited("O_ATOMIC!\n");
 	return 0;
 }
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 5d68b4279016..b686a6bd2db4 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -559,13 +559,14 @@ xfs_file_iomap_begin_delay(
 
 	eof = !xfs_iext_lookup_extent(ip, ifp, offset_fsb, &idx, &got);
 	if (!eof && got.br_startoff <= offset_fsb) {
-		if (xfs_is_reflink_inode(ip)) {
+		if ((flags & IOMAP_ATOMIC) || xfs_is_reflink_inode(ip)) {
 			bool		shared;
 
 			end_fsb = min(XFS_B_TO_FSB(mp, offset + count),
 					maxbytes_fsb);
 			xfs_trim_extent(&got, offset_fsb, end_fsb - offset_fsb);
-			error = xfs_reflink_reserve_cow(ip, &got, &shared);
+			error = xfs_reflink_reserve_cow(ip, &got, &shared,
+					(flags & IOMAP_ATOMIC));
 			if (error)
 				goto out_unlock;
 		}
@@ -951,7 +952,7 @@ static inline bool need_excl_ilock(struct xfs_inode *ip, unsigned flags)
 	 */
 	if (xfs_is_reflink_inode(ip) && (flags & (IOMAP_WRITE | IOMAP_ZERO)))
 		return true;
-	if ((flags & IOMAP_DIRECT) && (flags & IOMAP_WRITE))
+	if ((flags & (IOMAP_DIRECT | IOMAP_ATOMIC)) && (flags & IOMAP_WRITE))
 		return true;
 	return false;
 }
@@ -976,7 +977,8 @@ xfs_file_iomap_begin(
 		return -EIO;
 
 	if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
-			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
+	    ((flags & IOMAP_ATOMIC) ||
+	     (!IS_DAX(inode) && !xfs_get_extsz_hint(ip)))) {
 		/* Reserve delalloc blocks for regular writeback. */
 		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
 				iomap);
@@ -1008,15 +1010,17 @@ xfs_file_iomap_begin(
 			goto out_unlock;
 	}
 
-	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
+	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) &&
+	    ((flags & IOMAP_ATOMIC) || xfs_is_reflink_inode(ip))) {
 		if (flags & IOMAP_DIRECT) {
 			/* may drop and re-acquire the ilock */
 			error = xfs_reflink_allocate_cow(ip, &imap, &shared,
-					&lockmode);
+					&lockmode, flags & IOMAP_ATOMIC);
 			if (error)
 				goto out_unlock;
 		} else {
-			error = xfs_reflink_reserve_cow(ip, &imap, &shared);
+			error = xfs_reflink_reserve_cow(ip, &imap, &shared,
+					false);
 			if (error)
 				goto out_unlock;
 		}
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 4225b5e67b17..4702dd800ab8 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -264,9 +264,9 @@ int
 xfs_reflink_reserve_cow(
 	struct xfs_inode	*ip,
 	struct xfs_bmbt_irec	*imap,
-	bool			*shared)
+	bool			*shared,
+	bool			always_cow)
 {
-	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
 	struct xfs_bmbt_irec	got;
 	int			error = 0;
 	bool			eof = false, trimmed;
@@ -280,26 +280,30 @@ xfs_reflink_reserve_cow(
 	 * extent list is generally faster than going out to the shared extent
 	 * tree.
 	 */
-
-	if (!xfs_iext_lookup_extent(ip, ifp, imap->br_startoff, &idx, &got))
+	if (!ip->i_cowfp) {
+		ASSERT(always_cow);
+		xfs_ifork_init_cow(ip);
 		eof = true;
-	if (!eof && got.br_startoff <= imap->br_startoff) {
-		trace_xfs_reflink_cow_found(ip, imap);
-		xfs_trim_extent(imap, got.br_startoff, got.br_blockcount);
+	} else {
+		if (!xfs_iext_lookup_extent(ip, ip->i_cowfp, imap->br_startoff,
+				&idx, &got))
+			eof = true;
+		if (!eof && got.br_startoff <= imap->br_startoff) {
+			trace_xfs_reflink_cow_found(ip, imap);
+			xfs_trim_extent(imap, got.br_startoff,
+					got.br_blockcount);
+
+			*shared = true;
+			return 0;
+		}
 
-		*shared = true;
-		return 0;
+		/* Trim the mapping to the nearest shared extent boundary. */
+		error = xfs_reflink_trim_around_shared(ip, imap, shared,
+				&trimmed);
+		if (error || !*shared)
+			return error;
 	}
 
-	/* Trim the mapping to the nearest shared extent boundary. */
-	error = xfs_reflink_trim_around_shared(ip, imap, shared, &trimmed);
-	if (error)
-		return error;
-
-	/* Not shared?  Just report the (potentially capped) extent. */
-	if (!*shared)
-		return 0;
-
 	/*
 	 * Fork all the shared blocks from our write offset until the end of
 	 * the extent.
@@ -383,7 +387,8 @@ xfs_reflink_allocate_cow(
 	struct xfs_inode	*ip,
 	struct xfs_bmbt_irec	*imap,
 	bool			*shared,
-	uint			*lockmode)
+	uint			*lockmode,
+	bool			always_cow)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fileoff_t		offset_fsb = imap->br_startoff;
@@ -399,15 +404,19 @@ xfs_reflink_allocate_cow(
 	xfs_extnum_t		idx;
 
 retry:
-	ASSERT(xfs_is_reflink_inode(ip));
+	ASSERT(always_cow | xfs_is_reflink_inode(ip));
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL | XFS_ILOCK_SHARED));
 
+	if (!ip->i_cowfp) {
+		ASSERT(always_cow);
+		xfs_ifork_init_cow(ip);
+
 	/*
 	 * Even if the extent is not shared we might have a preallocation for
 	 * it in the COW fork.  If so use it.
 	 */
-	if (xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &idx, &got) &&
-	    got.br_startoff <= offset_fsb) {
+	} else if (xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &idx,
+			&got) && got.br_startoff <= offset_fsb) {
 		*shared = true;
 
 		/* If we have a real allocation in the COW fork we're done. */
@@ -418,7 +427,7 @@ xfs_reflink_allocate_cow(
 		}
 
 		xfs_trim_extent(imap, got.br_startoff, got.br_blockcount);
-	} else {
+	} else if (!always_cow) {
 		error = xfs_reflink_trim_around_shared(ip, imap, shared, &trimmed);
 		if (error || !*shared)
 			goto out;
@@ -684,6 +693,7 @@ xfs_reflink_end_cow(
 	xfs_fileoff_t			offset_fsb;
 	xfs_fileoff_t			end_fsb;
 	xfs_fsblock_t			firstfsb;
+	xfs_off_t			new_size;
 	struct xfs_defer_ops		dfops;
 	int				error;
 	unsigned int			resblks;
@@ -693,7 +703,7 @@ xfs_reflink_end_cow(
 	trace_xfs_reflink_end_cow(ip, offset, count);
 
 	/* No COW extents?  That's easy! */
-	if (ifp->if_bytes == 0)
+	if (!ifp || ifp->if_bytes == 0)
 		return 0;
 
 	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
@@ -776,6 +786,17 @@ xfs_reflink_end_cow(
 			break;
 	}
 
+	/*
+	 * Update the on-disk inode size if we completed an operation outside
+	 * of the inode size.  This can only happen for atomic writes, and not
+	 * for actual reflinked files.
+	 */
+	new_size = xfs_new_eof(ip, offset + count);
+	if (new_size) {
+		ip->i_d.di_size = new_size;
+		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	}
+
 	error = xfs_trans_commit(tp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	if (error)
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 9416279b3c89..0360e2c0f3a5 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -27,9 +27,10 @@ extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
 		struct xfs_bmbt_irec *irec, bool *shared, bool *trimmed);
 
 extern int xfs_reflink_reserve_cow(struct xfs_inode *ip,
-		struct xfs_bmbt_irec *imap, bool *shared);
+		struct xfs_bmbt_irec *imap, bool *shared, bool always_cow);
 extern int xfs_reflink_allocate_cow(struct xfs_inode *ip,
-		struct xfs_bmbt_irec *imap, bool *shared, uint *lockmode);
+		struct xfs_bmbt_irec *imap, bool *shared, uint *lockmode,
+		bool always_cow);
 extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
 extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 08/12] xfs: implement the F_IOINFO fcntl
  2017-02-28 14:57 [RFC] failure atomic writes for file systems and block devices Christoph Hellwig
                   ` (6 preceding siblings ...)
  2017-02-28 14:57 ` [PATCH 07/12] xfs: implement failure-atomic writes Christoph Hellwig
@ 2017-02-28 14:57 ` Christoph Hellwig
  2017-02-28 14:57 ` [PATCH 09/12] block: advertize max atomic write limit Christoph Hellwig
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2017-02-28 14:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs, linux-block

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a7d8324b59c5..4d955b3266df 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -898,6 +898,39 @@ xfs_file_dedupe_range(
 	return len;
 }
 
+static uint16_t
+xfs_dio_alignment(struct file *file)
+{
+	struct xfs_inode	*ip = XFS_I(file_inode(file));
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (file->f_flags & O_ATOMIC)
+		return mp->m_sb.sb_blocksize;
+	else if (XFS_IS_REALTIME_INODE(ip))
+		return mp->m_rtdev_targp->bt_logical_sectorsize;
+	else
+		return mp->m_ddev_targp->bt_logical_sectorsize;
+}
+
+static int
+xfs_file_ioinfo(
+	struct file		*file,
+	struct fcntl_ioinfo	*fio)
+{
+	if (file->f_flags & O_DIRECT)
+		fio->fio_alignment = xfs_dio_alignment(file);
+
+	if (file->f_flags & O_ATOMIC) {
+		fio->fio_flags = FIO_FL_ATOMIC_OSYNC | FIO_FL_ATOMIC_FSYNC;
+		fio->fio_max_atomic = INT_MAX;
+		if (fio->fio_alignment)
+			fio->fio_max_atomic &= ~(fio->fio_alignment - 1);
+	}
+
+
+	return 0;
+};
+
 STATIC int
 xfs_file_open(
 	struct inode	*inode,
@@ -1556,6 +1589,7 @@ const struct file_operations xfs_file_operations = {
 	.fallocate	= xfs_file_fallocate,
 	.clone_file_range = xfs_file_clone_range,
 	.dedupe_file_range = xfs_file_dedupe_range,
+	.ioinfo		= xfs_file_ioinfo,
 };
 
 const struct file_operations xfs_dir_file_operations = {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 09/12] block: advertize max atomic write limit
  2017-02-28 14:57 [RFC] failure atomic writes for file systems and block devices Christoph Hellwig
                   ` (7 preceding siblings ...)
  2017-02-28 14:57 ` [PATCH 08/12] xfs: implement the F_IOINFO fcntl Christoph Hellwig
@ 2017-02-28 14:57 ` Christoph Hellwig
  2017-02-28 14:57 ` [PATCH 10/12] block_dev: set REQ_NOMERGE for O_ATOMIC writes Christoph Hellwig
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2017-02-28 14:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs, linux-block

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-settings.c   | 22 ++++++++++++++++++++++
 block/blk-sysfs.c      | 12 ++++++++++++
 include/linux/blkdev.h |  9 +++++++++
 3 files changed, 43 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 529e55f52a03..9279542472fb 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -93,6 +93,7 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->virt_boundary_mask = 0;
 	lim->max_segment_size = BLK_MAX_SEGMENT_SIZE;
 	lim->max_sectors = lim->max_hw_sectors = BLK_SAFE_MAX_SECTORS;
+	lim->max_atomic_write_sectors = 0;
 	lim->max_dev_sectors = 0;
 	lim->chunk_sectors = 0;
 	lim->max_write_same_sectors = 0;
@@ -129,6 +130,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->discard_zeroes_data = 1;
 	lim->max_segments = USHRT_MAX;
 	lim->max_hw_sectors = UINT_MAX;
+	lim->max_atomic_write_sectors = 0;
 	lim->max_segment_size = UINT_MAX;
 	lim->max_sectors = UINT_MAX;
 	lim->max_dev_sectors = UINT_MAX;
@@ -258,6 +260,24 @@ void blk_queue_max_hw_sectors(struct request_queue *q, unsigned int max_hw_secto
 EXPORT_SYMBOL(blk_queue_max_hw_sectors);
 
 /**
+ * blk_queue_max_atomic_write_sectors - maximum sectors written atomically
+ * @q:  the request queue for the device
+ * @max_hw_sectors:  max hardware sectors in the usual 512b unit
+ *
+ * Description:
+ *    Enables a low level driver to advertise that it supports writing
+ *    multi-sector I/O atomically.  If the driver has any requirements
+ *    in addition to the maximum size it should not set this field to
+ *    indicate that it supports multi-sector atomic writes.
+ **/
+void blk_queue_max_atomic_write_sectors(struct request_queue *q,
+		unsigned int max_atomic_write_sectors)
+{
+	q->limits.max_atomic_write_sectors = max_atomic_write_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_atomic_write_sectors);
+
+/**
  * blk_queue_chunk_sectors - set size of the chunk for this queue
  * @q:  the request queue for the device
  * @chunk_sectors:  chunk sectors in the usual 512b unit
@@ -541,6 +561,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
 	t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors);
 	t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
+	/* no support for stacking atomic writes */
+	t->max_atomic_write_sectors = 0;
 	t->max_write_same_sectors = min(t->max_write_same_sectors,
 					b->max_write_same_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 1dbce057592d..2f39009731f6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -249,6 +249,12 @@ static ssize_t queue_max_hw_sectors_show(struct request_queue *q, char *page)
 	return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_max_atomic_write_sectors_show(struct request_queue *q,
+		char *page)
+{
+	return queue_var_show(queue_max_atomic_write_sectors(q) << 1, page);
+}
+
 #define QUEUE_SYSFS_BIT_FNS(name, flag, neg)				\
 static ssize_t								\
 queue_show_##name(struct request_queue *q, char *page)			\
@@ -540,6 +546,11 @@ static struct queue_sysfs_entry queue_max_hw_sectors_entry = {
 	.show = queue_max_hw_sectors_show,
 };
 
+static struct queue_sysfs_entry queue_max_atomic_write_sectors_entry = {
+	.attr = {.name = "max_atomic_write_sectors_kb", .mode = S_IRUGO },
+	.show = queue_max_atomic_write_sectors_show,
+};
+
 static struct queue_sysfs_entry queue_max_segments_entry = {
 	.attr = {.name = "max_segments", .mode = S_IRUGO },
 	.show = queue_max_segments_show,
@@ -695,6 +706,7 @@ static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
+	&queue_max_atomic_write_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
 	&queue_max_segments_entry.attr,
 	&queue_max_integrity_segments_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1ca8e8fd1078..c43d952557f9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -323,6 +323,7 @@ struct queue_limits {
 	unsigned int		alignment_offset;
 	unsigned int		io_min;
 	unsigned int		io_opt;
+	unsigned int		max_atomic_write_sectors;
 	unsigned int		max_discard_sectors;
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
@@ -1135,6 +1136,8 @@ extern void blk_cleanup_queue(struct request_queue *);
 extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
 extern void blk_queue_bounce_limit(struct request_queue *, u64);
 extern void blk_queue_max_hw_sectors(struct request_queue *, unsigned int);
+extern void blk_queue_max_atomic_write_sectors(struct request_queue *,
+		unsigned int);
 extern void blk_queue_chunk_sectors(struct request_queue *, unsigned int);
 extern void blk_queue_max_segments(struct request_queue *, unsigned short);
 extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
@@ -1371,6 +1374,12 @@ static inline unsigned int queue_max_hw_sectors(struct request_queue *q)
 	return q->limits.max_hw_sectors;
 }
 
+static inline unsigned int queue_max_atomic_write_sectors(
+		struct request_queue *q)
+{
+	return q->limits.max_atomic_write_sectors;
+}
+
 static inline unsigned short queue_max_segments(struct request_queue *q)
 {
 	return q->limits.max_segments;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 10/12] block_dev: set REQ_NOMERGE for O_ATOMIC writes
  2017-02-28 14:57 [RFC] failure atomic writes for file systems and block devices Christoph Hellwig
                   ` (8 preceding siblings ...)
  2017-02-28 14:57 ` [PATCH 09/12] block: advertize max atomic write limit Christoph Hellwig
@ 2017-02-28 14:57 ` Christoph Hellwig
  2017-02-28 14:57 ` [PATCH 11/12] block_dev: implement the F_IOINFO fcntl Christoph Hellwig
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2017-02-28 14:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs, linux-block

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/block_dev.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3c47614a4b32..4dd5c54cdefb 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -242,6 +242,10 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 		task_io_account_write(ret);
 	}
 
+	/* don't merge atomic requests to avoid going over the limit */
+	if (iocb->ki_filp->f_flags & O_ATOMIC)
+		bio.bi_opf |= REQ_NOMERGE;
+
 	qc = submit_bio(&bio);
 	for (;;) {
 		set_current_state(TASK_UNINTERRUPTIBLE);
@@ -377,6 +381,10 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 			task_io_account_write(bio->bi_iter.bi_size);
 		}
 
+		/* don't merge atomic requests to avoid going over the limit */
+		if (iocb->ki_filp->f_flags & O_ATOMIC)
+			bio->bi_opf |= REQ_NOMERGE;
+
 		dio->size += bio->bi_iter.bi_size;
 		pos += bio->bi_iter.bi_size;
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 11/12] block_dev: implement the F_IOINFO fcntl
  2017-02-28 14:57 [RFC] failure atomic writes for file systems and block devices Christoph Hellwig
                   ` (9 preceding siblings ...)
  2017-02-28 14:57 ` [PATCH 10/12] block_dev: set REQ_NOMERGE for O_ATOMIC writes Christoph Hellwig
@ 2017-02-28 14:57 ` Christoph Hellwig
  2017-02-28 14:57 ` [PATCH 12/12] nvme: export the atomic write limit Christoph Hellwig
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2017-02-28 14:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs, linux-block

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/block_dev.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4dd5c54cdefb..48a799964e1d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2116,6 +2116,26 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 					     end >> PAGE_SHIFT);
 }
 
+static int blkdev_ioinfo(struct file *file, struct fcntl_ioinfo *fio)
+{
+	struct block_device *bdev = I_BDEV(bdev_file_inode(file));
+	struct request_queue *q = bdev_get_queue(bdev);
+	unsigned int atomic_sectors = queue_max_atomic_write_sectors(q);
+
+	if (file->f_flags & O_DIRECT) {
+		fio->fio_alignment = bdev_logical_block_size(bdev);
+
+		if ((file->f_flags & O_ATOMIC) && atomic_sectors) {
+			fio->fio_flags = FIO_FL_ATOMIC_OSYNC;
+			fio->fio_max_atomic = (atomic_sectors << 9);
+			if (fio->fio_alignment)
+				fio->fio_max_atomic &= ~(fio->fio_alignment - 1);
+		}
+	}
+
+	return 0;
+};
+
 const struct file_operations def_blk_fops = {
 	.open		= blkdev_open,
 	.release	= blkdev_close,
@@ -2131,6 +2151,7 @@ const struct file_operations def_blk_fops = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= blkdev_fallocate,
+	.ioinfo		= blkdev_ioinfo,
 };
 
 int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 12/12] nvme: export the atomic write limit
  2017-02-28 14:57 [RFC] failure atomic writes for file systems and block devices Christoph Hellwig
                   ` (10 preceding siblings ...)
  2017-02-28 14:57 ` [PATCH 11/12] block_dev: implement the F_IOINFO fcntl Christoph Hellwig
@ 2017-02-28 14:57 ` Christoph Hellwig
  2017-02-28 20:48   ` Chris Mason
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2017-02-28 14:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs, linux-block

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 10 ++++++++++
 drivers/nvme/host/nvme.h |  1 +
 2 files changed, 11 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8a3c3e32a704..e86d07589f18 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -926,6 +926,15 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 	ns->pi_type = pi_type;
 	blk_queue_logical_block_size(ns->queue, bs);
 
+	/*
+	 * Advertisze the maximum atomic write size.  Don't bother with the
+	 * per-namespace values due to their alignment constraints.
+	 */
+	if (ns->ctrl->awupf > 1) {
+		blk_queue_max_atomic_write_sectors(ns->queue,
+			(ns->ctrl->awupf + 1) << (ns->lba_shift - 9));
+	}
+
 	if (ns->ms && !blk_get_integrity(disk) && !ns->ext)
 		nvme_init_integrity(ns);
 	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk))
@@ -1232,6 +1241,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 
 	ctrl->vid = le16_to_cpu(id->vid);
 	ctrl->oncs = le16_to_cpup(&id->oncs);
+	ctrl->awupf = le16_to_cpu(id->awupf);
 	atomic_set(&ctrl->abort_limit, id->acl + 1);
 	ctrl->vwc = id->vwc;
 	ctrl->cntlid = le16_to_cpup(&id->cntlid);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index aead6d08ed2c..020ffd6f7863 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -143,6 +143,7 @@ struct nvme_ctrl {
 	u32 vs;
 	u32 sgls;
 	u16 kas;
+	u16 awupf;
 	unsigned int kato;
 	bool subsystem;
 	unsigned long quirks;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 05/12] fs: add a F_IOINFO fcntl
  2017-02-28 14:57 ` [PATCH 05/12] fs: add a F_IOINFO fcntl Christoph Hellwig
@ 2017-02-28 16:51   ` Darrick J. Wong
  2017-03-01 15:11     ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2017-02-28 16:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs, linux-block

On Tue, Feb 28, 2017 at 06:57:30AM -0800, Christoph Hellwig wrote:
> This fcntl can be used to query I/O parameters for the given file
> descriptor.  Initially it is used for the I/O alignment and atomic
> write parameters.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/fcntl.c                 | 18 ++++++++++++++++++
>  include/linux/fs.h         |  1 +
>  include/uapi/linux/fcntl.h | 16 ++++++++++++++++
>  3 files changed, 35 insertions(+)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index ca5d228be7ea..248fb4cc66a6 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -241,6 +241,21 @@ static int f_getowner_uids(struct file *filp, unsigned long arg)
>  }
>  #endif
>  
> +static int fcntl_ioinfo(struct file *file, void __user *argp)
> +{
> +	struct fcntl_ioinfo fio = { 0, };
> +
> +	if (file->f_op->ioinfo) {
> +		int ret = file->f_op->ioinfo(file, &fio);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (copy_to_user(argp, &fio, sizeof(fio)))
> +		return -EFAULT;
> +	return 0;
> +}
> +
>  static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
>  		struct file *filp)
>  {
> @@ -335,6 +350,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
>  	case F_GET_SEALS:
>  		err = shmem_fcntl(filp, cmd, arg);
>  		break;
> +	case F_IOINFO:
> +		err = fcntl_ioinfo(filp, (void __user *)arg);
> +		break;
>  	default:
>  		break;
>  	}
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2ba074328894..33b08a8c2bc3 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1680,6 +1680,7 @@ struct file_operations {
>  			u64);
>  	ssize_t (*dedupe_file_range)(struct file *, u64, u64, struct file *,
>  			u64);
> +	int (*ioinfo)(struct file *, struct fcntl_ioinfo *);
>  };
>  
>  struct inode_operations {
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index beed138bd359..6b0aaba7c623 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -42,6 +42,22 @@
>  #define F_SEAL_WRITE	0x0008	/* prevent writes */
>  /* (1U << 31) is reserved for signed error codes */
>  
> +
> +#define F_IOINFO	(F_LINUX_SPECIFIC_BASE +  11)
> +
> +struct fcntl_ioinfo {
> +	__u16		fio_flags;	/* FIO_FL_* */
> +	__u16		fio_alignment;	/* required I/O alignment on disk */

Hm... is fio_alignment is specified in units of bytes?  If so, then
shouldn't this be a __u32 so that we can handle some weird future device
that wants, say, 1MB alignment for its atomic IO?

I'm not sure there /are/ such weird devices, and the current patchset
assumes (probably sanely) that atomic requests only have to be
lba-aligned, but otoh this is a userland field and we have plenty of
reserved space.

Though, now that I look at the XFS ioinfo patch, I guess fio_alignment
is set only for O_DIRECT files?  So it's really the required alignment
for directio operations.

(Now I think I'm simply confused about this field.)

--D

> +	__u32		__reserved1;	/* must be zero */
> +	__u64		fio_max_atomic;	/* max size for atomic writes */
> +	__u64		__reserved2[14];/* must be zero */
> +};
> +
> +/* supports atomic writes using O_(D)SYNC */
> +#define FIO_FL_ATOMIC_OSYNC	(1 << 0)
> +/* supports atomic writes committed using fsync/fdatasync/msync */
> +#define FIO_FL_ATOMIC_FSYNC	(1 << 1)
> +
>  /*
>   * Types of directory notifications that may be requested.
>   */
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] failure atomic writes for file systems and block devices
  2017-02-28 14:57 [RFC] failure atomic writes for file systems and block devices Christoph Hellwig
@ 2017-02-28 20:48   ` Chris Mason
  2017-02-28 14:57 ` [PATCH 02/12] iomap: pass IOMAP_* flags to actors Christoph Hellwig
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Chris Mason @ 2017-02-28 20:48 UTC (permalink / raw)
  To: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-block



On 02/28/2017 09:57 AM, Christoph Hellwig wrote:
> Hi all,
>
> this series implements a new O_ATOMIC flag for failure atomic writes
> to files.   It is based on and tries to unify to earlier proposals,
> the first one for block devices by Chris Mason:
>
> 	https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_573092_&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=9QPtTAxcitoznaWRKKHoEQ&m=P5byIhbDCF-kdlNpZVpxMKG3E36-cQ-lK27coqUFUng&s=rqXtuRMvf2rijHel_VAiO-KQ8AtQ5DXEI2obnCI_ljQ&e=
>
> and the second one for regular files, published by HP Research at
> Usenix FAST 2015:
>
> 	https://urldefense.proofpoint.com/v2/url?u=https-3A__www.usenix.org_conference_fast15_technical-2Dsessions_presentation_verma&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=9QPtTAxcitoznaWRKKHoEQ&m=P5byIhbDCF-kdlNpZVpxMKG3E36-cQ-lK27coqUFUng&s=ilnrrNs8nG4_UV2xx7tc2Efm20d2Wa8PHoJE8WUTCwI&e=
>
> It adds a new O_ATOMIC flag for open, which requests writes to be
> failure-atomic, that is either the whole write makes it to persistent
> storage, or none of it, even in case of power of other failures.
>
> There are two implementation various of this:  on block devices O_ATOMIC
> must be combined with O_(D)SYNC so that storage devices that can handle
> large writes atomically can simply do that without any additional work.
> This case is supported by NVMe.
>

Hi Christoph,

This is great, and supporting code in both dio and bio get rid of some 
of the warts from when I tried.  The DIO_PAGES define used to be an 
upper limit on the max contiguous bio that would get built, but that's 
much better now.

One thing that isn't clear to me is how we're dealing with boundary bio 
mappings, which will get submitted by submit_page_section()

sdio->boundary = buffer_boundary(map_bh);

In btrfs I'd just chain things together and do the extent pointer swap 
afterwards, but I didn't follow the XFS code well enough to see how its 
handled there.  But either way it feels like an error prone surprise 
waiting for later, and one gap we really want to get right in the FS 
support is O_ATOMIC across a fragmented extent.

If I'm reading the XFS patches right, the code always cows for atomic. 
Are you planning on adding an optimization to use atomic support in the 
device to skip COW when possible?

To turn off mysql double buffering, we only need 16K or 64K writes, 
which most of the time you'd be able to pass down directly without cows.

-chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] failure atomic writes for file systems and block devices
@ 2017-02-28 20:48   ` Chris Mason
  0 siblings, 0 replies; 25+ messages in thread
From: Chris Mason @ 2017-02-28 20:48 UTC (permalink / raw)
  To: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-block

On 02/28/2017 09:57 AM, Christoph Hellwig wrote:
> Hi all,
>
> this series implements a new O_ATOMIC flag for failure atomic writes
> to files.   It is based on and tries to unify to earlier proposals,
> the first one for block devices by Chris Mason:
>
> 	https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_573092_&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=9QPtTAxcitoznaWRKKHoEQ&m=P5byIhbDCF-kdlNpZVpxMKG3E36-cQ-lK27coqUFUng&s=rqXtuRMvf2rijHel_VAiO-KQ8AtQ5DXEI2obnCI_ljQ&e=
>
> and the second one for regular files, published by HP Research at
> Usenix FAST 2015:
>
> 	https://urldefense.proofpoint.com/v2/url?u=https-3A__www.usenix.org_conference_fast15_technical-2Dsessions_presentation_verma&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=9QPtTAxcitoznaWRKKHoEQ&m=P5byIhbDCF-kdlNpZVpxMKG3E36-cQ-lK27coqUFUng&s=ilnrrNs8nG4_UV2xx7tc2Efm20d2Wa8PHoJE8WUTCwI&e=
>
> It adds a new O_ATOMIC flag for open, which requests writes to be
> failure-atomic, that is either the whole write makes it to persistent
> storage, or none of it, even in case of power of other failures.
>
> There are two implementation various of this:  on block devices O_ATOMIC
> must be combined with O_(D)SYNC so that storage devices that can handle
> large writes atomically can simply do that without any additional work.
> This case is supported by NVMe.
>

Hi Christoph,

This is great, and supporting code in both dio and bio get rid of some 
of the warts from when I tried.  The DIO_PAGES define used to be an 
upper limit on the max contiguous bio that would get built, but that's 
much better now.

One thing that isn't clear to me is how we're dealing with boundary bio 
mappings, which will get submitted by submit_page_section()

sdio->boundary = buffer_boundary(map_bh);

In btrfs I'd just chain things together and do the extent pointer swap 
afterwards, but I didn't follow the XFS code well enough to see how its 
handled there.  But either way it feels like an error prone surprise 
waiting for later, and one gap we really want to get right in the FS 
support is O_ATOMIC across a fragmented extent.

If I'm reading the XFS patches right, the code always cows for atomic. 
Are you planning on adding an optimization to use atomic support in the 
device to skip COW when possible?

To turn off mysql double buffering, we only need 16K or 64K writes, 
which most of the time you'd be able to pass down directly without cows.

-chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 07/12] xfs: implement failure-atomic writes
  2017-02-28 14:57 ` [PATCH 07/12] xfs: implement failure-atomic writes Christoph Hellwig
@ 2017-02-28 23:09   ` Darrick J. Wong
  2017-03-01 15:17     ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2017-02-28 23:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs, linux-block

On Tue, Feb 28, 2017 at 06:57:32AM -0800, Christoph Hellwig wrote:
> If O_ATOMIC is specified in the open flags this will cause XFS to
> allocate new extents in the COW for even if overwriting existing data,

"COW fork"                    ^^^^^^^

The previous patch's commit message also has that quirk.

> and not remap them into the data fork until ->fsync is called,
> at which point the whole range will be atomically remapped into the
> data fork.  This allows applications to ѕafely overwrite data instead
> of having to do double writes.

By the way, the copy on write code remembers the extents it has
allocated for CoW staging in the refcount btree so that it can free them
after a crash, which means that O_ATOMIC requires reflink to be enabled.
There doesn't seem to be any explicit checking that reflink is even
enabled, which will probably just lead to weird crashes on a pre-reflink
xfs.

FWIW I didn't see any checking anywhere (vfs or xfs) that the filesystem
can actually support O_ATOMIC.  If the FS doesn't support atomic writes,
shouldn't the kernel send EINVAL or something back to userspace?

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_aops.c    | 18 +++++++++-----
>  fs/xfs/xfs_aops.h    |  4 ++-
>  fs/xfs/xfs_file.c    | 17 +++++++++++++
>  fs/xfs/xfs_iomap.c   | 18 ++++++++------
>  fs/xfs/xfs_reflink.c | 69 ++++++++++++++++++++++++++++++++++------------------
>  fs/xfs/xfs_reflink.h |  5 ++--
>  6 files changed, 91 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index c78b585b3d84..1c5efbb05b47 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -292,6 +292,7 @@ xfs_end_io(
>  	if (unlikely(error)) {
>  		switch (ioend->io_type) {
>  		case XFS_IO_COW:
> +		case XFS_IO_ATOMIC:

So we cancel the CoW staging blocks if the write was atomic and failed.

Later in the !error case we remap the blocks if it was a cow write, but
leave the mapping in memory if the write was atomic.  That is consistent
with the commit message, good.

At the start of xfs_reflink.c is a long block comment describing how the
copy on write mechanism works.  Since O_ATOMIC is a variant on CoW (it's
basically CoW with remapping deferred until fsync), please update the
comment so that the comments capture the details of how atomic writes
work.

(IOWs: Dave asked me to leave the big comment, so I'm going to try to
keep it fairly up to date.)

>  			xfs_reflink_cancel_cow_range(ip, offset, size, 0);
>  			break;
>  		}
> @@ -327,7 +328,9 @@ xfs_end_bio(
>  	struct xfs_ioend	*ioend = bio->bi_private;
>  	struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
>  
> -	if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_type == XFS_IO_COW)
> +	if (ioend->io_type == XFS_IO_UNWRITTEN ||
> +	    ioend->io_type == XFS_IO_COW ||
> +	    ioend->io_type == XFS_IO_ATOMIC)
>  		queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
>  	else if (ioend->io_append_trans)
>  		queue_work(mp->m_data_workqueue, &ioend->io_work);
> @@ -354,6 +357,7 @@ xfs_map_blocks(
>  		return -EIO;
>  
>  	ASSERT(type != XFS_IO_COW);
> +	ASSERT(type != XFS_IO_ATOMIC);
>  	if (type == XFS_IO_UNWRITTEN)
>  		bmapi_flags |= XFS_BMAPI_IGSTATE;
>  
> @@ -768,7 +772,8 @@ xfs_map_cow(
>  	struct xfs_writepage_ctx *wpc,
>  	struct inode		*inode,
>  	loff_t			offset,
> -	unsigned int		*new_type)
> +	unsigned int		*new_type,
> +	bool			atomic)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_bmbt_irec	imap;
> @@ -778,10 +783,10 @@ xfs_map_cow(
>  	/*
>  	 * If we already have a valid COW mapping keep using it.
>  	 */
> -	if (wpc->io_type == XFS_IO_COW) {
> +	if (wpc->io_type == XFS_IO_COW || wpc->io_type == XFS_IO_ATOMIC) {
>  		wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap, offset);
>  		if (wpc->imap_valid) {
> -			*new_type = XFS_IO_COW;
> +			*new_type = wpc->io_type;
>  			return 0;
>  		}
>  	}
> @@ -807,7 +812,7 @@ xfs_map_cow(
>  			return error;
>  	}
>  
> -	wpc->io_type = *new_type = XFS_IO_COW;
> +	wpc->io_type = *new_type = atomic ? XFS_IO_ATOMIC : XFS_IO_COW;
>  	wpc->imap_valid = true;
>  	wpc->imap = imap;
>  	return 0;
> @@ -886,7 +891,8 @@ xfs_writepage_map(
>  		}
>  
>  		if (XFS_I(inode)->i_cowfp) {
> -			error = xfs_map_cow(wpc, inode, offset, &new_type);
> +			error = xfs_map_cow(wpc, inode, offset, &new_type,
> +					buffer_atomic(bh));
>  			if (error)
>  				goto out;
>  		}
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index cc174ec6c2fd..798e653e68b6 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -29,6 +29,7 @@ enum {
>  	XFS_IO_UNWRITTEN,	/* covers allocated but uninitialized data */
>  	XFS_IO_OVERWRITE,	/* covers already allocated extent */
>  	XFS_IO_COW,		/* covers copy-on-write extent */
> +	XFS_IO_ATOMIC,		/* atomic write */
>  };
>  
>  #define XFS_IO_TYPES \
> @@ -36,7 +37,8 @@ enum {
>  	{ XFS_IO_DELALLOC,		"delalloc" }, \
>  	{ XFS_IO_UNWRITTEN,		"unwritten" }, \
>  	{ XFS_IO_OVERWRITE,		"overwrite" }, \
> -	{ XFS_IO_COW,			"CoW" }
> +	{ XFS_IO_COW,			"CoW" }, \
> +	{ XFS_IO_ATOMIC,		"atomic" }
>  
>  /*
>   * Structure for buffered I/O completions.
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 086440e79b86..a7d8324b59c5 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -160,6 +160,12 @@ xfs_file_fsync(
>  	else if (mp->m_logdev_targp != mp->m_ddev_targp)
>  		xfs_blkdev_issue_flush(mp->m_ddev_targp);
>  
> +	if (file->f_flags & O_ATOMIC) {
> +		error = xfs_reflink_end_cow(ip, start, end - start + 1);
> +		if (error)
> +			return error;
> +	}

I suppose it goes without saying that userspace will have to coordinate
its O_ATOMIC writes to the file.  What if this happens?

Process A			Process B

Open atomic file		Open atomic file
Dirty some pages		Dirty some other pages
fsync
Successful fsync return
Dirty more pages
				Dirty more of some other pages
<system crash>

When we come back up, the file contents will reflect everything A wrote
up to fsync, and (since fsync flushed everything) everything B wrote in
"dirty some other pages", even though it hadn't reached fsync.  Won't
this be surprising to B since it expected that disk mappings don't get
updated until it fsyncs?

Practically speaking, I wonder how often this will come up in the real
world, but it does seem to be a potential downside.  Per *file tracking
sounds like a bigger bookkeeping nightmare.

> +
>  	/*
>  	 * All metadata updates are logged, which means that we just have to
>  	 * flush the log up to the latest LSN that touched the inode. If we have
> @@ -457,6 +463,9 @@ xfs_dio_write_end_io(
>  	}
>  	spin_unlock(&ip->i_flags_lock);
>  
> +	if (iocb->ki_filp->f_flags & O_ATOMIC)
> +		return 0;
> +
>  	if (flags & IOMAP_DIO_COW) {
>  		error = xfs_reflink_end_cow(ip, offset, size);
>  		if (error)
> @@ -529,6 +538,12 @@ xfs_file_dio_aio_write(
>  		unaligned_io = 1;
>  
>  		/*
> +		 * We need filesystem block alignment to provide atomic commits.
> +		 */
> +		if (file->f_flags & O_ATOMIC)
> +			return -EINVAL;
> +
> +		/*
>  		 * We can't properly handle unaligned direct I/O to reflink
>  		 * files yet, as we can't unshare a partial block.
>  		 */
> @@ -892,6 +907,8 @@ xfs_file_open(
>  		return -EFBIG;
>  	if (XFS_FORCED_SHUTDOWN(XFS_M(inode->i_sb)))
>  		return -EIO;
> +	if (file->f_flags & O_ATOMIC)
> +		printk_ratelimited("O_ATOMIC!\n");

Per above,

if (file->f_flags & O_ATOMIC) {
	if (!xfs_sb_version_hasreflink(...))
		return -EPROTONOSUPPORT;
	printk_ratelimited("EXPERIMENTAL atomic writes feature in use!\n");
}

>  	return 0;
>  }
>  
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 5d68b4279016..b686a6bd2db4 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -559,13 +559,14 @@ xfs_file_iomap_begin_delay(
>  
>  	eof = !xfs_iext_lookup_extent(ip, ifp, offset_fsb, &idx, &got);
>  	if (!eof && got.br_startoff <= offset_fsb) {
> -		if (xfs_is_reflink_inode(ip)) {
> +		if ((flags & IOMAP_ATOMIC) || xfs_is_reflink_inode(ip)) {
>  			bool		shared;
>  
>  			end_fsb = min(XFS_B_TO_FSB(mp, offset + count),
>  					maxbytes_fsb);
>  			xfs_trim_extent(&got, offset_fsb, end_fsb - offset_fsb);
> -			error = xfs_reflink_reserve_cow(ip, &got, &shared);
> +			error = xfs_reflink_reserve_cow(ip, &got, &shared,
> +					(flags & IOMAP_ATOMIC));
>  			if (error)
>  				goto out_unlock;
>  		}
> @@ -951,7 +952,7 @@ static inline bool need_excl_ilock(struct xfs_inode *ip, unsigned flags)
>  	 */
>  	if (xfs_is_reflink_inode(ip) && (flags & (IOMAP_WRITE | IOMAP_ZERO)))
>  		return true;
> -	if ((flags & IOMAP_DIRECT) && (flags & IOMAP_WRITE))
> +	if ((flags & (IOMAP_DIRECT | IOMAP_ATOMIC)) && (flags & IOMAP_WRITE))
>  		return true;
>  	return false;
>  }
> @@ -976,7 +977,8 @@ xfs_file_iomap_begin(
>  		return -EIO;
>  
>  	if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
> -			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> +	    ((flags & IOMAP_ATOMIC) ||
> +	     (!IS_DAX(inode) && !xfs_get_extsz_hint(ip)))) {
>  		/* Reserve delalloc blocks for regular writeback. */
>  		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
>  				iomap);
> @@ -1008,15 +1010,17 @@ xfs_file_iomap_begin(
>  			goto out_unlock;
>  	}
>  
> -	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
> +	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) &&
> +	    ((flags & IOMAP_ATOMIC) || xfs_is_reflink_inode(ip))) {
>  		if (flags & IOMAP_DIRECT) {
>  			/* may drop and re-acquire the ilock */
>  			error = xfs_reflink_allocate_cow(ip, &imap, &shared,
> -					&lockmode);
> +					&lockmode, flags & IOMAP_ATOMIC);
>  			if (error)
>  				goto out_unlock;
>  		} else {
> -			error = xfs_reflink_reserve_cow(ip, &imap, &shared);
> +			error = xfs_reflink_reserve_cow(ip, &imap, &shared,
> +					false);
>  			if (error)
>  				goto out_unlock;
>  		}
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 4225b5e67b17..4702dd800ab8 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -264,9 +264,9 @@ int
>  xfs_reflink_reserve_cow(
>  	struct xfs_inode	*ip,
>  	struct xfs_bmbt_irec	*imap,
> -	bool			*shared)
> +	bool			*shared,
> +	bool			always_cow)
>  {
> -	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
>  	struct xfs_bmbt_irec	got;
>  	int			error = 0;
>  	bool			eof = false, trimmed;
> @@ -280,26 +280,30 @@ xfs_reflink_reserve_cow(
>  	 * extent list is generally faster than going out to the shared extent
>  	 * tree.
>  	 */
> -
> -	if (!xfs_iext_lookup_extent(ip, ifp, imap->br_startoff, &idx, &got))
> +	if (!ip->i_cowfp) {
> +		ASSERT(always_cow);
> +		xfs_ifork_init_cow(ip);
>  		eof = true;
> -	if (!eof && got.br_startoff <= imap->br_startoff) {
> -		trace_xfs_reflink_cow_found(ip, imap);
> -		xfs_trim_extent(imap, got.br_startoff, got.br_blockcount);
> +	} else {
> +		if (!xfs_iext_lookup_extent(ip, ip->i_cowfp, imap->br_startoff,
> +				&idx, &got))
> +			eof = true;
> +		if (!eof && got.br_startoff <= imap->br_startoff) {
> +			trace_xfs_reflink_cow_found(ip, imap);
> +			xfs_trim_extent(imap, got.br_startoff,
> +					got.br_blockcount);
> +
> +			*shared = true;
> +			return 0;
> +		}
>  
> -		*shared = true;
> -		return 0;
> +		/* Trim the mapping to the nearest shared extent boundary. */
> +		error = xfs_reflink_trim_around_shared(ip, imap, shared,
> +				&trimmed);
> +		if (error || !*shared)
> +			return error;
>  	}
>  
> -	/* Trim the mapping to the nearest shared extent boundary. */
> -	error = xfs_reflink_trim_around_shared(ip, imap, shared, &trimmed);
> -	if (error)
> -		return error;
> -
> -	/* Not shared?  Just report the (potentially capped) extent. */
> -	if (!*shared)
> -		return 0;
> -
>  	/*
>  	 * Fork all the shared blocks from our write offset until the end of
>  	 * the extent.
> @@ -383,7 +387,8 @@ xfs_reflink_allocate_cow(
>  	struct xfs_inode	*ip,
>  	struct xfs_bmbt_irec	*imap,
>  	bool			*shared,
> -	uint			*lockmode)
> +	uint			*lockmode,
> +	bool			always_cow)
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
>  	xfs_fileoff_t		offset_fsb = imap->br_startoff;
> @@ -399,15 +404,19 @@ xfs_reflink_allocate_cow(
>  	xfs_extnum_t		idx;
>  
>  retry:
> -	ASSERT(xfs_is_reflink_inode(ip));
> +	ASSERT(always_cow | xfs_is_reflink_inode(ip));
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL | XFS_ILOCK_SHARED));
>  
> +	if (!ip->i_cowfp) {
> +		ASSERT(always_cow);
> +		xfs_ifork_init_cow(ip);
> +
>  	/*
>  	 * Even if the extent is not shared we might have a preallocation for
>  	 * it in the COW fork.  If so use it.
>  	 */
> -	if (xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &idx, &got) &&
> -	    got.br_startoff <= offset_fsb) {
> +	} else if (xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &idx,
> +			&got) && got.br_startoff <= offset_fsb) {
>  		*shared = true;
>  
>  		/* If we have a real allocation in the COW fork we're done. */
> @@ -418,7 +427,7 @@ xfs_reflink_allocate_cow(
>  		}
>  
>  		xfs_trim_extent(imap, got.br_startoff, got.br_blockcount);
> -	} else {
> +	} else if (!always_cow) {
>  		error = xfs_reflink_trim_around_shared(ip, imap, shared, &trimmed);
>  		if (error || !*shared)
>  			goto out;
> @@ -684,6 +693,7 @@ xfs_reflink_end_cow(
>  	xfs_fileoff_t			offset_fsb;
>  	xfs_fileoff_t			end_fsb;
>  	xfs_fsblock_t			firstfsb;
> +	xfs_off_t			new_size;
>  	struct xfs_defer_ops		dfops;
>  	int				error;
>  	unsigned int			resblks;
> @@ -693,7 +703,7 @@ xfs_reflink_end_cow(
>  	trace_xfs_reflink_end_cow(ip, offset, count);
>  
>  	/* No COW extents?  That's easy! */
> -	if (ifp->if_bytes == 0)
> +	if (!ifp || ifp->if_bytes == 0)
>  		return 0;
>  
>  	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
> @@ -776,6 +786,17 @@ xfs_reflink_end_cow(
>  			break;
>  	}
>  
> +	/*
> +	 * Update the on-disk inode size if we completed an operation outside
> +	 * of the inode size.  This can only happen for atomic writes, and not
> +	 * for actual reflinked files.
> +	 */
> +	new_size = xfs_new_eof(ip, offset + count);
> +	if (new_size) {
> +		ip->i_d.di_size = new_size;
> +		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> +	}
> +
>  	error = xfs_trans_commit(tp);
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	if (error)
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index 9416279b3c89..0360e2c0f3a5 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -27,9 +27,10 @@ extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
>  		struct xfs_bmbt_irec *irec, bool *shared, bool *trimmed);
>  
>  extern int xfs_reflink_reserve_cow(struct xfs_inode *ip,
> -		struct xfs_bmbt_irec *imap, bool *shared);
> +		struct xfs_bmbt_irec *imap, bool *shared, bool always_cow);
>  extern int xfs_reflink_allocate_cow(struct xfs_inode *ip,
> -		struct xfs_bmbt_irec *imap, bool *shared, uint *lockmode);
> +		struct xfs_bmbt_irec *imap, bool *shared, uint *lockmode,
> +		bool always_cow);

manpages/xfstests needed, but the rest of this looks more or less sane.

--D

>  extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
>  		xfs_off_t count);
>  extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] failure atomic writes for file systems and block devices
  2017-02-28 14:57 [RFC] failure atomic writes for file systems and block devices Christoph Hellwig
                   ` (12 preceding siblings ...)
  2017-02-28 20:48   ` Chris Mason
@ 2017-02-28 23:22 ` Darrick J. Wong
  2017-03-01 15:09   ` Christoph Hellwig
  2017-03-01 11:21 ` Amir Goldstein
  14 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2017-02-28 23:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs, linux-block

On Tue, Feb 28, 2017 at 06:57:25AM -0800, Christoph Hellwig wrote:
> Hi all,
> 
> this series implements a new O_ATOMIC flag for failure atomic writes
> to files.   It is based on and tries to unify to earlier proposals,
> the first one for block devices by Chris Mason:
> 
> 	https://lwn.net/Articles/573092/
> 
> and the second one for regular files, published by HP Research at
> Usenix FAST 2015:
> 
> 	https://www.usenix.org/conference/fast15/technical-sessions/presentation/verma
> 
> It adds a new O_ATOMIC flag for open, which requests writes to be
> failure-atomic, that is either the whole write makes it to persistent
> storage, or none of it, even in case of power of other failures.
> 
> There are two implementation various of this:  on block devices O_ATOMIC
> must be combined with O_(D)SYNC so that storage devices that can handle
> large writes atomically can simply do that without any additional work.
> This case is supported by NVMe.
> 
> The second case is for file systems, where we simply write new blocks
> out of places and then remap them into the file atomically on either
> completion of an O_(D)SYNC write or when fsync is called explicitly.
> 
> The semantics of the latter case are explained in detail at the Usenix
> paper above.

(Assuming there's no syncv involved here...?)

> Last but not least a new fcntl is implemented to provide information
> about I/O restrictions such as alignment requirements and the maximum
> atomic write size.
> 
> The implementation is simple and clean, but I'm rather unhappy about
> the interface as it has too many failure modes to bullet proof.  For
> one old kernels ignore unknown open flags silently, so applications

Ok, heh, disregard my review comment (for the xfs part) about the
seemingly insufficient O_ATOMIC validation.

> have to check the F_IOINFO fcntl before, which is a bit of a killer.
> Because of that I've also not implemented any other validity checks
> yet, as they might make thing even worse when an open on a not supported
> file system or device fails, but not on an old kernel.  Maybe we need
> a new open version that checks arguments properly first?

Does fcntl(F_SETFL...) suffer from this?

> Also I'm really worried about the NVMe failure modes - devices simply
> advertise an atomic write size, with no way for the device to know
> that the host requested a given write to be atomic, and thus no
> error reporting.

Yikes!

> This is made worse by NVMe 1.2 adding per-namespace
> atomic I/O parameters that devices can use to introduce additional
> odd alignment quirks - while there is some language in the spec
> requiring them not to weaken the per-controller guarantees it all
> looks rather weak and I'm not too confident in all implementations
> getting everything right.
> 
> Last but not least this depends on a few XFS patches, so to actually
> apply / run the patches please use this git tree:

Well, the XFS parts don't look too bad....

--D

> 
>     git://git.infradead.org/users/hch/vfs.git O_ATOMIC
> 
> Gitweb:
> 
>     http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/O_ATOMIC

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] failure atomic writes for file systems and block devices
  2017-02-28 14:57 [RFC] failure atomic writes for file systems and block devices Christoph Hellwig
                   ` (13 preceding siblings ...)
  2017-02-28 23:22 ` Darrick J. Wong
@ 2017-03-01 11:21 ` Amir Goldstein
  2017-03-01 15:07     ` Christoph Hellwig
  14 siblings, 1 reply; 25+ messages in thread
From: Amir Goldstein @ 2017-03-01 11:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-xfs, linux-block, linux-api,
	Michael Kerrisk (man-pages)

On Tue, Feb 28, 2017 at 4:57 PM, Christoph Hellwig <hch@lst.de> wrote:
> Hi all,
>
> this series implements a new O_ATOMIC flag for failure atomic writes
> to files.   It is based on and tries to unify to earlier proposals,
> the first one for block devices by Chris Mason:
>
>         https://lwn.net/Articles/573092/
>
> and the second one for regular files, published by HP Research at
> Usenix FAST 2015:
>
>         https://www.usenix.org/conference/fast15/technical-sessions/presentation/verma
>
> It adds a new O_ATOMIC flag for open, which requests writes to be
> failure-atomic, that is either the whole write makes it to persistent
> storage, or none of it, even in case of power of other failures.
>
> There are two implementation various of this:  on block devices O_ATOMIC
> must be combined with O_(D)SYNC so that storage devices that can handle
> large writes atomically can simply do that without any additional work.
> This case is supported by NVMe.
>
> The second case is for file systems, where we simply write new blocks
> out of places and then remap them into the file atomically on either
> completion of an O_(D)SYNC write or when fsync is called explicitly.
>
> The semantics of the latter case are explained in detail at the Usenix
> paper above.
>
> Last but not least a new fcntl is implemented to provide information
> about I/O restrictions such as alignment requirements and the maximum
> atomic write size.
>
> The implementation is simple and clean, but I'm rather unhappy about
> the interface as it has too many failure modes to bullet proof.  For
> one old kernels ignore unknown open flags silently, so applications
> have to check the F_IOINFO fcntl before, which is a bit of a killer.
> Because of that I've also not implemented any other validity checks
> yet, as they might make thing even worse when an open on a not supported
> file system or device fails, but not on an old kernel.  Maybe we need
> a new open version that checks arguments properly first?
>

[CC += linux-api@vger.kernel.org] for that question and for the new API

> Also I'm really worried about the NVMe failure modes - devices simply
> advertise an atomic write size, with no way for the device to know
> that the host requested a given write to be atomic, and thus no
> error reporting.  This is made worse by NVMe 1.2 adding per-namespace
> atomic I/O parameters that devices can use to introduce additional
> odd alignment quirks - while there is some language in the spec
> requiring them not to weaken the per-controller guarantees it all
> looks rather weak and I'm not too confident in all implementations
> getting everything right.
>
> Last but not least this depends on a few XFS patches, so to actually
> apply / run the patches please use this git tree:
>
>     git://git.infradead.org/users/hch/vfs.git O_ATOMIC
>
> Gitweb:
>
>     http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/O_ATOMIC

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] failure atomic writes for file systems and block devices
  2017-02-28 20:48   ` Chris Mason
  (?)
@ 2017-03-01 15:07   ` Christoph Hellwig
  -1 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2017-03-01 15:07 UTC (permalink / raw)
  To: Chris Mason; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-block

On Tue, Feb 28, 2017 at 03:48:16PM -0500, Chris Mason wrote:
> One thing that isn't clear to me is how we're dealing with boundary bio 
> mappings, which will get submitted by submit_page_section()
>
> sdio->boundary = buffer_boundary(map_bh);

The old dio code is not supported at all by this code at the moment.
We'll either need the new block dev direct I/O code on block
devices (and limit to BIO_MAX_PAGES, this is a bug in this patchset
if people ever have devices with > 1MB atomic write support.  And thanks
to NVMe the failure case is silent, sigh..), or we need file system support
for out of place writes.

>
> In btrfs I'd just chain things together and do the extent pointer swap 
> afterwards, but I didn't follow the XFS code well enough to see how its 
> handled there.  But either way it feels like an error prone surprise 
> waiting for later, and one gap we really want to get right in the FS 
> support is O_ATOMIC across a fragmented extent.
>
> If I'm reading the XFS patches right, the code always cows for atomic.

It doesn't really COW - it uses the COW infrastructure to write out of
place and then commit it into the file later.  Because of that we don't
really care about things like boundary blocks (which XFS never used in
that form anyway) - data is written first, the cache is flushed and then
we swap around the extent pointers.

> Are 
> you planning on adding an optimization to use atomic support in the device 
> to skip COW when possible?

We could do that fairly easily for files that have a contiguous mapping
for the atomic write I/O.  But at this point I have a lot more trust in
the fs code than the devices, especially due to the silent failure mode.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] failure atomic writes for file systems and block devices
@ 2017-03-01 15:07     ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2017-03-01 15:07 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-block,
	linux-api, Michael Kerrisk (man-pages)

On Wed, Mar 01, 2017 at 01:21:41PM +0200, Amir Goldstein wrote:
> [CC += linux-api@vger.kernel.org] for that question and for the new API

We'll need to iterate over the API a few more times first I think..

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] failure atomic writes for file systems and block devices
@ 2017-03-01 15:07     ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2017-03-01 15:07 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christoph Hellwig, linux-fsdevel,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-block,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk (man-pages)

On Wed, Mar 01, 2017 at 01:21:41PM +0200, Amir Goldstein wrote:
> [CC += linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] for that question and for the new API

We'll need to iterate over the API a few more times first I think..

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] failure atomic writes for file systems and block devices
  2017-02-28 23:22 ` Darrick J. Wong
@ 2017-03-01 15:09   ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2017-03-01 15:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-block

On Tue, Feb 28, 2017 at 03:22:04PM -0800, Darrick J. Wong wrote:
> (Assuming there's no syncv involved here...?)

No.  While I think we could implement it for XFS similar how we roll
transactions over multiple inodes for a few transactions, the use case
is much more limited, and the potential pitfalls are much bigger.

> > have to check the F_IOINFO fcntl before, which is a bit of a killer.
> > Because of that I've also not implemented any other validity checks
> > yet, as they might make thing even worse when an open on a not supported
> > file system or device fails, but not on an old kernel.  Maybe we need
> > a new open version that checks arguments properly first?
> 
> Does fcntl(F_SETFL...) suffer from this?

Yes.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 05/12] fs: add a F_IOINFO fcntl
  2017-02-28 16:51   ` Darrick J. Wong
@ 2017-03-01 15:11     ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2017-03-01 15:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-block

On Tue, Feb 28, 2017 at 08:51:39AM -0800, Darrick J. Wong wrote:
> Hm... is fio_alignment is specified in units of bytes?

Yes.

> If so, then
> shouldn't this be a __u32 so that we can handle some weird future device
> that wants, say, 1MB alignment for its atomic IO?

That would be pretty useless.  Anything bigger than sector / block
size would not really be usable for typical applications.

> Though, now that I look at the XFS ioinfo patch, I guess fio_alignment
> is set only for O_DIRECT files?

Yes.

> So it's really the required alignment
> for directio operations.

For buffered I/O we can write at byte granularity and still use the
atomic commits, but for direct I/O we can only COW at block size
granularity.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 07/12] xfs: implement failure-atomic writes
  2017-02-28 23:09   ` Darrick J. Wong
@ 2017-03-01 15:17     ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2017-03-01 15:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-block

On Tue, Feb 28, 2017 at 03:09:40PM -0800, Darrick J. Wong wrote:
> By the way, the copy on write code remembers the extents it has
> allocated for CoW staging in the refcount btree so that it can free them
> after a crash, which means that O_ATOMIC requires reflink to be enabled.

Yeah.

> There doesn't seem to be any explicit checking that reflink is even
> enabled, which will probably just lead to weird crashes on a pre-reflink
> xfs.

True.  I had this earlier when I hat basic O_ATOMIC validity checking,
but that was dropped from the series I posted.

> 
> FWIW I didn't see any checking anywhere (vfs or xfs) that the filesystem
> can actually support O_ATOMIC.  If the FS doesn't support atomic writes,
> shouldn't the kernel send EINVAL or something back to userspace?

Older kernels can't check it, so having new ones check it creates even
more of a mess.

I'm still not feeling very well about O_ATOMIC - either we need an
open2 that checks for unknown flags, or I need to change this to
a per-op flag - RWF_ATOMIC for write (pwritev2 actually), and MAP_ATOMIC
for mmap.  But given that pwritev2 isn't really supported in common
userland yet that might be rather painful.

> At the start of xfs_reflink.c is a long block comment describing how the
> copy on write mechanism works.  Since O_ATOMIC is a variant on CoW (it's
> basically CoW with remapping deferred until fsync), please update the
> comment so that the comments capture the details of how atomic writes
> work.
> 
> (IOWs: Dave asked me to leave the big comment, so I'm going to try to
> keep it fairly up to date.)

I'll add some information to it.

> I suppose it goes without saying that userspace will have to coordinate
> its O_ATOMIC writes to the file.

It does - but if you have multiple writers to a file they really need
to be coordinated anyway.  If you have threads whose updates race
you'd need something like

open(O_TMPFILE)
clone file (or range) into tempfile

update tempfile

clone region you want atomically inserted back into the original file.

We can actually do that with existing primitives, but it's a bit more
heavyweight.  We could opimize this a bit by checking if an extent
already points to the same physical blocks before replacing it in
clone_file_range.

> > +	if (file->f_flags & O_ATOMIC)
> > +		printk_ratelimited("O_ATOMIC!\n");
> 
> Per above,
> 
> if (file->f_flags & O_ATOMIC) {
> 	if (!xfs_sb_version_hasreflink(...))
> 		return -EPROTONOSUPPORT;

Yeah.

> 	printk_ratelimited("EXPERIMENTAL atomic writes feature in use!\n");

And that should just go away - it was a local debug aid :)

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-03-01 15:18 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-28 14:57 [RFC] failure atomic writes for file systems and block devices Christoph Hellwig
2017-02-28 14:57 ` [PATCH 01/12] uapi/fs: add O_ATOMIC to the open flags Christoph Hellwig
2017-02-28 14:57 ` [PATCH 02/12] iomap: pass IOMAP_* flags to actors Christoph Hellwig
2017-02-28 14:57 ` [PATCH 03/12] iomap: add a IOMAP_ATOMIC flag Christoph Hellwig
2017-02-28 14:57 ` [PATCH 04/12] fs: add a BH_Atomic flag Christoph Hellwig
2017-02-28 14:57 ` [PATCH 05/12] fs: add a F_IOINFO fcntl Christoph Hellwig
2017-02-28 16:51   ` Darrick J. Wong
2017-03-01 15:11     ` Christoph Hellwig
2017-02-28 14:57 ` [PATCH 06/12] xfs: cleanup is_reflink checks Christoph Hellwig
2017-02-28 14:57 ` [PATCH 07/12] xfs: implement failure-atomic writes Christoph Hellwig
2017-02-28 23:09   ` Darrick J. Wong
2017-03-01 15:17     ` Christoph Hellwig
2017-02-28 14:57 ` [PATCH 08/12] xfs: implement the F_IOINFO fcntl Christoph Hellwig
2017-02-28 14:57 ` [PATCH 09/12] block: advertize max atomic write limit Christoph Hellwig
2017-02-28 14:57 ` [PATCH 10/12] block_dev: set REQ_NOMERGE for O_ATOMIC writes Christoph Hellwig
2017-02-28 14:57 ` [PATCH 11/12] block_dev: implement the F_IOINFO fcntl Christoph Hellwig
2017-02-28 14:57 ` [PATCH 12/12] nvme: export the atomic write limit Christoph Hellwig
2017-02-28 20:48 ` [RFC] failure atomic writes for file systems and block devices Chris Mason
2017-02-28 20:48   ` Chris Mason
2017-03-01 15:07   ` Christoph Hellwig
2017-02-28 23:22 ` Darrick J. Wong
2017-03-01 15:09   ` Christoph Hellwig
2017-03-01 11:21 ` Amir Goldstein
2017-03-01 15:07   ` Christoph Hellwig
2017-03-01 15:07     ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.