All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] ext4: DAX fixes
@ 2016-05-11  9:39 ` Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2016-05-11  9:39 UTC (permalink / raw)
  To: Ted Tso; +Cc: Jan Kara, linux-nvdimm, linux-fsdevel, linux-ext4

Hi!

These are ext4 patches that were originally part of my DAX locking patch
series. They stand on their own so I'm submitting them independently and since
they are non-trivial, I'd like them to be merged through ext4 tree.  The rest
of the DAX locking series depends on these patches so whichever tree is going
to merge them will have to pull the ext4 tree to get these patches.  They
didn't change since the last posting of the DAX locking series (except for
minor fixup due to patch reordering).

Ted, can you please queue these patches for the coming merge window? Thanks!

								Honza
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 0/4] ext4: DAX fixes
@ 2016-05-11  9:39 ` Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2016-05-11  9:39 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, linux-nvdimm, Vishal Verma, Ross Zwisler,
	Dan Williams, linux-fsdevel, Jan Kara

Hi!

These are ext4 patches that were originally part of my DAX locking patch
series. They stand on their own so I'm submitting them independently and since
they are non-trivial, I'd like them to be merged through ext4 tree.  The rest
of the DAX locking series depends on these patches so whichever tree is going
to merge them will have to pull the ext4 tree to get these patches.  They
didn't change since the last posting of the DAX locking series (except for
minor fixup due to patch reordering).

Ted, can you please queue these patches for the coming merge window? Thanks!

								Honza

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/4] ext4: Handle transient ENOSPC properly for DAX
  2016-05-11  9:39 ` Jan Kara
@ 2016-05-11  9:39   ` Jan Kara
  -1 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2016-05-11  9:39 UTC (permalink / raw)
  To: Ted Tso; +Cc: Jan Kara, linux-nvdimm, linux-fsdevel, linux-ext4

ext4_dax_get_blocks() was accidentally omitted fixing get blocks
handlers to properly handle transient ENOSPC errors. Fix it now to use
ext4_get_blocks_trans() helper which takes care of these errors.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 75 +++++++++++++++------------------------------------------
 1 file changed, 20 insertions(+), 55 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 981a1fc30eaa..3e0c06028668 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3218,72 +3218,37 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
 int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
 			    struct buffer_head *bh_result, int create)
 {
-	int ret, err;
-	int credits;
-	struct ext4_map_blocks map;
-	handle_t *handle = NULL;
-	int flags = 0;
+	int ret;
 
 	ext4_debug("ext4_dax_mmap_get_block: inode %lu, create flag %d\n",
 		   inode->i_ino, create);
-	map.m_lblk = iblock;
-	map.m_len = bh_result->b_size >> inode->i_blkbits;
-	credits = ext4_chunk_trans_blocks(inode, map.m_len);
-	if (create) {
-		flags |= EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_CREATE_ZERO;
-		handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
-		if (IS_ERR(handle)) {
-			ret = PTR_ERR(handle);
-			return ret;
-		}
-	}
+	if (!create)
+		return _ext4_get_block(inode, iblock, bh_result, 0);
 
-	ret = ext4_map_blocks(handle, inode, &map, flags);
-	if (create) {
-		err = ext4_journal_stop(handle);
-		if (ret >= 0 && err < 0)
-			ret = err;
-	}
-	if (ret <= 0)
-		goto out;
-	if (map.m_flags & EXT4_MAP_UNWRITTEN) {
-		int err2;
+	ret = ext4_get_block_trans(inode, iblock, bh_result,
+				   EXT4_GET_BLOCKS_PRE_IO |
+				   EXT4_GET_BLOCKS_CREATE_ZERO);
+	if (ret < 0)
+		return ret;
 
+	if (buffer_unwritten(bh_result)) {
 		/*
 		 * We are protected by i_mmap_sem so we know block cannot go
 		 * away from under us even though we dropped i_data_sem.
 		 * Convert extent to written and write zeros there.
-		 *
-		 * Note: We may get here even when create == 0.
 		 */
-		handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
-		if (IS_ERR(handle)) {
-			ret = PTR_ERR(handle);
-			goto out;
-		}
-
-		err = ext4_map_blocks(handle, inode, &map,
-		      EXT4_GET_BLOCKS_CONVERT | EXT4_GET_BLOCKS_CREATE_ZERO);
-		if (err < 0)
-			ret = err;
-		err2 = ext4_journal_stop(handle);
-		if (err2 < 0 && ret > 0)
-			ret = err2;
-	}
-out:
-	WARN_ON_ONCE(ret == 0 && create);
-	if (ret > 0) {
-		map_bh(bh_result, inode->i_sb, map.m_pblk);
-		/*
-		 * At least for now we have to clear BH_New so that DAX code
-		 * doesn't attempt to zero blocks again in a racy way.
-		 */
-		map.m_flags &= ~EXT4_MAP_NEW;
-		ext4_update_bh_state(bh_result, map.m_flags);
-		bh_result->b_size = map.m_len << inode->i_blkbits;
-		ret = 0;
+		ret = ext4_get_block_trans(inode, iblock, bh_result,
+					   EXT4_GET_BLOCKS_CONVERT |
+					   EXT4_GET_BLOCKS_CREATE_ZERO);
+		if (ret < 0)
+			return ret;
 	}
-	return ret;
+	/*
+	 * At least for now we have to clear BH_New so that DAX code
+	 * doesn't attempt to zero blocks again in a racy way.
+	 */
+	clear_buffer_new(bh_result);
+	return 0;
 }
 #endif
 
-- 
2.6.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 1/4] ext4: Handle transient ENOSPC properly for DAX
@ 2016-05-11  9:39   ` Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2016-05-11  9:39 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, linux-nvdimm, Vishal Verma, Ross Zwisler,
	Dan Williams, linux-fsdevel, Jan Kara

ext4_dax_get_blocks() was accidentally omitted fixing get blocks
handlers to properly handle transient ENOSPC errors. Fix it now to use
ext4_get_blocks_trans() helper which takes care of these errors.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 75 +++++++++++++++------------------------------------------
 1 file changed, 20 insertions(+), 55 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 981a1fc30eaa..3e0c06028668 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3218,72 +3218,37 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
 int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
 			    struct buffer_head *bh_result, int create)
 {
-	int ret, err;
-	int credits;
-	struct ext4_map_blocks map;
-	handle_t *handle = NULL;
-	int flags = 0;
+	int ret;
 
 	ext4_debug("ext4_dax_mmap_get_block: inode %lu, create flag %d\n",
 		   inode->i_ino, create);
-	map.m_lblk = iblock;
-	map.m_len = bh_result->b_size >> inode->i_blkbits;
-	credits = ext4_chunk_trans_blocks(inode, map.m_len);
-	if (create) {
-		flags |= EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_CREATE_ZERO;
-		handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
-		if (IS_ERR(handle)) {
-			ret = PTR_ERR(handle);
-			return ret;
-		}
-	}
+	if (!create)
+		return _ext4_get_block(inode, iblock, bh_result, 0);
 
-	ret = ext4_map_blocks(handle, inode, &map, flags);
-	if (create) {
-		err = ext4_journal_stop(handle);
-		if (ret >= 0 && err < 0)
-			ret = err;
-	}
-	if (ret <= 0)
-		goto out;
-	if (map.m_flags & EXT4_MAP_UNWRITTEN) {
-		int err2;
+	ret = ext4_get_block_trans(inode, iblock, bh_result,
+				   EXT4_GET_BLOCKS_PRE_IO |
+				   EXT4_GET_BLOCKS_CREATE_ZERO);
+	if (ret < 0)
+		return ret;
 
+	if (buffer_unwritten(bh_result)) {
 		/*
 		 * We are protected by i_mmap_sem so we know block cannot go
 		 * away from under us even though we dropped i_data_sem.
 		 * Convert extent to written and write zeros there.
-		 *
-		 * Note: We may get here even when create == 0.
 		 */
-		handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
-		if (IS_ERR(handle)) {
-			ret = PTR_ERR(handle);
-			goto out;
-		}
-
-		err = ext4_map_blocks(handle, inode, &map,
-		      EXT4_GET_BLOCKS_CONVERT | EXT4_GET_BLOCKS_CREATE_ZERO);
-		if (err < 0)
-			ret = err;
-		err2 = ext4_journal_stop(handle);
-		if (err2 < 0 && ret > 0)
-			ret = err2;
-	}
-out:
-	WARN_ON_ONCE(ret == 0 && create);
-	if (ret > 0) {
-		map_bh(bh_result, inode->i_sb, map.m_pblk);
-		/*
-		 * At least for now we have to clear BH_New so that DAX code
-		 * doesn't attempt to zero blocks again in a racy way.
-		 */
-		map.m_flags &= ~EXT4_MAP_NEW;
-		ext4_update_bh_state(bh_result, map.m_flags);
-		bh_result->b_size = map.m_len << inode->i_blkbits;
-		ret = 0;
+		ret = ext4_get_block_trans(inode, iblock, bh_result,
+					   EXT4_GET_BLOCKS_CONVERT |
+					   EXT4_GET_BLOCKS_CREATE_ZERO);
+		if (ret < 0)
+			return ret;
 	}
-	return ret;
+	/*
+	 * At least for now we have to clear BH_New so that DAX code
+	 * doesn't attempt to zero blocks again in a racy way.
+	 */
+	clear_buffer_new(bh_result);
+	return 0;
 }
 #endif
 
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] ext4: Fix race in transient ENOSPC detection
  2016-05-11  9:39 ` Jan Kara
@ 2016-05-11  9:39   ` Jan Kara
  -1 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2016-05-11  9:39 UTC (permalink / raw)
  To: Ted Tso; +Cc: Jan Kara, linux-nvdimm, linux-fsdevel, linux-ext4

When there are blocks to free in the running transaction, block
allocator can return ENOSPC although the filesystem has some blocks to
free. We use ext4_should_retry_alloc() to force commit of the current
transaction and return whether anything was committed so that it makes
sense to retry the allocation. However the transaction may get committed
after block allocation fails but before we call
ext4_should_retry_alloc(). So ext4_should_retry_alloc() returns false
because there is nothing to commit and we wrongly return ENOSPC.

Fix the race by unconditionally returning 1 from ext4_should_retry_alloc()
when we tried to commit a transaction. This should not add any
unnecessary retries since we had a transaction running a while ago when
trying to allocate blocks and we want to retry the allocation once that
transaction has committed anyway.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/balloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index fe1f50fe764f..3020fd70c392 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -610,7 +610,8 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries)
 
 	jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id);
 
-	return jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal);
+	jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal);
+	return 1;
 }
 
 /*
-- 
2.6.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] ext4: Fix race in transient ENOSPC detection
@ 2016-05-11  9:39   ` Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2016-05-11  9:39 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, linux-nvdimm, Vishal Verma, Ross Zwisler,
	Dan Williams, linux-fsdevel, Jan Kara

When there are blocks to free in the running transaction, block
allocator can return ENOSPC although the filesystem has some blocks to
free. We use ext4_should_retry_alloc() to force commit of the current
transaction and return whether anything was committed so that it makes
sense to retry the allocation. However the transaction may get committed
after block allocation fails but before we call
ext4_should_retry_alloc(). So ext4_should_retry_alloc() returns false
because there is nothing to commit and we wrongly return ENOSPC.

Fix the race by unconditionally returning 1 from ext4_should_retry_alloc()
when we tried to commit a transaction. This should not add any
unnecessary retries since we had a transaction running a while ago when
trying to allocate blocks and we want to retry the allocation once that
transaction has committed anyway.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/balloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index fe1f50fe764f..3020fd70c392 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -610,7 +610,8 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries)
 
 	jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id);
 
-	return jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal);
+	jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal);
+	return 1;
 }
 
 /*
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] ext4: Refactor direct IO code
  2016-05-11  9:39 ` Jan Kara
@ 2016-05-11  9:39   ` Jan Kara
  -1 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2016-05-11  9:39 UTC (permalink / raw)
  To: Ted Tso; +Cc: Jan Kara, linux-nvdimm, linux-fsdevel, linux-ext4

Currently ext4 direct IO handling is split between ext4_ext_direct_IO()
and ext4_ind_direct_IO(). However the extent based function calls into
the indirect based one for some cases and for example it is not able to
handle file extending. Previously it was not also properly handling
retries in case of ENOSPC errors. With DAX things would get even more
contrieved so just refactor the direct IO code and instead of indirect /
extent split do the split to read vs writes.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h     |   2 -
 fs/ext4/indirect.c | 127 ---------------------------------------------------
 fs/ext4/inode.c    | 131 ++++++++++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 114 insertions(+), 146 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 349afebe21ee..35792b430fb6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2581,8 +2581,6 @@ extern int ext4_get_next_extent(struct inode *inode, ext4_lblk_t lblk,
 /* indirect.c */
 extern int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
 				struct ext4_map_blocks *map, int flags);
-extern ssize_t ext4_ind_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
-				  loff_t offset);
 extern int ext4_ind_calc_metadata_amount(struct inode *inode, sector_t lblock);
 extern int ext4_ind_trans_blocks(struct inode *inode, int nrblocks);
 extern void ext4_ind_truncate(handle_t *, struct inode *inode);
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 3027fa681de5..bc15c2c17633 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -649,133 +649,6 @@ out:
 }
 
 /*
- * O_DIRECT for ext3 (or indirect map) based files
- *
- * If the O_DIRECT write will extend the file then add this inode to the
- * orphan list.  So recovery will truncate it back to the original size
- * if the machine crashes during the write.
- *
- * If the O_DIRECT write is intantiating holes inside i_size and the machine
- * crashes then stale disk data _may_ be exposed inside the file. But current
- * VFS code falls back into buffered path in that case so we are safe.
- */
-ssize_t ext4_ind_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
-			   loff_t offset)
-{
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file->f_mapping->host;
-	struct ext4_inode_info *ei = EXT4_I(inode);
-	handle_t *handle;
-	ssize_t ret;
-	int orphan = 0;
-	size_t count = iov_iter_count(iter);
-	int retries = 0;
-
-	if (iov_iter_rw(iter) == WRITE) {
-		loff_t final_size = offset + count;
-
-		if (final_size > inode->i_size) {
-			/* Credits for sb + inode write */
-			handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
-			if (IS_ERR(handle)) {
-				ret = PTR_ERR(handle);
-				goto out;
-			}
-			ret = ext4_orphan_add(handle, inode);
-			if (ret) {
-				ext4_journal_stop(handle);
-				goto out;
-			}
-			orphan = 1;
-			ei->i_disksize = inode->i_size;
-			ext4_journal_stop(handle);
-		}
-	}
-
-retry:
-	if (iov_iter_rw(iter) == READ && ext4_should_dioread_nolock(inode)) {
-		/*
-		 * Nolock dioread optimization may be dynamically disabled
-		 * via ext4_inode_block_unlocked_dio(). Check inode's state
-		 * while holding extra i_dio_count ref.
-		 */
-		inode_dio_begin(inode);
-		smp_mb();
-		if (unlikely(ext4_test_inode_state(inode,
-						    EXT4_STATE_DIOREAD_LOCK))) {
-			inode_dio_end(inode);
-			goto locked;
-		}
-		if (IS_DAX(inode))
-			ret = dax_do_io(iocb, inode, iter, offset,
-					ext4_dio_get_block, NULL, 0);
-		else
-			ret = __blockdev_direct_IO(iocb, inode,
-						   inode->i_sb->s_bdev, iter,
-						   offset, ext4_dio_get_block,
-						   NULL, NULL, 0);
-		inode_dio_end(inode);
-	} else {
-locked:
-		if (IS_DAX(inode))
-			ret = dax_do_io(iocb, inode, iter, offset,
-					ext4_dio_get_block, NULL, DIO_LOCKING);
-		else
-			ret = blockdev_direct_IO(iocb, inode, iter, offset,
-						 ext4_dio_get_block);
-
-		if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
-			loff_t isize = i_size_read(inode);
-			loff_t end = offset + count;
-
-			if (end > isize)
-				ext4_truncate_failed_write(inode);
-		}
-	}
-	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
-		goto retry;
-
-	if (orphan) {
-		int err;
-
-		/* Credits for sb + inode write */
-		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
-		if (IS_ERR(handle)) {
-			/* This is really bad luck. We've written the data
-			 * but cannot extend i_size. Bail out and pretend
-			 * the write failed... */
-			ret = PTR_ERR(handle);
-			if (inode->i_nlink)
-				ext4_orphan_del(NULL, inode);
-
-			goto out;
-		}
-		if (inode->i_nlink)
-			ext4_orphan_del(handle, inode);
-		if (ret > 0) {
-			loff_t end = offset + ret;
-			if (end > inode->i_size) {
-				ei->i_disksize = end;
-				i_size_write(inode, end);
-				/*
-				 * We're going to return a positive `ret'
-				 * here due to non-zero-length I/O, so there's
-				 * no way of reporting error returns from
-				 * ext4_mark_inode_dirty() to userspace.  So
-				 * ignore it.
-				 */
-				ext4_mark_inode_dirty(handle, inode);
-			}
-		}
-		err = ext4_journal_stop(handle);
-		if (ret == 0)
-			ret = err;
-	}
-out:
-	return ret;
-}
-
-/*
  * Calculate the number of metadata blocks need to reserve
  * to allocate a new block at @lblocks for non extent file based file
  */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3e0c06028668..23fd0e0a9223 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3281,7 +3281,9 @@ static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
 }
 
 /*
- * For ext4 extent files, ext4 will do direct-io write to holes,
+ * Handling of direct IO writes.
+ *
+ * For ext4 extent files, ext4 will do direct-io write even to holes,
  * preallocated extents, and those write extend the file, no need to
  * fall back to buffered IO.
  *
@@ -3299,21 +3301,37 @@ static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
  * if the machine crashes during the write.
  *
  */
-static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
-				  loff_t offset)
+static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter,
+				    loff_t offset)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
+	struct ext4_inode_info *ei = EXT4_I(inode);
 	ssize_t ret;
 	size_t count = iov_iter_count(iter);
 	int overwrite = 0;
 	get_block_t *get_block_func = NULL;
 	int dio_flags = 0;
 	loff_t final_size = offset + count;
+	int orphan = 0;
+	handle_t *handle;
 
-	/* Use the old path for reads and writes beyond i_size. */
-	if (iov_iter_rw(iter) != WRITE || final_size > inode->i_size)
-		return ext4_ind_direct_IO(iocb, iter, offset);
+	if (final_size > inode->i_size) {
+		/* Credits for sb + inode write */
+		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			goto out;
+		}
+		ret = ext4_orphan_add(handle, inode);
+		if (ret) {
+			ext4_journal_stop(handle);
+			goto out;
+		}
+		orphan = 1;
+		ei->i_disksize = inode->i_size;
+		ext4_journal_stop(handle);
+	}
 
 	BUG_ON(iocb->private == NULL);
 
@@ -3322,8 +3340,7 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	 * conversion. This also disallows race between truncate() and
 	 * overwrite DIO as i_dio_count needs to be incremented under i_mutex.
 	 */
-	if (iov_iter_rw(iter) == WRITE)
-		inode_dio_begin(inode);
+	inode_dio_begin(inode);
 
 	/* If we do a overwrite dio, i_mutex locking can be released */
 	overwrite = *((int *)iocb->private);
@@ -3332,7 +3349,7 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		inode_unlock(inode);
 
 	/*
-	 * We could direct write to holes and fallocate.
+	 * For extent mapped files we could direct write to holes and fallocate.
 	 *
 	 * Allocated blocks to fill the hole are marked as unwritten to prevent
 	 * parallel buffered read to expose the stale data before DIO complete
@@ -3354,7 +3371,11 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	iocb->private = NULL;
 	if (overwrite)
 		get_block_func = ext4_dio_get_block_overwrite;
-	else if (is_sync_kiocb(iocb)) {
+	else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
+		 round_down(offset, 1 << inode->i_blkbits) >= inode->i_size) {
+		get_block_func = ext4_dio_get_block;
+		dio_flags = DIO_LOCKING | DIO_SKIP_HOLES;
+	} else if (is_sync_kiocb(iocb)) {
 		get_block_func = ext4_dio_get_block_unwritten_sync;
 		dio_flags = DIO_LOCKING;
 	} else {
@@ -3364,10 +3385,11 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 #ifdef CONFIG_EXT4_FS_ENCRYPTION
 	BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
 #endif
-	if (IS_DAX(inode))
+	if (IS_DAX(inode)) {
+		dio_flags &= ~DIO_SKIP_HOLES;
 		ret = dax_do_io(iocb, inode, iter, offset, get_block_func,
 				ext4_end_io_dio, dio_flags);
-	else
+	} else
 		ret = __blockdev_direct_IO(iocb, inode,
 					   inode->i_sb->s_bdev, iter, offset,
 					   get_block_func,
@@ -3387,12 +3409,87 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		ext4_clear_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
 	}
 
-	if (iov_iter_rw(iter) == WRITE)
-		inode_dio_end(inode);
+	inode_dio_end(inode);
 	/* take i_mutex locking again if we do a ovewrite dio */
 	if (overwrite)
 		inode_lock(inode);
 
+	if (ret < 0 && final_size > inode->i_size)
+		ext4_truncate_failed_write(inode);
+
+	/* Handle extending of i_size after direct IO write */
+	if (orphan) {
+		int err;
+
+		/* Credits for sb + inode write */
+		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+		if (IS_ERR(handle)) {
+			/* This is really bad luck. We've written the data
+			 * but cannot extend i_size. Bail out and pretend
+			 * the write failed... */
+			ret = PTR_ERR(handle);
+			if (inode->i_nlink)
+				ext4_orphan_del(NULL, inode);
+
+			goto out;
+		}
+		if (inode->i_nlink)
+			ext4_orphan_del(handle, inode);
+		if (ret > 0) {
+			loff_t end = offset + ret;
+			if (end > inode->i_size) {
+				ei->i_disksize = end;
+				i_size_write(inode, end);
+				/*
+				 * We're going to return a positive `ret'
+				 * here due to non-zero-length I/O, so there's
+				 * no way of reporting error returns from
+				 * ext4_mark_inode_dirty() to userspace.  So
+				 * ignore it.
+				 */
+				ext4_mark_inode_dirty(handle, inode);
+			}
+		}
+		err = ext4_journal_stop(handle);
+		if (ret == 0)
+			ret = err;
+	}
+out:
+	return ret;
+}
+
+static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter,
+				   loff_t offset)
+{
+	int unlocked = 0;
+	struct inode *inode = iocb->ki_filp->f_mapping->host;
+	ssize_t ret;
+
+	if (ext4_should_dioread_nolock(inode)) {
+		/*
+		 * Nolock dioread optimization may be dynamically disabled
+		 * via ext4_inode_block_unlocked_dio(). Check inode's state
+		 * while holding extra i_dio_count ref.
+		 */
+		inode_dio_begin(inode);
+		smp_mb();
+		if (unlikely(ext4_test_inode_state(inode,
+						    EXT4_STATE_DIOREAD_LOCK)))
+			inode_dio_end(inode);
+		else
+			unlocked = 1;
+	}
+	if (IS_DAX(inode)) {
+		ret = dax_do_io(iocb, inode, iter, offset, ext4_dio_get_block,
+				NULL, unlocked ? 0 : DIO_LOCKING);
+	} else {
+		ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
+					   iter, offset, ext4_dio_get_block,
+					   NULL, NULL,
+					   unlocked ? 0 : DIO_LOCKING);
+	}
+	if (unlocked)
+		inode_dio_end(inode);
 	return ret;
 }
 
@@ -3420,10 +3517,10 @@ static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		return 0;
 
 	trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
-	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
-		ret = ext4_ext_direct_IO(iocb, iter, offset);
+	if (iov_iter_rw(iter) == READ)
+		ret = ext4_direct_IO_read(iocb, iter, offset);
 	else
-		ret = ext4_ind_direct_IO(iocb, iter, offset);
+		ret = ext4_direct_IO_write(iocb, iter, offset);
 	trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
 	return ret;
 }
-- 
2.6.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] ext4: Refactor direct IO code
@ 2016-05-11  9:39   ` Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2016-05-11  9:39 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, linux-nvdimm, Vishal Verma, Ross Zwisler,
	Dan Williams, linux-fsdevel, Jan Kara

Currently ext4 direct IO handling is split between ext4_ext_direct_IO()
and ext4_ind_direct_IO(). However the extent based function calls into
the indirect based one for some cases and for example it is not able to
handle file extending. Previously it was not also properly handling
retries in case of ENOSPC errors. With DAX things would get even more
contrieved so just refactor the direct IO code and instead of indirect /
extent split do the split to read vs writes.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h     |   2 -
 fs/ext4/indirect.c | 127 ---------------------------------------------------
 fs/ext4/inode.c    | 131 ++++++++++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 114 insertions(+), 146 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 349afebe21ee..35792b430fb6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2581,8 +2581,6 @@ extern int ext4_get_next_extent(struct inode *inode, ext4_lblk_t lblk,
 /* indirect.c */
 extern int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
 				struct ext4_map_blocks *map, int flags);
-extern ssize_t ext4_ind_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
-				  loff_t offset);
 extern int ext4_ind_calc_metadata_amount(struct inode *inode, sector_t lblock);
 extern int ext4_ind_trans_blocks(struct inode *inode, int nrblocks);
 extern void ext4_ind_truncate(handle_t *, struct inode *inode);
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 3027fa681de5..bc15c2c17633 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -649,133 +649,6 @@ out:
 }
 
 /*
- * O_DIRECT for ext3 (or indirect map) based files
- *
- * If the O_DIRECT write will extend the file then add this inode to the
- * orphan list.  So recovery will truncate it back to the original size
- * if the machine crashes during the write.
- *
- * If the O_DIRECT write is intantiating holes inside i_size and the machine
- * crashes then stale disk data _may_ be exposed inside the file. But current
- * VFS code falls back into buffered path in that case so we are safe.
- */
-ssize_t ext4_ind_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
-			   loff_t offset)
-{
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file->f_mapping->host;
-	struct ext4_inode_info *ei = EXT4_I(inode);
-	handle_t *handle;
-	ssize_t ret;
-	int orphan = 0;
-	size_t count = iov_iter_count(iter);
-	int retries = 0;
-
-	if (iov_iter_rw(iter) == WRITE) {
-		loff_t final_size = offset + count;
-
-		if (final_size > inode->i_size) {
-			/* Credits for sb + inode write */
-			handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
-			if (IS_ERR(handle)) {
-				ret = PTR_ERR(handle);
-				goto out;
-			}
-			ret = ext4_orphan_add(handle, inode);
-			if (ret) {
-				ext4_journal_stop(handle);
-				goto out;
-			}
-			orphan = 1;
-			ei->i_disksize = inode->i_size;
-			ext4_journal_stop(handle);
-		}
-	}
-
-retry:
-	if (iov_iter_rw(iter) == READ && ext4_should_dioread_nolock(inode)) {
-		/*
-		 * Nolock dioread optimization may be dynamically disabled
-		 * via ext4_inode_block_unlocked_dio(). Check inode's state
-		 * while holding extra i_dio_count ref.
-		 */
-		inode_dio_begin(inode);
-		smp_mb();
-		if (unlikely(ext4_test_inode_state(inode,
-						    EXT4_STATE_DIOREAD_LOCK))) {
-			inode_dio_end(inode);
-			goto locked;
-		}
-		if (IS_DAX(inode))
-			ret = dax_do_io(iocb, inode, iter, offset,
-					ext4_dio_get_block, NULL, 0);
-		else
-			ret = __blockdev_direct_IO(iocb, inode,
-						   inode->i_sb->s_bdev, iter,
-						   offset, ext4_dio_get_block,
-						   NULL, NULL, 0);
-		inode_dio_end(inode);
-	} else {
-locked:
-		if (IS_DAX(inode))
-			ret = dax_do_io(iocb, inode, iter, offset,
-					ext4_dio_get_block, NULL, DIO_LOCKING);
-		else
-			ret = blockdev_direct_IO(iocb, inode, iter, offset,
-						 ext4_dio_get_block);
-
-		if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
-			loff_t isize = i_size_read(inode);
-			loff_t end = offset + count;
-
-			if (end > isize)
-				ext4_truncate_failed_write(inode);
-		}
-	}
-	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
-		goto retry;
-
-	if (orphan) {
-		int err;
-
-		/* Credits for sb + inode write */
-		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
-		if (IS_ERR(handle)) {
-			/* This is really bad luck. We've written the data
-			 * but cannot extend i_size. Bail out and pretend
-			 * the write failed... */
-			ret = PTR_ERR(handle);
-			if (inode->i_nlink)
-				ext4_orphan_del(NULL, inode);
-
-			goto out;
-		}
-		if (inode->i_nlink)
-			ext4_orphan_del(handle, inode);
-		if (ret > 0) {
-			loff_t end = offset + ret;
-			if (end > inode->i_size) {
-				ei->i_disksize = end;
-				i_size_write(inode, end);
-				/*
-				 * We're going to return a positive `ret'
-				 * here due to non-zero-length I/O, so there's
-				 * no way of reporting error returns from
-				 * ext4_mark_inode_dirty() to userspace.  So
-				 * ignore it.
-				 */
-				ext4_mark_inode_dirty(handle, inode);
-			}
-		}
-		err = ext4_journal_stop(handle);
-		if (ret == 0)
-			ret = err;
-	}
-out:
-	return ret;
-}
-
-/*
  * Calculate the number of metadata blocks need to reserve
  * to allocate a new block at @lblocks for non extent file based file
  */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3e0c06028668..23fd0e0a9223 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3281,7 +3281,9 @@ static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
 }
 
 /*
- * For ext4 extent files, ext4 will do direct-io write to holes,
+ * Handling of direct IO writes.
+ *
+ * For ext4 extent files, ext4 will do direct-io write even to holes,
  * preallocated extents, and those write extend the file, no need to
  * fall back to buffered IO.
  *
@@ -3299,21 +3301,37 @@ static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
  * if the machine crashes during the write.
  *
  */
-static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
-				  loff_t offset)
+static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter,
+				    loff_t offset)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
+	struct ext4_inode_info *ei = EXT4_I(inode);
 	ssize_t ret;
 	size_t count = iov_iter_count(iter);
 	int overwrite = 0;
 	get_block_t *get_block_func = NULL;
 	int dio_flags = 0;
 	loff_t final_size = offset + count;
+	int orphan = 0;
+	handle_t *handle;
 
-	/* Use the old path for reads and writes beyond i_size. */
-	if (iov_iter_rw(iter) != WRITE || final_size > inode->i_size)
-		return ext4_ind_direct_IO(iocb, iter, offset);
+	if (final_size > inode->i_size) {
+		/* Credits for sb + inode write */
+		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			goto out;
+		}
+		ret = ext4_orphan_add(handle, inode);
+		if (ret) {
+			ext4_journal_stop(handle);
+			goto out;
+		}
+		orphan = 1;
+		ei->i_disksize = inode->i_size;
+		ext4_journal_stop(handle);
+	}
 
 	BUG_ON(iocb->private == NULL);
 
@@ -3322,8 +3340,7 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	 * conversion. This also disallows race between truncate() and
 	 * overwrite DIO as i_dio_count needs to be incremented under i_mutex.
 	 */
-	if (iov_iter_rw(iter) == WRITE)
-		inode_dio_begin(inode);
+	inode_dio_begin(inode);
 
 	/* If we do a overwrite dio, i_mutex locking can be released */
 	overwrite = *((int *)iocb->private);
@@ -3332,7 +3349,7 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		inode_unlock(inode);
 
 	/*
-	 * We could direct write to holes and fallocate.
+	 * For extent mapped files we could direct write to holes and fallocate.
 	 *
 	 * Allocated blocks to fill the hole are marked as unwritten to prevent
 	 * parallel buffered read to expose the stale data before DIO complete
@@ -3354,7 +3371,11 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	iocb->private = NULL;
 	if (overwrite)
 		get_block_func = ext4_dio_get_block_overwrite;
-	else if (is_sync_kiocb(iocb)) {
+	else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
+		 round_down(offset, 1 << inode->i_blkbits) >= inode->i_size) {
+		get_block_func = ext4_dio_get_block;
+		dio_flags = DIO_LOCKING | DIO_SKIP_HOLES;
+	} else if (is_sync_kiocb(iocb)) {
 		get_block_func = ext4_dio_get_block_unwritten_sync;
 		dio_flags = DIO_LOCKING;
 	} else {
@@ -3364,10 +3385,11 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 #ifdef CONFIG_EXT4_FS_ENCRYPTION
 	BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
 #endif
-	if (IS_DAX(inode))
+	if (IS_DAX(inode)) {
+		dio_flags &= ~DIO_SKIP_HOLES;
 		ret = dax_do_io(iocb, inode, iter, offset, get_block_func,
 				ext4_end_io_dio, dio_flags);
-	else
+	} else
 		ret = __blockdev_direct_IO(iocb, inode,
 					   inode->i_sb->s_bdev, iter, offset,
 					   get_block_func,
@@ -3387,12 +3409,87 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		ext4_clear_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
 	}
 
-	if (iov_iter_rw(iter) == WRITE)
-		inode_dio_end(inode);
+	inode_dio_end(inode);
 	/* take i_mutex locking again if we do a ovewrite dio */
 	if (overwrite)
 		inode_lock(inode);
 
+	if (ret < 0 && final_size > inode->i_size)
+		ext4_truncate_failed_write(inode);
+
+	/* Handle extending of i_size after direct IO write */
+	if (orphan) {
+		int err;
+
+		/* Credits for sb + inode write */
+		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+		if (IS_ERR(handle)) {
+			/* This is really bad luck. We've written the data
+			 * but cannot extend i_size. Bail out and pretend
+			 * the write failed... */
+			ret = PTR_ERR(handle);
+			if (inode->i_nlink)
+				ext4_orphan_del(NULL, inode);
+
+			goto out;
+		}
+		if (inode->i_nlink)
+			ext4_orphan_del(handle, inode);
+		if (ret > 0) {
+			loff_t end = offset + ret;
+			if (end > inode->i_size) {
+				ei->i_disksize = end;
+				i_size_write(inode, end);
+				/*
+				 * We're going to return a positive `ret'
+				 * here due to non-zero-length I/O, so there's
+				 * no way of reporting error returns from
+				 * ext4_mark_inode_dirty() to userspace.  So
+				 * ignore it.
+				 */
+				ext4_mark_inode_dirty(handle, inode);
+			}
+		}
+		err = ext4_journal_stop(handle);
+		if (ret == 0)
+			ret = err;
+	}
+out:
+	return ret;
+}
+
+static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter,
+				   loff_t offset)
+{
+	int unlocked = 0;
+	struct inode *inode = iocb->ki_filp->f_mapping->host;
+	ssize_t ret;
+
+	if (ext4_should_dioread_nolock(inode)) {
+		/*
+		 * Nolock dioread optimization may be dynamically disabled
+		 * via ext4_inode_block_unlocked_dio(). Check inode's state
+		 * while holding extra i_dio_count ref.
+		 */
+		inode_dio_begin(inode);
+		smp_mb();
+		if (unlikely(ext4_test_inode_state(inode,
+						    EXT4_STATE_DIOREAD_LOCK)))
+			inode_dio_end(inode);
+		else
+			unlocked = 1;
+	}
+	if (IS_DAX(inode)) {
+		ret = dax_do_io(iocb, inode, iter, offset, ext4_dio_get_block,
+				NULL, unlocked ? 0 : DIO_LOCKING);
+	} else {
+		ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
+					   iter, offset, ext4_dio_get_block,
+					   NULL, NULL,
+					   unlocked ? 0 : DIO_LOCKING);
+	}
+	if (unlocked)
+		inode_dio_end(inode);
 	return ret;
 }
 
@@ -3420,10 +3517,10 @@ static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		return 0;
 
 	trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
-	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
-		ret = ext4_ext_direct_IO(iocb, iter, offset);
+	if (iov_iter_rw(iter) == READ)
+		ret = ext4_direct_IO_read(iocb, iter, offset);
 	else
-		ret = ext4_ind_direct_IO(iocb, iter, offset);
+		ret = ext4_direct_IO_write(iocb, iter, offset);
 	trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
 	return ret;
 }
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] ext4: Pre-zero allocated blocks for DAX IO
  2016-05-11  9:39 ` Jan Kara
@ 2016-05-11  9:39   ` Jan Kara
  -1 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2016-05-11  9:39 UTC (permalink / raw)
  To: Ted Tso; +Cc: Jan Kara, linux-nvdimm, linux-fsdevel, linux-ext4

Currently ext4 treats DAX IO the same way as direct IO. I.e., it
allocates unwritten extents before IO is done and converts unwritten
extents afterwards. However this way DAX IO can race with page fault to
the same area:

ext4_ext_direct_IO()				dax_fault()
  dax_io()
    get_block() - allocates unwritten extent
    copy_from_iter_pmem()
						  get_block() - converts
						    unwritten block to
						    written and zeroes it
						    out
  ext4_convert_unwritten_extents()

So data written with DAX IO gets lost. Similarly dax_new_buf() called
from dax_io() can overwrite data that has been already written to the
block via mmap.

Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
use them for DAX mmap. The downside of this solution is that every
allocating write writes each block twice (once zeros, once data). Fixing
the race with locking is possible as well however we would need to
lock-out faults for the whole range written to by DAX IO. And that is
not easy to do without locking-out faults for the whole file which seems
too aggressive.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h  | 11 +++++++++--
 fs/ext4/file.c  |  4 ++--
 fs/ext4/inode.c | 43 +++++++++++++++++++++++++++++++++----------
 3 files changed, 44 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 35792b430fb6..516e3dd506c0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2521,8 +2521,8 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
 struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
 int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
 			     struct buffer_head *bh_result, int create);
-int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
-			    struct buffer_head *bh_result, int create);
+int ext4_dax_get_block(struct inode *inode, sector_t iblock,
+		       struct buffer_head *bh_result, int create);
 int ext4_get_block(struct inode *inode, sector_t iblock,
 		   struct buffer_head *bh_result, int create);
 int ext4_dio_get_block(struct inode *inode, sector_t iblock,
@@ -3328,6 +3328,13 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
 	}
 }
 
+static inline bool ext4_aligned_io(struct inode *inode, loff_t off, loff_t len)
+{
+	int blksize = 1 << inode->i_blkbits;
+
+	return IS_ALIGNED(off, blksize) && IS_ALIGNED(len, blksize);
+}
+
 #endif	/* __KERNEL__ */
 
 #define EFSBADCRC	EBADMSG		/* Bad CRC detected */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index fa2208bae2e1..dfb33da04589 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -207,7 +207,7 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	if (IS_ERR(handle))
 		result = VM_FAULT_SIGBUS;
 	else
-		result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);
+		result = __dax_fault(vma, vmf, ext4_dax_get_block, NULL);
 
 	if (write) {
 		if (!IS_ERR(handle))
@@ -243,7 +243,7 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 		result = VM_FAULT_SIGBUS;
 	else
 		result = __dax_pmd_fault(vma, addr, pmd, flags,
-				ext4_dax_mmap_get_block, NULL);
+					 ext4_dax_get_block, NULL);
 
 	if (write) {
 		if (!IS_ERR(handle))
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 23fd0e0a9223..a2b7e4761c82 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3215,13 +3215,17 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
 }
 
 #ifdef CONFIG_FS_DAX
-int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
-			    struct buffer_head *bh_result, int create)
+/*
+ * Get block function for DAX IO and mmap faults. It takes care of converting
+ * unwritten extents to written ones and initializes new / converted blocks
+ * to zeros.
+ */
+int ext4_dax_get_block(struct inode *inode, sector_t iblock,
+		       struct buffer_head *bh_result, int create)
 {
 	int ret;
 
-	ext4_debug("ext4_dax_mmap_get_block: inode %lu, create flag %d\n",
-		   inode->i_ino, create);
+	ext4_debug("inode %lu, create flag %d\n", inode->i_ino, create);
 	if (!create)
 		return _ext4_get_block(inode, iblock, bh_result, 0);
 
@@ -3233,9 +3237,9 @@ int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
 
 	if (buffer_unwritten(bh_result)) {
 		/*
-		 * We are protected by i_mmap_sem so we know block cannot go
-		 * away from under us even though we dropped i_data_sem.
-		 * Convert extent to written and write zeros there.
+		 * We are protected by i_mmap_sem or i_mutex so we know block
+		 * cannot go away from under us even though we dropped
+		 * i_data_sem. Convert extent to written and write zeros there.
 		 */
 		ret = ext4_get_block_trans(inode, iblock, bh_result,
 					   EXT4_GET_BLOCKS_CONVERT |
@@ -3250,6 +3254,14 @@ int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
 	clear_buffer_new(bh_result);
 	return 0;
 }
+#else
+/* Just define empty function, it will never get called. */
+int ext4_dax_get_block(struct inode *inode, sector_t iblock,
+		       struct buffer_head *bh_result, int create)
+{
+	BUG();
+	return 0;
+}
 #endif
 
 static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
@@ -3371,8 +3383,20 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter,
 	iocb->private = NULL;
 	if (overwrite)
 		get_block_func = ext4_dio_get_block_overwrite;
-	else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
-		 round_down(offset, 1 << inode->i_blkbits) >= inode->i_size) {
+	else if (IS_DAX(inode)) {
+		/*
+		 * We can avoid zeroing for aligned DAX writes beyond EOF. Other
+		 * writes need zeroing either because they can race with page
+		 * faults or because they use partial blocks.
+		 */
+		if (round_down(offset, 1<<inode->i_blkbits) >= inode->i_size &&
+		    ext4_aligned_io(inode, offset, count))
+			get_block_func = ext4_dio_get_block;
+		else
+			get_block_func = ext4_dax_get_block;
+		dio_flags = DIO_LOCKING;
+	} else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
+		   round_down(offset, 1 << inode->i_blkbits) >= inode->i_size) {
 		get_block_func = ext4_dio_get_block;
 		dio_flags = DIO_LOCKING | DIO_SKIP_HOLES;
 	} else if (is_sync_kiocb(iocb)) {
@@ -3386,7 +3410,6 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter,
 	BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
 #endif
 	if (IS_DAX(inode)) {
-		dio_flags &= ~DIO_SKIP_HOLES;
 		ret = dax_do_io(iocb, inode, iter, offset, get_block_func,
 				ext4_end_io_dio, dio_flags);
 	} else
-- 
2.6.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] ext4: Pre-zero allocated blocks for DAX IO
@ 2016-05-11  9:39   ` Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2016-05-11  9:39 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, linux-nvdimm, Vishal Verma, Ross Zwisler,
	Dan Williams, linux-fsdevel, Jan Kara

Currently ext4 treats DAX IO the same way as direct IO. I.e., it
allocates unwritten extents before IO is done and converts unwritten
extents afterwards. However this way DAX IO can race with page fault to
the same area:

ext4_ext_direct_IO()				dax_fault()
  dax_io()
    get_block() - allocates unwritten extent
    copy_from_iter_pmem()
						  get_block() - converts
						    unwritten block to
						    written and zeroes it
						    out
  ext4_convert_unwritten_extents()

So data written with DAX IO gets lost. Similarly dax_new_buf() called
from dax_io() can overwrite data that has been already written to the
block via mmap.

Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
use them for DAX mmap. The downside of this solution is that every
allocating write writes each block twice (once zeros, once data). Fixing
the race with locking is possible as well however we would need to
lock-out faults for the whole range written to by DAX IO. And that is
not easy to do without locking-out faults for the whole file which seems
too aggressive.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h  | 11 +++++++++--
 fs/ext4/file.c  |  4 ++--
 fs/ext4/inode.c | 43 +++++++++++++++++++++++++++++++++----------
 3 files changed, 44 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 35792b430fb6..516e3dd506c0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2521,8 +2521,8 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
 struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
 int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
 			     struct buffer_head *bh_result, int create);
-int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
-			    struct buffer_head *bh_result, int create);
+int ext4_dax_get_block(struct inode *inode, sector_t iblock,
+		       struct buffer_head *bh_result, int create);
 int ext4_get_block(struct inode *inode, sector_t iblock,
 		   struct buffer_head *bh_result, int create);
 int ext4_dio_get_block(struct inode *inode, sector_t iblock,
@@ -3328,6 +3328,13 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
 	}
 }
 
+static inline bool ext4_aligned_io(struct inode *inode, loff_t off, loff_t len)
+{
+	int blksize = 1 << inode->i_blkbits;
+
+	return IS_ALIGNED(off, blksize) && IS_ALIGNED(len, blksize);
+}
+
 #endif	/* __KERNEL__ */
 
 #define EFSBADCRC	EBADMSG		/* Bad CRC detected */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index fa2208bae2e1..dfb33da04589 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -207,7 +207,7 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	if (IS_ERR(handle))
 		result = VM_FAULT_SIGBUS;
 	else
-		result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);
+		result = __dax_fault(vma, vmf, ext4_dax_get_block, NULL);
 
 	if (write) {
 		if (!IS_ERR(handle))
@@ -243,7 +243,7 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 		result = VM_FAULT_SIGBUS;
 	else
 		result = __dax_pmd_fault(vma, addr, pmd, flags,
-				ext4_dax_mmap_get_block, NULL);
+					 ext4_dax_get_block, NULL);
 
 	if (write) {
 		if (!IS_ERR(handle))
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 23fd0e0a9223..a2b7e4761c82 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3215,13 +3215,17 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
 }
 
 #ifdef CONFIG_FS_DAX
-int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
-			    struct buffer_head *bh_result, int create)
+/*
+ * Get block function for DAX IO and mmap faults. It takes care of converting
+ * unwritten extents to written ones and initializes new / converted blocks
+ * to zeros.
+ */
+int ext4_dax_get_block(struct inode *inode, sector_t iblock,
+		       struct buffer_head *bh_result, int create)
 {
 	int ret;
 
-	ext4_debug("ext4_dax_mmap_get_block: inode %lu, create flag %d\n",
-		   inode->i_ino, create);
+	ext4_debug("inode %lu, create flag %d\n", inode->i_ino, create);
 	if (!create)
 		return _ext4_get_block(inode, iblock, bh_result, 0);
 
@@ -3233,9 +3237,9 @@ int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
 
 	if (buffer_unwritten(bh_result)) {
 		/*
-		 * We are protected by i_mmap_sem so we know block cannot go
-		 * away from under us even though we dropped i_data_sem.
-		 * Convert extent to written and write zeros there.
+		 * We are protected by i_mmap_sem or i_mutex so we know block
+		 * cannot go away from under us even though we dropped
+		 * i_data_sem. Convert extent to written and write zeros there.
 		 */
 		ret = ext4_get_block_trans(inode, iblock, bh_result,
 					   EXT4_GET_BLOCKS_CONVERT |
@@ -3250,6 +3254,14 @@ int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
 	clear_buffer_new(bh_result);
 	return 0;
 }
+#else
+/* Just define empty function, it will never get called. */
+int ext4_dax_get_block(struct inode *inode, sector_t iblock,
+		       struct buffer_head *bh_result, int create)
+{
+	BUG();
+	return 0;
+}
 #endif
 
 static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
@@ -3371,8 +3383,20 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter,
 	iocb->private = NULL;
 	if (overwrite)
 		get_block_func = ext4_dio_get_block_overwrite;
-	else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
-		 round_down(offset, 1 << inode->i_blkbits) >= inode->i_size) {
+	else if (IS_DAX(inode)) {
+		/*
+		 * We can avoid zeroing for aligned DAX writes beyond EOF. Other
+		 * writes need zeroing either because they can race with page
+		 * faults or because they use partial blocks.
+		 */
+		if (round_down(offset, 1<<inode->i_blkbits) >= inode->i_size &&
+		    ext4_aligned_io(inode, offset, count))
+			get_block_func = ext4_dio_get_block;
+		else
+			get_block_func = ext4_dax_get_block;
+		dio_flags = DIO_LOCKING;
+	} else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
+		   round_down(offset, 1 << inode->i_blkbits) >= inode->i_size) {
 		get_block_func = ext4_dio_get_block;
 		dio_flags = DIO_LOCKING | DIO_SKIP_HOLES;
 	} else if (is_sync_kiocb(iocb)) {
@@ -3386,7 +3410,6 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter,
 	BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
 #endif
 	if (IS_DAX(inode)) {
-		dio_flags &= ~DIO_SKIP_HOLES;
 		ret = dax_do_io(iocb, inode, iter, offset, get_block_func,
 				ext4_end_io_dio, dio_flags);
 	} else
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] ext4: DAX fixes
  2016-05-11  9:39 ` Jan Kara
                   ` (4 preceding siblings ...)
  (?)
@ 2016-05-13  5:24 ` Theodore Ts'o
  2016-05-13 13:56   ` Theodore Ts'o
  -1 siblings, 1 reply; 19+ messages in thread
From: Theodore Ts'o @ 2016-05-13  5:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-nvdimm, Vishal Verma, Ross Zwisler,
	Dan Williams, linux-fsdevel

On Wed, May 11, 2016 at 11:39:07AM +0200, Jan Kara wrote:
> Hi!
> 
> These are ext4 patches that were originally part of my DAX locking patch
> series. They stand on their own so I'm submitting them independently and since
> they are non-trivial, I'd like them to be merged through ext4 tree.  The rest
> of the DAX locking series depends on these patches so whichever tree is going
> to merge them will have to pull the ext4 tree to get these patches.  They
> didn't change since the last posting of the DAX locking series (except for
> minor fixup due to patch reordering).
> 
> Ted, can you please queue these patches for the coming merge window? Thanks!

Queued and I'm running tests before I push them out.  The patches
passed the smoke tests[1], and the full tests should be running for the
next seven hours or so at:   http://104.197.222.91

I'm also doing a DAX run which should done in a half an hour or so.

    	       	     	       	      	      - Ted

[1] CMDLINE: smoke
FSTESTVER: e2fsprogs	v1.43-WIP-2015-05-18-64-g2334bd3 (Sat, 5 Sep 2015 22:21:35 -0400)
FSTESTVER: fio		fio-2.6-8-ge6989e1 (Thu, 4 Feb 2016 12:09:48 -0700)
FSTESTVER: quota		67fd9cc (Mon, 4 Apr 2016 00:32:39 -0400)
FSTESTVER: xfsprogs	v4.3.0 (Mon, 23 Nov 2015 15:24:24 +1100)
FSTESTVER: xfstests-bld	ccae8d1 (Tue, 26 Apr 2016 00:42:18 -0400)
FSTESTVER: xfstests	linux-v3.8-1010-g97dd2d8 (Wed, 27 Apr 2016 00:04:44 -0400)
FSTESTVER: kernel	4.6.0-rc4-ext4-00022-gaacd0c8 #282 SMP Fri May 13 00:52:07 EDT 2016 x86_64
FSTESTCFG: "4k"
FSTESTSET: "-g quick"
FSTESTEXC: ""
FSTESTOPT: "aex"
MNTOPTS: ""
CPUS: "2"
MEM: "7497.07"
MEM: 7680 MB (Max capacity)
BEGIN TEST 4k: Ext4 4k block Fri May 13 01:05:52 EDT 2016
Passed all 158 tests



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] ext4: DAX fixes
  2016-05-13  5:24 ` [PATCH 0/4] ext4: DAX fixes Theodore Ts'o
@ 2016-05-13 13:56   ` Theodore Ts'o
  2016-05-16  9:35     ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Theodore Ts'o @ 2016-05-13 13:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-nvdimm, Vishal Verma, Ross Zwisler,
	Dan Williams, linux-fsdevel

On Fri, May 13, 2016 at 01:24:19AM -0400, Theodore Ts'o wrote:
> Queued and I'm running tests before I push them out.  The patches
> passed the smoke tests[1], and the full tests should be running for the
> next seven hours or so at:   http://104.197.222.91

So the good news is that the full tests ("gce-xfstests full") look
really good[1]) Unfortunately the DAX tests ("gce-xfstests -c dax -g
auto") show a regression.  Is this expected?  (i.e., were you
expecting test regressions that will be fixed by the DAX patch
series?)

Before this patch series:

BEGIN TEST dax: Ext4 4k block using DAX Fri May 13 08:01:57 EDT 2016
Failures: generic/250 generic/252 generic/299 generic/338

After the patch series:

CMDLINE: -c dax -g auto
FSTESTVER: e2fsprogs	v1.43-WIP-2015-05-18-64-g2334bd3 (Sat, 5 Sep 2015 22:21:35 -0400)
FSTESTVER: fio		fio-2.6-8-ge6989e1 (Thu, 4 Feb 2016 12:09:48 -0700)
FSTESTVER: quota		67fd9cc (Mon, 4 Apr 2016 00:32:39 -0400)
FSTESTVER: xfsprogs	v4.3.0 (Mon, 23 Nov 2015 15:24:24 +1100)
FSTESTVER: xfstests-bld	ccae8d1 (Tue, 26 Apr 2016 00:42:18 -0400)
FSTESTVER: xfstests	linux-v3.8-1010-g97dd2d8 (Wed, 27 Apr 2016 00:04:44 -0400)
FSTESTVER: kernel	4.6.0-rc4-ext4-00022-gaacd0c8 #282 SMP Fri May 13 00:52:07 EDT 2016 x86_64
FSTESTCFG: "dax"
FSTESTSET: "-g auto"
FSTESTEXC: ""
FSTESTOPT: "aex"
MNTOPTS: ""
CPUS: "4"
MEM: "13052.8"
MEM: 26 GB (Max capacity)
BEGIN TEST dax: Ext4 4k block using DAX Fri May 13 01:21:58 EDT 2016
Failures: generic/075 generic/091 generic/112 generic/127 generic/231 generic/250 generic/252 generic/263 generic/299 generic/338 generic/340

				- Ted

[1] CMDLINE: full
FSTESTVER: e2fsprogs	v1.43-WIP-2015-05-18-64-g2334bd3 (Sat, 5 Sep 2015 22:21:35 -0400)
FSTESTVER: fio		fio-2.6-8-ge6989e1 (Thu, 4 Feb 2016 12:09:48 -0700)
FSTESTVER: quota		67fd9cc (Mon, 4 Apr 2016 00:32:39 -0400)
FSTESTVER: xfsprogs	v4.3.0 (Mon, 23 Nov 2015 15:24:24 +1100)
FSTESTVER: xfstests-bld	ccae8d1 (Tue, 26 Apr 2016 00:42:18 -0400)
FSTESTVER: xfstests	linux-v3.8-1010-g97dd2d8 (Wed, 27 Apr 2016 00:04:44 -0400)
FSTESTVER: kernel	4.6.0-rc4-ext4-00018-g816cd71 #279 SMP Thu May 5 22:45:36 EDT 2016 x86_64
FSTESTCFG: "4k 1k ext3 encrypt nojournal ext3conv adv dioread_nolock data_journal bigalloc bigalloc_1k"
FSTESTSET: "-g auto"
FSTESTEXC: ""
FSTESTOPT: "aex"
MNTOPTS: ""
CPUS: "2"
MEM: "7497.07"
MEM: 7680 MB (Max capacity)
BEGIN TEST 4k: Ext4 4k block Thu May  5 22:52:10 EDT 2016
Passed all 212 tests
BEGIN TEST 1k: Ext4 1k block Thu May  5 23:38:50 EDT 2016
Failures: generic/018
BEGIN TEST ext3: Ext4 4k block emulating ext3 Fri May  6 00:30:11 EDT 2016
Passed all 167 tests
BEGIN TEST encrypt: Ext4 encryption Fri May  6 01:15:41 EDT 2016
Failures: ext4/020
BEGIN TEST nojournal: Ext4 4k block w/ no journal Fri May  6 01:40:25 EDT 2016
Failures: ext4/301
BEGIN TEST ext3conv: Ext4 4k block w/nodelalloc and no flex_bg Fri May  6 02:20:39 EDT 2016
Passed all 212 tests
BEGIN TEST adv: Ext4 advanced features (inline_data, metadata_csum, 64bit) Fri May  6 03:06:25 EDT 2016
Passed all 211 tests
BEGIN TEST dioread_nolock: Ext4 4k block w/dioread_nolock Fri May  6 03:53:34 EDT 2016
Passed all 212 tests
BEGIN TEST data_journal: Ext4 4k block w/data=journal Fri May  6 04:40:52 EDT 2016
Failures: ext4/271 generic/018
BEGIN TEST bigalloc: Ext4 4k block w/bigalloc Fri May  6 05:28:12 EDT 2016
Failures: ext4/004 generic/204 generic/219 generic/235 generic/273
BEGIN TEST bigalloc_1k: Ext4 1k block w/bigalloc Fri May  6 06:18:54 EDT 2016
Failures: ext4/004 generic/204 generic/235 generic/270

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] ext4: DAX fixes
  2016-05-13 13:56   ` Theodore Ts'o
@ 2016-05-16  9:35     ` Jan Kara
  2016-05-16 14:26       ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2016-05-16  9:35 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, linux-ext4, linux-nvdimm, Vishal Verma, Ross Zwisler,
	Dan Williams, linux-fsdevel

On Fri 13-05-16 09:56:00, Ted Tso wrote:
> On Fri, May 13, 2016 at 01:24:19AM -0400, Theodore Ts'o wrote:
> > Queued and I'm running tests before I push them out.  The patches
> > passed the smoke tests[1], and the full tests should be running for the
> > next seven hours or so at:   http://104.197.222.91
> 
> So the good news is that the full tests ("gce-xfstests full") look
> really good[1]) Unfortunately the DAX tests ("gce-xfstests -c dax -g
> auto") show a regression.  Is this expected?  (i.e., were you
> expecting test regressions that will be fixed by the DAX patch
> series?)

No, I was not expecting to get new xfstests failures from just ext4
patches. That being said I didn't test ext4 patches on their own, just in
combination with other DAX changes so there may be some interaction I
forgot about. I'll see whether I can reproduce the failures and understand
what's going on. Thanks for having a look!

								Honza

> Before this patch series:
> 
> BEGIN TEST dax: Ext4 4k block using DAX Fri May 13 08:01:57 EDT 2016
> Failures: generic/250 generic/252 generic/299 generic/338
> 
> After the patch series:
> 
> CMDLINE: -c dax -g auto
> FSTESTVER: e2fsprogs	v1.43-WIP-2015-05-18-64-g2334bd3 (Sat, 5 Sep 2015 22:21:35 -0400)
> FSTESTVER: fio		fio-2.6-8-ge6989e1 (Thu, 4 Feb 2016 12:09:48 -0700)
> FSTESTVER: quota		67fd9cc (Mon, 4 Apr 2016 00:32:39 -0400)
> FSTESTVER: xfsprogs	v4.3.0 (Mon, 23 Nov 2015 15:24:24 +1100)
> FSTESTVER: xfstests-bld	ccae8d1 (Tue, 26 Apr 2016 00:42:18 -0400)
> FSTESTVER: xfstests	linux-v3.8-1010-g97dd2d8 (Wed, 27 Apr 2016 00:04:44 -0400)
> FSTESTVER: kernel	4.6.0-rc4-ext4-00022-gaacd0c8 #282 SMP Fri May 13 00:52:07 EDT 2016 x86_64
> FSTESTCFG: "dax"
> FSTESTSET: "-g auto"
> FSTESTEXC: ""
> FSTESTOPT: "aex"
> MNTOPTS: ""
> CPUS: "4"
> MEM: "13052.8"
> MEM: 26 GB (Max capacity)
> BEGIN TEST dax: Ext4 4k block using DAX Fri May 13 01:21:58 EDT 2016
> Failures: generic/075 generic/091 generic/112 generic/127 generic/231 generic/250 generic/252 generic/263 generic/299 generic/338 generic/340
> 
> 				- Ted
> 
> [1] CMDLINE: full
> FSTESTVER: e2fsprogs	v1.43-WIP-2015-05-18-64-g2334bd3 (Sat, 5 Sep 2015 22:21:35 -0400)
> FSTESTVER: fio		fio-2.6-8-ge6989e1 (Thu, 4 Feb 2016 12:09:48 -0700)
> FSTESTVER: quota		67fd9cc (Mon, 4 Apr 2016 00:32:39 -0400)
> FSTESTVER: xfsprogs	v4.3.0 (Mon, 23 Nov 2015 15:24:24 +1100)
> FSTESTVER: xfstests-bld	ccae8d1 (Tue, 26 Apr 2016 00:42:18 -0400)
> FSTESTVER: xfstests	linux-v3.8-1010-g97dd2d8 (Wed, 27 Apr 2016 00:04:44 -0400)
> FSTESTVER: kernel	4.6.0-rc4-ext4-00018-g816cd71 #279 SMP Thu May 5 22:45:36 EDT 2016 x86_64
> FSTESTCFG: "4k 1k ext3 encrypt nojournal ext3conv adv dioread_nolock data_journal bigalloc bigalloc_1k"
> FSTESTSET: "-g auto"
> FSTESTEXC: ""
> FSTESTOPT: "aex"
> MNTOPTS: ""
> CPUS: "2"
> MEM: "7497.07"
> MEM: 7680 MB (Max capacity)
> BEGIN TEST 4k: Ext4 4k block Thu May  5 22:52:10 EDT 2016
> Passed all 212 tests
> BEGIN TEST 1k: Ext4 1k block Thu May  5 23:38:50 EDT 2016
> Failures: generic/018
> BEGIN TEST ext3: Ext4 4k block emulating ext3 Fri May  6 00:30:11 EDT 2016
> Passed all 167 tests
> BEGIN TEST encrypt: Ext4 encryption Fri May  6 01:15:41 EDT 2016
> Failures: ext4/020
> BEGIN TEST nojournal: Ext4 4k block w/ no journal Fri May  6 01:40:25 EDT 2016
> Failures: ext4/301
> BEGIN TEST ext3conv: Ext4 4k block w/nodelalloc and no flex_bg Fri May  6 02:20:39 EDT 2016
> Passed all 212 tests
> BEGIN TEST adv: Ext4 advanced features (inline_data, metadata_csum, 64bit) Fri May  6 03:06:25 EDT 2016
> Passed all 211 tests
> BEGIN TEST dioread_nolock: Ext4 4k block w/dioread_nolock Fri May  6 03:53:34 EDT 2016
> Passed all 212 tests
> BEGIN TEST data_journal: Ext4 4k block w/data=journal Fri May  6 04:40:52 EDT 2016
> Failures: ext4/271 generic/018
> BEGIN TEST bigalloc: Ext4 4k block w/bigalloc Fri May  6 05:28:12 EDT 2016
> Failures: ext4/004 generic/204 generic/219 generic/235 generic/273
> BEGIN TEST bigalloc_1k: Ext4 1k block w/bigalloc Fri May  6 06:18:54 EDT 2016
> Failures: ext4/004 generic/204 generic/235 generic/270
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] ext4: DAX fixes
  2016-05-16  9:35     ` Jan Kara
@ 2016-05-16 14:26       ` Jan Kara
  2016-05-16 15:08         ` Theodore Ts'o
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2016-05-16 14:26 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, linux-ext4, linux-nvdimm, Vishal Verma, Ross Zwisler,
	Dan Williams, linux-fsdevel

On Mon 16-05-16 11:35:25, Jan Kara wrote:
> On Fri 13-05-16 09:56:00, Ted Tso wrote:
> > On Fri, May 13, 2016 at 01:24:19AM -0400, Theodore Ts'o wrote:
> > > Queued and I'm running tests before I push them out.  The patches
> > > passed the smoke tests[1], and the full tests should be running for the
> > > next seven hours or so at:   http://104.197.222.91
> > 
> > So the good news is that the full tests ("gce-xfstests full") look
> > really good[1]) Unfortunately the DAX tests ("gce-xfstests -c dax -g
> > auto") show a regression.  Is this expected?  (i.e., were you
> > expecting test regressions that will be fixed by the DAX patch
> > series?)
> 
> No, I was not expecting to get new xfstests failures from just ext4
> patches. That being said I didn't test ext4 patches on their own, just in
> combination with other DAX changes so there may be some interaction I
> forgot about. I'll see whether I can reproduce the failures and understand
> what's going on. Thanks for having a look!

So I have nailed this down. The patch "ext4: Handle transient ENOSPC
properly for DAX" changes how ext4_dax_mmap_get_block() (later renamed to
ext4_dax_get_block()) behaves for calls with create == 0 and that trips
over an issue in __dax_fault() (a bug that __dax_fault() doesn't call
get_blocks() with create == 1 for write faults into unwritten extents)
which gets fixed only in "dax: Remove dead zeroing code from fault handlers".

I see two options here:

1) Just push patches as is and have ext4 dax broken between ext4 merge and
nvdimm merge.

2) Split out the one-line change from "dax: Remove dead zeroing code from
fault handlers" in __dax_fault() which fixes the behavior for ext4 and
merge it through ext4 tree. Merge the rest through nvdimm tree.

Dan? Ted?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] ext4: DAX fixes
  2016-05-16 14:26       ` Jan Kara
@ 2016-05-16 15:08         ` Theodore Ts'o
  2016-05-16 15:13           ` Dan Williams
  0 siblings, 1 reply; 19+ messages in thread
From: Theodore Ts'o @ 2016-05-16 15:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-nvdimm, Vishal Verma, Ross Zwisler,
	Dan Williams, linux-fsdevel

On Mon, May 16, 2016 at 04:26:16PM +0200, Jan Kara wrote:
> 1) Just push patches as is and have ext4 dax broken between ext4 merge and
> nvdimm merge.
> 
> 2) Split out the one-line change from "dax: Remove dead zeroing code from
> fault handlers" in __dax_fault() which fixes the behavior for ext4 and
> merge it through ext4 tree. Merge the rest through nvdimm tree.

I'm good either way, although I have a slight preference for (2).
It's really tiny preference, though, so if you or Dan want to run the
fix through the dax branch, that's fine too.

I've updated the ext4 dev branch to point include your dax changes, so
they can get picked up for linux-next integration testing.

     	     	       	   	      		  - Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] ext4: DAX fixes
  2016-05-16 15:08         ` Theodore Ts'o
@ 2016-05-16 15:13           ` Dan Williams
  2016-05-16 15:59             ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2016-05-16 15:13 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, linux-ext4, linux-nvdimm, Vishal Verma, Ross Zwisler,
	linux-fsdevel

On Mon, May 16, 2016 at 8:08 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, May 16, 2016 at 04:26:16PM +0200, Jan Kara wrote:
>> 1) Just push patches as is and have ext4 dax broken between ext4 merge and
>> nvdimm merge.
>>
>> 2) Split out the one-line change from "dax: Remove dead zeroing code from
>> fault handlers" in __dax_fault() which fixes the behavior for ext4 and
>> merge it through ext4 tree. Merge the rest through nvdimm tree.
>
> I'm good either way, although I have a slight preference for (2).
> It's really tiny preference, though, so if you or Dan want to run the
> fix through the dax branch, that's fine too.

Would you fold the change and trigger a rebase or just apply it on
top?  If just applying on top then it seems the same exposure as
merging it intact through nvdimm.git.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] ext4: DAX fixes
  2016-05-16 15:13           ` Dan Williams
@ 2016-05-16 15:59             ` Jan Kara
  2016-05-16 18:29               ` Dan Williams
  2016-05-16 20:47               ` Theodore Ts'o
  0 siblings, 2 replies; 19+ messages in thread
From: Jan Kara @ 2016-05-16 15:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Theodore Ts'o, Jan Kara, linux-ext4, linux-nvdimm,
	Vishal Verma, Ross Zwisler, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1298 bytes --]

On Mon 16-05-16 08:13:50, Dan Williams wrote:
> On Mon, May 16, 2016 at 8:08 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> > On Mon, May 16, 2016 at 04:26:16PM +0200, Jan Kara wrote:
> >> 1) Just push patches as is and have ext4 dax broken between ext4 merge and
> >> nvdimm merge.
> >>
> >> 2) Split out the one-line change from "dax: Remove dead zeroing code from
> >> fault handlers" in __dax_fault() which fixes the behavior for ext4 and
> >> merge it through ext4 tree. Merge the rest through nvdimm tree.
> >
> > I'm good either way, although I have a slight preference for (2).
> > It's really tiny preference, though, so if you or Dan want to run the
> > fix through the dax branch, that's fine too.
> 
> Would you fold the change and trigger a rebase or just apply it on
> top?  If just applying on top then it seems the same exposure as
> merging it intact through nvdimm.git.

The patch which fixes ext4 behavior is attached. Just that we know what we
are speaking about... Rebasing all the patches on top of this is trivial
(git rebase just handles the conflict automatically).

I've scheduled full ext4 & XFS xfstest run with just this patch and ext4
fixes to make sure it doesn't introduce some intermediate regresion
somewhere.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-dax-Call-get_blocks-with-create-1-for-write-faults-t.patch --]
[-- Type: text/x-patch, Size: 1547 bytes --]

>From c8963499eeba112430fc4499c675cca243b8b4a6 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Mon, 16 May 2016 17:42:11 +0200
Subject: [PATCH] dax: Call get_blocks() with create == 1 for write faults to
 unwritten extents

Currently, __dax_fault() does not call get_blocks() callback with create
argument set, when we got back unwritten extent from the initial
get_blocks() call during a write fault. This is because originally
filesystems were supposed to convert unwritten extents to written ones
using complete_unwritten() callback. Later this was abandoned in favor of
using pre-zeroed blocks however the condition whether get_blocks() needs
to be called with create == 1 remained.

Fix the condition so that filesystems are not forced to zero-out and
convert unwritten extents when get_blocks() is called with create == 0
(which introduces unnecessary overhead for read faults and can be
problematic as the filesystem may possibly be read-only).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 75ba46d82a76..2494255c5785 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -667,7 +667,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (error)
 		goto unlock_page;
 
-	if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
+	if (!buffer_mapped(&bh) && !vmf->cow_page) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
 			error = get_block(inode, block, &bh, 1);
 			count_vm_event(PGMAJFAULT);
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] ext4: DAX fixes
  2016-05-16 15:59             ` Jan Kara
@ 2016-05-16 18:29               ` Dan Williams
  2016-05-16 20:47               ` Theodore Ts'o
  1 sibling, 0 replies; 19+ messages in thread
From: Dan Williams @ 2016-05-16 18:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, linux-ext4, linux-nvdimm, Vishal Verma,
	Ross Zwisler, linux-fsdevel

On Mon, May 16, 2016 at 8:59 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 16-05-16 08:13:50, Dan Williams wrote:
>> On Mon, May 16, 2016 at 8:08 AM, Theodore Ts'o <tytso@mit.edu> wrote:
>> > On Mon, May 16, 2016 at 04:26:16PM +0200, Jan Kara wrote:
>> >> 1) Just push patches as is and have ext4 dax broken between ext4 merge and
>> >> nvdimm merge.
>> >>
>> >> 2) Split out the one-line change from "dax: Remove dead zeroing code from
>> >> fault handlers" in __dax_fault() which fixes the behavior for ext4 and
>> >> merge it through ext4 tree. Merge the rest through nvdimm tree.
>> >
>> > I'm good either way, although I have a slight preference for (2).
>> > It's really tiny preference, though, so if you or Dan want to run the
>> > fix through the dax branch, that's fine too.
>>
>> Would you fold the change and trigger a rebase or just apply it on
>> top?  If just applying on top then it seems the same exposure as
>> merging it intact through nvdimm.git.
>
> The patch which fixes ext4 behavior is attached. Just that we know what we
> are speaking about... Rebasing all the patches on top of this is trivial
> (git rebase just handles the conflict automatically).
>
> I've scheduled full ext4 & XFS xfstest run with just this patch and ext4
> fixes to make sure it doesn't introduce some intermediate regresion
> somewhere.
>

Ok, sounds good.  It makes the most sense to base the nvdimm.git
branch on top of this fix.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] ext4: DAX fixes
  2016-05-16 15:59             ` Jan Kara
  2016-05-16 18:29               ` Dan Williams
@ 2016-05-16 20:47               ` Theodore Ts'o
  1 sibling, 0 replies; 19+ messages in thread
From: Theodore Ts'o @ 2016-05-16 20:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, linux-ext4, linux-nvdimm, Vishal Verma,
	Ross Zwisler, linux-fsdevel

On Mon, May 16, 2016 at 05:59:50PM +0200, Jan Kara wrote:
> The patch which fixes ext4 behavior is attached. Just that we know what we
> are speaking about... Rebasing all the patches on top of this is trivial
> (git rebase just handles the conflict automatically).

OK, I've added this one-liner to the ext4 tree.  I presume that if git
rebase handles the conflict automatically, git merge should be able to
handle automatically it as well.

					- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-05-16 20:47 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-11  9:39 [PATCH 0/4] ext4: DAX fixes Jan Kara
2016-05-11  9:39 ` Jan Kara
2016-05-11  9:39 ` [PATCH 1/4] ext4: Handle transient ENOSPC properly for DAX Jan Kara
2016-05-11  9:39   ` Jan Kara
2016-05-11  9:39 ` [PATCH 2/4] ext4: Fix race in transient ENOSPC detection Jan Kara
2016-05-11  9:39   ` Jan Kara
2016-05-11  9:39 ` [PATCH 3/4] ext4: Refactor direct IO code Jan Kara
2016-05-11  9:39   ` Jan Kara
2016-05-11  9:39 ` [PATCH 4/4] ext4: Pre-zero allocated blocks for DAX IO Jan Kara
2016-05-11  9:39   ` Jan Kara
2016-05-13  5:24 ` [PATCH 0/4] ext4: DAX fixes Theodore Ts'o
2016-05-13 13:56   ` Theodore Ts'o
2016-05-16  9:35     ` Jan Kara
2016-05-16 14:26       ` Jan Kara
2016-05-16 15:08         ` Theodore Ts'o
2016-05-16 15:13           ` Dan Williams
2016-05-16 15:59             ` Jan Kara
2016-05-16 18:29               ` Dan Williams
2016-05-16 20:47               ` Theodore Ts'o

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.