All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] i_size/i_disksize update fixes and cleanup series
@ 2014-08-22 11:32 Dmitry Monakhov
  2014-08-22 11:32 ` [PATCH 1/5] use ext4_update_i_disksize instead of opencoded ones Dmitry Monakhov
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Dmitry Monakhov @ 2014-08-22 11:32 UTC (permalink / raw)
  To: linux-ext4; +Cc: lczerner, Dmitry Monakhov

#First two patches are simply cleanups
use ext4_update_i_disksize instead of opencoded ones
move i_size, i_diskzie update routines to helper function
# Simple fix found during code review
ext4_zero_range fix incorect journal credits reservation
# Fixes i_size,i_disksize,mtime,ctime update issues
# primary test: xfstest generic/019
fix transaction issues for fallocate and zero_range
update sync i_disksize coherently with block allocation

In fact last two issues can be fixed by inserting inode to orphan list
during file grow, but this turns to be a problem for ext4_writepages because
orhpan list is protected by i_mutex.

BTW: May be it is reasonable serialize orhpan list manipulations by i_data_sem?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/5] use ext4_update_i_disksize instead of opencoded ones
  2014-08-22 11:32 [PATCH 0/5] i_size/i_disksize update fixes and cleanup series Dmitry Monakhov
@ 2014-08-22 11:32 ` Dmitry Monakhov
  2014-08-31  3:17   ` Theodore Ts'o
  2014-08-22 11:32 ` [PATCH 2/5] move i_size,i_diskzie update routines to helper function Dmitry Monakhov
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Dmitry Monakhov @ 2014-08-22 11:32 UTC (permalink / raw)
  To: linux-ext4; +Cc: lczerner, Dmitry Monakhov


Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/inode.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 367a60c..b3b755b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2670,10 +2670,7 @@ static int ext4_da_write_end(struct file *file,
 	if (copied && new_i_size > EXT4_I(inode)->i_disksize) {
 		if (ext4_has_inline_data(inode) ||
 		    ext4_da_should_update_i_disksize(page, end)) {
-			down_write(&EXT4_I(inode)->i_data_sem);
-			if (new_i_size > EXT4_I(inode)->i_disksize)
-				EXT4_I(inode)->i_disksize = new_i_size;
-			up_write(&EXT4_I(inode)->i_data_sem);
+			ext4_update_i_disksize(inode, new_i_size);
 			/* We need to mark inode dirty even if
 			 * new_i_size is less that inode->i_size
 			 * bu greater than i_disksize.(hint delalloc)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/5] move i_size,i_diskzie update routines to helper function
  2014-08-22 11:32 [PATCH 0/5] i_size/i_disksize update fixes and cleanup series Dmitry Monakhov
  2014-08-22 11:32 ` [PATCH 1/5] use ext4_update_i_disksize instead of opencoded ones Dmitry Monakhov
@ 2014-08-22 11:32 ` Dmitry Monakhov
  2014-08-26  1:30   ` Theodore Ts'o
  2014-08-22 11:32 ` [PATCH 3/5] ext4_zero_range: fix incorect journal credits reservation Dmitry Monakhov
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Dmitry Monakhov @ 2014-08-22 11:32 UTC (permalink / raw)
  To: linux-ext4; +Cc: lczerner, Dmitry Monakhov


Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/ext4.h    |   16 ++++++++++++++++
 fs/ext4/extents.c |   17 ++++-------------
 fs/ext4/inode.c   |   34 ++++++++--------------------------
 3 files changed, 28 insertions(+), 39 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 847f7a4..a6b4f1e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2454,6 +2454,22 @@ static inline void ext4_update_i_disksize(struct inode *inode, loff_t newsize)
 	up_write(&EXT4_I(inode)->i_data_sem);
 }
 
+/* Update i_size, i_disksize. Requires i_mutex to avoid races with truncate */
+static inline int ext4_update_inode_size(struct inode *inode, loff_t newsize)
+{
+	int changed = 0;
+
+	if (newsize > inode->i_size) {
+		i_size_write(inode, newsize);
+		changed = 1;
+	}
+	if (newsize > EXT4_I(inode)->i_disksize) {
+		ext4_update_i_disksize(inode, newsize);
+		changed |= 2;
+	}
+	return changed;
+}
+
 struct ext4_group_info {
 	unsigned long   bb_state;
 	struct rb_root  bb_free_root;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index ee7489f..5bf718b 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4848,12 +4848,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	}
 
 	inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
-
 	if (new_size) {
-		if (new_size > i_size_read(inode))
-			i_size_write(inode, new_size);
-		if (new_size > EXT4_I(inode)->i_disksize)
-			ext4_update_i_disksize(inode, new_size);
+		ext4_update_inode_size(inode, new_size);
 	} else {
 		/*
 		* Mark that we allocate beyond EOF so the subsequent truncate
@@ -4895,7 +4891,6 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	int ret = 0;
 	int flags;
 	ext4_lblk_t lblk;
-	struct timespec tv;
 	unsigned int blkbits = inode->i_blkbits;
 
 	/* Return error if mode is not supported */
@@ -4954,15 +4949,11 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	if (IS_ERR(handle))
 		goto out;
 
-	tv = inode->i_ctime = ext4_current_time(inode);
+	inode->i_ctime = ext4_current_time(inode);
 
 	if (new_size) {
-		if (new_size > i_size_read(inode)) {
-			i_size_write(inode, new_size);
-			inode->i_mtime = tv;
-		}
-		if (new_size > EXT4_I(inode)->i_disksize)
-			ext4_update_i_disksize(inode, new_size);
+		if (ext4_update_inode_size(inode, new_size) & 0x1)
+			inode->i_mtime = inode->i_ctime;
 	} else {
 		/*
 		* Mark that we allocate beyond EOF so the subsequent truncate
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b3b755b..3919e25 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1055,27 +1055,11 @@ static int ext4_write_end(struct file *file,
 	} else
 		copied = block_write_end(file, mapping, pos,
 					 len, copied, page, fsdata);
-
 	/*
-	 * No need to use i_size_read() here, the i_size
-	 * cannot change under us because we hole i_mutex.
-	 *
-	 * But it's important to update i_size while still holding page lock:
+	 * it's important to update i_size while still holding page lock:
 	 * page writeout could otherwise come in and zero beyond i_size.
 	 */
-	if (pos + copied > inode->i_size) {
-		i_size_write(inode, pos + copied);
-		i_size_changed = 1;
-	}
-
-	if (pos + copied > EXT4_I(inode)->i_disksize) {
-		/* We need to mark inode dirty even if
-		 * new_i_size is less that inode->i_size
-		 * but greater than i_disksize. (hint delalloc)
-		 */
-		ext4_update_i_disksize(inode, (pos + copied));
-		i_size_changed = 1;
-	}
+	i_size_changed = ext4_update_inode_size(inode, pos + copied);
 	unlock_page(page);
 	page_cache_release(page);
 
@@ -1123,7 +1107,7 @@ static int ext4_journalled_write_end(struct file *file,
 	int ret = 0, ret2;
 	int partial = 0;
 	unsigned from, to;
-	loff_t new_i_size;
+	int size_changed = 0;
 
 	trace_ext4_journalled_write_end(inode, pos, len, copied);
 	from = pos & (PAGE_CACHE_SIZE - 1);
@@ -1146,20 +1130,18 @@ static int ext4_journalled_write_end(struct file *file,
 		if (!partial)
 			SetPageUptodate(page);
 	}
-	new_i_size = pos + copied;
-	if (new_i_size > inode->i_size)
-		i_size_write(inode, pos+copied);
+	size_changed = ext4_update_inode_size(inode, pos + copied);
 	ext4_set_inode_state(inode, EXT4_STATE_JDATA);
 	EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
-	if (new_i_size > EXT4_I(inode)->i_disksize) {
-		ext4_update_i_disksize(inode, new_i_size);
+	unlock_page(page);
+	page_cache_release(page);
+
+	if (size_changed) {
 		ret2 = ext4_mark_inode_dirty(handle, inode);
 		if (!ret)
 			ret = ret2;
 	}
 
-	unlock_page(page);
-	page_cache_release(page);
 	if (pos + len > inode->i_size && ext4_can_truncate(inode))
 		/* if we have allocated more blocks and copied
 		 * less. We will have blocks allocated outside
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/5] ext4_zero_range: fix incorect journal credits reservation
  2014-08-22 11:32 [PATCH 0/5] i_size/i_disksize update fixes and cleanup series Dmitry Monakhov
  2014-08-22 11:32 ` [PATCH 1/5] use ext4_update_i_disksize instead of opencoded ones Dmitry Monakhov
  2014-08-22 11:32 ` [PATCH 2/5] move i_size,i_diskzie update routines to helper function Dmitry Monakhov
@ 2014-08-22 11:32 ` Dmitry Monakhov
  2014-08-23 19:08   ` Theodore Ts'o
  2014-08-22 11:32 ` [PATCH 4/5] fix transaction issues for ext4_fallocate and ext_zero_range Dmitry Monakhov
  2014-08-22 11:32 ` [PATCH 5/5] update i_disksize coherently with block allocation Dmitry Monakhov
  4 siblings, 1 reply; 19+ messages in thread
From: Dmitry Monakhov @ 2014-08-22 11:32 UTC (permalink / raw)
  To: linux-ext4; +Cc: lczerner, Dmitry Monakhov

Currently we reserve only 4 blocks but in worst case scenario
ext4_zero_partial_blocks() may want to zeroout and convert two
non adjacent blocks.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/extents.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 5bf718b..39fbcbb 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4740,6 +4740,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	loff_t new_size = 0;
 	int ret = 0;
 	int flags;
+	int credits;
 	int partial;
 	loff_t start, end;
 	ext4_lblk_t lblk;
@@ -4839,8 +4840,10 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 		if (ret)
 			goto out_dio;
 	}

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/5] fix transaction issues for ext4_fallocate and ext_zero_range
  2014-08-22 11:32 [PATCH 0/5] i_size/i_disksize update fixes and cleanup series Dmitry Monakhov
                   ` (2 preceding siblings ...)
  2014-08-22 11:32 ` [PATCH 3/5] ext4_zero_range: fix incorect journal credits reservation Dmitry Monakhov
@ 2014-08-22 11:32 ` Dmitry Monakhov
  2014-08-26  1:30   ` Theodore Ts'o
  2014-08-22 11:32 ` [PATCH 5/5] update i_disksize coherently with block allocation Dmitry Monakhov
  4 siblings, 1 reply; 19+ messages in thread
From: Dmitry Monakhov @ 2014-08-22 11:32 UTC (permalink / raw)
  To: linux-ext4; +Cc: lczerner, Dmitry Monakhov

After commit f282ac19d86f we use different transactions for preallocation
and i_disksize update which result in complain from fsck after power-failure.
spotted by generic/019. IMHO this is regression because fs becomes inconsistent,
even more 'e2fsck -p' will no longer works (which drives admins go crazy)
Same transaction requirement applies ctime,mtime updates

testcase: xfstest generic/019

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/extents.c |   68 +++++++++++++++++++++++++++-------------------------
 1 files changed, 35 insertions(+), 33 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 39fbcbb..3fae328 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4674,7 +4674,8 @@ retry:
 }
 
 static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
-				  ext4_lblk_t len, int flags, int mode)
+				  ext4_lblk_t len, loff_t new_size,
+				  int flags, int mode)
 {
 	struct inode *inode = file_inode(file);
 	handle_t *handle;
@@ -4683,8 +4684,10 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 	int retries = 0;
 	struct ext4_map_blocks map;
 	unsigned int credits;
+	loff_t epos;
 
 	map.m_lblk = offset;
+	map.m_len = len;
 	/*
 	 * Don't normalize the request if it can fit in one extent so
 	 * that it doesn't get unnecessarily split into multiple
@@ -4699,9 +4702,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 	credits = ext4_chunk_trans_blocks(inode, len);
 
 retry:
-	while (ret >= 0 && ret < len) {
-		map.m_lblk = map.m_lblk + ret;
-		map.m_len = len = len - ret;
+	while (ret >= 0 && len) {
 		handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
 					    credits);
 		if (IS_ERR(handle)) {
@@ -4718,6 +4719,21 @@ retry:
 			ret2 = ext4_journal_stop(handle);
 			break;
 		}
+		map.m_lblk += ret;
+		map.m_len = len = len - ret;
+		epos = (loff_t)map.m_lblk << inode->i_blkbits;
+		inode->i_ctime = ext4_current_time(inode);
+		if (new_size) {
+			if (epos > new_size)
+				epos = new_size;
+			if (ext4_update_inode_size(inode, epos) & 0x1)
+				inode->i_mtime = inode->i_ctime;
+		} else {
+			if (epos > inode->i_size)
+				ext4_set_inode_flag(inode,
+						    EXT4_INODE_EOFBLOCKS);
+		}
+		ext4_mark_inode_dirty(handle, inode);
 		ret2 = ext4_journal_stop(handle);
 		if (ret2)
 			break;
@@ -4741,7 +4757,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	int ret = 0;
 	int flags;
 	int credits;
-	int partial;
+	int partial_begin, partial_end;
 	loff_t start, end;
 	ext4_lblk_t lblk;
 	struct address_space *mapping = inode->i_mapping;
@@ -4781,7 +4797,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 
 	if (start < offset || end > offset + len)
 		return -EINVAL;
-	partial = (offset + len) & ((1 << blkbits) - 1);
+	partial_begin = offset & ((1 << blkbits) - 1);
+	partial_end = (offset + len) & ((1 << blkbits) - 1);
 
 	lblk = start >> blkbits;
 	max_blocks = (end >> blkbits);
@@ -4815,7 +4832,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 		 * If we have a partial block after EOF we have to allocate
 		 * the entire block.
 		 */
-		if (partial)
+		if (partial_end)
 			max_blocks += 1;
 	}
 
@@ -4823,6 +4840,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 
 		/* Now release the pages and zero block aligned part of pages*/
 		truncate_pagecache_range(inode, start, end - 1);
+		inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
 
 		/* Wait all existing dio workers, newcomers will block on i_mutex */
 		ext4_inode_block_unlocked_dio(inode);
@@ -4835,11 +4853,14 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 		if (ret)
 			goto out_dio;
 
-		ret = ext4_alloc_file_blocks(file, lblk, max_blocks, flags,
-					     mode);
+		ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
+					     flags, mode);
 		if (ret)
 			goto out_dio;
 	}
+	if (!partial_begin && !partial_end)
+		goto out_dio;
+
 	/* In worst case we have to writeout two nonadjacent unwritten blocks */
 	credits = ext4_chunk_trans_blocks(inode, 1) * 2 -
 		EXT4_META_TRANS_BLOCKS(inode->i_sb);
@@ -4861,7 +4882,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 		if ((offset + len) > i_size_read(inode))
 			ext4_set_inode_flag(inode, EXT4_INODE_EOFBLOCKS);
 	}
-
 	ext4_mark_inode_dirty(handle, inode);
 
 	/* Zero out partial block at the edges of the range */
@@ -4888,7 +4908,6 @@ out_mutex:
 long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 {
 	struct inode *inode = file_inode(file);
-	handle_t *handle;
 	loff_t new_size = 0;
 	unsigned int max_blocks;
 	int ret = 0;
@@ -4944,32 +4963,15 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 			goto out;
 	}
 
-	ret = ext4_alloc_file_blocks(file, lblk, max_blocks, flags, mode);
+	ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
+				     flags, mode);
 	if (ret)
 		goto out;
 
-	handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
-	if (IS_ERR(handle))
-		goto out;
-
-	inode->i_ctime = ext4_current_time(inode);
-
-	if (new_size) {
-		if (ext4_update_inode_size(inode, new_size) & 0x1)
-			inode->i_mtime = inode->i_ctime;
-	} else {
-		/*
-		* Mark that we allocate beyond EOF so the subsequent truncate
-		* can proceed even if the new size is the same as i_size.
-		*/
-		if ((offset + len) > i_size_read(inode))
-			ext4_set_inode_flag(inode, EXT4_INODE_EOFBLOCKS);
+	if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal) {
+		ret = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
+						EXT4_I(inode)->i_sync_tid);
 	}
-	ext4_mark_inode_dirty(handle, inode);
-	if (file->f_flags & O_SYNC)
-		ext4_handle_sync(handle);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/5] update i_disksize coherently with block allocation
  2014-08-22 11:32 [PATCH 0/5] i_size/i_disksize update fixes and cleanup series Dmitry Monakhov
                   ` (3 preceding siblings ...)
  2014-08-22 11:32 ` [PATCH 4/5] fix transaction issues for ext4_fallocate and ext_zero_range Dmitry Monakhov
@ 2014-08-22 11:32 ` Dmitry Monakhov
  2014-08-23 22:00   ` Theodore Ts'o
  2014-08-26  1:13   ` Theodore Ts'o
  4 siblings, 2 replies; 19+ messages in thread
From: Dmitry Monakhov @ 2014-08-22 11:32 UTC (permalink / raw)
  To: linux-ext4; +Cc: lczerner, Dmitry Monakhov

Writeback call trace looks like follows:
ext4_writepages
 while(nr_pages)
 ->journal_start
 ->mpage_map_and_submit_extent -> may alloc some blocks
   ->mpage_map_one_extent
 ->journal_stop
In case of delalloc block i_disksize may be less than i_size. So we have to
update i_disksize each time we allocated and submitted some blocks beyond
i_disksize. And we MUST update it in the same transaction, otherwise this
result in fs-inconsistency in case of upcoming power-failure.

Another possible way to fix that issue is to insert inode to orhphan list
on ext4_writepages entrance.

testcase: xfstest generic/019

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/inode.c |   10 ++++++++--
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3919e25..b1d92fb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2077,6 +2077,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 	struct ext4_map_blocks *map = &mpd->map;
 	int err;
 	loff_t disksize;
+	int progress = 0;
 
 	mpd->io_submit.io_end->offset =
 				((loff_t)map->m_lblk) << inode->i_blkbits;
@@ -2093,8 +2094,11 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 			 * is non-zero, a commit should free up blocks.
 			 */
 			if ((err == -ENOMEM) ||
-			    (err == -ENOSPC && ext4_count_free_clusters(sb)))
+			    (err == -ENOSPC && ext4_count_free_clusters(sb))) {
+				if (progress)
+					goto update_disksize;
 				return err;
+			}
 			ext4_msg(sb, KERN_CRIT,
 				 "Delayed block allocation failed for "
 				 "inode %lu at logical offset %llu with"
@@ -2111,15 +2115,17 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 			*give_up_on_write = true;
 			return err;
 		}
+		progress = 1;
 		/*
 		 * Update buffer state, submit mapped pages, and get us new
 		 * extent to map
 		 */
 		err = mpage_map_and_submit_buffers(mpd);
 		if (err < 0)
-			return err;
+			goto update_disksize;
 	} while (map->m_len);
 
+update_disksize:
 	/*
 	 * Update on-disk size after IO is submitted.  Races with
 	 * truncate are avoided by checking i_size under i_data_sem.
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/5] ext4_zero_range: fix incorect journal credits reservation
  2014-08-22 11:32 ` [PATCH 3/5] ext4_zero_range: fix incorect journal credits reservation Dmitry Monakhov
@ 2014-08-23 19:08   ` Theodore Ts'o
  2014-08-25  7:19     ` Dmitry Monakhov
  0 siblings, 1 reply; 19+ messages in thread
From: Theodore Ts'o @ 2014-08-23 19:08 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, lczerner

On Fri, Aug 22, 2014 at 03:32:25PM +0400, Dmitry Monakhov wrote:
> -	handle = ext4_journal_start(inode, EXT4_HT_MISC, 4);
> +	/* In worst case we have to writeout two nonadjacent unwritten blocks */
> +	credits = ext4_chunk_trans_blocks(inode, 1) * 2 -
> +		EXT4_META_TRANS_BLOCKS(inode->i_sb);

This looks like it would be a massive over-estimate, since it includes
the block group allocation bitmaps, which we wouldn't need to update,
no?

Wouldn't

	credts = ext4_index_trans_blocks(inode, 1, 1) * 2;

be sufficient?

					- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] update i_disksize coherently with block allocation
  2014-08-22 11:32 ` [PATCH 5/5] update i_disksize coherently with block allocation Dmitry Monakhov
@ 2014-08-23 22:00   ` Theodore Ts'o
  2014-08-25  7:59     ` Dmitry Monakhov
  2014-08-26  1:13   ` Theodore Ts'o
  1 sibling, 1 reply; 19+ messages in thread
From: Theodore Ts'o @ 2014-08-23 22:00 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, lczerner

On Fri, Aug 22, 2014 at 03:32:27PM +0400, Dmitry Monakhov wrote:
> Writeback call trace looks like follows:
> ext4_writepages
>  while(nr_pages)
>  ->journal_start
>  ->mpage_map_and_submit_extent -> may alloc some blocks
>    ->mpage_map_one_extent
>  ->journal_stop
> In case of delalloc block i_disksize may be less than i_size. So we have to
> update i_disksize each time we allocated and submitted some blocks beyond
> i_disksize. And we MUST update it in the same transaction, otherwise this
> result in fs-inconsistency in case of upcoming power-failure.
> 
> Another possible way to fix that issue is to insert inode to orhphan list
> on ext4_writepages entrance.
> 
> testcase: xfstest generic/019
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>

Hi Dmitry, were you seeing generic/019 fail before this patch series?
I've been trying to build a kernel with CONFIG_FAIL_MAKE_REQUEST and I
haven't been able to get generic/019 to fail on me.  Is there
something else we need in order to reliably trigger the test fail?

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/5] ext4_zero_range: fix incorect journal credits reservation
  2014-08-23 19:08   ` Theodore Ts'o
@ 2014-08-25  7:19     ` Dmitry Monakhov
  2014-08-26  1:30       ` Theodore Ts'o
  2014-08-27 21:31       ` Theodore Ts'o
  0 siblings, 2 replies; 19+ messages in thread
From: Dmitry Monakhov @ 2014-08-25  7:19 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, lczerner

[-- Attachment #1: Type: text/plain, Size: 961 bytes --]

On Sat, 23 Aug 2014 15:08:35 -0400, "Theodore Ts'o" <tytso@mit.edu> wrote:
> On Fri, Aug 22, 2014 at 03:32:25PM +0400, Dmitry Monakhov wrote:
> > -	handle = ext4_journal_start(inode, EXT4_HT_MISC, 4);
> > +	/* In worst case we have to writeout two nonadjacent unwritten blocks */
> > +	credits = ext4_chunk_trans_blocks(inode, 1) * 2 -
> > +		EXT4_META_TRANS_BLOCKS(inode->i_sb);
> 
> This looks like it would be a massive over-estimate, since it includes
> the block group allocation bitmaps, which we wouldn't need to update,
> no?
  Oh. You right. groups metadata/quota should not be affected.
> 
> Wouldn't
> 
> 	credts = ext4_index_trans_blocks(inode, 1, 1) * 2;
Yes, but ext4_index_trans_blocks is statically defined in inode.c,
but since zero_range works only for extent based
ext4_ext_index_trans_blocks is sufficient. Also we must reserve 2 block
for data. So final calculation looks like follows:
credits =  ext4_ext_index_trans_blocks(inode, 2) + 2


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-ext4_zero_range-fix-incorect-journal-credits-reserva.patch --]
[-- Type: text/x-patch, Size: 975 bytes --]

>From ec7e1926e8bb99ccd0eb525efc6666cb8ea36a5e Mon Sep 17 00:00:00 2001
From: Dmitry Monakhov <dmonakhov@openvz.org>
Date: Mon, 25 Aug 2014 11:12:35 +0400
Subject: [PATCH] ext4_zero_range: fix incorect journal credits reservation v2

Currently we reserve only 4 blocks but in worst case scenario
ext4_zero_partial_blocks() may want to zeroout and convert two
non adjacent blocks.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/extents.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 5bf718b..f09dbe7 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4740,6 +4740,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	loff_t new_size = 0;
 	int ret = 0;
 	int flags;
+	int credits;
 	int partial;
 	loff_t start, end;
 	ext4_lblk_t lblk;
@@ -4839,8 +4840,9 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 		if (ret)
 			goto out_dio;
 	}

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] update i_disksize coherently with block allocation
  2014-08-23 22:00   ` Theodore Ts'o
@ 2014-08-25  7:59     ` Dmitry Monakhov
  2014-09-14 12:38       ` Dmitry Monakhov
  0 siblings, 1 reply; 19+ messages in thread
From: Dmitry Monakhov @ 2014-08-25  7:59 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, lczerner

On Sat, 23 Aug 2014 18:00:29 -0400, "Theodore Ts'o" <tytso@mit.edu> wrote:
> On Fri, Aug 22, 2014 at 03:32:27PM +0400, Dmitry Monakhov wrote:
> > Writeback call trace looks like follows:
> > ext4_writepages
> >  while(nr_pages)
> >  ->journal_start
> >  ->mpage_map_and_submit_extent -> may alloc some blocks
> >    ->mpage_map_one_extent
> >  ->journal_stop
> > In case of delalloc block i_disksize may be less than i_size. So we have to
> > update i_disksize each time we allocated and submitted some blocks beyond
> > i_disksize. And we MUST update it in the same transaction, otherwise this
> > result in fs-inconsistency in case of upcoming power-failure.
> > 
> > Another possible way to fix that issue is to insert inode to orhphan list
> > on ext4_writepages entrance.
> > 
> > testcase: xfstest generic/019
> > 
> > Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> 
> Hi Dmitry, were you seeing generic/019 fail before this patch series?
> I've been trying to build a kernel with CONFIG_FAIL_MAKE_REQUEST and I
> haven't been able to get generic/019 to fail on me.  Is there
> something else we need in order to reliably trigger the test fail?
As usual this kind of test are not 100% reliable, I've saw failures from
time to time. But I've assumed that it was side effect of incorrect
error detection in e2fsck introduced d3f32c2db8f11, But this week i've
rechecked e2fsck and found that condition was fixed and it is correct.
In order to speedup testing I use ram dev:
options brd rd_nr=4 rd_size=10485760 part_show=1
TEST_DEV=/dev/ram0
SCRATCH_DEV=/dev/ram1
And run several rounds for this test:
for ((i=0;i<20;i++));do ./check generic/019 || break ;done

You also can increase probability by playing with fsstress options
--- a/tests/generic/019
+++ b/tests/generic/019
@@ -135,7 +135,7 @@ FSSTRESS_AVOID="$FSSTRESS_AVOID -ffsync=0 -fsync=0
-ffdatasync=0 -f setattr=1"
 _workout()
 {
        out=$SCRATCH_MNT/fsstress.$$
-       args=`_scale_fsstress_args -p 1 -n999999999 -f setattr=0
$FSSTRESS_AVOID -d $out`
+       args=`_scale_fsstress_args -p 8 -n999999999 -f setattr=0
$FSSTRESS_AVOID -d $out`
        echo ""
        echo "Start fsstress.."
        echo ""

And finally the cherry on top of this cake I've found that this test
provoke orphan list corruption or dangling inodes after failure.
fsck 1.43-WIP (09-Jul-2014)
e2fsck 1.43-WIP (09-Jul-2014)
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 43792 has zero dtime.  Fix<y>? no
Inodes that were part of a corrupted orphan linked list found.  Fix<y>?
no
Inode 493817 was part of the orphaned inode list.  IGNORED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -148712 -148714
Fix<y>? no
Inode bitmap differences:  -43792 -493817
Fix<y>? no

/dev/ram1: ********** WARNING: Filesystem still has errors **********

/dev/ram1: 201244/655360 files (0.0% non-contiguous), 409632/10485760
blocks
[root@ts105 xfstests-dev.git2]# INO=493817
[root@ts105 xfstests-dev.git2]# debugfs /dev/ram1 -R "ex <$INO>" ; \
            debugfs /dev/ram1 -R "stat <$INO>" ; debugfs /dev/ram1 -R "ncheck $INO"
debugfs 1.43-WIP (09-Jul-2014)
Level Entries       Logical            Physical Length Flags
 0/ 0   1/  1     0 -     0   148712 -   148712      1 
debugfs 1.43-WIP (09-Jul-2014)
Inode: 493817   Type: symlink    Mode:  0777   Flags: 0x80000
Generation: 4038911591    Version: 0x00000000:00000001
User:     0   Group:     0   Size: 638
File ACL: 0    Directory ACL: 0
Links: 0   Blockcount: 2
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x53fae8e8:ea861dc0 -- Mon Aug 25 11:42:32 2014
 atime: 0x53fae8e8:ea861dc0 -- Mon Aug 25 11:42:32 2014
 mtime: 0x53fae8e8:ea861dc0 -- Mon Aug 25 11:42:32 2014
crtime: 0x53fae8e8:ea861dc0 -- Mon Aug 25 11:42:32 2014
dtime: 0x0000ab10 -- Thu Jan  1 15:09:52 1970
Size of extra inode fields: 28
EXTENTS:
(0):148712
debugfs 1.43-WIP (09-Jul-2014)
Inode   Pathname

I saw this effect with different file types (synmlink,chdev,regfile)
>From my findings we lost newly created inode during creation.
Actually code is very simple, but at this moment I can not find why and
where this happen.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] update i_disksize coherently with block allocation
  2014-08-22 11:32 ` [PATCH 5/5] update i_disksize coherently with block allocation Dmitry Monakhov
  2014-08-23 22:00   ` Theodore Ts'o
@ 2014-08-26  1:13   ` Theodore Ts'o
  2014-08-26  7:47     ` Dmitry Monakhov
  1 sibling, 1 reply; 19+ messages in thread
From: Theodore Ts'o @ 2014-08-26  1:13 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, lczerner

On Fri, Aug 22, 2014 at 03:32:27PM +0400, Dmitry Monakhov wrote:
> Writeback call trace looks like follows:
> ext4_writepages
>  while(nr_pages)
>  ->journal_start
>  ->mpage_map_and_submit_extent -> may alloc some blocks
>    ->mpage_map_one_extent
>  ->journal_stop
> In case of delalloc block i_disksize may be less than i_size. So we have to
> update i_disksize each time we allocated and submitted some blocks beyond
> i_disksize. And we MUST update it in the same transaction, otherwise this
> result in fs-inconsistency in case of upcoming power-failure.

This description doesn't seem to be consistent with the change in the
patch; the patch included makes sure that on an error, we update the
on-disk block size if some blocks had been previously allocated.
Which looks OK, but the commit description seems to imply tha the
patch does more than this.  Did part of the patch get dropped, or
should we rewrite the commit description to be more clear what it is
changing?

Thanks,

					- Ted


> 
> Another possible way to fix that issue is to insert inode to orhphan list
> on ext4_writepages entrance.
> 
> testcase: xfstest generic/019
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  fs/ext4/inode.c |   10 ++++++++--
>  1 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3919e25..b1d92fb 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2077,6 +2077,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  	struct ext4_map_blocks *map = &mpd->map;
>  	int err;
>  	loff_t disksize;
> +	int progress = 0;
>  
>  	mpd->io_submit.io_end->offset =
>  				((loff_t)map->m_lblk) << inode->i_blkbits;
> @@ -2093,8 +2094,11 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  			 * is non-zero, a commit should free up blocks.
>  			 */
>  			if ((err == -ENOMEM) ||
> -			    (err == -ENOSPC && ext4_count_free_clusters(sb)))
> +			    (err == -ENOSPC && ext4_count_free_clusters(sb))) {
> +				if (progress)
> +					goto update_disksize;
>  				return err;
> +			}
>  			ext4_msg(sb, KERN_CRIT,
>  				 "Delayed block allocation failed for "
>  				 "inode %lu at logical offset %llu with"
> @@ -2111,15 +2115,17 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  			*give_up_on_write = true;
>  			return err;
>  		}
> +		progress = 1;
>  		/*
>  		 * Update buffer state, submit mapped pages, and get us new
>  		 * extent to map
>  		 */
>  		err = mpage_map_and_submit_buffers(mpd);
>  		if (err < 0)
> -			return err;
> +			goto update_disksize;
>  	} while (map->m_len);
>  
> +update_disksize:
>  	/*
>  	 * Update on-disk size after IO is submitted.  Races with
>  	 * truncate are avoided by checking i_size under i_data_sem.
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] move i_size,i_diskzie update routines to helper function
  2014-08-22 11:32 ` [PATCH 2/5] move i_size,i_diskzie update routines to helper function Dmitry Monakhov
@ 2014-08-26  1:30   ` Theodore Ts'o
  0 siblings, 0 replies; 19+ messages in thread
From: Theodore Ts'o @ 2014-08-26  1:30 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, lczerner

Applied, thanks.

					- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/5] ext4_zero_range: fix incorect journal credits reservation
  2014-08-25  7:19     ` Dmitry Monakhov
@ 2014-08-26  1:30       ` Theodore Ts'o
  2014-08-27 21:31       ` Theodore Ts'o
  1 sibling, 0 replies; 19+ messages in thread
From: Theodore Ts'o @ 2014-08-26  1:30 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, lczerner

> From ec7e1926e8bb99ccd0eb525efc6666cb8ea36a5e Mon Sep 17 00:00:00 2001
> From: Dmitry Monakhov <dmonakhov@openvz.org>
> Date: Mon, 25 Aug 2014 11:12:35 +0400
> Subject: [PATCH] ext4_zero_range: fix incorect journal credits reservation v2
> 
> Currently we reserve only 4 blocks but in worst case scenario
> ext4_zero_partial_blocks() may want to zeroout and convert two
> non adjacent blocks.
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>

Applied, thanks.

					- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] fix transaction issues for ext4_fallocate and ext_zero_range
  2014-08-22 11:32 ` [PATCH 4/5] fix transaction issues for ext4_fallocate and ext_zero_range Dmitry Monakhov
@ 2014-08-26  1:30   ` Theodore Ts'o
  0 siblings, 0 replies; 19+ messages in thread
From: Theodore Ts'o @ 2014-08-26  1:30 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, lczerner

On Fri, Aug 22, 2014 at 03:32:26PM +0400, Dmitry Monakhov wrote:
> After commit f282ac19d86f we use different transactions for preallocation
> and i_disksize update which result in complain from fsck after power-failure.
> spotted by generic/019. IMHO this is regression because fs becomes inconsistent,
> even more 'e2fsck -p' will no longer works (which drives admins go crazy)
> Same transaction requirement applies ctime,mtime updates
> 
> testcase: xfstest generic/019
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>

Applied, thanks.

					- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] update i_disksize coherently with block allocation
  2014-08-26  1:13   ` Theodore Ts'o
@ 2014-08-26  7:47     ` Dmitry Monakhov
  2014-08-26 12:26       ` Theodore Ts'o
  0 siblings, 1 reply; 19+ messages in thread
From: Dmitry Monakhov @ 2014-08-26  7:47 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, lczerner

On Mon, 25 Aug 2014 21:13:10 -0400, Theodore Ts'o <tytso@mit.edu> wrote:
> On Fri, Aug 22, 2014 at 03:32:27PM +0400, Dmitry Monakhov wrote:
> > Writeback call trace looks like follows:
> > ext4_writepages
> >  while(nr_pages)
> >  ->journal_start
> >  ->mpage_map_and_submit_extent -> may alloc some blocks
> >    ->mpage_map_one_extent
> >  ->journal_stop
> > In case of delalloc block i_disksize may be less than i_size. So we have to
> > update i_disksize each time we allocated and submitted some blocks beyond
> > i_disksize. And we MUST update it in the same transaction, otherwise this
> > result in fs-inconsistency in case of upcoming power-failure.
> 
> This description doesn't seem to be consistent with the change in the
> patch; the patch included makes sure that on an error, we update the
> on-disk block size if some blocks had been previously allocated.
> Which looks OK, but the commit description seems to imply tha the
> patch does more than this.

>  Did part of the patch get dropped, or
> should we rewrite the commit description to be more clear what it is
> changing?
No. patch was written like this from very beginning. 
By logic is follows. We must update i_disksize in the same transaction
Before the patch i_disksize was updated only if requested range was fully completed.
But we also have to update it in case of error. And patch fix what.
> > Another possible way to fix that issue is to insert inode to orhphan list
> > on ext4_writepages entrance.
> > 
> > testcase: xfstest generic/019
> > 
> > Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> > ---
> >  fs/ext4/inode.c |   10 ++++++++--
> >  1 files changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 3919e25..b1d92fb 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -2077,6 +2077,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
> >  	struct ext4_map_blocks *map = &mpd->map;
> >  	int err;
> >  	loff_t disksize;
> > +	int progress = 0;
> >  
> >  	mpd->io_submit.io_end->offset =
> >  				((loff_t)map->m_lblk) << inode->i_blkbits;
> > @@ -2093,8 +2094,11 @@ static int mpage_map_and_submit_extent(handle_t *handle,
> >  			 * is non-zero, a commit should free up blocks.
> >  			 */
> >  			if ((err == -ENOMEM) ||
> > -			    (err == -ENOSPC && ext4_count_free_clusters(sb)))
> > +			    (err == -ENOSPC && ext4_count_free_clusters(sb))) {
> > +				if (progress)
> > +					goto update_disksize;
> >  				return err;
> > +			}
> >  			ext4_msg(sb, KERN_CRIT,
> >  				 "Delayed block allocation failed for "
> >  				 "inode %lu at logical offset %llu with"
> > @@ -2111,15 +2115,17 @@ static int mpage_map_and_submit_extent(handle_t *handle,
> >  			*give_up_on_write = true;
> >  			return err;
> >  		}
> > +		progress = 1;
> >  		/*
> >  		 * Update buffer state, submit mapped pages, and get us new
> >  		 * extent to map
> >  		 */
> >  		err = mpage_map_and_submit_buffers(mpd);
> >  		if (err < 0)
> > -			return err;
> > +			goto update_disksize;
> >  	} while (map->m_len);
> >  
> > +update_disksize:
> >  	/*
> >  	 * Update on-disk size after IO is submitted.  Races with
> >  	 * truncate are avoided by checking i_size under i_data_sem.
> > -- 
> > 1.7.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] update i_disksize coherently with block allocation
  2014-08-26  7:47     ` Dmitry Monakhov
@ 2014-08-26 12:26       ` Theodore Ts'o
  0 siblings, 0 replies; 19+ messages in thread
From: Theodore Ts'o @ 2014-08-26 12:26 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, lczerner

On Tue, Aug 26, 2014 at 11:47:10AM +0400, Dmitry Monakhov wrote:
> No. patch was written like this from very beginning. 
> By logic is follows. We must update i_disksize in the same transaction
> Before the patch i_disksize was updated only if requested range was fully completed.
> But we also have to update it in case of error. And patch fix what.

Ok, I'll update the commit description to make it clear that what
we're fixing is on the error path.

Thanks!!

						- Ted

P.S.  I haven't forgotten about your 1/5 patch; I'm just currently
prioritizing patches that will get pushed to Linus during the current
development cycle.  I'll do a second pass for cleanup and new feature
commits (i.e., things that aren't stable kernel fodder) after I push
fixes to Linus for -rc3.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/5] ext4_zero_range: fix incorect journal credits reservation
  2014-08-25  7:19     ` Dmitry Monakhov
  2014-08-26  1:30       ` Theodore Ts'o
@ 2014-08-27 21:31       ` Theodore Ts'o
  1 sibling, 0 replies; 19+ messages in thread
From: Theodore Ts'o @ 2014-08-27 21:31 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, lczerner

On Mon, Aug 25, 2014 at 11:19:39AM +0400, Dmitry Monakhov wrote:
> Yes, but ext4_index_trans_blocks is statically defined in inode.c,
> but since zero_range works only for extent based
> ext4_ext_index_trans_blocks is sufficient. Also we must reserve 2 block
> for data. So final calculation looks like follows:
> credits =  ext4_ext_index_trans_blocks(inode, 2) + 2

It turns out this wasn't enough.  This was causing test failures in
the data=journal case for tests generic/008, generic/091, generic/127,
and generic/263, due to not reserving enough credits.  After taking a
closer look at the code, and the requirements of ext4_zero_range, I
believe this is the correct fix.

       	       	   	   	   	   - Ted

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0d531d1..0e9de23 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4831,8 +4831,13 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 		if (ret)
 			goto out_dio;
 	}
-	/* In worst case we have to writeout two nonadjacent unwritten blocks */
-	credits = ext4_ext_index_trans_blocks(inode, 2) + 2;
+	/*
+	 * In worst case we have to writeout two nonadjacent unwritten
+	 * blocks and update the inode
+	 */
+	credits = (2 * ext4_ext_index_trans_blocks(inode, 2)) + 1;
+	if (ext4_should_journal_data(inode))
+		credits += 2;
 	handle = ext4_journal_start(inode, EXT4_HT_MISC, credits);
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/5] use ext4_update_i_disksize instead of opencoded ones
  2014-08-22 11:32 ` [PATCH 1/5] use ext4_update_i_disksize instead of opencoded ones Dmitry Monakhov
@ 2014-08-31  3:17   ` Theodore Ts'o
  0 siblings, 0 replies; 19+ messages in thread
From: Theodore Ts'o @ 2014-08-31  3:17 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, lczerner

On Fri, Aug 22, 2014 at 03:32:23PM +0400, Dmitry Monakhov wrote:
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>

Thanks, applied.

					- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] update i_disksize coherently with block allocation
  2014-08-25  7:59     ` Dmitry Monakhov
@ 2014-09-14 12:38       ` Dmitry Monakhov
  0 siblings, 0 replies; 19+ messages in thread
From: Dmitry Monakhov @ 2014-09-14 12:38 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, lczerner

On Mon, 25 Aug 2014 11:59:08 +0400, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
> On Sat, 23 Aug 2014 18:00:29 -0400, "Theodore Ts'o" <tytso@mit.edu> wrote:
> > On Fri, Aug 22, 2014 at 03:32:27PM +0400, Dmitry Monakhov wrote:
> > > Writeback call trace looks like follows:
> > > ext4_writepages
> > >  while(nr_pages)
> > >  ->journal_start
> > >  ->mpage_map_and_submit_extent -> may alloc some blocks
> > >    ->mpage_map_one_extent
> > >  ->journal_stop
> > > In case of delalloc block i_disksize may be less than i_size. So we have to
> > > update i_disksize each time we allocated and submitted some blocks beyond
> > > i_disksize. And we MUST update it in the same transaction, otherwise this
> > > result in fs-inconsistency in case of upcoming power-failure.
> > > 
> > > Another possible way to fix that issue is to insert inode to orhphan list
> > > on ext4_writepages entrance.
> > > 
> > > testcase: xfstest generic/019
> > > 
> > > Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> > 
> > Hi Dmitry, were you seeing generic/019 fail before this patch series?
> > I've been trying to build a kernel with CONFIG_FAIL_MAKE_REQUEST and I
> > haven't been able to get generic/019 to fail on me.  Is there
> > something else we need in order to reliably trigger the test fail?
> As usual this kind of test are not 100% reliable, I've saw failures from
> time to time. But I've assumed that it was side effect of incorrect
> error detection in e2fsck introduced d3f32c2db8f11, But this week i've
> rechecked e2fsck and found that condition was fixed and it is correct.
> In order to speedup testing I use ram dev:
> options brd rd_nr=4 rd_size=10485760 part_show=1
> TEST_DEV=/dev/ram0
> SCRATCH_DEV=/dev/ram1
> And run several rounds for this test:
> for ((i=0;i<20;i++));do ./check generic/019 || break ;done
> 
> You also can increase probability by playing with fsstress options
> --- a/tests/generic/019
> +++ b/tests/generic/019
> @@ -135,7 +135,7 @@ FSSTRESS_AVOID="$FSSTRESS_AVOID -ffsync=0 -fsync=0
> -ffdatasync=0 -f setattr=1"
>  _workout()
>  {
>         out=$SCRATCH_MNT/fsstress.$$
> -       args=`_scale_fsstress_args -p 1 -n999999999 -f setattr=0
> $FSSTRESS_AVOID -d $out`
> +       args=`_scale_fsstress_args -p 8 -n999999999 -f setattr=0
> $FSSTRESS_AVOID -d $out`
>         echo ""
>         echo "Start fsstress.."
>         echo ""
> 
> And finally the cherry on top of this cake I've found that this test
> provoke orphan list corruption or dangling inodes after failure.
> fsck 1.43-WIP (09-Jul-2014)
> e2fsck 1.43-WIP (09-Jul-2014)
> Pass 1: Checking inodes, blocks, and sizes
> Deleted inode 43792 has zero dtime.  Fix<y>? no
> Inodes that were part of a corrupted orphan linked list found.  Fix<y>?
> no
> Inode 493817 was part of the orphaned inode list.  IGNORED.
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Block bitmap differences:  -148712 -148714
> Fix<y>? no
> Inode bitmap differences:  -43792 -493817
> Fix<y>? no
> 
> /dev/ram1: ********** WARNING: Filesystem still has errors **********
> 
> /dev/ram1: 201244/655360 files (0.0% non-contiguous), 409632/10485760
> blocks
> [root@ts105 xfstests-dev.git2]# INO=493817
> [root@ts105 xfstests-dev.git2]# debugfs /dev/ram1 -R "ex <$INO>" ; \
>             debugfs /dev/ram1 -R "stat <$INO>" ; debugfs /dev/ram1 -R "ncheck $INO"
> debugfs 1.43-WIP (09-Jul-2014)
> Level Entries       Logical            Physical Length Flags
>  0/ 0   1/  1     0 -     0   148712 -   148712      1 
> debugfs 1.43-WIP (09-Jul-2014)
> Inode: 493817   Type: symlink    Mode:  0777   Flags: 0x80000
> Generation: 4038911591    Version: 0x00000000:00000001
> User:     0   Group:     0   Size: 638
> File ACL: 0    Directory ACL: 0
> Links: 0   Blockcount: 2
> Fragment:  Address: 0    Number: 0    Size: 0
>  ctime: 0x53fae8e8:ea861dc0 -- Mon Aug 25 11:42:32 2014
>  atime: 0x53fae8e8:ea861dc0 -- Mon Aug 25 11:42:32 2014
>  mtime: 0x53fae8e8:ea861dc0 -- Mon Aug 25 11:42:32 2014
> crtime: 0x53fae8e8:ea861dc0 -- Mon Aug 25 11:42:32 2014
> dtime: 0x0000ab10 -- Thu Jan  1 15:09:52 1970
> Size of extra inode fields: 28
> EXTENTS:
> (0):148712
> debugfs 1.43-WIP (09-Jul-2014)
> Inode   Pathname
> 
> I saw this effect with different file types (synmlink,chdev,regfile)
> From my findings we lost newly created inode during creation.
> Actually code is very simple, but at this moment I can not find why and
> where this happen.
I've had plenty of time to brain storm this issue :).
In fact it is very simple test-environment related issue.
Once we force make_request failure for all new IO requests
ext4_error will tag on-disk SB state with EXT4_ERROR_FS. In normal
situation this update should not reach permanent-storage, but in our
case updated EXT4_SB(sb)->s_sbh may be under writeback so ERROR_FS flag
will be visible on next mount and orphan_list cleanup will be skipped
due to ERROR_FS. Latest action is 100% correct.
It looks we have to fix the test by using another failure technique.
At this moment I think that faulty bcache may works for us. 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-09-14 12:38 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-22 11:32 [PATCH 0/5] i_size/i_disksize update fixes and cleanup series Dmitry Monakhov
2014-08-22 11:32 ` [PATCH 1/5] use ext4_update_i_disksize instead of opencoded ones Dmitry Monakhov
2014-08-31  3:17   ` Theodore Ts'o
2014-08-22 11:32 ` [PATCH 2/5] move i_size,i_diskzie update routines to helper function Dmitry Monakhov
2014-08-26  1:30   ` Theodore Ts'o
2014-08-22 11:32 ` [PATCH 3/5] ext4_zero_range: fix incorect journal credits reservation Dmitry Monakhov
2014-08-23 19:08   ` Theodore Ts'o
2014-08-25  7:19     ` Dmitry Monakhov
2014-08-26  1:30       ` Theodore Ts'o
2014-08-27 21:31       ` Theodore Ts'o
2014-08-22 11:32 ` [PATCH 4/5] fix transaction issues for ext4_fallocate and ext_zero_range Dmitry Monakhov
2014-08-26  1:30   ` Theodore Ts'o
2014-08-22 11:32 ` [PATCH 5/5] update i_disksize coherently with block allocation Dmitry Monakhov
2014-08-23 22:00   ` Theodore Ts'o
2014-08-25  7:59     ` Dmitry Monakhov
2014-09-14 12:38       ` Dmitry Monakhov
2014-08-26  1:13   ` Theodore Ts'o
2014-08-26  7:47     ` Dmitry Monakhov
2014-08-26 12:26       ` Theodore Ts'o

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.