All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] ext4: try to fix up es regressions
@ 2013-03-06 14:17 Zheng Liu
  2013-03-06 14:17 ` [PATCH v2 1/5] ext4: improve ext4_es_can_be_merged() to avoid a potential overflow Zheng Liu
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Zheng Liu @ 2013-03-06 14:17 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o, Dmitry Monakhov

Hi all,

The patch series tries to fixup some regressions after applied the extent
status tree.  These patches have rebased against the latest dev branch of
ext4 and have been tested by xfstests.

After rebased the latest dev branch, two patches have been dropped because
they have been applied into the branch.  A new patch is added, which tries
to fix up a wrong return value in ext4_split_extent().  Otherwise, there
are two major changes in this version.  The first one is to improve the
self-testing-infrastructure according to Dmitry's comment.  The second one
is to improve the zero out code.

After applied this patch series, I havn't seen the warning messages from
self-testing infrastructure except the following cases.

 - xfstests #13 with bigalloc or with no journal
 - xfstests #223 with dioread_nolock
The reason is that when we lookup a block mapping from status tree
i_data_sem locking won't be taken.  So there is a race window that an 
unwritten extent could be converted by end_io when we compare the result
between extent tree and status tree.

Dmitry, Ted, could you please confirm that this patch series can fix the
defrag regression?  Thank you so much.  Until now I run #300 and #301 a
lot times but I failed to hit this regression. :-(

*Big Note*
When I am testing this patch series, I found some regressions in dev branch.
Here is a note.  These regressions could be hitted by running test case
serveral times.  So If we just run xfstests one time, they could be missed.

 - xfstests #74 with data=journal
 - xfstests #83 with bigalloc
Some threads could be blocked for 120s.

 - xfstests #247 with data=journal
Some warning messages are printed by ext4_releasepage.  We hit
WARN_ON(PageChecked(page)) in this function.  But the test case itself can
pass.

 - xfstests #269 with dioread_nolock
The system will hang

I don't paste full details here to make description clearly.  I will go on
tracing these problems.  I am happy to provide full details if some one
want to take a close look at these problems.

v2 <- v1:
 * Improve self-testing infrastructure
 * Improve zero out code
 * Fix a wrong return value in ext4_split_extent

v1: http://permalink.gmane.org/gmane.comp.file-systems.ext4/37338

Thanks,
						- Zheng

Dmitry Monakhov (1):
  ext4: add self-testing infrastructure to do a sanity check

Zheng Liu (4):
  ext4: improve ext4_es_can_be_merged() to avoid a potential overflow
  ext4: fix wrong m_len value after unwritten extent conversion
  ext4: update extent status tree after an extent is zeroed out
  ext4: fix wrong the number of the allocted blocks in
    ext4_split_extent

 fs/ext4/extents.c        |  45 ++++++++--
 fs/ext4/extents_status.c | 212 +++++++++++++++++++++++++++++++++++++++++++++--
 fs/ext4/extents_status.h |   9 ++
 fs/ext4/inode.c          | 106 ++++++++++++++++++++++++
 4 files changed, 362 insertions(+), 10 deletions(-)

-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2 1/5] ext4: improve ext4_es_can_be_merged() to avoid a potential overflow
  2013-03-06 14:17 [PATCH v2 0/5] ext4: try to fix up es regressions Zheng Liu
@ 2013-03-06 14:17 ` Zheng Liu
  2013-03-11  0:43   ` Theodore Ts'o
  2013-03-06 14:17 ` [PATCH v2 2/5] ext4: add self-testing infrastructure to do a sanity check Zheng Liu
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Zheng Liu @ 2013-03-06 14:17 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o, Dmitry Monakhov

From: Zheng Liu <wenqing.lz@taobao.com>

This commit tries to improve ext4_es_can_be_merged.  We check the length
of extent to avoid a potential overflow.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/extents_status.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 95796a1..dc4e4c5 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -333,17 +333,33 @@ static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
 static int ext4_es_can_be_merged(struct extent_status *es1,
 				 struct extent_status *es2)
 {
-	if (es1->es_lblk + es1->es_len != es2->es_lblk)
+	__u64 es1_len, es2_len, max_len;
+
+	if (ext4_es_status(es1) ^ ext4_es_status(es2))
 		return 0;
 
-	if (ext4_es_status(es1) != ext4_es_status(es2))
+	max_len = 0xFFFFFFFFULL;
+	es1_len = es1->es_len;
+	es2_len = es2->es_len;
+
+	if (es1_len + es2_len > max_len)
 		return 0;
 
-	if ((ext4_es_is_written(es1) || ext4_es_is_unwritten(es1)) &&
-	    (ext4_es_pblock(es1) + es1->es_len != ext4_es_pblock(es2)))
+	if (es1->es_lblk + es1_len != es2->es_lblk)
 		return 0;
 
-	return 1;
+	if ((ext4_es_is_written(es1) || ext4_es_is_unwritten(es1)) &&
+	    (ext4_es_pblock(es1) + es1_len == ext4_es_pblock(es2)))
+		return 1;
+
+	if (ext4_es_is_hole(es1))
+		return 1;
+
+	/* we need to check delayed extent is without unwritten status */
+	if (ext4_es_is_delayed(es1) && !ext4_es_is_unwritten(es1))
+		return 1;
+
+	return 0;
 }
 
 static struct extent_status *
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 2/5] ext4: add self-testing infrastructure to do a sanity check
  2013-03-06 14:17 [PATCH v2 0/5] ext4: try to fix up es regressions Zheng Liu
  2013-03-06 14:17 ` [PATCH v2 1/5] ext4: improve ext4_es_can_be_merged() to avoid a potential overflow Zheng Liu
@ 2013-03-06 14:17 ` Zheng Liu
  2013-03-07 15:41   ` Dmitry Monakhov
  2013-03-11  1:01   ` Theodore Ts'o
  2013-03-06 14:17 ` [PATCH v2 3/5] ext4: fix wrong m_len value after unwritten extent conversion Zheng Liu
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Zheng Liu @ 2013-03-06 14:17 UTC (permalink / raw)
  To: linux-ext4; +Cc: Dmitry Monakhov, Zheng Liu, Theodore Ts'o

From: Dmitry Monakhov <dmonakhov@openvz.org>

This commit adds a self-testing infrastructure like extent tree does to
do a sanity check for extent status tree.  After status tree is as a
extent cache, we'd better to make sure that it caches right result.

After applied this commit, we will get a lot of messages when we run
xfstests as below.

...
kernel: ES len assertation failed for inode: 230 retval 1 != map->m_len
3 in ext4_map_blocks (allocation)
...
kernel: ES cache assertation failed for inode: 230 es_cached ex
[974/2/4781/20] != found ex [974/1/4781/1000]
...
kernel: ES insert assertation failed for inode: 635 ex_status
[0/45/21388/w] != es_status [44/1/21432/u]
...

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/extents_status.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/extents_status.h |   6 ++
 fs/ext4/inode.c          |  96 +++++++++++++++++++++++++++
 3 files changed, 271 insertions(+)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index dc4e4c5..a434f81 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -405,6 +405,171 @@ ext4_es_try_to_merge_right(struct inode *inode, struct extent_status *es)
 	return es;
 }
 
+#ifdef ES_AGGRESSIVE_TEST
+static void ext4_es_insert_extent_ext_check(struct inode *inode,
+					    struct extent_status *es)
+{
+	struct ext4_ext_path *path = NULL;
+	struct ext4_extent *ex;
+	ext4_lblk_t ee_block;
+	ext4_fsblk_t ee_start;
+	unsigned short ee_len;
+	int depth, ee_status, es_status;
+
+	path = ext4_ext_find_extent(inode, es->es_lblk, NULL);
+	if (IS_ERR(path))
+		return;
+
+	depth = ext_depth(inode);
+	ex = path[depth].p_ext;
+
+	if (ex) {
+
+		ee_block = le32_to_cpu(ex->ee_block);
+		ee_start = ext4_ext_pblock(ex);
+		ee_len = ext4_ext_get_actual_len(ex);
+
+		ee_status = ext4_ext_is_uninitialized(ex) ? 1 : 0;
+		es_status = ext4_es_is_unwritten(es) ? 1 : 0;
+
+		/*
+		 * Make sure ex and es are not overlap when we try to insert
+		 * a delayed/hole extent.
+		 */
+		if (!ext4_es_is_written(es) && !ext4_es_is_unwritten(es)) {
+			if (in_range(es->es_lblk, ee_block, ee_len)) {
+				printk("ES insert assertation failed for inode: %lu "
+				       "we can find an extent at block "
+				       "[%d/%d/%llu/%c], but we want to add an "
+				       "delayed/hole extent [%d/%d/%llu/%llx]\n",
+				       inode->i_ino, ee_block, ee_len, ee_start,
+				       ee_status ? 'u' : 'w', es->es_lblk, es->es_len,
+				       ext4_es_pblock(es), ext4_es_status(es));
+			}
+			goto out;
+		}
+
+		/*
+		 * We don't check ee_block == es->es_lblk, etc. because es
+		 * might be a part of whole extent, vice versa.
+		 */
+		if (es->es_lblk < ee_block ||
+		    ext4_es_pblock(es) != ee_start + es->es_lblk - ee_block) {
+			printk("ES insert assertation failed for inode: %lu "
+			       "ex_status [%d/%d/%llu/%c] != "
+			       "es_status [%d/%d/%llu/%c]\n", inode->i_ino,
+			       ee_block, ee_len, ee_start, ee_status ? 'u' : 'w',
+			       es->es_lblk, es->es_len, ext4_es_pblock(es),
+			       es_status ? 'u' : 'w');
+			goto out;
+		}
+
+		if (ee_status ^ es_status) {
+			printk("ES insert assertation failed for inode: %lu "
+			       "ex_status [%d/%d/%llu/%c] != "
+			       "es_status [%d/%d/%llu/%c]\n", inode->i_ino,
+			       ee_block, ee_len, ee_start, ee_status ? 'u' : 'w',
+			       es->es_lblk, es->es_len, ext4_es_pblock(es),
+			       es_status ? 'u' : 'w');
+		}
+	} else {
+		/*
+		 * We can't find an extent on disk.  So we need to make sure
+		 * that we don't want to add an written/unwritten extent.
+		 */
+		if (!ext4_es_is_delayed(es) && !ext4_es_is_hole(es)) {
+			printk("ES insert assertation failed for inode: %lu "
+			       "can't find an extent at block %d but we want "
+			       "to add an written/unwritten extent "
+			       "[%d/%d/%llu/%llx]\n", inode->i_ino,
+			       es->es_lblk, es->es_lblk, es->es_len,
+			       ext4_es_pblock(es), ext4_es_status(es));
+		}
+	}
+out:
+	if (path) {
+		ext4_ext_drop_refs(path);
+		kfree(path);
+	}
+}
+
+static void ext4_es_insert_extent_ind_check(struct inode *inode,
+					    struct extent_status *es)
+{
+	struct ext4_map_blocks map;
+	int retval;
+
+	/*
+	 * Here we call ext4_ind_map_blocks to lookup a block mapping because
+	 * 'Indirect' structure is defined in indirect.c.  So we couldn't
+	 * access direct/indirect tree from outside.  It is too dirty to define
+	 * this function in indirect.c file.
+	 */
+
+	map.m_lblk = es->es_lblk;
+	map.m_len = es->es_len;
+
+	retval = ext4_ind_map_blocks(NULL, inode, &map, 0);
+	if (retval > 0) {
+		if (ext4_es_is_delayed(es) || ext4_es_is_hole(es)) {
+			/*
+			 * We want to add a delayed/hole extent but this
+			 * block has been allocated.
+			 */
+			printk("ES insert assertation failed for inode: %lu "
+			       "We can find blocks but we want to add a "
+			       "delayed/hole extent [%d/%d/%llu/%llx]\n",
+			       inode->i_ino, es->es_lblk, es->es_len,
+			       ext4_es_pblock(es), ext4_es_status(es));
+			return;
+		} else if (ext4_es_is_written(es)) {
+			if (retval != es->es_len) {
+				printk("ES insert assertation failed for inode: "
+				       "%lu retval %d != es_len %d\n",
+				       inode->i_ino, retval, es->es_len);
+				return;
+			}
+			if (map.m_pblk != ext4_es_pblock(es)) {
+				printk("ES insert assertation failed for inode: "
+				       "%lu m_pblk %llu != es_pblk %llu\n",
+				       inode->i_ino, map.m_pblk,
+				       ext4_es_pblock(es));
+				return;
+			}
+		} else {
+			/*
+			 * We don't need to check unwritten extent because
+			 * indirect-based file doesn't have it.
+			 */
+			BUG_ON(1);
+		}
+	} else if (retval == 0) {
+		if (ext4_es_is_written(es)) {
+			printk("ES insert assertation failed for inode: %lu "
+			       "We can't find the block but we want to add "
+			       "an written extent [%d/%d/%llu/%llx]\n",
+			       inode->i_ino, es->es_lblk, es->es_len,
+			       ext4_es_pblock(es), ext4_es_status(es));
+			return;
+		}
+	}
+}
+
+static inline void ext4_es_insert_extent_check(struct inode *inode,
+					       struct extent_status *es)
+{
+	/*
+	 * We don't need to worry about the race condition because
+	 * caller takes i_data_sem locking.
+	 */
+	BUG_ON(!rwsem_is_locked(&EXT4_I(inode)->i_data_sem));
+	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+		ext4_es_insert_extent_ext_check(inode, es);
+	else
+		ext4_es_insert_extent_ind_check(inode, es);
+}
+#endif
+
 static int __es_insert_extent(struct inode *inode, struct extent_status *newes)
 {
 	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
@@ -487,6 +652,10 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	ext4_es_store_status(&newes, status);
 	trace_ext4_es_insert_extent(inode, &newes);
 
+#ifdef ES_AGGRESSIVE_TEST
+	ext4_es_insert_extent_check(inode, &newes);
+#endif
+
 	write_lock(&EXT4_I(inode)->i_es_lock);
 	err = __es_remove_extent(inode, lblk, end);
 	if (err != 0)
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index f190dfe..56140ad 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -21,6 +21,12 @@
 #endif
 
 /*
+ * With ES_AGGRESSIVE_TEST defined, the result of es caching will be
+ * checked with old map_block's result.
+ */
+#define ES_AGGRESSIVE_TEST__
+
+/*
  * These flags live in the high bits of extent_status.es_pblk
  */
 #define EXTENT_STATUS_WRITTEN	(1ULL << 63)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 95a0c62..3186a43 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -482,6 +482,58 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
 	return num;
 }
 
+#ifdef ES_AGGRESSIVE_TEST
+static void ext4_map_blocks_es_recheck(handle_t *handle,
+				       struct inode *inode,
+				       struct ext4_map_blocks *es_map,
+				       struct ext4_map_blocks *map,
+				       int flags)
+{
+	int retval;
+
+	map->m_flags = 0;
+	/*
+	 * There is a race window that the result is not the same.
+	 * e.g. xfstests #223 when dioread_nolock enables.  The reason
+	 * is that we lookup a block mapping in extent status tree with
+	 * out taking i_data_sem.  So at the time the unwritten extent
+	 * could be converted.
+	 */
+	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
+		down_read((&EXT4_I(inode)->i_data_sem));
+	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
+		retval = ext4_ext_map_blocks(handle, inode, map, flags &
+					     EXT4_GET_BLOCKS_KEEP_SIZE);
+	} else {
+		retval = ext4_ind_map_blocks(handle, inode, map, flags &
+					     EXT4_GET_BLOCKS_KEEP_SIZE);
+	}
+	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
+		up_read((&EXT4_I(inode)->i_data_sem));
+	/*
+	 * Clear EXT4_MAP_FROM_CLUSTER and EXT4_MAP_BOUNDARY flag
+	 * because it shouldn't be marked in es_map->m_flags.
+	 */
+	map->m_flags &= ~(EXT4_MAP_FROM_CLUSTER | EXT4_MAP_BOUNDARY);
+
+	/*
+	 * We don't check m_len because extent will be collpased in status
+	 * tree.  So the m_len might not equal.
+	 */
+	if (es_map->m_lblk != map->m_lblk ||
+	    es_map->m_flags != map->m_flags ||
+	    es_map->m_pblk != map->m_pblk) {
+		printk("ES cache assertation failed for inode: %lu "
+		       "es_cached ex [%d/%d/%llu/%x] != "
+		       "found ex [%d/%d/%llu/%x] retval %d flags %x\n",
+		       inode->i_ino, es_map->m_lblk, es_map->m_len,
+		       es_map->m_pblk, es_map->m_flags, map->m_lblk,
+		       map->m_len, map->m_pblk, map->m_flags,
+		       retval, flags);
+	}
+}
+#endif /* ES_AGGRESSIVE_TEST */
+
 /*
  * The ext4_map_blocks() function tries to look up the requested blocks,
  * and returns if the blocks are already mapped.
@@ -509,6 +561,11 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 {
 	struct extent_status es;
 	int retval;
+#ifdef ES_AGGRESSIVE_TEST
+	struct ext4_map_blocks orig_map;
+
+	memcpy(&orig_map, map, sizeof(*map));
+#endif
 
 	map->m_flags = 0;
 	ext_debug("ext4_map_blocks(): inode %lu, flag %d, max_blocks %u,"
@@ -531,6 +588,10 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		} else {
 			BUG_ON(1);
 		}
+#ifdef ES_AGGRESSIVE_TEST
+		ext4_map_blocks_es_recheck(handle, inode, map,
+					   &orig_map, flags);
+#endif
 		goto found;
 	}
 
@@ -551,6 +612,15 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		int ret;
 		unsigned long long status;
 
+#ifdef ES_AGGRESSIVE_TEST
+		if (retval != map->m_len) {
+			printk("ES len assertation failed for inode: %lu "
+			       "retval %d != map->m_len %d "
+			       "in %s (lookup)\n", inode->i_ino, retval,
+			       map->m_len, __func__);
+		}
+#endif
+
 		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
 				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
 		if (!(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) &&
@@ -643,6 +713,15 @@ found:
 		int ret;
 		unsigned long long status;
 
+#ifdef ES_AGGRESSIVE_TEST
+		if (retval != map->m_len) {
+			printk("ES len assertation failed for inode: %lu "
+			       "retval %d != map->m_len %d "
+			       "in %s (allocation)\n", inode->i_ino, retval,
+			       map->m_len, __func__);
+		}
+#endif
+
 		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
 				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
 		if (!(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) &&
@@ -1768,6 +1847,11 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 	struct extent_status es;
 	int retval;
 	sector_t invalid_block = ~((sector_t) 0xffff);
+#ifdef ES_AGGRESSIVE_TEST
+	struct ext4_map_blocks orig_map;
+
+	memcpy(&orig_map, map, sizeof(*map));
+#endif
 
 	if (invalid_block < ext4_blocks_count(EXT4_SB(inode->i_sb)->s_es))
 		invalid_block = ~0;
@@ -1809,6 +1893,9 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 		else
 			BUG_ON(1);
 
+#ifdef ES_AGGRESSIVE_TEST
+		ext4_map_blocks_es_recheck(NULL, inode, map, &orig_map, 0);
+#endif
 		return retval;
 	}
 
@@ -1873,6 +1960,15 @@ add_delayed:
 		int ret;
 		unsigned long long status;
 
+#ifdef ES_AGGRESSIVE_TEST
+		if (retval != map->m_len) {
+			printk("ES len assertation failed for inode: %lu "
+			       "retval %d != map->m_len %d "
+			       "in %s (lookup)\n", inode->i_ino, retval,
+			       map->m_len, __func__);
+		}
+#endif
+
 		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
 				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
 		ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 3/5] ext4: fix wrong m_len value after unwritten extent conversion
  2013-03-06 14:17 [PATCH v2 0/5] ext4: try to fix up es regressions Zheng Liu
  2013-03-06 14:17 ` [PATCH v2 1/5] ext4: improve ext4_es_can_be_merged() to avoid a potential overflow Zheng Liu
  2013-03-06 14:17 ` [PATCH v2 2/5] ext4: add self-testing infrastructure to do a sanity check Zheng Liu
@ 2013-03-06 14:17 ` Zheng Liu
  2013-03-07 15:42   ` Dmitry Monakhov
  2013-03-11  1:07   ` Theodore Ts'o
  2013-03-06 14:17 ` [PATCH v2 4/5] ext4: update extent status tree after an extent is zeroed out Zheng Liu
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Zheng Liu @ 2013-03-06 14:17 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o, Dmitry Monakhov

From: Zheng Liu <wenqing.lz@taobao.com>

We always assume that the return value of ext4_ext_map_blocks is equal
to map->m_len but when we try to convert an unwritten extent 'm_len'
value will break this assumption.  It is harmless until we use status
tree as a extent cache because we need to update status tree according
to 'm_len' value.

Meanwhile this commit marks EXT4_MAP_MAPPED flag after unwritten extent
conversion.  It shouldn't cause a bug because we update status tree
according to checking EXT4_MAP_UNWRITTEN flag.  But it should be fixed.

After applied this commit, the following error message from self-testing
infrastructure disappears.

    ...
    kernel: ES len assertation failed for inode: 230 retval 1 !=
    map->m_len 3 in ext4_map_blocks (allocation)
    ...

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/extents.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 25c86aa..110e85a 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3650,6 +3650,10 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
 						 path, map->m_len);
 		} else
 			err = ret;
+		map->m_flags |= EXT4_MAP_MAPPED;
+		if (allocated > map->m_len)
+			allocated = map->m_len;
+		map->m_len = allocated;
 		goto out2;
 	}
 	/* buffered IO case */
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 4/5] ext4: update extent status tree after an extent is zeroed out
  2013-03-06 14:17 [PATCH v2 0/5] ext4: try to fix up es regressions Zheng Liu
                   ` (2 preceding siblings ...)
  2013-03-06 14:17 ` [PATCH v2 3/5] ext4: fix wrong m_len value after unwritten extent conversion Zheng Liu
@ 2013-03-06 14:17 ` Zheng Liu
  2013-03-07 15:55   ` Dmitry Monakhov
  2013-03-06 14:17 ` [PATCH v2 5/5] ext4: fix wrong the number of the allocted blocks in ext4_split_extent Zheng Liu
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Zheng Liu @ 2013-03-06 14:17 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o, Dmitry Monakhov

From: Zheng Liu <wenqing.lz@taobao.com>

When we try to split an extent, this extent could be zeroed out and mark
as initialized.  But we don't know this in ext4_map_blocks because it
only returns a length of allocated extent.  Meanwhile we will mark this
extent as uninitialized because we only check m_flags.

This commit update extent status tree when we try to split an unwritten
extent.  We don't need to worry about the status of this extent because
we always mark it as initialized.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/extents.c        | 35 +++++++++++++++++++++++++++++++----
 fs/ext4/extents_status.c | 17 +++++++++++++++++
 fs/ext4/extents_status.h |  3 +++
 fs/ext4/inode.c          | 10 ++++++++++
 4 files changed, 61 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 110e85a..7e37018 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2925,7 +2925,7 @@ static int ext4_split_extent_at(handle_t *handle,
 {
 	ext4_fsblk_t newblock;
 	ext4_lblk_t ee_block;
-	struct ext4_extent *ex, newex, orig_ex;
+	struct ext4_extent *ex, newex, orig_ex, zero_ex;
 	struct ext4_extent *ex2 = NULL;
 	unsigned int ee_len, depth;
 	int err = 0;
@@ -2996,12 +2996,26 @@ static int ext4_split_extent_at(handle_t *handle,
 	err = ext4_ext_insert_extent(handle, inode, path, &newex, flags);
 	if (err == -ENOSPC && (EXT4_EXT_MAY_ZEROOUT & split_flag)) {
 		if (split_flag & (EXT4_EXT_DATA_VALID1|EXT4_EXT_DATA_VALID2)) {
-			if (split_flag & EXT4_EXT_DATA_VALID1)
+			if (split_flag & EXT4_EXT_DATA_VALID1) {
 				err = ext4_ext_zeroout(inode, ex2);
-			else
+				zero_ex.ee_block = ex2->ee_block;
+				zero_ex.ee_len = ext4_ext_get_actual_len(ex2);
+				ext4_ext_store_pblock(&zero_ex,
+						      ext4_ext_pblock(ex2));
+			} else {
 				err = ext4_ext_zeroout(inode, ex);
-		} else
+				zero_ex.ee_block = ex->ee_block;
+				zero_ex.ee_len = ext4_ext_get_actual_len(ex);
+				ext4_ext_store_pblock(&zero_ex,
+						      ext4_ext_pblock(ex));
+			}
+		} else {
 			err = ext4_ext_zeroout(inode, &orig_ex);
+			zero_ex.ee_block = orig_ex.ee_block;
+			zero_ex.ee_len = ext4_ext_get_actual_len(&orig_ex);
+			ext4_ext_store_pblock(&zero_ex,
+					      ext4_ext_pblock(&orig_ex));
+		}
 
 		if (err)
 			goto fix_extent_len;
@@ -3009,6 +3023,12 @@ static int ext4_split_extent_at(handle_t *handle,
 		ex->ee_len = cpu_to_le16(ee_len);
 		ext4_ext_try_to_merge(handle, inode, path, ex);
 		err = ext4_ext_dirty(handle, inode, path + path->p_depth);
+		if (err)
+			goto fix_extent_len;
+
+		/* update extent status tree */
+		err = ext4_es_zeroout(inode, &zero_ex);
+
 		goto out;
 	} else if (err)
 		goto fix_extent_len;
@@ -3150,6 +3170,7 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
 	ee_block = le32_to_cpu(ex->ee_block);
 	ee_len = ext4_ext_get_actual_len(ex);
 	allocated = ee_len - (map->m_lblk - ee_block);
+	zero_ex.ee_len = 0;
 
 	trace_ext4_ext_convert_to_initialized_enter(inode, map, ex);
 
@@ -3247,6 +3268,9 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
 		err = ext4_ext_zeroout(inode, ex);
 		if (err)
 			goto out;
+		zero_ex.ee_block = ex->ee_block;
+		zero_ex.ee_len = ext4_ext_get_actual_len(ex);
+		ext4_ext_store_pblock(&zero_ex, ext4_ext_pblock(ex));
 
 		err = ext4_ext_get_access(handle, inode, path + depth);
 		if (err)
@@ -3305,6 +3329,9 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
 		err = allocated;
 
 out:
+	/* If we have gotten a failure, don't zero out status tree */
+	if (!err)
+		err = ext4_es_zeroout(inode, &zero_ex);
 	return err ? err : allocated;
 }
 
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index a434f81..e7bebbc 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -854,6 +854,23 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	return err;
 }
 
+int ext4_es_zeroout(struct inode *inode, struct ext4_extent *ex)
+{
+	ext4_lblk_t  ee_block;
+	ext4_fsblk_t ee_pblock;
+	unsigned int ee_len;
+
+	ee_block  = le32_to_cpu(ex->ee_block);
+	ee_len    = ext4_ext_get_actual_len(ex);
+	ee_pblock = ext4_ext_pblock(ex);
+
+	if (ee_len == 0)
+		return 0;
+
+	return ext4_es_insert_extent(inode, ee_block, ee_len, ee_pblock,
+				     EXTENT_STATUS_WRITTEN);
+}
+
 static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct ext4_sb_info *sbi = container_of(shrink,
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 56140ad..d8e2d4d 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -39,6 +39,8 @@
 				 EXTENT_STATUS_DELAYED | \
 				 EXTENT_STATUS_HOLE)
 
+struct ext4_extent;
+
 struct extent_status {
 	struct rb_node rb_node;
 	ext4_lblk_t es_lblk;	/* first logical block extent covers */
@@ -64,6 +66,7 @@ extern void ext4_es_find_delayed_extent(struct inode *inode, ext4_lblk_t lblk,
 					struct extent_status *es);
 extern int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
 				 struct extent_status *es);
+extern int ext4_es_zeroout(struct inode *inode, struct ext4_extent *ex);
 
 static inline int ext4_es_is_written(struct extent_status *es)
 {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3186a43..4f1d54a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -722,6 +722,15 @@ found:
 		}
 #endif
 
+		/*
+		 * If the extent has been zeroed out, we don't need to update
+		 * extent status tree.
+		 */
+		if ((flags & EXT4_GET_BLOCKS_PRE_IO) &&
+		    ext4_es_lookup_extent(inode, map->m_lblk, &es)) {
+			if (ext4_es_is_written(&es))
+				goto has_zeroout;
+		}
 		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
 				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
 		if (!(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) &&
@@ -734,6 +743,7 @@ found:
 			retval = ret;
 	}
 
+has_zeroout:
 	up_write((&EXT4_I(inode)->i_data_sem));
 	if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
 		int ret = check_block_validity(inode, map);
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 5/5] ext4: fix wrong the number of the allocted blocks in ext4_split_extent
  2013-03-06 14:17 [PATCH v2 0/5] ext4: try to fix up es regressions Zheng Liu
                   ` (3 preceding siblings ...)
  2013-03-06 14:17 ` [PATCH v2 4/5] ext4: update extent status tree after an extent is zeroed out Zheng Liu
@ 2013-03-06 14:17 ` Zheng Liu
  2013-03-06 22:58 ` Dev branch regressions Theodore Ts'o
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Zheng Liu @ 2013-03-06 14:17 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o, Dmitry Monakhov

From: Zheng Liu <wenqing.lz@taobao.com>

This commit fixes a wrong return value of the number of the allocated
blocks in ext4_split_extent.  When the length of blocks we want to
allocated is greater than the length of the current extent, we return a
wrong number.  Let's see what happens in the following case when we call
ext4_split_extent().

  map: [48, 72]
  ex:  [32, 64, u]

'ex' will be splitted in to two parts:
  ex1: [32, 47, u]
  ex2: [48, 64, w]

'map->m_len' is returned from this function, and the value is 24.  But
the real length is 16.  So it should be fixed.

Meanwhile in this commit we use right length of the allocated blocks
when get_reserved_cluster_alloc in ext4_ext_handle_uninitialized_extents
is called.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/extents.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 7e37018..69df02f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3067,6 +3067,7 @@ static int ext4_split_extent(handle_t *handle,
 	int err = 0;
 	int uninitialized;
 	int split_flag1, flags1;
+	int allocated = map->m_len;
 
 	depth = ext_depth(inode);
 	ex = path[depth].p_ext;
@@ -3086,6 +3087,8 @@ static int ext4_split_extent(handle_t *handle,
 				map->m_lblk + map->m_len, split_flag1, flags1);
 		if (err)
 			goto out;
+	} else {
+		allocated = ee_len - (map->m_lblk - ee_block);
 	}
 	/*
 	 * Update path is required because previous ext4_split_extent_at() may
@@ -3115,7 +3118,7 @@ static int ext4_split_extent(handle_t *handle,
 
 	ext4_ext_show_leaf(inode, path);
 out:
-	return err ? err : map->m_len;
+	return err ? err : allocated;
 }
 
 /*
@@ -3730,6 +3733,7 @@ out:
 					allocated - map->m_len);
 		allocated = map->m_len;
 	}
+	map->m_len = allocated;
 
 	/*
 	 * If we have done fallocate with the offset that is already
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Dev branch regressions
  2013-03-06 14:17 [PATCH v2 0/5] ext4: try to fix up es regressions Zheng Liu
                   ` (4 preceding siblings ...)
  2013-03-06 14:17 ` [PATCH v2 5/5] ext4: fix wrong the number of the allocted blocks in ext4_split_extent Zheng Liu
@ 2013-03-06 22:58 ` Theodore Ts'o
  2013-03-07  2:40   ` Zheng Liu
  2013-03-07  6:47   ` Lukáš Czerner
  2013-03-07 16:08 ` [PATCH v2 0/5] ext4: try to fix up es regressions Dmitry Monakhov
  2013-03-11  2:11 ` Theodore Ts'o
  7 siblings, 2 replies; 27+ messages in thread
From: Theodore Ts'o @ 2013-03-06 22:58 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Zheng Liu, Dmitry Monakhov

On Wed, Mar 06, 2013 at 10:17:10PM +0800, Zheng Liu wrote:
> 
> *Big Note*
> When I am testing this patch series, I found some regressions in dev branch.
> Here is a note.  These regressions could be hitted by running test case
> serveral times.  So If we just run xfstests one time, they could be missed.
> 
>  - xfstests #74 with data=journal
> 
>  - xfstests #247 with data=journal
> Some warning messages are printed by ext4_releasepage.  We hit
> WARN_ON(PageChecked(page)) in this function.  But the test case itself can
> pass.
> 
>  - xfstests #269 with dioread_nolock
> The system will hang

I'm going to guess that you were running this using your SSD test
setup?  I just ran:

kvm-xfstests -c data_journal 74,74,74,74,74,247,247,247,247,247

using my standard hdd setup, and didn't see any failures or warnings.

How frequently are you seeing these failures?  When I have a chance
I'll try running these tests with a tmpfs image and see if I have any
better luck reproducing the problem there.

I did manage to get a hang (preceded with a soft lockup for the
dioread_nolock with test 269).

>  - xfstests #83 with bigalloc
> Some threads could be blocked for 120s.

I've seen this test blocked for hours (but without managing to trigger
the 120s soft lockup warning), but I'm not entirely sure this was a
regression.  I believe I've seen a similar hang with 3.8.0-rc3 if I
recall correctly.  I had been hoping the changes with the extent
status tree would fix it, but apparently no such luck.  :-(

> I don't paste full details here to make description clearly.  I will go on
> tracing these problems.  I am happy to provide full details if some one
> want to take a close look at these problems.

If you have a chance, please do send e-mails with each failure
separated out in a separate e-mail with different subject line so it's
easier for others to follow along.

Thanks!!

						- Ted

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Dev branch regressions
  2013-03-06 22:58 ` Dev branch regressions Theodore Ts'o
@ 2013-03-07  2:40   ` Zheng Liu
  2013-03-07  6:47   ` Lukáš Czerner
  1 sibling, 0 replies; 27+ messages in thread
From: Zheng Liu @ 2013-03-07  2:40 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Zheng Liu, Dmitry Monakhov

On Wed, Mar 06, 2013 at 05:58:18PM -0500, Theodore Ts'o wrote:
> On Wed, Mar 06, 2013 at 10:17:10PM +0800, Zheng Liu wrote:
> > 
> > *Big Note*
> > When I am testing this patch series, I found some regressions in dev branch.
> > Here is a note.  These regressions could be hitted by running test case
> > serveral times.  So If we just run xfstests one time, they could be missed.
> > 
> >  - xfstests #74 with data=journal
> > 
> >  - xfstests #247 with data=journal
> > Some warning messages are printed by ext4_releasepage.  We hit
> > WARN_ON(PageChecked(page)) in this function.  But the test case itself can
> > pass.
> > 
> >  - xfstests #269 with dioread_nolock
> > The system will hang
> 
> I'm going to guess that you were running this using your SSD test
> setup?  I just ran:

Yes, I run these tests in my SSD setup.

> 
> kvm-xfstests -c data_journal 74,74,74,74,74,247,247,247,247,247
> 
> using my standard hdd setup, and didn't see any failures or warnings.

I use the following commands to hit thses warnings.

  for i in {0..9}
  do
    ./chech 74
  done

> 
> How frequently are you seeing these failures?  When I have a chance
> I'll try running these tests with a tmpfs image and see if I have any
> better luck reproducing the problem there.
> 
> I did manage to get a hang (preceded with a soft lockup for the
> dioread_nolock with test 269).
> 
> >  - xfstests #83 with bigalloc
> > Some threads could be blocked for 120s.
> 
> I've seen this test blocked for hours (but without managing to trigger
> the 120s soft lockup warning), but I'm not entirely sure this was a
> regression.  I believe I've seen a similar hang with 3.8.0-rc3 if I
> recall correctly.  I had been hoping the changes with the extent
> status tree would fix it, but apparently no such luck.  :-(
> 
> > I don't paste full details here to make description clearly.  I will go on
> > tracing these problems.  I am happy to provide full details if some one
> > want to take a close look at these problems.
> 
> If you have a chance, please do send e-mails with each failure
> separated out in a separate e-mail with different subject line so it's
> easier for others to follow along.

I will run the test case in 3.8 kernel to understand which one is a
regression, and which one is a bug that has been there for a long time.
Later I will send the report to the mailing list.  Thanks for sharing
the result with me.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Dev branch regressions
  2013-03-06 22:58 ` Dev branch regressions Theodore Ts'o
  2013-03-07  2:40   ` Zheng Liu
@ 2013-03-07  6:47   ` Lukáš Czerner
  2013-03-07 11:54     ` Zheng Liu
  1 sibling, 1 reply; 27+ messages in thread
From: Lukáš Czerner @ 2013-03-07  6:47 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Zheng Liu, linux-ext4, Zheng Liu, Dmitry Monakhov

On Wed, 6 Mar 2013, Theodore Ts'o wrote:

> Date: Wed, 6 Mar 2013 17:58:18 -0500
> From: Theodore Ts'o <tytso@mit.edu>
> To: Zheng Liu <gnehzuil.liu@gmail.com>
> Cc: linux-ext4@vger.kernel.org, Zheng Liu <wenqing.lz@taobao.com>,
>     Dmitry Monakhov <dmonakhov@openvz.org>
> Subject: Dev branch regressions
> 
> On Wed, Mar 06, 2013 at 10:17:10PM +0800, Zheng Liu wrote:
> > 
> > *Big Note*
> > When I am testing this patch series, I found some regressions in dev branch.
> > Here is a note.  These regressions could be hitted by running test case
> > serveral times.  So If we just run xfstests one time, they could be missed.
> > 
> >  - xfstests #74 with data=journal
> > 
> >  - xfstests #247 with data=journal
> > Some warning messages are printed by ext4_releasepage.  We hit
> > WARN_ON(PageChecked(page)) in this function.  But the test case itself can
> > pass.
> > 
> >  - xfstests #269 with dioread_nolock
> > The system will hang
> 
> I'm going to guess that you were running this using your SSD test
> setup?  I just ran:
> 
> kvm-xfstests -c data_journal 74,74,74,74,74,247,247,247,247,247
> 
> using my standard hdd setup, and didn't see any failures or warnings.
> 
> How frequently are you seeing these failures?  When I have a chance
> I'll try running these tests with a tmpfs image and see if I have any
> better luck reproducing the problem there.
> 
> I did manage to get a hang (preceded with a soft lockup for the
> dioread_nolock with test 269).
> 
> >  - xfstests #83 with bigalloc
> > Some threads could be blocked for 120s.
> 
> I've seen this test blocked for hours (but without managing to trigger
> the 120s soft lockup warning), but I'm not entirely sure this was a
> regression.  I believe I've seen a similar hang with 3.8.0-rc3 if I
> recall correctly.  I had been hoping the changes with the extent
> status tree would fix it, but apparently no such luck.  :-(

You're right this is not a regression the problem has always been
there, however now with some bigalloc fixes it becomes more obvious.
I have some patches to address this issue, though it's not ready
yet.

-Lukas

> 
> > I don't paste full details here to make description clearly.  I will go on
> > tracing these problems.  I am happy to provide full details if some one
> > want to take a close look at these problems.
> 
> If you have a chance, please do send e-mails with each failure
> separated out in a separate e-mail with different subject line so it's
> easier for others to follow along.
> 
> Thanks!!
> 
> 						- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Dev branch regressions
  2013-03-07  6:47   ` Lukáš Czerner
@ 2013-03-07 11:54     ` Zheng Liu
  0 siblings, 0 replies; 27+ messages in thread
From: Zheng Liu @ 2013-03-07 11:54 UTC (permalink / raw)
  To: Lukáš Czerner
  Cc: Theodore Ts'o, linux-ext4, Zheng Liu, Dmitry Monakhov

On Thu, Mar 07, 2013 at 07:47:31AM +0100, Lukáš Czerner wrote:
> On Wed, 6 Mar 2013, Theodore Ts'o wrote:
> 
> > Date: Wed, 6 Mar 2013 17:58:18 -0500
> > From: Theodore Ts'o <tytso@mit.edu>
> > To: Zheng Liu <gnehzuil.liu@gmail.com>
> > Cc: linux-ext4@vger.kernel.org, Zheng Liu <wenqing.lz@taobao.com>,
> >     Dmitry Monakhov <dmonakhov@openvz.org>
> > Subject: Dev branch regressions
> > 
> > On Wed, Mar 06, 2013 at 10:17:10PM +0800, Zheng Liu wrote:
> > > 
> > > *Big Note*
> > > When I am testing this patch series, I found some regressions in dev branch.
> > > Here is a note.  These regressions could be hitted by running test case
> > > serveral times.  So If we just run xfstests one time, they could be missed.
> > > 
> > >  - xfstests #74 with data=journal
> > > 
> > >  - xfstests #247 with data=journal
> > > Some warning messages are printed by ext4_releasepage.  We hit
> > > WARN_ON(PageChecked(page)) in this function.  But the test case itself can
> > > pass.
> > > 
> > >  - xfstests #269 with dioread_nolock
> > > The system will hang
> > 
> > I'm going to guess that you were running this using your SSD test
> > setup?  I just ran:
> > 
> > kvm-xfstests -c data_journal 74,74,74,74,74,247,247,247,247,247
> > 
> > using my standard hdd setup, and didn't see any failures or warnings.
> > 
> > How frequently are you seeing these failures?  When I have a chance
> > I'll try running these tests with a tmpfs image and see if I have any
> > better luck reproducing the problem there.
> > 
> > I did manage to get a hang (preceded with a soft lockup for the
> > dioread_nolock with test 269).
> > 
> > >  - xfstests #83 with bigalloc
> > > Some threads could be blocked for 120s.
> > 
> > I've seen this test blocked for hours (but without managing to trigger
> > the 120s soft lockup warning), but I'm not entirely sure this was a
> > regression.  I believe I've seen a similar hang with 3.8.0-rc3 if I
> > recall correctly.  I had been hoping the changes with the extent
> > status tree would fix it, but apparently no such luck.  :-(
> 
> You're right this is not a regression the problem has always been
> there, however now with some bigalloc fixes it becomes more obvious.
> I have some patches to address this issue, though it's not ready
> yet.

Hi Ted, Lukas,

I take a close look at these problems, and I can confirm that they are
not regression because in 3.8 kernel they have been there.  Honestly
they are hard to be hitted because I need to run several times to
trigger them.  IMHO most users don't use this features (data=journal,
bigalloc, dioread_nolock).  So let us fix them one by one.  I will file
these bugs in mailing list to let folks know.

Regards,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/5] ext4: add self-testing infrastructure to do a sanity check
  2013-03-06 14:17 ` [PATCH v2 2/5] ext4: add self-testing infrastructure to do a sanity check Zheng Liu
@ 2013-03-07 15:41   ` Dmitry Monakhov
  2013-03-08 13:01     ` Zheng Liu
  2013-03-11  1:01   ` Theodore Ts'o
  1 sibling, 1 reply; 27+ messages in thread
From: Dmitry Monakhov @ 2013-03-07 15:41 UTC (permalink / raw)
  To: Zheng Liu, linux-ext4; +Cc: Zheng Liu, Theodore Ts'o

On Wed,  6 Mar 2013 22:17:12 +0800, Zheng Liu <gnehzuil.liu@gmail.com> wrote:
> From: Dmitry Monakhov <dmonakhov@openvz.org>
Looks good with small comments (see below)
> 
> This commit adds a self-testing infrastructure like extent tree does to
> do a sanity check for extent status tree.  After status tree is as a
> extent cache, we'd better to make sure that it caches right result.
> 
> After applied this commit, we will get a lot of messages when we run
> xfstests as below.
> 
> ...
> kernel: ES len assertation failed for inode: 230 retval 1 != map->m_len
> 3 in ext4_map_blocks (allocation)
> ...
> kernel: ES cache assertation failed for inode: 230 es_cached ex
> [974/2/4781/20] != found ex [974/1/4781/1000]
> ...
> kernel: ES insert assertation failed for inode: 635 ex_status
> [0/45/21388/w] != es_status [44/1/21432/u]
> ...
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  fs/ext4/extents_status.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/ext4/extents_status.h |   6 ++
>  fs/ext4/inode.c          |  96 +++++++++++++++++++++++++++
>  3 files changed, 271 insertions(+)
> 
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index dc4e4c5..a434f81 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -405,6 +405,171 @@ ext4_es_try_to_merge_right(struct inode *inode, struct extent_status *es)
>  	return es;
>  }
>  
> +#ifdef ES_AGGRESSIVE_TEST
> +static void ext4_es_insert_extent_ext_check(struct inode *inode,
> +					    struct extent_status *es)
> +{
> +	struct ext4_ext_path *path = NULL;
> +	struct ext4_extent *ex;
> +	ext4_lblk_t ee_block;
> +	ext4_fsblk_t ee_start;
> +	unsigned short ee_len;
> +	int depth, ee_status, es_status;
> +
> +	path = ext4_ext_find_extent(inode, es->es_lblk, NULL);
> +	if (IS_ERR(path))
> +		return;
> +
> +	depth = ext_depth(inode);
> +	ex = path[depth].p_ext;
> +
> +	if (ex) {
> +
> +		ee_block = le32_to_cpu(ex->ee_block);
> +		ee_start = ext4_ext_pblock(ex);
> +		ee_len = ext4_ext_get_actual_len(ex);
> +
> +		ee_status = ext4_ext_is_uninitialized(ex) ? 1 : 0;
> +		es_status = ext4_es_is_unwritten(es) ? 1 : 0;
> +
> +		/*
> +		 * Make sure ex and es are not overlap when we try to insert
> +		 * a delayed/hole extent.
> +		 */
> +		if (!ext4_es_is_written(es) && !ext4_es_is_unwritten(es)) {
> +			if (in_range(es->es_lblk, ee_block, ee_len)) {
> +				printk("ES insert assertation failed for inode: %lu "
> +				       "we can find an extent at block "
> +				       "[%d/%d/%llu/%c], but we want to add an "
> +				       "delayed/hole extent [%d/%d/%llu/%llx]\n",
> +				       inode->i_ino, ee_block, ee_len, ee_start,
> +				       ee_status ? 'u' : 'w', es->es_lblk, es->es_len,
> +				       ext4_es_pblock(es), ext4_es_status(es));
> +			}
> +			goto out;
> +		}
> +
> +		/*
> +		 * We don't check ee_block == es->es_lblk, etc. because es
> +		 * might be a part of whole extent, vice versa.
> +		 */
> +		if (es->es_lblk < ee_block ||
> +		    ext4_es_pblock(es) != ee_start + es->es_lblk - ee_block) {
> +			printk("ES insert assertation failed for inode: %lu "
> +			       "ex_status [%d/%d/%llu/%c] != "
> +			       "es_status [%d/%d/%llu/%c]\n", inode->i_ino,
> +			       ee_block, ee_len, ee_start, ee_status ? 'u' : 'w',
> +			       es->es_lblk, es->es_len, ext4_es_pblock(es),
> +			       es_status ? 'u' : 'w');
> +			goto out;
> +		}
> +
> +		if (ee_status ^ es_status) {
> +			printk("ES insert assertation failed for inode: %lu "
> +			       "ex_status [%d/%d/%llu/%c] != "
> +			       "es_status [%d/%d/%llu/%c]\n", inode->i_ino,
> +			       ee_block, ee_len, ee_start, ee_status ? 'u' : 'w',
> +			       es->es_lblk, es->es_len, ext4_es_pblock(es),
> +			       es_status ? 'u' : 'w');
> +		}
> +	} else {
> +		/*
> +		 * We can't find an extent on disk.  So we need to make sure
> +		 * that we don't want to add an written/unwritten extent.
> +		 */
> +		if (!ext4_es_is_delayed(es) && !ext4_es_is_hole(es)) {
> +			printk("ES insert assertation failed for inode: %lu "
> +			       "can't find an extent at block %d but we want "
> +			       "to add an written/unwritten extent "
> +			       "[%d/%d/%llu/%llx]\n", inode->i_ino,
> +			       es->es_lblk, es->es_lblk, es->es_len,
> +			       ext4_es_pblock(es), ext4_es_status(es));
> +		}
> +	}
> +out:
> +	if (path) {
> +		ext4_ext_drop_refs(path);
> +		kfree(path);
> +	}
> +}
> +
> +static void ext4_es_insert_extent_ind_check(struct inode *inode,
> +					    struct extent_status *es)
> +{
> +	struct ext4_map_blocks map;
> +	int retval;
> +
> +	/*
> +	 * Here we call ext4_ind_map_blocks to lookup a block mapping because
> +	 * 'Indirect' structure is defined in indirect.c.  So we couldn't
> +	 * access direct/indirect tree from outside.  It is too dirty to define
> +	 * this function in indirect.c file.
> +	 */
> +
> +	map.m_lblk = es->es_lblk;
> +	map.m_len = es->es_len;
> +
> +	retval = ext4_ind_map_blocks(NULL, inode, &map, 0);
> +	if (retval > 0) {
> +		if (ext4_es_is_delayed(es) || ext4_es_is_hole(es)) {
> +			/*
> +			 * We want to add a delayed/hole extent but this
> +			 * block has been allocated.
> +			 */
> +			printk("ES insert assertation failed for inode: %lu "
> +			       "We can find blocks but we want to add a "
> +			       "delayed/hole extent [%d/%d/%llu/%llx]\n",
> +			       inode->i_ino, es->es_lblk, es->es_len,
> +			       ext4_es_pblock(es), ext4_es_status(es));
> +			return;
> +		} else if (ext4_es_is_written(es)) {
> +			if (retval != es->es_len) {
> +				printk("ES insert assertation failed for inode: "
> +				       "%lu retval %d != es_len %d\n",
> +				       inode->i_ino, retval, es->es_len);
> +				return;
> +			}
> +			if (map.m_pblk != ext4_es_pblock(es)) {
> +				printk("ES insert assertation failed for inode: "
> +				       "%lu m_pblk %llu != es_pblk %llu\n",
> +				       inode->i_ino, map.m_pblk,
> +				       ext4_es_pblock(es));
> +				return;
> +			}
> +		} else {
> +			/*
> +			 * We don't need to check unwritten extent because
> +			 * indirect-based file doesn't have it.
> +			 */
> +			BUG_ON(1);
> +		}
> +	} else if (retval == 0) {
> +		if (ext4_es_is_written(es)) {
> +			printk("ES insert assertation failed for inode: %lu "
> +			       "We can't find the block but we want to add "
> +			       "an written extent [%d/%d/%llu/%llx]\n",
> +			       inode->i_ino, es->es_lblk, es->es_len,
> +			       ext4_es_pblock(es), ext4_es_status(es));
> +			return;
> +		}
> +	}
> +}
> +
> +static inline void ext4_es_insert_extent_check(struct inode *inode,
> +					       struct extent_status *es)
> +{
> +	/*
> +	 * We don't need to worry about the race condition because
> +	 * caller takes i_data_sem locking.
> +	 */
> +	BUG_ON(!rwsem_is_locked(&EXT4_I(inode)->i_data_sem));
> +	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> +		ext4_es_insert_extent_ext_check(inode, es);
> +	else
> +		ext4_es_insert_extent_ind_check(inode, es);
> +}
> +#endif
> +
>  static int __es_insert_extent(struct inode *inode, struct extent_status *newes)
>  {
>  	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
> @@ -487,6 +652,10 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>  	ext4_es_store_status(&newes, status);
>  	trace_ext4_es_insert_extent(inode, &newes);
>  
> +#ifdef ES_AGGRESSIVE_TEST
> +	ext4_es_insert_extent_check(inode, &newes);
> +#endif
We can avoid #ifdef here simply by defining ext4_es_insert_extent_check
like follows:
#ifdef ES_AGGRESSIVE_TEST
void  ext4_es_insert_extent_check(inode, &newes) {
... our stuff here
}
#elseif
#define ext4_es_insert_extent_check(a, b) do {} while(0)
#endif

> +#endif
> +
>  	write_lock(&EXT4_I(inode)->i_es_lock);
>  	err = __es_remove_extent(inode, lblk, end);
>  	if (err != 0)
> diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
> index f190dfe..56140ad 100644
> --- a/fs/ext4/extents_status.h
> +++ b/fs/ext4/extents_status.h
> @@ -21,6 +21,12 @@
>  #endif
>  
>  /*
> + * With ES_AGGRESSIVE_TEST defined, the result of es caching will be
> + * checked with old map_block's result.
> + */
> +#define ES_AGGRESSIVE_TEST__
> +
> +/*
>   * These flags live in the high bits of extent_status.es_pblk
>   */
>  #define EXTENT_STATUS_WRITTEN	(1ULL << 63)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 95a0c62..3186a43 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -482,6 +482,58 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
>  	return num;
>  }
>  
> +#ifdef ES_AGGRESSIVE_TEST
> +static void ext4_map_blocks_es_recheck(handle_t *handle,
> +				       struct inode *inode,
> +				       struct ext4_map_blocks *es_map,
> +				       struct ext4_map_blocks *map,
> +				       int flags)
> +{
> +	int retval;
> +
> +	map->m_flags = 0;
> +	/*
> +	 * There is a race window that the result is not the same.
> +	 * e.g. xfstests #223 when dioread_nolock enables.  The reason
> +	 * is that we lookup a block mapping in extent status tree with
> +	 * out taking i_data_sem.  So at the time the unwritten extent
> +	 * could be converted.
> +	 */
> +	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
> +		down_read((&EXT4_I(inode)->i_data_sem));
> +	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
> +		retval = ext4_ext_map_blocks(handle, inode, map, flags &
> +					     EXT4_GET_BLOCKS_KEEP_SIZE);
> +	} else {
> +		retval = ext4_ind_map_blocks(handle, inode, map, flags &
> +					     EXT4_GET_BLOCKS_KEEP_SIZE);
> +	}
> +	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
> +		up_read((&EXT4_I(inode)->i_data_sem));
> +	/*
> +	 * Clear EXT4_MAP_FROM_CLUSTER and EXT4_MAP_BOUNDARY flag
> +	 * because it shouldn't be marked in es_map->m_flags.
> +	 */
> +	map->m_flags &= ~(EXT4_MAP_FROM_CLUSTER | EXT4_MAP_BOUNDARY);
> +
> +	/*
> +	 * We don't check m_len because extent will be collpased in status
> +	 * tree.  So the m_len might not equal.
> +	 */
> +	if (es_map->m_lblk != map->m_lblk ||
> +	    es_map->m_flags != map->m_flags ||
> +	    es_map->m_pblk != map->m_pblk) {
> +		printk("ES cache assertation failed for inode: %lu "
> +		       "es_cached ex [%d/%d/%llu/%x] != "
> +		       "found ex [%d/%d/%llu/%x] retval %d flags %x\n",
> +		       inode->i_ino, es_map->m_lblk, es_map->m_len,
> +		       es_map->m_pblk, es_map->m_flags, map->m_lblk,
> +		       map->m_len, map->m_pblk, map->m_flags,
> +		       retval, flags);
> +	}
> +}
> +#endif /* ES_AGGRESSIVE_TEST */
> +
>  /*
>   * The ext4_map_blocks() function tries to look up the requested blocks,
>   * and returns if the blocks are already mapped.
> @@ -509,6 +561,11 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
>  {
>  	struct extent_status es;
>  	int retval;
> +#ifdef ES_AGGRESSIVE_TEST
> +	struct ext4_map_blocks orig_map;
> +
> +	memcpy(&orig_map, map, sizeof(*map));
> +#endif
>  
>  	map->m_flags = 0;
>  	ext_debug("ext4_map_blocks(): inode %lu, flag %d, max_blocks %u,"
> @@ -531,6 +588,10 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
>  		} else {
>  			BUG_ON(1);
>  		}
> +#ifdef ES_AGGRESSIVE_TEST
> +		ext4_map_blocks_es_recheck(handle, inode, map,
> +					   &orig_map, flags);
> +#endif
>  		goto found;
>  	}
>  
> @@ -551,6 +612,15 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
>  		int ret;
>  		unsigned long long status;
>  
> +#ifdef ES_AGGRESSIVE_TEST
> +		if (retval != map->m_len) {
> +			printk("ES len assertation failed for inode: %lu "
> +			       "retval %d != map->m_len %d "
> +			       "in %s (lookup)\n", inode->i_ino, retval,
> +			       map->m_len, __func__);
> +		}
> +#endif
It is reasonable to replace this by one-line check
                BUG_ON(retval != map->m_len)
> +
>  		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
>  				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
>  		if (!(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) &&
> @@ -643,6 +713,15 @@ found:
>  		int ret;
>  		unsigned long long status;
>  
> +#ifdef ES_AGGRESSIVE_TEST
> +		if (retval != map->m_len) {
> +			printk("ES len assertation failed for inode: %lu "
> +			       "retval %d != map->m_len %d "
> +			       "in %s (allocation)\n", inode->i_ino, retval,
> +			       map->m_len, __func__);
> +		}
> +#endif
Also one line check BUG_ON(retval != map->m_len)
> +
>  		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
>  				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
>  		if (!(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) &&
> @@ -1768,6 +1847,11 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
>  	struct extent_status es;
>  	int retval;
>  	sector_t invalid_block = ~((sector_t) 0xffff);
> +#ifdef ES_AGGRESSIVE_TEST
> +	struct ext4_map_blocks orig_map;
> +
> +	memcpy(&orig_map, map, sizeof(*map));
> +#endif
>  
>  	if (invalid_block < ext4_blocks_count(EXT4_SB(inode->i_sb)->s_es))
>  		invalid_block = ~0;
> @@ -1809,6 +1893,9 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
>  		else
>  			BUG_ON(1);
>  
> +#ifdef ES_AGGRESSIVE_TEST
> +		ext4_map_blocks_es_recheck(NULL, inode, map, &orig_map, 0);
> +#endif
>  		return retval;
>  	}
>  
> @@ -1873,6 +1960,15 @@ add_delayed:
>  		int ret;
>  		unsigned long long status;
>  
> +#ifdef ES_AGGRESSIVE_TEST
> +		if (retval != map->m_len) {
> +			printk("ES len assertation failed for inode: %lu "
> +			       "retval %d != map->m_len %d "
> +			       "in %s (lookup)\n", inode->i_ino, retval,
> +			       map->m_len, __func__);
> +		}
> +#endif
> +
>  		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
>  				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
>  		ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> -- 
> 1.7.12.rc2.18.g61b472e
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 3/5] ext4: fix wrong m_len value after unwritten extent conversion
  2013-03-06 14:17 ` [PATCH v2 3/5] ext4: fix wrong m_len value after unwritten extent conversion Zheng Liu
@ 2013-03-07 15:42   ` Dmitry Monakhov
  2013-03-11  1:07   ` Theodore Ts'o
  1 sibling, 0 replies; 27+ messages in thread
From: Dmitry Monakhov @ 2013-03-07 15:42 UTC (permalink / raw)
  To: Zheng Liu, linux-ext4; +Cc: Zheng Liu, Theodore Ts'o

On Wed,  6 Mar 2013 22:17:13 +0800, Zheng Liu <gnehzuil.liu@gmail.com> wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> We always assume that the return value of ext4_ext_map_blocks is equal
> to map->m_len but when we try to convert an unwritten extent 'm_len'
> value will break this assumption.  It is harmless until we use status
> tree as a extent cache because we need to update status tree according
> to 'm_len' value.
> 
> Meanwhile this commit marks EXT4_MAP_MAPPED flag after unwritten extent
> conversion.  It shouldn't cause a bug because we update status tree
> according to checking EXT4_MAP_UNWRITTEN flag.  But it should be fixed.
> 
> After applied this commit, the following error message from self-testing
> infrastructure disappears.
> 
>     ...
>     kernel: ES len assertation failed for inode: 230 retval 1 !=
>     map->m_len 3 in ext4_map_blocks (allocation)
>     ...
Ok. fill free to add Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
> 
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  fs/ext4/extents.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 25c86aa..110e85a 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3650,6 +3650,10 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
>  						 path, map->m_len);
>  		} else
>  			err = ret;
> +		map->m_flags |= EXT4_MAP_MAPPED;
> +		if (allocated > map->m_len)
> +			allocated = map->m_len;
> +		map->m_len = allocated;
>  		goto out2;
>  	}
>  	/* buffered IO case */
> -- 
> 1.7.12.rc2.18.g61b472e
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 4/5] ext4: update extent status tree after an extent is zeroed out
  2013-03-06 14:17 ` [PATCH v2 4/5] ext4: update extent status tree after an extent is zeroed out Zheng Liu
@ 2013-03-07 15:55   ` Dmitry Monakhov
  2013-03-08 13:14     ` Zheng Liu
  0 siblings, 1 reply; 27+ messages in thread
From: Dmitry Monakhov @ 2013-03-07 15:55 UTC (permalink / raw)
  To: Zheng Liu, linux-ext4; +Cc: Zheng Liu, Theodore Ts'o

On Wed,  6 Mar 2013 22:17:14 +0800, Zheng Liu <gnehzuil.liu@gmail.com> wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> When we try to split an extent, this extent could be zeroed out and mark
> as initialized.  But we don't know this in ext4_map_blocks because it
> only returns a length of allocated extent.  Meanwhile we will mark this
> extent as uninitialized because we only check m_flags.
> 
> This commit update extent status tree when we try to split an unwritten
> extent.  We don't need to worry about the status of this extent because
> we always mark it as initialized.
> 
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  fs/ext4/extents.c        | 35 +++++++++++++++++++++++++++++++----
>  fs/ext4/extents_status.c | 17 +++++++++++++++++
>  fs/ext4/extents_status.h |  3 +++
>  fs/ext4/inode.c          | 10 ++++++++++
>  4 files changed, 61 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 110e85a..7e37018 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -2925,7 +2925,7 @@ static int ext4_split_extent_at(handle_t *handle,
>  {
>  	ext4_fsblk_t newblock;
>  	ext4_lblk_t ee_block;
> -	struct ext4_extent *ex, newex, orig_ex;
> +	struct ext4_extent *ex, newex, orig_ex, zero_ex;
>  	struct ext4_extent *ex2 = NULL;
>  	unsigned int ee_len, depth;
>  	int err = 0;
> @@ -2996,12 +2996,26 @@ static int ext4_split_extent_at(handle_t *handle,
>  	err = ext4_ext_insert_extent(handle, inode, path, &newex, flags);
>  	if (err == -ENOSPC && (EXT4_EXT_MAY_ZEROOUT & split_flag)) {
>  		if (split_flag & (EXT4_EXT_DATA_VALID1|EXT4_EXT_DATA_VALID2)) {
> -			if (split_flag & EXT4_EXT_DATA_VALID1)
> +			if (split_flag & EXT4_EXT_DATA_VALID1) {
>  				err = ext4_ext_zeroout(inode, ex2);
> -			else
> +				zero_ex.ee_block = ex2->ee_block;
> +				zero_ex.ee_len = ext4_ext_get_actual_len(ex2);
> +				ext4_ext_store_pblock(&zero_ex,
> +						      ext4_ext_pblock(ex2));
> +			} else {
>  				err = ext4_ext_zeroout(inode, ex);
> -		} else
> +				zero_ex.ee_block = ex->ee_block;
> +				zero_ex.ee_len = ext4_ext_get_actual_len(ex);
> +				ext4_ext_store_pblock(&zero_ex,
> +						      ext4_ext_pblock(ex));
> +			}
> +		} else {
>  			err = ext4_ext_zeroout(inode, &orig_ex);
> +			zero_ex.ee_block = orig_ex.ee_block;
> +			zero_ex.ee_len = ext4_ext_get_actual_len(&orig_ex);
> +			ext4_ext_store_pblock(&zero_ex,
> +					      ext4_ext_pblock(&orig_ex));              
> +		}
>  		if (err)
>  			goto fix_extent_len;
> @@ -3009,6 +3023,12 @@ static int ext4_split_extent_at(handle_t *handle,
>  		ex->ee_len = cpu_to_le16(ee_len);
>  		ext4_ext_try_to_merge(handle, inode, path, ex);
>  		err = ext4_ext_dirty(handle, inode, path + path->p_depth);
> +		if (err)
> +			goto fix_extent_len;
> +
> +		/* update extent status tree */
> +		err = ext4_es_zeroout(inode, &zero_ex);
> +
Previous blocks are correct but too complex.
At this point we know that extent "ex" becomes initialized so just
manually update it like follows:
        err = ext4_es_insert_extent(inode, ee_block, ee_len,
                                                   ext4_ext_pblock(ex),
                                                   EXTENT_STATUS_WRITTEN);
BTW I'm wonder what happen if one of ext4_es_xxx functions failed with
error. ASAIU this possible only incase of ENOMEM so it is very unlikely
but allowed. If this happens then es_tree will be out of sinc with
extent_tree which later result in corruption.
>  		goto out;
>  	} else if (err)
>  		goto fix_extent_len;
> @@ -3150,6 +3170,7 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
>  	ee_block = le32_to_cpu(ex->ee_block);
>  	ee_len = ext4_ext_get_actual_len(ex);
>  	allocated = ee_len - (map->m_lblk - ee_block);
> +	zero_ex.ee_len = 0;
>  
>  	trace_ext4_ext_convert_to_initialized_enter(inode, map, ex);
>  
> @@ -3247,6 +3268,9 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
>  		err = ext4_ext_zeroout(inode, ex);
>  		if (err)
>  			goto out;
> +		zero_ex.ee_block = ex->ee_block;
> +		zero_ex.ee_len = ext4_ext_get_actual_len(ex);
> +		ext4_ext_store_pblock(&zero_ex, ext4_ext_pblock(ex));
>  
>  		err = ext4_ext_get_access(handle, inode, path + depth);
>  		if (err)
> @@ -3305,6 +3329,9 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
>  		err = allocated;
>  
>  out:
> +	/* If we have gotten a failure, don't zero out status tree */
> +	if (!err)
> +		err = ext4_es_zeroout(inode, &zero_ex);
>  	return err ? err : allocated;
>  }
>  
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index a434f81..e7bebbc 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -854,6 +854,23 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  	return err;
>  }
>  
> +int ext4_es_zeroout(struct inode *inode, struct ext4_extent *ex)
> +{
> +	ext4_lblk_t  ee_block;
> +	ext4_fsblk_t ee_pblock;
> +	unsigned int ee_len;
> +
> +	ee_block  = le32_to_cpu(ex->ee_block);
> +	ee_len    = ext4_ext_get_actual_len(ex);
> +	ee_pblock = ext4_ext_pblock(ex);
Andreas Dilger do not like local variables which used only once.  Let's avoid that.
> +
> +	if (ee_len == 0)
> +		return 0;
> +
> +	return ext4_es_insert_extent(inode, ee_block, ee_len, ee_pblock,
> +				     EXTENT_STATUS_WRITTEN);
> +}
> +
>  static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
>  {
>  	struct ext4_sb_info *sbi = container_of(shrink,
> diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
> index 56140ad..d8e2d4d 100644
> --- a/fs/ext4/extents_status.h
> +++ b/fs/ext4/extents_status.h
> @@ -39,6 +39,8 @@
>  				 EXTENT_STATUS_DELAYED | \
>  				 EXTENT_STATUS_HOLE)
>  
> +struct ext4_extent;
> +
>  struct extent_status {
>  	struct rb_node rb_node;
>  	ext4_lblk_t es_lblk;	/* first logical block extent covers */
> @@ -64,6 +66,7 @@ extern void ext4_es_find_delayed_extent(struct inode *inode, ext4_lblk_t lblk,
>  					struct extent_status *es);
>  extern int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
>  				 struct extent_status *es);
> +extern int ext4_es_zeroout(struct inode *inode, struct ext4_extent *ex);
>  
>  static inline int ext4_es_is_written(struct extent_status *es)
>  {
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3186a43..4f1d54a 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -722,6 +722,15 @@ found:
>  		}
>  #endif
>  
> +		/*
> +		 * If the extent has been zeroed out, we don't need to update
> +		 * extent status tree.
> +		 */
> +		if ((flags & EXT4_GET_BLOCKS_PRE_IO) &&
> +		    ext4_es_lookup_extent(inode, map->m_lblk, &es)) {
> +			if (ext4_es_is_written(&es))
> +				goto has_zeroout;
> +		}
>  		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
>  				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
>  		if (!(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) &&
> @@ -734,6 +743,7 @@ found:
>  			retval = ret;
>  	}
>  
> +has_zeroout:
>  	up_write((&EXT4_I(inode)->i_data_sem));
>  	if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
>  		int ret = check_block_validity(inode, map);
> -- 
> 1.7.12.rc2.18.g61b472e
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 0/5] ext4: try to fix up es regressions
  2013-03-06 14:17 [PATCH v2 0/5] ext4: try to fix up es regressions Zheng Liu
                   ` (5 preceding siblings ...)
  2013-03-06 22:58 ` Dev branch regressions Theodore Ts'o
@ 2013-03-07 16:08 ` Dmitry Monakhov
  2013-03-08 13:18   ` Zheng Liu
  2013-03-11  2:11 ` Theodore Ts'o
  7 siblings, 1 reply; 27+ messages in thread
From: Dmitry Monakhov @ 2013-03-07 16:08 UTC (permalink / raw)
  To: Zheng Liu, linux-ext4; +Cc: Zheng Liu, Theodore Ts'o

On Wed,  6 Mar 2013 22:17:10 +0800, Zheng Liu <gnehzuil.liu@gmail.com> wrote:
> Hi all,
> 
> The patch series tries to fixup some regressions after applied the extent
> status tree.  These patches have rebased against the latest dev branch of
> ext4 and have been tested by xfstests.
> 
> After rebased the latest dev branch, two patches have been dropped because
> they have been applied into the branch.  A new patch is added, which tries
> to fix up a wrong return value in ext4_split_extent().  Otherwise, there
> are two major changes in this version.  The first one is to improve the
> self-testing-infrastructure according to Dmitry's comment.  The second one
> is to improve the zero out code.
> 
> After applied this patch series, I havn't seen the warning messages from
> self-testing infrastructure except the following cases.
> 
>  - xfstests #13 with bigalloc or with no journal
>  - xfstests #223 with dioread_nolock
> The reason is that when we lookup a block mapping from status tree
> i_data_sem locking won't be taken.  So there is a race window that an 
> unwritten extent could be converted by end_io when we compare the result
> between extent tree and status tree.
> 
> Dmitry, Ted, could you please confirm that this patch series can fix the
> defrag regression?  Thank you so much.  Until now I run #300 and #301 a
> lot times but I failed to hit this regression. :-(
Good work. All my tests now succeed (no error, no warning, no bugons), 
BUT 301'th (in terms of git://oss.sgi.com/xfs/cmds/xfstests.git)
result in massive memory leakage
about 8gb in an hour
#while true; do ./check 301  ;done
I suspect that 'struct ext4_ext_path' is leaked somewhere, I'm not
even sure that it is new one. 


I'll be out of email next week. 

Good luck.
> 
> *Big Note*
> When I am testing this patch series, I found some regressions in dev branch.
> Here is a note.  These regressions could be hitted by running test case
> serveral times.  So If we just run xfstests one time, they could be missed.
> 
>  - xfstests #74 with data=journal
>  - xfstests #83 with bigalloc
> Some threads could be blocked for 120s.
> 
>  - xfstests #247 with data=journal
> Some warning messages are printed by ext4_releasepage.  We hit
> WARN_ON(PageChecked(page)) in this function.  But the test case itself can
> pass.
> 
>  - xfstests #269 with dioread_nolock
> The system will hang
> 
> I don't paste full details here to make description clearly.  I will go on
> tracing these problems.  I am happy to provide full details if some one
> want to take a close look at these problems.
> 
> v2 <- v1:
>  * Improve self-testing infrastructure
>  * Improve zero out code
>  * Fix a wrong return value in ext4_split_extent
> 
> v1: http://permalink.gmane.org/gmane.comp.file-systems.ext4/37338
> 
> Thanks,
> 						- Zheng
> 
> Dmitry Monakhov (1):
>   ext4: add self-testing infrastructure to do a sanity check
> 
> Zheng Liu (4):
>   ext4: improve ext4_es_can_be_merged() to avoid a potential overflow
>   ext4: fix wrong m_len value after unwritten extent conversion
>   ext4: update extent status tree after an extent is zeroed out
>   ext4: fix wrong the number of the allocted blocks in
>     ext4_split_extent
> 
>  fs/ext4/extents.c        |  45 ++++++++--
>  fs/ext4/extents_status.c | 212 +++++++++++++++++++++++++++++++++++++++++++++--
>  fs/ext4/extents_status.h |   9 ++
>  fs/ext4/inode.c          | 106 ++++++++++++++++++++++++
>  4 files changed, 362 insertions(+), 10 deletions(-)
> 
> -- 
> 1.7.12.rc2.18.g61b472e
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/5] ext4: add self-testing infrastructure to do a sanity check
  2013-03-07 15:41   ` Dmitry Monakhov
@ 2013-03-08 13:01     ` Zheng Liu
  0 siblings, 0 replies; 27+ messages in thread
From: Zheng Liu @ 2013-03-08 13:01 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o

On Thu, Mar 07, 2013 at 07:41:05PM +0400, Dmitry Monakhov wrote:
> On Wed,  6 Mar 2013 22:17:12 +0800, Zheng Liu <gnehzuil.liu@gmail.com> wrote:
> > From: Dmitry Monakhov <dmonakhov@openvz.org>
> Looks good with small comments (see below)

Yes, I will fix them in next version.  Thanks for your comments.

Regards,
                                                - Zheng

> > 
> > This commit adds a self-testing infrastructure like extent tree does to
> > do a sanity check for extent status tree.  After status tree is as a
> > extent cache, we'd better to make sure that it caches right result.
> > 
> > After applied this commit, we will get a lot of messages when we run
> > xfstests as below.
> > 
> > ...
> > kernel: ES len assertation failed for inode: 230 retval 1 != map->m_len
> > 3 in ext4_map_blocks (allocation)
> > ...
> > kernel: ES cache assertation failed for inode: 230 es_cached ex
> > [974/2/4781/20] != found ex [974/1/4781/1000]
> > ...
> > kernel: ES insert assertation failed for inode: 635 ex_status
> > [0/45/21388/w] != es_status [44/1/21432/u]
> > ...
> > 
> > Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > ---
> >  fs/ext4/extents_status.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/ext4/extents_status.h |   6 ++
> >  fs/ext4/inode.c          |  96 +++++++++++++++++++++++++++
> >  3 files changed, 271 insertions(+)
> > 
> > diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> > index dc4e4c5..a434f81 100644
> > --- a/fs/ext4/extents_status.c
> > +++ b/fs/ext4/extents_status.c
> > @@ -405,6 +405,171 @@ ext4_es_try_to_merge_right(struct inode *inode, struct extent_status *es)
> >  	return es;
> >  }
> >  
> > +#ifdef ES_AGGRESSIVE_TEST
> > +static void ext4_es_insert_extent_ext_check(struct inode *inode,
> > +					    struct extent_status *es)
> > +{
> > +	struct ext4_ext_path *path = NULL;
> > +	struct ext4_extent *ex;
> > +	ext4_lblk_t ee_block;
> > +	ext4_fsblk_t ee_start;
> > +	unsigned short ee_len;
> > +	int depth, ee_status, es_status;
> > +
> > +	path = ext4_ext_find_extent(inode, es->es_lblk, NULL);
> > +	if (IS_ERR(path))
> > +		return;
> > +
> > +	depth = ext_depth(inode);
> > +	ex = path[depth].p_ext;
> > +
> > +	if (ex) {
> > +
> > +		ee_block = le32_to_cpu(ex->ee_block);
> > +		ee_start = ext4_ext_pblock(ex);
> > +		ee_len = ext4_ext_get_actual_len(ex);
> > +
> > +		ee_status = ext4_ext_is_uninitialized(ex) ? 1 : 0;
> > +		es_status = ext4_es_is_unwritten(es) ? 1 : 0;
> > +
> > +		/*
> > +		 * Make sure ex and es are not overlap when we try to insert
> > +		 * a delayed/hole extent.
> > +		 */
> > +		if (!ext4_es_is_written(es) && !ext4_es_is_unwritten(es)) {
> > +			if (in_range(es->es_lblk, ee_block, ee_len)) {
> > +				printk("ES insert assertation failed for inode: %lu "
> > +				       "we can find an extent at block "
> > +				       "[%d/%d/%llu/%c], but we want to add an "
> > +				       "delayed/hole extent [%d/%d/%llu/%llx]\n",
> > +				       inode->i_ino, ee_block, ee_len, ee_start,
> > +				       ee_status ? 'u' : 'w', es->es_lblk, es->es_len,
> > +				       ext4_es_pblock(es), ext4_es_status(es));
> > +			}
> > +			goto out;
> > +		}
> > +
> > +		/*
> > +		 * We don't check ee_block == es->es_lblk, etc. because es
> > +		 * might be a part of whole extent, vice versa.
> > +		 */
> > +		if (es->es_lblk < ee_block ||
> > +		    ext4_es_pblock(es) != ee_start + es->es_lblk - ee_block) {
> > +			printk("ES insert assertation failed for inode: %lu "
> > +			       "ex_status [%d/%d/%llu/%c] != "
> > +			       "es_status [%d/%d/%llu/%c]\n", inode->i_ino,
> > +			       ee_block, ee_len, ee_start, ee_status ? 'u' : 'w',
> > +			       es->es_lblk, es->es_len, ext4_es_pblock(es),
> > +			       es_status ? 'u' : 'w');
> > +			goto out;
> > +		}
> > +
> > +		if (ee_status ^ es_status) {
> > +			printk("ES insert assertation failed for inode: %lu "
> > +			       "ex_status [%d/%d/%llu/%c] != "
> > +			       "es_status [%d/%d/%llu/%c]\n", inode->i_ino,
> > +			       ee_block, ee_len, ee_start, ee_status ? 'u' : 'w',
> > +			       es->es_lblk, es->es_len, ext4_es_pblock(es),
> > +			       es_status ? 'u' : 'w');
> > +		}
> > +	} else {
> > +		/*
> > +		 * We can't find an extent on disk.  So we need to make sure
> > +		 * that we don't want to add an written/unwritten extent.
> > +		 */
> > +		if (!ext4_es_is_delayed(es) && !ext4_es_is_hole(es)) {
> > +			printk("ES insert assertation failed for inode: %lu "
> > +			       "can't find an extent at block %d but we want "
> > +			       "to add an written/unwritten extent "
> > +			       "[%d/%d/%llu/%llx]\n", inode->i_ino,
> > +			       es->es_lblk, es->es_lblk, es->es_len,
> > +			       ext4_es_pblock(es), ext4_es_status(es));
> > +		}
> > +	}
> > +out:
> > +	if (path) {
> > +		ext4_ext_drop_refs(path);
> > +		kfree(path);
> > +	}
> > +}
> > +
> > +static void ext4_es_insert_extent_ind_check(struct inode *inode,
> > +					    struct extent_status *es)
> > +{
> > +	struct ext4_map_blocks map;
> > +	int retval;
> > +
> > +	/*
> > +	 * Here we call ext4_ind_map_blocks to lookup a block mapping because
> > +	 * 'Indirect' structure is defined in indirect.c.  So we couldn't
> > +	 * access direct/indirect tree from outside.  It is too dirty to define
> > +	 * this function in indirect.c file.
> > +	 */
> > +
> > +	map.m_lblk = es->es_lblk;
> > +	map.m_len = es->es_len;
> > +
> > +	retval = ext4_ind_map_blocks(NULL, inode, &map, 0);
> > +	if (retval > 0) {
> > +		if (ext4_es_is_delayed(es) || ext4_es_is_hole(es)) {
> > +			/*
> > +			 * We want to add a delayed/hole extent but this
> > +			 * block has been allocated.
> > +			 */
> > +			printk("ES insert assertation failed for inode: %lu "
> > +			       "We can find blocks but we want to add a "
> > +			       "delayed/hole extent [%d/%d/%llu/%llx]\n",
> > +			       inode->i_ino, es->es_lblk, es->es_len,
> > +			       ext4_es_pblock(es), ext4_es_status(es));
> > +			return;
> > +		} else if (ext4_es_is_written(es)) {
> > +			if (retval != es->es_len) {
> > +				printk("ES insert assertation failed for inode: "
> > +				       "%lu retval %d != es_len %d\n",
> > +				       inode->i_ino, retval, es->es_len);
> > +				return;
> > +			}
> > +			if (map.m_pblk != ext4_es_pblock(es)) {
> > +				printk("ES insert assertation failed for inode: "
> > +				       "%lu m_pblk %llu != es_pblk %llu\n",
> > +				       inode->i_ino, map.m_pblk,
> > +				       ext4_es_pblock(es));
> > +				return;
> > +			}
> > +		} else {
> > +			/*
> > +			 * We don't need to check unwritten extent because
> > +			 * indirect-based file doesn't have it.
> > +			 */
> > +			BUG_ON(1);
> > +		}
> > +	} else if (retval == 0) {
> > +		if (ext4_es_is_written(es)) {
> > +			printk("ES insert assertation failed for inode: %lu "
> > +			       "We can't find the block but we want to add "
> > +			       "an written extent [%d/%d/%llu/%llx]\n",
> > +			       inode->i_ino, es->es_lblk, es->es_len,
> > +			       ext4_es_pblock(es), ext4_es_status(es));
> > +			return;
> > +		}
> > +	}
> > +}
> > +
> > +static inline void ext4_es_insert_extent_check(struct inode *inode,
> > +					       struct extent_status *es)
> > +{
> > +	/*
> > +	 * We don't need to worry about the race condition because
> > +	 * caller takes i_data_sem locking.
> > +	 */
> > +	BUG_ON(!rwsem_is_locked(&EXT4_I(inode)->i_data_sem));
> > +	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> > +		ext4_es_insert_extent_ext_check(inode, es);
> > +	else
> > +		ext4_es_insert_extent_ind_check(inode, es);
> > +}
> > +#endif
> > +
> >  static int __es_insert_extent(struct inode *inode, struct extent_status *newes)
> >  {
> >  	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
> > @@ -487,6 +652,10 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> >  	ext4_es_store_status(&newes, status);
> >  	trace_ext4_es_insert_extent(inode, &newes);
> >  
> > +#ifdef ES_AGGRESSIVE_TEST
> > +	ext4_es_insert_extent_check(inode, &newes);
> > +#endif
> We can avoid #ifdef here simply by defining ext4_es_insert_extent_check
> like follows:
> #ifdef ES_AGGRESSIVE_TEST
> void  ext4_es_insert_extent_check(inode, &newes) {
> ... our stuff here
> }
> #elseif
> #define ext4_es_insert_extent_check(a, b) do {} while(0)
> #endif
> 
> > +#endif
> > +
> >  	write_lock(&EXT4_I(inode)->i_es_lock);
> >  	err = __es_remove_extent(inode, lblk, end);
> >  	if (err != 0)
> > diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
> > index f190dfe..56140ad 100644
> > --- a/fs/ext4/extents_status.h
> > +++ b/fs/ext4/extents_status.h
> > @@ -21,6 +21,12 @@
> >  #endif
> >  
> >  /*
> > + * With ES_AGGRESSIVE_TEST defined, the result of es caching will be
> > + * checked with old map_block's result.
> > + */
> > +#define ES_AGGRESSIVE_TEST__
> > +
> > +/*
> >   * These flags live in the high bits of extent_status.es_pblk
> >   */
> >  #define EXTENT_STATUS_WRITTEN	(1ULL << 63)
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 95a0c62..3186a43 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -482,6 +482,58 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
> >  	return num;
> >  }
> >  
> > +#ifdef ES_AGGRESSIVE_TEST
> > +static void ext4_map_blocks_es_recheck(handle_t *handle,
> > +				       struct inode *inode,
> > +				       struct ext4_map_blocks *es_map,
> > +				       struct ext4_map_blocks *map,
> > +				       int flags)
> > +{
> > +	int retval;
> > +
> > +	map->m_flags = 0;
> > +	/*
> > +	 * There is a race window that the result is not the same.
> > +	 * e.g. xfstests #223 when dioread_nolock enables.  The reason
> > +	 * is that we lookup a block mapping in extent status tree with
> > +	 * out taking i_data_sem.  So at the time the unwritten extent
> > +	 * could be converted.
> > +	 */
> > +	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
> > +		down_read((&EXT4_I(inode)->i_data_sem));
> > +	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
> > +		retval = ext4_ext_map_blocks(handle, inode, map, flags &
> > +					     EXT4_GET_BLOCKS_KEEP_SIZE);
> > +	} else {
> > +		retval = ext4_ind_map_blocks(handle, inode, map, flags &
> > +					     EXT4_GET_BLOCKS_KEEP_SIZE);
> > +	}
> > +	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
> > +		up_read((&EXT4_I(inode)->i_data_sem));
> > +	/*
> > +	 * Clear EXT4_MAP_FROM_CLUSTER and EXT4_MAP_BOUNDARY flag
> > +	 * because it shouldn't be marked in es_map->m_flags.
> > +	 */
> > +	map->m_flags &= ~(EXT4_MAP_FROM_CLUSTER | EXT4_MAP_BOUNDARY);
> > +
> > +	/*
> > +	 * We don't check m_len because extent will be collpased in status
> > +	 * tree.  So the m_len might not equal.
> > +	 */
> > +	if (es_map->m_lblk != map->m_lblk ||
> > +	    es_map->m_flags != map->m_flags ||
> > +	    es_map->m_pblk != map->m_pblk) {
> > +		printk("ES cache assertation failed for inode: %lu "
> > +		       "es_cached ex [%d/%d/%llu/%x] != "
> > +		       "found ex [%d/%d/%llu/%x] retval %d flags %x\n",
> > +		       inode->i_ino, es_map->m_lblk, es_map->m_len,
> > +		       es_map->m_pblk, es_map->m_flags, map->m_lblk,
> > +		       map->m_len, map->m_pblk, map->m_flags,
> > +		       retval, flags);
> > +	}
> > +}
> > +#endif /* ES_AGGRESSIVE_TEST */
> > +
> >  /*
> >   * The ext4_map_blocks() function tries to look up the requested blocks,
> >   * and returns if the blocks are already mapped.
> > @@ -509,6 +561,11 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> >  {
> >  	struct extent_status es;
> >  	int retval;
> > +#ifdef ES_AGGRESSIVE_TEST
> > +	struct ext4_map_blocks orig_map;
> > +
> > +	memcpy(&orig_map, map, sizeof(*map));
> > +#endif
> >  
> >  	map->m_flags = 0;
> >  	ext_debug("ext4_map_blocks(): inode %lu, flag %d, max_blocks %u,"
> > @@ -531,6 +588,10 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> >  		} else {
> >  			BUG_ON(1);
> >  		}
> > +#ifdef ES_AGGRESSIVE_TEST
> > +		ext4_map_blocks_es_recheck(handle, inode, map,
> > +					   &orig_map, flags);
> > +#endif
> >  		goto found;
> >  	}
> >  
> > @@ -551,6 +612,15 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> >  		int ret;
> >  		unsigned long long status;
> >  
> > +#ifdef ES_AGGRESSIVE_TEST
> > +		if (retval != map->m_len) {
> > +			printk("ES len assertation failed for inode: %lu "
> > +			       "retval %d != map->m_len %d "
> > +			       "in %s (lookup)\n", inode->i_ino, retval,
> > +			       map->m_len, __func__);
> > +		}
> > +#endif
> It is reasonable to replace this by one-line check
>                 BUG_ON(retval != map->m_len)
> > +
> >  		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
> >  				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
> >  		if (!(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) &&
> > @@ -643,6 +713,15 @@ found:
> >  		int ret;
> >  		unsigned long long status;
> >  
> > +#ifdef ES_AGGRESSIVE_TEST
> > +		if (retval != map->m_len) {
> > +			printk("ES len assertation failed for inode: %lu "
> > +			       "retval %d != map->m_len %d "
> > +			       "in %s (allocation)\n", inode->i_ino, retval,
> > +			       map->m_len, __func__);
> > +		}
> > +#endif
> Also one line check BUG_ON(retval != map->m_len)
> > +
> >  		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
> >  				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
> >  		if (!(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) &&
> > @@ -1768,6 +1847,11 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
> >  	struct extent_status es;
> >  	int retval;
> >  	sector_t invalid_block = ~((sector_t) 0xffff);
> > +#ifdef ES_AGGRESSIVE_TEST
> > +	struct ext4_map_blocks orig_map;
> > +
> > +	memcpy(&orig_map, map, sizeof(*map));
> > +#endif
> >  
> >  	if (invalid_block < ext4_blocks_count(EXT4_SB(inode->i_sb)->s_es))
> >  		invalid_block = ~0;
> > @@ -1809,6 +1893,9 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
> >  		else
> >  			BUG_ON(1);
> >  
> > +#ifdef ES_AGGRESSIVE_TEST
> > +		ext4_map_blocks_es_recheck(NULL, inode, map, &orig_map, 0);
> > +#endif
> >  		return retval;
> >  	}
> >  
> > @@ -1873,6 +1960,15 @@ add_delayed:
> >  		int ret;
> >  		unsigned long long status;
> >  
> > +#ifdef ES_AGGRESSIVE_TEST
> > +		if (retval != map->m_len) {
> > +			printk("ES len assertation failed for inode: %lu "
> > +			       "retval %d != map->m_len %d "
> > +			       "in %s (lookup)\n", inode->i_ino, retval,
> > +			       map->m_len, __func__);
> > +		}
> > +#endif
> > +
> >  		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
> >  				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
> >  		ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> > -- 
> > 1.7.12.rc2.18.g61b472e
> > 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 4/5] ext4: update extent status tree after an extent is zeroed out
  2013-03-07 15:55   ` Dmitry Monakhov
@ 2013-03-08 13:14     ` Zheng Liu
  0 siblings, 0 replies; 27+ messages in thread
From: Zheng Liu @ 2013-03-08 13:14 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o

On Thu, Mar 07, 2013 at 07:55:14PM +0400, Dmitry Monakhov wrote:
> On Wed,  6 Mar 2013 22:17:14 +0800, Zheng Liu <gnehzuil.liu@gmail.com> wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > When we try to split an extent, this extent could be zeroed out and mark
> > as initialized.  But we don't know this in ext4_map_blocks because it
> > only returns a length of allocated extent.  Meanwhile we will mark this
> > extent as uninitialized because we only check m_flags.
> > 
> > This commit update extent status tree when we try to split an unwritten
> > extent.  We don't need to worry about the status of this extent because
> > we always mark it as initialized.
> > 
> > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > Cc: Dmitry Monakhov <dmonakhov@openvz.org>
> > ---
> >  fs/ext4/extents.c        | 35 +++++++++++++++++++++++++++++++----
> >  fs/ext4/extents_status.c | 17 +++++++++++++++++
> >  fs/ext4/extents_status.h |  3 +++
> >  fs/ext4/inode.c          | 10 ++++++++++
> >  4 files changed, 61 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index 110e85a..7e37018 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -2925,7 +2925,7 @@ static int ext4_split_extent_at(handle_t *handle,
> >  {
> >  	ext4_fsblk_t newblock;
> >  	ext4_lblk_t ee_block;
> > -	struct ext4_extent *ex, newex, orig_ex;
> > +	struct ext4_extent *ex, newex, orig_ex, zero_ex;
> >  	struct ext4_extent *ex2 = NULL;
> >  	unsigned int ee_len, depth;
> >  	int err = 0;
> > @@ -2996,12 +2996,26 @@ static int ext4_split_extent_at(handle_t *handle,
> >  	err = ext4_ext_insert_extent(handle, inode, path, &newex, flags);
> >  	if (err == -ENOSPC && (EXT4_EXT_MAY_ZEROOUT & split_flag)) {
> >  		if (split_flag & (EXT4_EXT_DATA_VALID1|EXT4_EXT_DATA_VALID2)) {
> > -			if (split_flag & EXT4_EXT_DATA_VALID1)
> > +			if (split_flag & EXT4_EXT_DATA_VALID1) {
> >  				err = ext4_ext_zeroout(inode, ex2);
> > -			else
> > +				zero_ex.ee_block = ex2->ee_block;
> > +				zero_ex.ee_len = ext4_ext_get_actual_len(ex2);
> > +				ext4_ext_store_pblock(&zero_ex,
> > +						      ext4_ext_pblock(ex2));
> > +			} else {
> >  				err = ext4_ext_zeroout(inode, ex);
> > -		} else
> > +				zero_ex.ee_block = ex->ee_block;
> > +				zero_ex.ee_len = ext4_ext_get_actual_len(ex);
> > +				ext4_ext_store_pblock(&zero_ex,
> > +						      ext4_ext_pblock(ex));
> > +			}
> > +		} else {
> >  			err = ext4_ext_zeroout(inode, &orig_ex);
> > +			zero_ex.ee_block = orig_ex.ee_block;
> > +			zero_ex.ee_len = ext4_ext_get_actual_len(&orig_ex);
> > +			ext4_ext_store_pblock(&zero_ex,
> > +					      ext4_ext_pblock(&orig_ex));              
> > +		}
> >  		if (err)
> >  			goto fix_extent_len;
> > @@ -3009,6 +3023,12 @@ static int ext4_split_extent_at(handle_t *handle,
> >  		ex->ee_len = cpu_to_le16(ee_len);
> >  		ext4_ext_try_to_merge(handle, inode, path, ex);
> >  		err = ext4_ext_dirty(handle, inode, path + path->p_depth);
> > +		if (err)
> > +			goto fix_extent_len;
> > +
> > +		/* update extent status tree */
> > +		err = ext4_es_zeroout(inode, &zero_ex);
> > +
> Previous blocks are correct but too complex.
> At this point we know that extent "ex" becomes initialized so just
> manually update it like follows:
>         err = ext4_es_insert_extent(inode, ee_block, ee_len,
>                                                    ext4_ext_pblock(ex),
>                                                    EXTENT_STATUS_WRITTEN);

Yeah, but maybe a inline function is better.  As you said below, I need
to avoid to define a local variable that is used only once.  So it look
like this:

inline int ext4_es_zeroout(struct inode *inode, struct ext4_extent *ex)
{
        return ext4_es_insert_extent(inode, le32_to_cpu(ex->ee_block),
                                     ext4_ext_get_actual_len(ex),
                                     ext4_ext_pblock(ex),
                                     EXTENT_STATUS_WRITTEN);
}

What do you think?

> BTW I'm wonder what happen if one of ext4_es_xxx functions failed with
> error. ASAIU this possible only incase of ENOMEM so it is very unlikely
> but allowed. If this happens then es_tree will be out of sinc with
> extent_tree which later result in corruption.

I don't think we need to worry about it because we always remove some
extents from status tree, and then try to insert some one.  -ENOMEM will
be returned when we are trying to insert a extent.  So when -ENOMEM is
returned, that means that no related extent is in status tree.  So later
we won't lookup this extent in map_blocks().

> >  		goto out;
> >  	} else if (err)
> >  		goto fix_extent_len;
> > @@ -3150,6 +3170,7 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
> >  	ee_block = le32_to_cpu(ex->ee_block);
> >  	ee_len = ext4_ext_get_actual_len(ex);
> >  	allocated = ee_len - (map->m_lblk - ee_block);
> > +	zero_ex.ee_len = 0;
> >  
> >  	trace_ext4_ext_convert_to_initialized_enter(inode, map, ex);
> >  
> > @@ -3247,6 +3268,9 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
> >  		err = ext4_ext_zeroout(inode, ex);
> >  		if (err)
> >  			goto out;
> > +		zero_ex.ee_block = ex->ee_block;
> > +		zero_ex.ee_len = ext4_ext_get_actual_len(ex);
> > +		ext4_ext_store_pblock(&zero_ex, ext4_ext_pblock(ex));
> >  
> >  		err = ext4_ext_get_access(handle, inode, path + depth);
> >  		if (err)
> > @@ -3305,6 +3329,9 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
> >  		err = allocated;
> >  
> >  out:
> > +	/* If we have gotten a failure, don't zero out status tree */
> > +	if (!err)
> > +		err = ext4_es_zeroout(inode, &zero_ex);
> >  	return err ? err : allocated;
> >  }
> >  
> > diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> > index a434f81..e7bebbc 100644
> > --- a/fs/ext4/extents_status.c
> > +++ b/fs/ext4/extents_status.c
> > @@ -854,6 +854,23 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
> >  	return err;
> >  }
> >  
> > +int ext4_es_zeroout(struct inode *inode, struct ext4_extent *ex)
> > +{
> > +	ext4_lblk_t  ee_block;
> > +	ext4_fsblk_t ee_pblock;
> > +	unsigned int ee_len;
> > +
> > +	ee_block  = le32_to_cpu(ex->ee_block);
> > +	ee_len    = ext4_ext_get_actual_len(ex);
> > +	ee_pblock = ext4_ext_pblock(ex);
> Andreas Dilger do not like local variables which used only once.  Let's avoid that.

As I said above.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 0/5] ext4: try to fix up es regressions
  2013-03-07 16:08 ` [PATCH v2 0/5] ext4: try to fix up es regressions Dmitry Monakhov
@ 2013-03-08 13:18   ` Zheng Liu
  0 siblings, 0 replies; 27+ messages in thread
From: Zheng Liu @ 2013-03-08 13:18 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o

On Thu, Mar 07, 2013 at 08:08:47PM +0400, Dmitry Monakhov wrote:
> On Wed,  6 Mar 2013 22:17:10 +0800, Zheng Liu <gnehzuil.liu@gmail.com> wrote:
> > Hi all,
> > 
> > The patch series tries to fixup some regressions after applied the extent
> > status tree.  These patches have rebased against the latest dev branch of
> > ext4 and have been tested by xfstests.
> > 
> > After rebased the latest dev branch, two patches have been dropped because
> > they have been applied into the branch.  A new patch is added, which tries
> > to fix up a wrong return value in ext4_split_extent().  Otherwise, there
> > are two major changes in this version.  The first one is to improve the
> > self-testing-infrastructure according to Dmitry's comment.  The second one
> > is to improve the zero out code.
> > 
> > After applied this patch series, I havn't seen the warning messages from
> > self-testing infrastructure except the following cases.
> > 
> >  - xfstests #13 with bigalloc or with no journal
> >  - xfstests #223 with dioread_nolock
> > The reason is that when we lookup a block mapping from status tree
> > i_data_sem locking won't be taken.  So there is a race window that an 
> > unwritten extent could be converted by end_io when we compare the result
> > between extent tree and status tree.
> > 
> > Dmitry, Ted, could you please confirm that this patch series can fix the
> > defrag regression?  Thank you so much.  Until now I run #300 and #301 a
> > lot times but I failed to hit this regression. :-(
> Good work. All my tests now succeed (no error, no warning, no bugons), 

Great!  Thanks for your confirmation.

> BUT 301'th (in terms of git://oss.sgi.com/xfs/cmds/xfstests.git)
> result in massive memory leakage
> about 8gb in an hour
> #while true; do ./check 301  ;done
> I suspect that 'struct ext4_ext_path' is leaked somewhere, I'm not
> even sure that it is new one. 

Thanks for the reminder.  Maybe there still has some bugs in extent
tree.  I will look at it.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/5] ext4: improve ext4_es_can_be_merged() to avoid a potential overflow
  2013-03-06 14:17 ` [PATCH v2 1/5] ext4: improve ext4_es_can_be_merged() to avoid a potential overflow Zheng Liu
@ 2013-03-11  0:43   ` Theodore Ts'o
  2013-03-11  6:03     ` Zheng Liu
  0 siblings, 1 reply; 27+ messages in thread
From: Theodore Ts'o @ 2013-03-11  0:43 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Zheng Liu, Dmitry Monakhov

On Wed, Mar 06, 2013 at 10:17:11PM +0800, Zheng Liu wrote:
> +	if (ext4_es_status(es1) ^ ext4_es_status(es2))
>  		return 0;
>  
> -	if (ext4_es_status(es1) != ext4_es_status(es2))

Did you have a reason why changed != to ^?  

It's identical from a functional perspective, but it's less obvious to
future readers of the code what's going on.  I tried checking to see
if GCC did any better optimizing the code, but it doesn't seem to make
any difference.  I'm going to switch it back to !=....

> +	/* we need to check delayed extent is without unwritten status */
> +	if (ext4_es_is_delayed(es1) && !ext4_es_is_unwritten(es1))
> +		return 1;

I'm not sure why we need to check the unwritten status?  Under what
circumstances would we have an extent marked as under delayed
allocation but also unwritten?

							- Ted

This is how I've restructured this function for now mainly to make it
easier to understand;

static int ext4_es_can_be_merged(struct extent_status *es1,
				 struct extent_status *es2)
{
	if (ext4_es_status(es1) != ext4_es_status(es2))
		return 0;

	if (((__u64) es1->es_len) + es2->es_len > 0xFFFFFFFFULL)
		return 0;

	if (((__u64) es1->es_lblk) + es1->es_len != es2->es_lblk)
		return 0;

	if ((ext4_es_is_written(es1) || ext4_es_is_unwritten(es1)) &&
	    (ext4_es_pblock(es1) + es1->es_len == ext4_es_pblock(es2)))
		return 1;

	if (ext4_es_is_hole(es1))
		return 1;

	/* we need to check delayed extent is without unwritten status */
	if (ext4_es_is_delayed(es1) && !ext4_es_is_unwritten(es1))
		return 1;

	return 0;
}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/5] ext4: add self-testing infrastructure to do a sanity check
  2013-03-06 14:17 ` [PATCH v2 2/5] ext4: add self-testing infrastructure to do a sanity check Zheng Liu
  2013-03-07 15:41   ` Dmitry Monakhov
@ 2013-03-11  1:01   ` Theodore Ts'o
  1 sibling, 0 replies; 27+ messages in thread
From: Theodore Ts'o @ 2013-03-11  1:01 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Dmitry Monakhov, Zheng Liu

On Wed, Mar 06, 2013 at 10:17:12PM +0800, Zheng Liu wrote:
> +				printk("ES insert assertation failed for inode: %lu "
> +				       "we can find an extent at block "

FYI, I changed all of the printk() messages to pr_warn().

  	      	     	      	       - Ted

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 3/5] ext4: fix wrong m_len value after unwritten extent conversion
  2013-03-06 14:17 ` [PATCH v2 3/5] ext4: fix wrong m_len value after unwritten extent conversion Zheng Liu
  2013-03-07 15:42   ` Dmitry Monakhov
@ 2013-03-11  1:07   ` Theodore Ts'o
  2013-03-11  5:47     ` Zheng Liu
  1 sibling, 1 reply; 27+ messages in thread
From: Theodore Ts'o @ 2013-03-11  1:07 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Zheng Liu, Dmitry Monakhov

On Wed, Mar 06, 2013 at 10:17:13PM +0800, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> We always assume that the return value of ext4_ext_map_blocks is equal
> to map->m_len

Note that in general, this is _never_ safe to assume.  There are a
number of times when the number of blocks mapped is less than what the
caller originally requested, both when allocating blocks (and there
isn't the requestd number of contiguous blocks available), and when
EXT4_GET_BLOCKS_CREATE is not set.

					- Ted

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 0/5] ext4: try to fix up es regressions
  2013-03-06 14:17 [PATCH v2 0/5] ext4: try to fix up es regressions Zheng Liu
                   ` (6 preceding siblings ...)
  2013-03-07 16:08 ` [PATCH v2 0/5] ext4: try to fix up es regressions Dmitry Monakhov
@ 2013-03-11  2:11 ` Theodore Ts'o
  2013-03-11  6:23   ` Zheng Liu
  7 siblings, 1 reply; 27+ messages in thread
From: Theodore Ts'o @ 2013-03-11  2:11 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Zheng Liu, Dmitry Monakhov

I've placed these patches (with a few minor fixups) on the dev branch,
and will start running a test/verification run on them).  Thanks for
working on these patches!

						- Ted

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 3/5] ext4: fix wrong m_len value after unwritten extent conversion
  2013-03-11  1:07   ` Theodore Ts'o
@ 2013-03-11  5:47     ` Zheng Liu
  2013-03-13  1:57       ` Theodore Ts'o
  0 siblings, 1 reply; 27+ messages in thread
From: Zheng Liu @ 2013-03-11  5:47 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Zheng Liu, Dmitry Monakhov

On Sun, Mar 10, 2013 at 09:07:18PM -0400, Theodore Ts'o wrote:
> On Wed, Mar 06, 2013 at 10:17:13PM +0800, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > We always assume that the return value of ext4_ext_map_blocks is equal
> > to map->m_len
> 
> Note that in general, this is _never_ safe to assume.  There are a
> number of times when the number of blocks mapped is less than what the
> caller originally requested, both when allocating blocks (and there
> isn't the requestd number of contiguous blocks available), and when
> EXT4_GET_BLOCKS_CREATE is not set.

Yes, When EXT4_GET_BLOCKS_CREATE is not set, it could be 0 because there
is no block mapping, and we don't create it.  Meanwhile when we want to
allocate some blocks, it could be less than the number of block we
requested.  But IMHO at least when we try to allocate some blocks, m_len
should be changed according to the number of allocated blocks in order
to make them equal if the number of allocated blocks is less than the
number of blocks we requested.  Namely, when the return value (retval)
is greater than 0, this assumption will be right.  Because we will use
m_len value after map_blocks function returns.  We need to let upper
level know it, such as write_begin, DIO, etc...  Am I miss something?

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/5] ext4: improve ext4_es_can_be_merged() to avoid a potential overflow
  2013-03-11  0:43   ` Theodore Ts'o
@ 2013-03-11  6:03     ` Zheng Liu
  0 siblings, 0 replies; 27+ messages in thread
From: Zheng Liu @ 2013-03-11  6:03 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Zheng Liu, Dmitry Monakhov

On Sun, Mar 10, 2013 at 08:43:58PM -0400, Theodore Ts'o wrote:
> On Wed, Mar 06, 2013 at 10:17:11PM +0800, Zheng Liu wrote:
> > +	if (ext4_es_status(es1) ^ ext4_es_status(es2))
> >  		return 0;
> >  
> > -	if (ext4_es_status(es1) != ext4_es_status(es2))
> 
> Did you have a reason why changed != to ^?  

Honestly, no.  Just because subconsciously I think bit operation is
faster than other operations, and in ext4_can_extents_be_merged() it
also use '^'.  So I guess there is an optimization.

> 
> It's identical from a functional perspective, but it's less obvious to
> future readers of the code what's going on.  I tried checking to see
> if GCC did any better optimizing the code, but it doesn't seem to make
> any difference.  I'm going to switch it back to !=....

Obviously I'm wrong.  Thanks for checking it.

> 
> > +	/* we need to check delayed extent is without unwritten status */
> > +	if (ext4_es_is_delayed(es1) && !ext4_es_is_unwritten(es1))
> > +		return 1;
> 
> I'm not sure why we need to check the unwritten status?  Under what
> circumstances would we have an extent marked as under delayed
> allocation but also unwritten?

We could do some buffered writes into a hole.  So the extent will be
with delayed status.  Before these extents are written out, user might
uses fallocate(2) to preallocate some blocks at the same offset.  Then
these extents are marked as unwritten status.  But we still need to keep
delayed status because later these extents will be written out and we
will update reserved space according to these extents, especially in a
bigalloc file system.

> 
> 							- Ted
> 
> This is how I've restructured this function for now mainly to make it
> easier to understand;
> 
> static int ext4_es_can_be_merged(struct extent_status *es1,
> 				 struct extent_status *es2)
> {
> 	if (ext4_es_status(es1) != ext4_es_status(es2))
> 		return 0;
> 
> 	if (((__u64) es1->es_len) + es2->es_len > 0xFFFFFFFFULL)
> 		return 0;
> 
> 	if (((__u64) es1->es_lblk) + es1->es_len != es2->es_lblk)
> 		return 0;
> 
> 	if ((ext4_es_is_written(es1) || ext4_es_is_unwritten(es1)) &&
> 	    (ext4_es_pblock(es1) + es1->es_len == ext4_es_pblock(es2)))
> 		return 1;
> 
> 	if (ext4_es_is_hole(es1))
> 		return 1;
> 
> 	/* we need to check delayed extent is without unwritten status */
> 	if (ext4_es_is_delayed(es1) && !ext4_es_is_unwritten(es1))
> 		return 1;
> 
> 	return 0;
> }

It looks good to me.

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 0/5] ext4: try to fix up es regressions
  2013-03-11  2:11 ` Theodore Ts'o
@ 2013-03-11  6:23   ` Zheng Liu
  0 siblings, 0 replies; 27+ messages in thread
From: Zheng Liu @ 2013-03-11  6:23 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Zheng Liu, Dmitry Monakhov

On Sun, Mar 10, 2013 at 10:11:08PM -0400, Theodore Ts'o wrote:
> I've placed these patches (with a few minor fixups) on the dev branch,
> and will start running a test/verification run on them).  Thanks for
> working on these patches!

Thanks.

As Dmitry's comment, I have updated my patches.

In patch 2 (ext4: add self-testing infrastructure to do a sanity check)
Dmitry suggests that we'd better replace 'retval != map->m_len' condition
with a BUG_ON().  But now I don't think we need to touch it because we
are still discussing this assumption.

In patch 4 (ext4: update extent status tree after an extent is zeroed
out) I have moved ext4_es_zeroout() into extents.c file and remove some
local variables that are only used once.  Do I need to send it out?  I
know the test has started.  Maybe I should wait a moment, and let the
tree stable before -rc2.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 3/5] ext4: fix wrong m_len value after unwritten extent conversion
  2013-03-11  5:47     ` Zheng Liu
@ 2013-03-13  1:57       ` Theodore Ts'o
  2013-03-13  2:14         ` Theodore Ts'o
  0 siblings, 1 reply; 27+ messages in thread
From: Theodore Ts'o @ 2013-03-13  1:57 UTC (permalink / raw)
  To: linux-ext4, Zheng Liu, Dmitry Monakhov

On Mon, Mar 11, 2013 at 01:47:07PM +0800, Zheng Liu wrote:
> On Sun, Mar 10, 2013 at 09:07:18PM -0400, Theodore Ts'o wrote:
> > On Wed, Mar 06, 2013 at 10:17:13PM +0800, Zheng Liu wrote:
> > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > 
> > > We always assume that the return value of ext4_ext_map_blocks is equal
> > > to map->m_len
> > 
> > Note that in general, this is _never_ safe to assume.  There are a
> > number of times when the number of blocks mapped is less than what the
> > caller originally requested, both when allocating blocks (and there
> > isn't the requestd number of contiguous blocks available), and when
> > EXT4_GET_BLOCKS_CREATE is not set.
> 
> Yes, When EXT4_GET_BLOCKS_CREATE is not set, it could be 0 because there
> is no block mapping, and we don't create it.  Meanwhile when we want to
> allocate some blocks, it could be less than the number of block we
> requested.  But IMHO at least when we try to allocate some blocks, m_len
> should be changed according to the number of allocated blocks in order
> to make them equal if the number of allocated blocks is less than the
> number of blocks we requested.  Namely, when the return value (retval)
> is greater than 0, this assumption will be right.  Because we will use
> m_len value after map_blocks function returns.  We need to let upper
> level know it, such as write_begin, DIO, etc...  Am I miss something?

No, you didn't miss anything.  I just wanted to say point out that any
assumption that ext4_ext_map_blocks() is equial to map->m_len was
always wrong, and not something that recently changed.  I updated the
commit description lightly to make this clear.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 3/5] ext4: fix wrong m_len value after unwritten extent conversion
  2013-03-13  1:57       ` Theodore Ts'o
@ 2013-03-13  2:14         ` Theodore Ts'o
  2013-03-13  8:53           ` Zheng Liu
  0 siblings, 1 reply; 27+ messages in thread
From: Theodore Ts'o @ 2013-03-13  2:14 UTC (permalink / raw)
  To: linux-ext4, Zheng Liu, Dmitry Monakhov

Oh, now I see my confusion.  You're talking about retval being
different from the output of map->m_len.  I was talking about retval
being often different from the *input* value of map->m_len.

I agree that the output of ext4_ext_map_blocks() and
ext4_ind_map_blocks() should be the same as map->m_len, and having a
BUG_ON there would make sense.  But we can save those changes for
after -rc3 as a cleanup....

					- Ted

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 3/5] ext4: fix wrong m_len value after unwritten extent conversion
  2013-03-13  2:14         ` Theodore Ts'o
@ 2013-03-13  8:53           ` Zheng Liu
  0 siblings, 0 replies; 27+ messages in thread
From: Zheng Liu @ 2013-03-13  8:53 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Zheng Liu, Dmitry Monakhov

On Tue, Mar 12, 2013 at 10:14:20PM -0400, Theodore Ts'o wrote:
> Oh, now I see my confusion.  You're talking about retval being
> different from the output of map->m_len.  I was talking about retval
> being often different from the *input* value of map->m_len.

Ah, sorry for my bad expression. :-(

> 
> I agree that the output of ext4_ext_map_blocks() and
> ext4_ind_map_blocks() should be the same as map->m_len, and having a
> BUG_ON there would make sense.  But we can save those changes for
> after -rc3 as a cleanup....

Actually, in my tree I use a WARN_ON() because I don't want to throw a
kernel panic for user.  I think it is too unfriendly.  I will send a
patch to change it after -rc3.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2013-03-13  8:37 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-06 14:17 [PATCH v2 0/5] ext4: try to fix up es regressions Zheng Liu
2013-03-06 14:17 ` [PATCH v2 1/5] ext4: improve ext4_es_can_be_merged() to avoid a potential overflow Zheng Liu
2013-03-11  0:43   ` Theodore Ts'o
2013-03-11  6:03     ` Zheng Liu
2013-03-06 14:17 ` [PATCH v2 2/5] ext4: add self-testing infrastructure to do a sanity check Zheng Liu
2013-03-07 15:41   ` Dmitry Monakhov
2013-03-08 13:01     ` Zheng Liu
2013-03-11  1:01   ` Theodore Ts'o
2013-03-06 14:17 ` [PATCH v2 3/5] ext4: fix wrong m_len value after unwritten extent conversion Zheng Liu
2013-03-07 15:42   ` Dmitry Monakhov
2013-03-11  1:07   ` Theodore Ts'o
2013-03-11  5:47     ` Zheng Liu
2013-03-13  1:57       ` Theodore Ts'o
2013-03-13  2:14         ` Theodore Ts'o
2013-03-13  8:53           ` Zheng Liu
2013-03-06 14:17 ` [PATCH v2 4/5] ext4: update extent status tree after an extent is zeroed out Zheng Liu
2013-03-07 15:55   ` Dmitry Monakhov
2013-03-08 13:14     ` Zheng Liu
2013-03-06 14:17 ` [PATCH v2 5/5] ext4: fix wrong the number of the allocted blocks in ext4_split_extent Zheng Liu
2013-03-06 22:58 ` Dev branch regressions Theodore Ts'o
2013-03-07  2:40   ` Zheng Liu
2013-03-07  6:47   ` Lukáš Czerner
2013-03-07 11:54     ` Zheng Liu
2013-03-07 16:08 ` [PATCH v2 0/5] ext4: try to fix up es regressions Dmitry Monakhov
2013-03-08 13:18   ` Zheng Liu
2013-03-11  2:11 ` Theodore Ts'o
2013-03-11  6:23   ` Zheng Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.