All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10 v5] ext4: extent status tree (step2)
@ 2013-02-08  8:43 Zheng Liu
  2013-02-08  8:43 ` [PATCH 01/10 v5] ext4: refine extent status tree Zheng Liu
                   ` (10 more replies)
  0 siblings, 11 replies; 37+ messages in thread
From: Zheng Liu @ 2013-02-08  8:43 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o, Jan kara

Hi all,

This is my fifth try to implement the second step of extent status tree.
The patch set can be divided into the following parts.

Patch 1/10
  This patch refines the extent status tree

Patch 2/10-6/10
  These patches try to track all extent status in extent status tree and
make it as a extent cache.  In extent_status structure bit field is removed
because we get some warnings from 'sparse'.  Now es_pblk and es_status are
manipulated by ext4_es_*_pblock and ext4_es_*_status directly.  Currently
when an unwritten extent is allocated, we never know it from map->m_flags
because ext4_ext_map_blocks doesn't return EXT4_MAP_UNWRITTEN flag.  A
patch fixes it and we can determine the extent status according to m_flags.
  According to Jan's feedback, we put the hole into extent cache to avoid
to access extent tree in disk as far as possible.  Here if the whole file
is a hole, this hole will not be cached in extent status tree because it
is always splitted immediately.  Meanwhile the hole will not be cached
when ext4_da_map_blocks looks up a block mapping because this hole will be
as a delayed extent later.

Patch 7/10-8/10
  This two patches try to reclaim memory from extent status tree when we
are under a high memeory pressure.

Patch 9/10-10/10
  Thses patches are picked up again from 1st version because I aware that
they could remove a bogus wait in ext4_ind_direct_IO when dioread_nolock
is enabled.  After applied them, the latency of dio read can be reduced.

I measure it using fio and the result shows as below.

config file
-----------
[global]
ioengine=psync
direct=1
bs=4k
thread
group_reporting
directory=/mnt/sda1/
filename=testfile
filesize=10g
size=10g
runtime=120
iodepth=16

[fio]
rw=randrw
numjobs=4

result
------
w/ bogus wait
  read : io=1508.1MB, bw=12876KB/s, iops=3218 , runt=120001msec
    clat (usec): min=128 , max=268738 , avg=718.62, stdev=3703.97
     lat (usec): min=128 , max=268739 , avg=718.78, stdev=3703.97
  write: io=1505.2MB, bw=12843KB/s, iops=3210 , runt=120001msec
    clat (usec): min=47 , max=991727 , avg=520.94, stdev=3451.63
     lat (usec): min=47 , max=991727 , avg=521.31, stdev=3451.63

w/o bogus wait
  read : io=1576.4MB, bw=13451KB/s, iops=3362 , runt=120001msec
    clat (usec): min=128 , max=283906 , avg=685.88, stdev=2762.64
     lat (usec): min=128 , max=283907 , avg=686.05, stdev=2762.64
  write: io=1577.9MB, bw=13458KB/s, iops=3364 , runt=120001msec
    clat (usec): min=48 , max=977942 , avg=498.97, stdev=3093.08
     lat (usec): min=48 , max=977943 , avg=499.33, stdev=3093.08

>From the result we can see that the avg. of latency could be reduced a little.

changelog:
v5 <- v4:
 - drop a patch that removes EXT4_MAP_FROM_CLUSTER flag
   (I will revise it in the patch set of get_block_t refinement)
 - fold original patch 3/9 into patch 4/9
 - manipulate es_pblk and es_status directly
   (bit field is removed because it causes some warnings from 'sparse')
 - let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
 - rename ext4_es_find_extent with ext4_es_find_delayed_extent
 - add hole status and put hole into extent status tree as a cache
 - convert unwritten extents from extent status tree in ext4_ext_direct_IO
   and end_io callback
 - remove a bogus wait in ext4_ind_direct_IO when dioread_nolock is enabled

v4 <- v3:
 - register a normal shrinker to reclaim extent from extent status tree

v3 <- v2:
 - use prune_super() to reclaim extents from extent status tree
 - stashed es_status into es_pblk
 - remove single extent cache
 - rebase against 3.8-rc4

v2 <- v1:
 - drop patches that try to improve unwritten extent conversion
 - remove EXT4_MAP_FROM_CLUSTER flag
 - add tracepoint for ext4_es_lookup_extent()
 - drop a patch, which tries to fix a warning when bigalloc and delalloc
   are enabled
 - add a shrinker to reclaim memory from extent status tree
 - rebase against 3.8-rc2

v4: http://lwn.net/Articles/536037/
v3: http://lwn.net/Articles/533730/
v2: http://lwn.net/Articles/532446/
v1: http://lwn.net/Articles/531065/

As always, any comments or feedbacks are welcome.

FWIW, when I try to implement patch 3/10, I realize that get_block_t and
*_map_blocks functions need to be refactored because in ext4 we already
have six get_block_t functions
 - ext4_get_block
 - ext4_get_block_write
 - ext4_get_block_write_nolock
 - noalloc_get_block_write
 - ext4_da_get_block_prep
 - _ext4_get_block

and four *_map_blocks
 - ext4_map_blocks
 - ext4_da_map_blocks
 - ext4_ext_map_blocks
 - ext4_ind_map_blocks

So I am planning to refine them.  First I will try to split ext4_map_blocks
into two parts, e.g. ext4_map_blocks_read and ext4_map_blocks_write, and 
then try other cleanups and improvmentes.

Thanks,
						- Zheng

Zheng Liu (10):
  ext4: refine extent status tree
  ext4: add physical block and status member into extent status tree
  ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
  ext4: track all extent status in extent status tree
  ext4: lookup block mapping in extent status tree
  ext4: remove single extent cache
  ext4: adjust some functions for reclaiming extents from extent status
    tree
  ext4: reclaim extents from extent status tree
  ext4: convert unwritten extents from extent status tree in end_io
  ext4: remove bogus wait for unwritten extents in ext4_ind_direct_IO

 fs/ext4/ext4.h              |  21 +-
 fs/ext4/ext4_extents.h      |   6 -
 fs/ext4/extents.c           | 211 ++++--------
 fs/ext4/extents_status.c    | 779 +++++++++++++++++++++++++++++++++++---------
 fs/ext4/extents_status.h    |  84 ++++-
 fs/ext4/file.c              |  16 +-
 fs/ext4/indirect.c          |   5 -
 fs/ext4/inode.c             | 148 +++++++--
 fs/ext4/move_extent.c       |   3 -
 fs/ext4/page-io.c           |   8 +-
 fs/ext4/super.c             |   8 +-
 include/trace/events/ext4.h | 207 ++++++++++--
 12 files changed, 1075 insertions(+), 421 deletions(-)

-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 01/10 v5] ext4: refine extent status tree
  2013-02-08  8:43 [PATCH 00/10 v5] ext4: extent status tree (step2) Zheng Liu
@ 2013-02-08  8:43 ` Zheng Liu
  2013-02-08 15:35   ` Jan Kara
  2013-02-08  8:43 ` [PATCH 02/10 v5] ext4: add physical block and status member into " Zheng Liu
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Zheng Liu @ 2013-02-08  8:43 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o, Jan kara

From: Zheng Liu <wenqing.lz@taobao.com>

This commit refines the extent status tree code.

1) A prefix 'es_' is added to to the extent status tree structure
members.

2) Refactored es_remove_extent() so that __es_remove_extent() can be
used by es_insert_extent() to remove the old extent entry(-ies) before
inserting a new one.

3) Rename extent_status_end() to ext4_es_end()

4) ext4_es_can_be_merged() is define to check whether two extents can
be merged or not.

5) Update and clarified comments.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
---
 fs/ext4/extents.c           |  21 +--
 fs/ext4/extents_status.c    | 318 ++++++++++++++++++++++++--------------------
 fs/ext4/extents_status.h    |   8 +-
 fs/ext4/file.c              |  12 +-
 include/trace/events/ext4.h |  40 +++---
 5 files changed, 217 insertions(+), 182 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 5ae1674..f7bf616 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3525,13 +3525,14 @@ static int ext4_find_delalloc_range(struct inode *inode,
 {
 	struct extent_status es;
 
-	es.start = lblk_start;
-	ext4_es_find_extent(inode, &es);
-	if (es.len == 0)
+	es.es_lblk = lblk_start;
+	(void)ext4_es_find_extent(inode, &es);
+	if (es.es_len == 0)
 		return 0; /* there is no delay extent in this tree */
-	else if (es.start <= lblk_start && lblk_start < es.start + es.len)
+	else if (es.es_lblk <= lblk_start &&
+		 lblk_start < es.es_lblk + es.es_len)
 		return 1;
-	else if (lblk_start <= es.start && es.start <= lblk_end)
+	else if (lblk_start <= es.es_lblk && es.es_lblk <= lblk_end)
 		return 1;
 	else
 		return 0;
@@ -4567,7 +4568,7 @@ static int ext4_find_delayed_extent(struct inode *inode,
 	struct extent_status es;
 	ext4_lblk_t next_del;
 
-	es.start = newex->ec_block;
+	es.es_lblk = newex->ec_block;
 	next_del = ext4_es_find_extent(inode, &es);
 
 	if (newex->ec_start == 0) {
@@ -4575,18 +4576,18 @@ static int ext4_find_delayed_extent(struct inode *inode,
 		 * No extent in extent-tree contains block @newex->ec_start,
 		 * then the block may stay in 1)a hole or 2)delayed-extent.
 		 */
-		if (es.len == 0)
+		if (es.es_len == 0)
 			/* A hole found. */
 			return 0;
 
-		if (es.start > newex->ec_block) {
+		if (es.es_lblk > newex->ec_block) {
 			/* A hole found. */
-			newex->ec_len = min(es.start - newex->ec_block,
+			newex->ec_len = min(es.es_lblk - newex->ec_block,
 					    newex->ec_len);
 			return 0;
 		}
 
-		newex->ec_len = es.start + es.len - newex->ec_block;
+		newex->ec_len = es.es_lblk + es.es_len - newex->ec_block;
 	}
 
 	return next_del;
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 564d981..aa4d346 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -23,40 +23,53 @@
  * (e.g. Reservation space warning), and provide extent-level locking.
  * Delay extent tree is the first step to achieve this goal.  It is
  * original built by Yongqiang Yang.  At that time it is called delay
- * extent tree, whose goal is only track delay extent in memory to
+ * extent tree, whose goal is only track delayed extents in memory to
  * simplify the implementation of fiemap and bigalloc, and introduce
  * lseek SEEK_DATA/SEEK_HOLE support.  That is why it is still called
- * delay extent tree at the following comment.  But for better
- * understand what it does, it has been rename to extent status tree.
+ * delay extent tree at the first commit.  But for better understand
+ * what it does, it has been rename to extent status tree.
  *
- * Currently the first step has been done.  All delay extents are
- * tracked in the tree.  It maintains the delay extent when a delay
- * allocation is issued, and the delay extent is written out or
+ * Step1:
+ * Currently the first step has been done.  All delayed extents are
+ * tracked in the tree.  It maintains the delayed extent when a delayed
+ * allocation is issued, and the delayed extent is written out or
  * invalidated.  Therefore the implementation of fiemap and bigalloc
  * are simplified, and SEEK_DATA/SEEK_HOLE are introduced.
  *
  * The following comment describes the implemenmtation of extent
  * status tree and future works.
+ *
+ * Step2:
+ * In this step all extent status is tracked by extent status tree.
+ * Thus, we can first try to lookup a block mapping in this tree before
+ * finding it in extent tree.  Hence, single extent cache can be removed
+ * because extent status tree can do a better job.  Extents in status
+ * tree are loaded on-demand.  Therefore, the extent status tree may not
+ * contain all of the extents in a file.  Meanwhile we add
+ * nr_cached_objects and free_cached_objects callback functions to
+ * reclaim extents from extent status tree.  These functions make us
+ * reclaim written/unwritten extents from the tree under a heavy memory
+ * pressure.  Delayed extents will not be reclaimed because fiemap,
+ * bigalloc, and seek_data/hole need it.
  */
 
 /*
- * extents status tree implementation for ext4.
+ * Extent status tree implementation for ext4.
  *
  *
  * ==========================================================================
- * Extents status encompass delayed extents and extent locks
+ * Extent status tree tracks all extent status.
  *
- * 1. Why delayed extent implementation ?
+ * 1. Why we need to implement extent status tree?
  *
- * Without delayed extent, ext4 identifies a delayed extent by looking
+ * Without extent status tree, ext4 identifies a delayed extent by looking
  * up page cache, this has several deficiencies - complicated, buggy,
  * and inefficient code.
  *
- * FIEMAP, SEEK_HOLE/DATA, bigalloc, punch hole and writeout all need
- * to know if a block or a range of blocks are belonged to a delayed
- * extent.
+ * FIEMAP, SEEK_HOLE/DATA, bigalloc, and writeout all need to know if a
+ * block or a range of blocks are belonged to a delayed extent.
  *
- * Let us have a look at how they do without delayed extents implementation.
+ * Let us have a look at how they do without extent status tree.
  *   --	FIEMAP
  *	FIEMAP looks up page cache to identify delayed allocations from holes.
  *
@@ -68,47 +81,48 @@
  *	already under delayed allocation or not to determine whether
  *	quota reserving is needed for the cluster.
  *
- *   -- punch hole
- *	punch hole looks up page cache to identify a delayed extent.
- *
  *   --	writeout
  *	Writeout looks up whole page cache to see if a buffer is
  *	mapped, If there are not very many delayed buffers, then it is
  *	time comsuming.
  *
- * With delayed extents implementation, FIEMAP, SEEK_HOLE/DATA,
+ * With extent status tree implementation, FIEMAP, SEEK_HOLE/DATA,
  * bigalloc and writeout can figure out if a block or a range of
  * blocks is under delayed allocation(belonged to a delayed extent) or
- * not by searching the delayed extent tree.
+ * not by searching the extent tree.
  *
  *
  * ==========================================================================
- * 2. ext4 delayed extents impelmentation
+ * 2. Ext4 extent status tree impelmentation
+ *
+ *   --	extent
+ *	A extent is a range of blocks which are contiguous logically and
+ *	physically.  Unlike extent in extent tree, this extent in ext4 is
+ *	a in-memory struct, there is no corresponding on-disk data.  There
+ *	is no limit on length of extent, so an extent can contain as many
+ *	blocks as they are contiguous logically and physically.
  *
- *   --	delayed extent
- *	A delayed extent is a range of blocks which are contiguous
- *	logically and under delayed allocation.  Unlike extent in
- *	ext4, delayed extent in ext4 is a in-memory struct, there is
- *	no corresponding on-disk data.  There is no limit on length of
- *	delayed extent, so a delayed extent can contain as many blocks
- *	as they are contiguous logically.
+ *   --	extent status tree
+ *	Every inode has an extent status tree and all allocation blocks
+ *	are added to the tree with different status.  The extent in the
+ *	tree are ordered by logical block no.
  *
- *   --	delayed extent tree
- *	Every inode has a delayed extent tree and all under delayed
- *	allocation blocks are added to the tree as delayed extents.
- *	Delayed extents in the tree are ordered by logical block no.
+ *   --	operations on a extent status tree
+ *	There are three important operations on a delayed extent tree: find
+ *	next extent, adding a extent(a range of blocks) and removing a extent.
  *
- *   --	operations on a delayed extent tree
- *	There are three operations on a delayed extent tree: find next
- *	delayed extent, adding a space(a range of blocks) and removing
- *	a space.
+ *   --	race on a extent status tree
+ *	Extent status tree is protected inode->i_es_lock.
  *
- *   --	race on a delayed extent tree
- *	Delayed extent tree is protected inode->i_es_lock.
+ *   --	memory consumption
+ *      Fragmented extent tree will make extent status tree cost too much
+ *      memory.  Hence, we will reclaim written/unwritten extents from the
+ *      tree under a heavy memory pressure.
  *
  *
  * ==========================================================================
- * 3. performance analysis
+ * 3. Performance analysis
+ *
  *   --	overhead
  *	1. There is a cache extent for write access, so if writes are
  *	not very random, adding space operaions are in O(1) time.
@@ -120,15 +134,19 @@
  *
  * ==========================================================================
  * 4. TODO list
- *   -- Track all extent status
  *
- *   -- Improve get block process
+ *   -- Refactor delayed space reservation
  *
  *   -- Extent-level locking
  */
 
 static struct kmem_cache *ext4_es_cachep;
 
+static int __es_insert_extent(struct ext4_es_tree *tree,
+			      struct extent_status *newes);
+static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
+			      ext4_lblk_t end);
+
 int __init ext4_init_es(void)
 {
 	ext4_es_cachep = KMEM_CACHE(extent_status, SLAB_RECLAIM_ACCOUNT);
@@ -161,7 +179,7 @@ static void ext4_es_print_tree(struct inode *inode)
 	while (node) {
 		struct extent_status *es;
 		es = rb_entry(node, struct extent_status, rb_node);
-		printk(KERN_DEBUG " [%u/%u)", es->start, es->len);
+		printk(KERN_DEBUG " [%u/%u)", es->es_lblk, es->es_len);
 		node = rb_next(node);
 	}
 	printk(KERN_DEBUG "\n");
@@ -170,10 +188,10 @@ static void ext4_es_print_tree(struct inode *inode)
 #define ext4_es_print_tree(inode)
 #endif
 
-static inline ext4_lblk_t extent_status_end(struct extent_status *es)
+static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
 {
-	BUG_ON(es->start + es->len < es->start);
-	return es->start + es->len - 1;
+	BUG_ON(es->es_lblk + es->es_len < es->es_lblk);
+	return es->es_lblk + es->es_len - 1;
 }
 
 /*
@@ -181,25 +199,25 @@ static inline ext4_lblk_t extent_status_end(struct extent_status *es)
  * it can't be found, try to find next extent.
  */
 static struct extent_status *__es_tree_search(struct rb_root *root,
-					      ext4_lblk_t offset)
+					      ext4_lblk_t lblk)
 {
 	struct rb_node *node = root->rb_node;
 	struct extent_status *es = NULL;
 
 	while (node) {
 		es = rb_entry(node, struct extent_status, rb_node);
-		if (offset < es->start)
+		if (lblk < es->es_lblk)
 			node = node->rb_left;
-		else if (offset > extent_status_end(es))
+		else if (lblk > ext4_es_end(es))
 			node = node->rb_right;
 		else
 			return es;
 	}
 
-	if (es && offset < es->start)
+	if (es && lblk < es->es_lblk)
 		return es;
 
-	if (es && offset > extent_status_end(es)) {
+	if (es && lblk > ext4_es_end(es)) {
 		node = rb_next(&es->rb_node);
 		return node ? rb_entry(node, struct extent_status, rb_node) :
 			      NULL;
@@ -209,8 +227,8 @@ static struct extent_status *__es_tree_search(struct rb_root *root,
 }
 
 /*
- * ext4_es_find_extent: find the 1st delayed extent covering @es->start
- * if it exists, otherwise, the next extent after @es->start.
+ * ext4_es_find_extent: find the 1st delayed extent covering @es->lblk
+ * if it exists, otherwise, the next extent after @es->lblk.
  *
  * @inode: the inode which owns delayed extents
  * @es: delayed extent that we found
@@ -226,7 +244,7 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
 	struct rb_node *node;
 	ext4_lblk_t ret = EXT_MAX_BLOCKS;
 
-	trace_ext4_es_find_extent_enter(inode, es->start);
+	trace_ext4_es_find_extent_enter(inode, es->es_lblk);
 
 	read_lock(&EXT4_I(inode)->i_es_lock);
 	tree = &EXT4_I(inode)->i_es_tree;
@@ -234,25 +252,25 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
 	/* find delay extent in cache firstly */
 	if (tree->cache_es) {
 		es1 = tree->cache_es;
-		if (in_range(es->start, es1->start, es1->len)) {
+		if (in_range(es->es_lblk, es1->es_lblk, es1->es_len)) {
 			es_debug("%u cached by [%u/%u)\n",
-				 es->start, es1->start, es1->len);
+				 es->es_lblk, es1->es_lblk, es1->es_len);
 			goto out;
 		}
 	}
 
-	es->len = 0;
-	es1 = __es_tree_search(&tree->root, es->start);
+	es->es_len = 0;
+	es1 = __es_tree_search(&tree->root, es->es_lblk);
 
 out:
 	if (es1) {
 		tree->cache_es = es1;
-		es->start = es1->start;
-		es->len = es1->len;
+		es->es_lblk = es1->es_lblk;
+		es->es_len = es1->es_len;
 		node = rb_next(&es1->rb_node);
 		if (node) {
 			es1 = rb_entry(node, struct extent_status, rb_node);
-			ret = es1->start;
+			ret = es1->es_lblk;
 		}
 	}
 
@@ -263,14 +281,14 @@ out:
 }
 
 static struct extent_status *
-ext4_es_alloc_extent(ext4_lblk_t start, ext4_lblk_t len)
+ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len)
 {
 	struct extent_status *es;
 	es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
 	if (es == NULL)
 		return NULL;
-	es->start = start;
-	es->len = len;
+	es->es_lblk = lblk;
+	es->es_len = len;
 	return es;
 }
 
@@ -279,6 +297,20 @@ static void ext4_es_free_extent(struct extent_status *es)
 	kmem_cache_free(ext4_es_cachep, es);
 }
 
+/*
+ * Check whether or not two extents can be merged
+ * Condition:
+ *  - logical block number is contiguous
+ */
+static int ext4_es_can_be_merged(struct extent_status *es1,
+				 struct extent_status *es2)
+{
+	if (es1->es_lblk + es1->es_len != es2->es_lblk)
+		return 0;
+
+	return 1;
+}
+
 static struct extent_status *
 ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
 {
@@ -290,8 +322,8 @@ ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
 		return es;
 
 	es1 = rb_entry(node, struct extent_status, rb_node);
-	if (es->start == extent_status_end(es1) + 1) {
-		es1->len += es->len;
+	if (ext4_es_can_be_merged(es1, es)) {
+		es1->es_len += es->es_len;
 		rb_erase(&es->rb_node, &tree->root);
 		ext4_es_free_extent(es);
 		es = es1;
@@ -311,8 +343,8 @@ ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
 		return es;
 
 	es1 = rb_entry(node, struct extent_status, rb_node);
-	if (es1->start == extent_status_end(es) + 1) {
-		es->len += es1->len;
+	if (ext4_es_can_be_merged(es, es1)) {
+		es->es_len += es1->es_len;
 		rb_erase(node, &tree->root);
 		ext4_es_free_extent(es1);
 	}
@@ -320,60 +352,39 @@ ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
 	return es;
 }
 
-static int __es_insert_extent(struct ext4_es_tree *tree, ext4_lblk_t offset,
-			      ext4_lblk_t len)
+static int __es_insert_extent(struct ext4_es_tree *tree,
+			      struct extent_status *newes)
 {
 	struct rb_node **p = &tree->root.rb_node;
 	struct rb_node *parent = NULL;
 	struct extent_status *es;
-	ext4_lblk_t end = offset + len - 1;
-
-	BUG_ON(end < offset);
-	es = tree->cache_es;
-	if (es && offset == (extent_status_end(es) + 1)) {
-		es_debug("cached by [%u/%u)\n", es->start, es->len);
-		es->len += len;
-		es = ext4_es_try_to_merge_right(tree, es);
-		goto out;
-	} else if (es && es->start == end + 1) {
-		es_debug("cached by [%u/%u)\n", es->start, es->len);
-		es->start = offset;
-		es->len += len;
-		es = ext4_es_try_to_merge_left(tree, es);
-		goto out;
-	} else if (es && es->start <= offset &&
-		   end <= extent_status_end(es)) {
-		es_debug("cached by [%u/%u)\n", es->start, es->len);
-		goto out;
-	}
 
 	while (*p) {
 		parent = *p;
 		es = rb_entry(parent, struct extent_status, rb_node);
 
-		if (offset < es->start) {
-			if (es->start == end + 1) {
-				es->start = offset;
-				es->len += len;
+		if (newes->es_lblk < es->es_lblk) {
+			if (ext4_es_can_be_merged(newes, es)) {
+				es->es_lblk = newes->es_lblk;
+				es->es_len += newes->es_len;
 				es = ext4_es_try_to_merge_left(tree, es);
 				goto out;
 			}
 			p = &(*p)->rb_left;
-		} else if (offset > extent_status_end(es)) {
-			if (offset == extent_status_end(es) + 1) {
-				es->len += len;
+		} else if (newes->es_lblk > ext4_es_end(es)) {
+			if (ext4_es_can_be_merged(es, newes)) {
+				es->es_len += newes->es_len;
 				es = ext4_es_try_to_merge_right(tree, es);
 				goto out;
 			}
 			p = &(*p)->rb_right;
 		} else {
-			if (extent_status_end(es) <= end)
-				es->len = offset - es->start + len;
-			goto out;
+			BUG_ON(1);
+			return -EINVAL;
 		}
 	}
 
-	es = ext4_es_alloc_extent(offset, len);
+	es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len);
 	if (!es)
 		return -ENOMEM;
 	rb_link_node(&es->rb_node, parent, p);
@@ -385,27 +396,38 @@ out:
 }
 
 /*
- * ext4_es_insert_extent() adds a space to a delayed extent tree.
- * Caller holds inode->i_es_lock.
+ * ext4_es_insert_extent() adds a space to a extent status tree.
  *
  * ext4_es_insert_extent is called by ext4_da_write_begin and
  * ext4_es_remove_extent.
  *
  * Return 0 on success, error code on failure.
  */
-int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t offset,
+int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 			  ext4_lblk_t len)
 {
 	struct ext4_es_tree *tree;
+	struct extent_status newes;
+	ext4_lblk_t end = lblk + len - 1;
 	int err = 0;
 
-	trace_ext4_es_insert_extent(inode, offset, len);
+	trace_ext4_es_insert_extent(inode, lblk, len);
 	es_debug("add [%u/%u) to extent status tree of inode %lu\n",
-		 offset, len, inode->i_ino);
+		 lblk, len, inode->i_ino);
+
+	BUG_ON(end < lblk);
+
+	newes.es_lblk = lblk;
+	newes.es_len = len;
 
 	write_lock(&EXT4_I(inode)->i_es_lock);
 	tree = &EXT4_I(inode)->i_es_tree;
-	err = __es_insert_extent(tree, offset, len);
+	err = __es_remove_extent(tree, lblk, end);
+	if (err != 0)
+		goto error;
+	err = __es_insert_extent(tree, &newes);
+
+error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 
 	ext4_es_print_tree(inode);
@@ -413,57 +435,45 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t offset,
 	return err;
 }
 
-/*
- * ext4_es_remove_extent() removes a space from a delayed extent tree.
- * Caller holds inode->i_es_lock.
- *
- * Return 0 on success, error code on failure.
- */
-int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t offset,
-			  ext4_lblk_t len)
+static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
+				 ext4_lblk_t end)
 {
 	struct rb_node *node;
-	struct ext4_es_tree *tree;
 	struct extent_status *es;
 	struct extent_status orig_es;
-	ext4_lblk_t len1, len2, end;
+	ext4_lblk_t len1, len2;
 	int err = 0;
 
-	trace_ext4_es_remove_extent(inode, offset, len);
-	es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
-		 offset, len, inode->i_ino);
-
-	end = offset + len - 1;
-	BUG_ON(end < offset);
-	write_lock(&EXT4_I(inode)->i_es_lock);
-	tree = &EXT4_I(inode)->i_es_tree;
-	es = __es_tree_search(&tree->root, offset);
+	es = __es_tree_search(&tree->root, lblk);
 	if (!es)
 		goto out;
-	if (es->start > end)
+	if (es->es_lblk > end)
 		goto out;
 
 	/* Simply invalidate cache_es. */
 	tree->cache_es = NULL;
 
-	orig_es.start = es->start;
-	orig_es.len = es->len;
-	len1 = offset > es->start ? offset - es->start : 0;
-	len2 = extent_status_end(es) > end ?
-	       extent_status_end(es) - end : 0;
+	orig_es.es_lblk = es->es_lblk;
+	orig_es.es_len = es->es_len;
+	len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0;
+	len2 = ext4_es_end(es) > end ? ext4_es_end(es) - end : 0;
 	if (len1 > 0)
-		es->len = len1;
+		es->es_len = len1;
 	if (len2 > 0) {
 		if (len1 > 0) {
-			err = __es_insert_extent(tree, end + 1, len2);
+			struct extent_status newes;
+
+			newes.es_lblk = end + 1;
+			newes.es_len = len2;
+			err = __es_insert_extent(tree, &newes);
 			if (err) {
-				es->start = orig_es.start;
-				es->len = orig_es.len;
+				es->es_lblk = orig_es.es_lblk;
+				es->es_len = orig_es.es_len;
 				goto out;
 			}
 		} else {
-			es->start = end + 1;
-			es->len = len2;
+			es->es_lblk = end + 1;
+			es->es_len = len2;
 		}
 		goto out;
 	}
@@ -476,7 +486,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t offset,
 			es = NULL;
 	}
 
-	while (es && extent_status_end(es) <= end) {
+	while (es && ext4_es_end(es) <= end) {
 		node = rb_next(&es->rb_node);
 		rb_erase(&es->rb_node, &tree->root);
 		ext4_es_free_extent(es);
@@ -487,13 +497,39 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t offset,
 		es = rb_entry(node, struct extent_status, rb_node);
 	}
 
-	if (es && es->start < end + 1) {
-		len1 = extent_status_end(es) - end;
-		es->start = end + 1;
-		es->len = len1;
+	if (es && es->es_lblk < end + 1) {
+		len1 = ext4_es_end(es) - end;
+		es->es_lblk = end + 1;
+		es->es_len = len1;
 	}
 
 out:
+	return err;
+}
+
+/*
+ * ext4_es_remove_extent() removes a space from a extent status tree.
+ *
+ * Return 0 on success, error code on failure.
+ */
+int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
+			  ext4_lblk_t len)
+{
+	struct ext4_es_tree *tree;
+	ext4_lblk_t end;
+	int err = 0;
+
+	trace_ext4_es_remove_extent(inode, lblk, len);
+	es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
+		 lblk, len, inode->i_ino);
+
+	end = lblk + len - 1;
+	BUG_ON(end < lblk);
+
+	tree = &EXT4_I(inode)->i_es_tree;
+
+	write_lock(&EXT4_I(inode)->i_es_lock);
+	err = __es_remove_extent(tree, lblk, end);
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 	ext4_es_print_tree(inode);
 	return err;
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 077f82d..81e9339 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -22,8 +22,8 @@
 
 struct extent_status {
 	struct rb_node rb_node;
-	ext4_lblk_t start;	/* first block extent covers */
-	ext4_lblk_t len;	/* length of extent in block */
+	ext4_lblk_t es_lblk;	/* first logical block extent covers */
+	ext4_lblk_t es_len;	/* length of extent in block */
 };
 
 struct ext4_es_tree {
@@ -35,9 +35,9 @@ extern int __init ext4_init_es(void);
 extern void ext4_exit_es(void);
 extern void ext4_es_init_tree(struct ext4_es_tree *tree);
 
-extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t start,
+extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t len);
-extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t start,
+extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t len);
 extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
 				struct extent_status *es);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 405565a..718c49f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -464,10 +464,9 @@ static loff_t ext4_seek_data(struct file *file, loff_t offset, loff_t maxsize)
 		 * If there is a delay extent at this offset,
 		 * it will be as a data.
 		 */
-		es.start = last;
+		es.es_lblk = last;
 		(void)ext4_es_find_extent(inode, &es);
-		if (last >= es.start &&
-		    last < es.start + es.len) {
+		if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
 			if (last != start)
 				dataoff = last << blkbits;
 			break;
@@ -549,11 +548,10 @@ static loff_t ext4_seek_hole(struct file *file, loff_t offset, loff_t maxsize)
 		 * If there is a delay extent at this offset,
 		 * we will skip this extent.
 		 */
-		es.start = last;
+		es.es_lblk = last;
 		(void)ext4_es_find_extent(inode, &es);
-		if (last >= es.start &&
-		    last < es.start + es.len) {
-			last = es.start + es.len;
+		if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
+			last = es.es_lblk + es.es_len;
 			holeoff = last << blkbits;
 			continue;
 		}
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 7e8c36b..952628a 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2068,75 +2068,75 @@ TRACE_EVENT(ext4_ext_remove_space_done,
 );
 
 TRACE_EVENT(ext4_es_insert_extent,
-	TP_PROTO(struct inode *inode, ext4_lblk_t start, ext4_lblk_t len),
+	TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
 
-	TP_ARGS(inode, start, len),
+	TP_ARGS(inode, lblk, len),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,	dev			)
 		__field(	ino_t,	ino			)
-		__field(	loff_t,	start			)
+		__field(	loff_t,	lblk			)
 		__field(	loff_t, len			)
 	),
 
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
-		__entry->start	= start;
+		__entry->lblk	= lblk;
 		__entry->len	= len;
 	),
 
 	TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->start, __entry->len)
+		  __entry->lblk, __entry->len)
 );
 
 TRACE_EVENT(ext4_es_remove_extent,
-	TP_PROTO(struct inode *inode, ext4_lblk_t start, ext4_lblk_t len),
+	TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
 
-	TP_ARGS(inode, start, len),
+	TP_ARGS(inode, lblk, len),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,	dev			)
 		__field(	ino_t,	ino			)
-		__field(	loff_t,	start			)
+		__field(	loff_t,	lblk			)
 		__field(	loff_t,	len			)
 	),
 
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
-		__entry->start	= start;
+		__entry->lblk	= lblk;
 		__entry->len	= len;
 	),
 
 	TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->start, __entry->len)
+		  __entry->lblk, __entry->len)
 );
 
 TRACE_EVENT(ext4_es_find_extent_enter,
-	TP_PROTO(struct inode *inode, ext4_lblk_t start),
+	TP_PROTO(struct inode *inode, ext4_lblk_t lblk),
 
-	TP_ARGS(inode, start),
+	TP_ARGS(inode, lblk),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,		dev		)
 		__field(	ino_t,		ino		)
-		__field(	ext4_lblk_t,	start		)
+		__field(	ext4_lblk_t,	lblk		)
 	),
 
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
-		__entry->start	= start;
+		__entry->lblk	= lblk;
 	),
 
-	TP_printk("dev %d,%d ino %lu start %u",
+	TP_printk("dev %d,%d ino %lu lblk %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  (unsigned long) __entry->ino, __entry->start)
+		  (unsigned long) __entry->ino, __entry->lblk)
 );
 
 TRACE_EVENT(ext4_es_find_extent_exit,
@@ -2148,7 +2148,7 @@ TRACE_EVENT(ext4_es_find_extent_exit,
 	TP_STRUCT__entry(
 		__field(	dev_t,		dev		)
 		__field(	ino_t,		ino		)
-		__field(	ext4_lblk_t,	start		)
+		__field(	ext4_lblk_t,	lblk		)
 		__field(	ext4_lblk_t,	len		)
 		__field(	ext4_lblk_t,	ret		)
 	),
@@ -2156,15 +2156,15 @@ TRACE_EVENT(ext4_es_find_extent_exit,
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
-		__entry->start	= es->start;
-		__entry->len	= es->len;
+		__entry->lblk	= es->es_lblk;
+		__entry->len	= es->es_len;
 		__entry->ret	= ret;
 	),
 
 	TP_printk("dev %d,%d ino %lu es [%u/%u) ret %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->start, __entry->len, __entry->ret)
+		  __entry->lblk, __entry->len, __entry->ret)
 );
 
 #endif /* _TRACE_EXT4_H */
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 02/10 v5] ext4: add physical block and status member into extent status tree
  2013-02-08  8:43 [PATCH 00/10 v5] ext4: extent status tree (step2) Zheng Liu
  2013-02-08  8:43 ` [PATCH 01/10 v5] ext4: refine extent status tree Zheng Liu
@ 2013-02-08  8:43 ` Zheng Liu
  2013-02-08 15:39   ` Jan Kara
  2013-02-08  8:43 ` [PATCH 03/10 v5] ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag Zheng Liu
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Zheng Liu @ 2013-02-08  8:43 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o, Jan kara

From: Zheng Liu <wenqing.lz@taobao.com>

This commit adds two members in extent_status structure to let it record
physical block and extent status.  Here es_pblk is used to record both
of them because physical block only has 48 bits.  So extent status could
be stashed into it so that we can save some memory.  Now written,
unwritten, delayed and hole are defined as status.

Due to new member is added into extent status tree, all interfaces need
to be adjusted.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
---
 fs/ext4/extents_status.c    | 67 +++++++++++++++++++++++++++++++++++++--------
 fs/ext4/extents_status.h    | 64 ++++++++++++++++++++++++++++++++++++++++++-
 fs/ext4/inode.c             |  3 +-
 include/trace/events/ext4.h | 34 +++++++++++++++--------
 4 files changed, 142 insertions(+), 26 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index aa4d346..5093cee 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -179,7 +179,9 @@ static void ext4_es_print_tree(struct inode *inode)
 	while (node) {
 		struct extent_status *es;
 		es = rb_entry(node, struct extent_status, rb_node);
-		printk(KERN_DEBUG " [%u/%u)", es->es_lblk, es->es_len);
+		printk(KERN_DEBUG " [%u/%u) %llu %llx",
+		       es->es_lblk, es->es_len,
+		       ext4_es_pblock(es), ext4_es_status(es));
 		node = rb_next(node);
 	}
 	printk(KERN_DEBUG "\n");
@@ -234,7 +236,7 @@ static struct extent_status *__es_tree_search(struct rb_root *root,
  * @es: delayed extent that we found
  *
  * Returns the first block of the next extent after es, otherwise
- * EXT_MAX_BLOCKS if no delay extent is found.
+ * EXT_MAX_BLOCKS if no extent is found.
  * Delayed extent is returned via @es.
  */
 ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
@@ -249,17 +251,18 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
 	read_lock(&EXT4_I(inode)->i_es_lock);
 	tree = &EXT4_I(inode)->i_es_tree;
 
-	/* find delay extent in cache firstly */
+	/* find extent in cache firstly */
+	es->es_len = es->es_pblk = 0;
 	if (tree->cache_es) {
 		es1 = tree->cache_es;
 		if (in_range(es->es_lblk, es1->es_lblk, es1->es_len)) {
-			es_debug("%u cached by [%u/%u)\n",
-				 es->es_lblk, es1->es_lblk, es1->es_len);
+			es_debug("%u cached by [%u/%u) %llu %llx\n",
+				 es->es_lblk, es1->es_lblk, es1->es_len,
+				 ext4_es_pblock(es1), ext4_es_status(es1));
 			goto out;
 		}
 	}
 
-	es->es_len = 0;
 	es1 = __es_tree_search(&tree->root, es->es_lblk);
 
 out:
@@ -267,6 +270,7 @@ out:
 		tree->cache_es = es1;
 		es->es_lblk = es1->es_lblk;
 		es->es_len = es1->es_len;
+		es->es_pblk = es1->es_pblk;
 		node = rb_next(&es1->rb_node);
 		if (node) {
 			es1 = rb_entry(node, struct extent_status, rb_node);
@@ -281,7 +285,7 @@ out:
 }
 
 static struct extent_status *
-ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len)
+ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len, ext4_fsblk_t pblk)
 {
 	struct extent_status *es;
 	es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
@@ -289,6 +293,7 @@ ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len)
 		return NULL;
 	es->es_lblk = lblk;
 	es->es_len = len;
+	es->es_pblk = pblk;
 	return es;
 }
 
@@ -301,6 +306,8 @@ static void ext4_es_free_extent(struct extent_status *es)
  * Check whether or not two extents can be merged
  * Condition:
  *  - logical block number is contiguous
+ *  - physical block number is contiguous
+ *  - status is equal
  */
 static int ext4_es_can_be_merged(struct extent_status *es1,
 				 struct extent_status *es2)
@@ -308,6 +315,13 @@ static int ext4_es_can_be_merged(struct extent_status *es1,
 	if (es1->es_lblk + es1->es_len != es2->es_lblk)
 		return 0;
 
+	if (ext4_es_status(es1) != ext4_es_status(es2))
+		return 0;
+
+	if ((ext4_es_is_written(es1) || ext4_es_is_unwritten(es1)) &&
+	    (ext4_es_pblock(es1) + es1->es_len != ext4_es_pblock(es2)))
+		return 0;
+
 	return 1;
 }
 
@@ -367,6 +381,10 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
 			if (ext4_es_can_be_merged(newes, es)) {
 				es->es_lblk = newes->es_lblk;
 				es->es_len += newes->es_len;
+				if (ext4_es_is_written(es) ||
+				    ext4_es_is_unwritten(es))
+					ext4_es_store_pblock(es,
+							     newes->es_pblk);
 				es = ext4_es_try_to_merge_left(tree, es);
 				goto out;
 			}
@@ -384,7 +402,8 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
 		}
 	}
 
-	es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len);
+	es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len,
+				  newes->es_pblk);
 	if (!es)
 		return -ENOMEM;
 	rb_link_node(&es->rb_node, parent, p);
@@ -404,21 +423,24 @@ out:
  * Return 0 on success, error code on failure.
  */
 int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
-			  ext4_lblk_t len)
+			  ext4_lblk_t len, ext4_fsblk_t pblk,
+			  unsigned long long status)
 {
 	struct ext4_es_tree *tree;
 	struct extent_status newes;
 	ext4_lblk_t end = lblk + len - 1;
 	int err = 0;
 
-	trace_ext4_es_insert_extent(inode, lblk, len);
-	es_debug("add [%u/%u) to extent status tree of inode %lu\n",
-		 lblk, len, inode->i_ino);
+	es_debug("add [%u/%u) %llu %llx to extent status tree of inode %lu\n",
+		 lblk, len, pblk, status, inode->i_ino);
 
 	BUG_ON(end < lblk);
 
 	newes.es_lblk = lblk;
 	newes.es_len = len;
+	ext4_es_store_pblock(&newes, pblk);
+	ext4_es_store_status(&newes, status);
+	trace_ext4_es_insert_extent(inode, &newes);
 
 	write_lock(&EXT4_I(inode)->i_es_lock);
 	tree = &EXT4_I(inode)->i_es_tree;
@@ -442,6 +464,7 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 	struct extent_status *es;
 	struct extent_status orig_es;
 	ext4_lblk_t len1, len2;
+	ext4_fsblk_t block;
 	int err = 0;
 
 	es = __es_tree_search(&tree->root, lblk);
@@ -455,6 +478,8 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 
 	orig_es.es_lblk = es->es_lblk;
 	orig_es.es_len = es->es_len;
+	orig_es.es_pblk = es->es_pblk;
+
 	len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0;
 	len2 = ext4_es_end(es) > end ? ext4_es_end(es) - end : 0;
 	if (len1 > 0)
@@ -465,6 +490,13 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 
 			newes.es_lblk = end + 1;
 			newes.es_len = len2;
+			if (ext4_es_is_written(&orig_es) ||
+			    ext4_es_is_unwritten(&orig_es)) {
+				block = ext4_es_pblock(&orig_es) +
+					orig_es.es_len - len2;
+				ext4_es_store_pblock(&newes, block);
+			}
+			ext4_es_store_status(&newes, ext4_es_status(&orig_es));
 			err = __es_insert_extent(tree, &newes);
 			if (err) {
 				es->es_lblk = orig_es.es_lblk;
@@ -474,6 +506,11 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 		} else {
 			es->es_lblk = end + 1;
 			es->es_len = len2;
+			if (ext4_es_is_written(es) ||
+			    ext4_es_is_unwritten(es)) {
+				block = orig_es.es_pblk + orig_es.es_len - len2;
+				ext4_es_store_pblock(es, block);
+			}
 		}
 		goto out;
 	}
@@ -498,9 +535,15 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 	}
 
 	if (es && es->es_lblk < end + 1) {
+		ext4_lblk_t orig_len = es->es_len;
+
 		len1 = ext4_es_end(es) - end;
 		es->es_lblk = end + 1;
 		es->es_len = len1;
+		if (ext4_es_is_written(es) || ext4_es_is_unwritten(es)) {
+			block = es->es_pblk + orig_len - len1;
+			ext4_es_store_pblock(es, block);
+		}
 	}
 
 out:
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 81e9339..2a5d69e 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -20,10 +20,21 @@
 #define es_debug(fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
 #endif
 
+#define EXTENT_STATUS_WRITTEN	0x10000000	/* written extent */
+#define EXTENT_STATUS_UNWRITTEN	0x20000000	/* unwritten extent */
+#define EXTENT_STATUS_DELAYED	0x40000000	/* delayed extent */
+#define EXTENT_STATUS_HOLE	0x80000000	/* hole */
+
+#define EXTENT_STATUS_FLAGS	(EXTENT_STATUS_WRITTEN | \
+				 EXTENT_STATUS_UNWRITTEN | \
+				 EXTENT_STATUS_DELAYED | \
+				 EXTENT_STATUS_HOLE)
+
 struct extent_status {
 	struct rb_node rb_node;
 	ext4_lblk_t es_lblk;	/* first logical block extent covers */
 	ext4_lblk_t es_len;	/* length of extent in block */
+	ext4_fsblk_t es_pblk;	/* first physical block */
 };
 
 struct ext4_es_tree {
@@ -36,10 +47,61 @@ extern void ext4_exit_es(void);
 extern void ext4_es_init_tree(struct ext4_es_tree *tree);
 
 extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
-				 ext4_lblk_t len);
+				 ext4_lblk_t len, ext4_fsblk_t pblk,
+				 unsigned long long status);
 extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t len);
 extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
 				struct extent_status *es);
 
+static inline int ext4_es_is_written(struct extent_status *es)
+{
+	return (es->es_pblk & EXTENT_STATUS_WRITTEN);
+}
+
+static inline int ext4_es_is_unwritten(struct extent_status *es)
+{
+	return (es->es_pblk & EXTENT_STATUS_UNWRITTEN);
+}
+
+static inline int ext4_es_is_delayed(struct extent_status *es)
+{
+	return (es->es_pblk & EXTENT_STATUS_DELAYED);
+}
+
+static inline int ext4_es_is_hole(struct extent_status *es)
+{
+	return (es->es_pblk & EXTENT_STATUS_HOLE);
+}
+
+static inline ext4_fsblk_t ext4_es_status(struct extent_status *es)
+{
+	return (es->es_pblk & EXTENT_STATUS_FLAGS);
+}
+
+static inline ext4_fsblk_t ext4_es_pblock(struct extent_status *es)
+{
+	return (es->es_pblk & ~EXTENT_STATUS_FLAGS);
+}
+
+static inline void ext4_es_store_pblock(struct extent_status *es,
+					ext4_fsblk_t pb)
+{
+	ext4_fsblk_t block;
+
+	block = (pb & ~EXTENT_STATUS_FLAGS) |
+		(es->es_pblk & EXTENT_STATUS_FLAGS);
+	es->es_pblk = block;
+}
+
+static inline void ext4_es_store_status(struct extent_status *es,
+					unsigned long long status)
+{
+	ext4_fsblk_t block;
+
+	block = (status & EXTENT_STATUS_FLAGS) |
+		(es->es_pblk & ~EXTENT_STATUS_FLAGS);
+	es->es_pblk = block;
+}
+
 #endif /* _EXT4_EXTENTS_STATUS_H */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cbfe13b..7fb00d8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1821,7 +1821,8 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 				goto out_unlock;
 		}
 
-		retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len);
+		retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+					       ~0, EXTENT_STATUS_DELAYED);
 		if (retval)
 			goto out_unlock;
 
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 952628a..ef2f96e 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2068,28 +2068,33 @@ TRACE_EVENT(ext4_ext_remove_space_done,
 );
 
 TRACE_EVENT(ext4_es_insert_extent,
-	TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
+	TP_PROTO(struct inode *inode, struct extent_status *es),
 
-	TP_ARGS(inode, lblk, len),
+	TP_ARGS(inode, es),
 
 	TP_STRUCT__entry(
-		__field(	dev_t,	dev			)
-		__field(	ino_t,	ino			)
-		__field(	loff_t,	lblk			)
-		__field(	loff_t, len			)
+		__field(	dev_t,		dev		)
+		__field(	ino_t,		ino		)
+		__field(	ext4_lblk_t,	lblk		)
+		__field(	ext4_lblk_t,	len		)
+		__field(	ext4_fsblk_t,	pblk		)
+		__field(	unsigned long long, status	)
 	),
 
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
-		__entry->lblk	= lblk;
-		__entry->len	= len;
+		__entry->lblk	= es->es_lblk;
+		__entry->len	= es->es_len;
+		__entry->pblk	= ext4_es_pblock(es);
+		__entry->status	= ext4_es_status(es);
 	),
 
-	TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
+	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %llx",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->lblk, __entry->len)
+		  __entry->lblk, __entry->len,
+		  __entry->pblk, __entry->status)
 );
 
 TRACE_EVENT(ext4_es_remove_extent,
@@ -2150,6 +2155,8 @@ TRACE_EVENT(ext4_es_find_extent_exit,
 		__field(	ino_t,		ino		)
 		__field(	ext4_lblk_t,	lblk		)
 		__field(	ext4_lblk_t,	len		)
+		__field(	ext4_fsblk_t,	pblk		)
+		__field(	unsigned long long, status	)
 		__field(	ext4_lblk_t,	ret		)
 	),
 
@@ -2158,13 +2165,16 @@ TRACE_EVENT(ext4_es_find_extent_exit,
 		__entry->ino	= inode->i_ino;
 		__entry->lblk	= es->es_lblk;
 		__entry->len	= es->es_len;
+		__entry->pblk	= ext4_es_pblock(es);
+		__entry->status	= ext4_es_status(es);
 		__entry->ret	= ret;
 	),
 
-	TP_printk("dev %d,%d ino %lu es [%u/%u) ret %u",
+	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %llx ret %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->lblk, __entry->len, __entry->ret)
+		  __entry->lblk, __entry->len,
+		  __entry->pblk, __entry->status, __entry->ret)
 );
 
 #endif /* _TRACE_EXT4_H */
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 03/10 v5] ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
  2013-02-08  8:43 [PATCH 00/10 v5] ext4: extent status tree (step2) Zheng Liu
  2013-02-08  8:43 ` [PATCH 01/10 v5] ext4: refine extent status tree Zheng Liu
  2013-02-08  8:43 ` [PATCH 02/10 v5] ext4: add physical block and status member into " Zheng Liu
@ 2013-02-08  8:43 ` Zheng Liu
  2013-02-08 15:41   ` Jan Kara
  2013-02-08  8:44 ` [PATCH 04/10 v5] ext4: track all extent status in extent status tree Zheng Liu
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Zheng Liu @ 2013-02-08  8:43 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o, Jan kara

From: Zheng Liu <wenqing.lz@taobao.com>

This commit lets ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
because in later commit ext4_map_blocks needs to use this flag to
determine the extent status.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
---
 fs/ext4/extents.c |  6 +++++-
 fs/ext4/inode.c   | 12 +++---------
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index f7bf616..d92947f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3657,6 +3657,7 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
 			ext4_set_io_unwritten_flag(inode, io);
 		else
 			ext4_set_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
+		map->m_flags |= EXT4_MAP_UNWRITTEN;
 		if (ext4_should_dioread_nolock(inode))
 			map->m_flags |= EXT4_MAP_UNINIT;
 		goto out;
@@ -3678,8 +3679,10 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
 	 * repeat fallocate creation request
 	 * we already have an unwritten extent
 	 */
-	if (flags & EXT4_GET_BLOCKS_UNINIT_EXT)
+	if (flags & EXT4_GET_BLOCKS_UNINIT_EXT) {
+		map->m_flags |= EXT4_MAP_UNWRITTEN;
 		goto map_out;
+	}
 
 	/* buffered READ or buffered write_begin() lookup */
 	if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
@@ -4109,6 +4112,7 @@ got_allocated_blocks:
 	/* Mark uninitialized */
 	if (flags & EXT4_GET_BLOCKS_UNINIT_EXT){
 		ext4_ext_mark_uninitialized(&newex);
+		map->m_flags |= EXT4_MAP_UNWRITTEN;
 		/*
 		 * io_end structure was created for every IO write to an
 		 * uninitialized extent. To avoid unnecessary conversion,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7fb00d8..c7e9665 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -560,16 +560,10 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		return retval;
 
 	/*
-	 * When we call get_blocks without the create flag, the
-	 * BH_Unwritten flag could have gotten set if the blocks
-	 * requested were part of a uninitialized extent.  We need to
-	 * clear this flag now that we are committed to convert all or
-	 * part of the uninitialized extent to be an initialized
-	 * extent.  This is because we need to avoid the combination
-	 * of BH_Unwritten and BH_Mapped flags being simultaneously
-	 * set on the buffer_head.
+	 * Here we clear m_flags because after allocating an new extent,
+	 * it will be set again.
 	 */
-	map->m_flags &= ~EXT4_MAP_UNWRITTEN;
+	map->m_flags &= ~EXT4_MAP_FLAGS;
 
 	/*
 	 * New blocks allocate and/or writing to uninitialized extent
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 04/10 v5] ext4: track all extent status in extent status tree
  2013-02-08  8:43 [PATCH 00/10 v5] ext4: extent status tree (step2) Zheng Liu
                   ` (2 preceding siblings ...)
  2013-02-08  8:43 ` [PATCH 03/10 v5] ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag Zheng Liu
@ 2013-02-08  8:44 ` Zheng Liu
  2013-02-11 12:21   ` Jan Kara
  2013-02-13  3:28   ` Theodore Ts'o
  2013-02-08  8:44 ` [PATCH 05/10 v5] ext4: lookup block mapping " Zheng Liu
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 37+ messages in thread
From: Zheng Liu @ 2013-02-08  8:44 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o, Jan kara

From: Zheng Liu <wenqing.lz@taobao.com>

By recording the phycisal block and status, extent status tree is able
to track the status of every extents.  When we call _map_blocks
functions to lookup an extent or create a new written/unwritten/delayed
extent, this extent will be inserted into extent status tree.  The hole
extent is inserted in ext4_ext_put_gap_in_cache().  If there is no any
extent, we will not insert a hole extent [0, ~0] into the extent status
tree in order to reduce the complextiy of code.

We don't load all extents from disk in alloc_inode() because it costs
too much memory, and if a file is opened and closed frequently it will
takes too much time to load all extent information.  So currently when
we create/lookup an extent, this extent will be inserted into extent
status tree.  Hence, the extent status tree may not comprehensively
contain all of the extents found in the file.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
---
 fs/ext4/extents.c           |  4 +--
 fs/ext4/extents_status.c    | 27 ++++++++++++------
 fs/ext4/extents_status.h    |  4 +--
 fs/ext4/file.c              |  4 +--
 fs/ext4/inode.c             | 68 ++++++++++++++++++++++++++++-----------------
 include/trace/events/ext4.h |  4 +--
 6 files changed, 70 insertions(+), 41 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index d92947f..4b065ff 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3526,7 +3526,7 @@ static int ext4_find_delalloc_range(struct inode *inode,
 	struct extent_status es;
 
 	es.es_lblk = lblk_start;
-	(void)ext4_es_find_extent(inode, &es);
+	(void)ext4_es_find_delayed_extent(inode, &es);
 	if (es.es_len == 0)
 		return 0; /* there is no delay extent in this tree */
 	else if (es.es_lblk <= lblk_start &&
@@ -4573,7 +4573,7 @@ static int ext4_find_delayed_extent(struct inode *inode,
 	ext4_lblk_t next_del;
 
 	es.es_lblk = newex->ec_block;
-	next_del = ext4_es_find_extent(inode, &es);
+	next_del = ext4_es_find_delayed_extent(inode, &es);
 
 	if (newex->ec_start == 0) {
 		/*
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 5093cee..71cb75a 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -229,7 +229,7 @@ static struct extent_status *__es_tree_search(struct rb_root *root,
 }
 
 /*
- * ext4_es_find_extent: find the 1st delayed extent covering @es->lblk
+ * ext4_es_find_delayed_extent: find the 1st delayed extent covering @es->lblk
  * if it exists, otherwise, the next extent after @es->lblk.
  *
  * @inode: the inode which owns delayed extents
@@ -239,14 +239,15 @@ static struct extent_status *__es_tree_search(struct rb_root *root,
  * EXT_MAX_BLOCKS if no extent is found.
  * Delayed extent is returned via @es.
  */
-ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
+ext4_lblk_t ext4_es_find_delayed_extent(struct inode *inode,
+					struct extent_status *es)
 {
 	struct ext4_es_tree *tree = NULL;
 	struct extent_status *es1 = NULL;
 	struct rb_node *node;
 	ext4_lblk_t ret = EXT_MAX_BLOCKS;
 
-	trace_ext4_es_find_extent_enter(inode, es->es_lblk);
+	trace_ext4_es_find_delayed_extent_enter(inode, es->es_lblk);
 
 	read_lock(&EXT4_I(inode)->i_es_lock);
 	tree = &EXT4_I(inode)->i_es_tree;
@@ -266,21 +267,31 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
 	es1 = __es_tree_search(&tree->root, es->es_lblk);
 
 out:
-	if (es1) {
+	if (es1 && !ext4_es_is_delayed(es1)) {
+		while ((node = rb_next(&es1->rb_node)) != NULL) {
+			es1 = rb_entry(node, struct extent_status, rb_node);
+			if (ext4_es_is_delayed(es1))
+				break;
+		}
+	}
+
+	if (es1 && ext4_es_is_delayed(es1)) {
 		tree->cache_es = es1;
 		es->es_lblk = es1->es_lblk;
 		es->es_len = es1->es_len;
 		es->es_pblk = es1->es_pblk;
-		node = rb_next(&es1->rb_node);
-		if (node) {
+		while ((node = rb_next(&es1->rb_node)) != NULL) {
 			es1 = rb_entry(node, struct extent_status, rb_node);
-			ret = es1->es_lblk;
+			if (ext4_es_is_delayed(es1)) {
+				ret = es1->es_lblk;
+				break;
+			}
 		}
 	}
 
 	read_unlock(&EXT4_I(inode)->i_es_lock);
 
-	trace_ext4_es_find_extent_exit(inode, es, ret);
+	trace_ext4_es_find_delayed_extent_exit(inode, es, ret);
 	return ret;
 }
 
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 2a5d69e..b5788eb 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -51,8 +51,8 @@ extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 				 unsigned long long status);
 extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t len);
-extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
-				struct extent_status *es);
+extern ext4_lblk_t ext4_es_find_delayed_extent(struct inode *inode,
+					       struct extent_status *es);
 
 static inline int ext4_es_is_written(struct extent_status *es)
 {
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 718c49f..bbfbf78 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -465,7 +465,7 @@ static loff_t ext4_seek_data(struct file *file, loff_t offset, loff_t maxsize)
 		 * it will be as a data.
 		 */
 		es.es_lblk = last;
-		(void)ext4_es_find_extent(inode, &es);
+		(void)ext4_es_find_delayed_extent(inode, &es);
 		if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
 			if (last != start)
 				dataoff = last << blkbits;
@@ -549,7 +549,7 @@ static loff_t ext4_seek_hole(struct file *file, loff_t offset, loff_t maxsize)
 		 * we will skip this extent.
 		 */
 		es.es_lblk = last;
-		(void)ext4_es_find_extent(inode, &es);
+		(void)ext4_es_find_delayed_extent(inode, &es);
 		if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
 			last = es.es_lblk + es.es_len;
 			holeoff = last << blkbits;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c7e9665..16454fc 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -527,20 +527,22 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		retval = ext4_ind_map_blocks(handle, inode, map, flags &
 					     EXT4_GET_BLOCKS_KEEP_SIZE);
 	}
+	if (retval > 0) {
+		int ret;
+		unsigned long long status;
+
+		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+		ret = ext4_es_insert_extent(inode, map->m_lblk,
+					    map->m_len, map->m_pblk, status);
+		if (ret < 0)
+			retval = ret;
+	}
 	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
 		up_read((&EXT4_I(inode)->i_data_sem));
 
 	if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
-		int ret;
-		if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
-			/* delayed alloc may be allocated by fallocate and
-			 * coverted to initialized by directIO.
-			 * we need to handle delayed extent here.
-			 */
-			down_write((&EXT4_I(inode)->i_data_sem));
-			goto delayed_mapped;
-		}
-		ret = check_block_validity(inode, map);
+		int ret = check_block_validity(inode, map);
 		if (ret != 0)
 			return ret;
 	}
@@ -609,18 +611,19 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 			(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE))
 			ext4_da_update_reserve_space(inode, retval, 1);
 	}
-	if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
+	if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
 		ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);
 
-		if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
-			int ret;
-delayed_mapped:
-			/* delayed allocation blocks has been allocated */
-			ret = ext4_es_remove_extent(inode, map->m_lblk,
-						    map->m_len);
-			if (ret < 0)
-				retval = ret;
-		}
+	if (retval > 0) {
+		int ret;
+		unsigned long long status;
+
+		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+		ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+					    map->m_pblk, status);
+		if (ret < 0)
+			retval = ret;
 	}
 
 	up_write((&EXT4_I(inode)->i_data_sem));
@@ -1802,6 +1805,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 		retval = ext4_ind_map_blocks(NULL, inode, map, 0);
 
 	if (retval == 0) {
+		int ret;
 		/*
 		 * XXX: __block_prepare_write() unmaps passed block,
 		 * is it OK?
@@ -1809,16 +1813,20 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 		/* If the block was allocated from previously allocated cluster,
 		 * then we dont need to reserve it again. */
 		if (!(map->m_flags & EXT4_MAP_FROM_CLUSTER)) {
-			retval = ext4_da_reserve_space(inode, iblock);
-			if (retval)
+			ret = ext4_da_reserve_space(inode, iblock);
+			if (ret) {
 				/* not enough space to reserve */
+				retval = ret;
 				goto out_unlock;
+			}
 		}
 
-		retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
-					       ~0, EXTENT_STATUS_DELAYED);
-		if (retval)
+		ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+					    ~0, EXTENT_STATUS_DELAYED);
+		if (ret) {
+			retval = ret;
 			goto out_unlock;
+		}
 
 		/* Clear EXT4_MAP_FROM_CLUSTER flag since its purpose is served
 		 * and it should not appear on the bh->b_state.
@@ -1828,6 +1836,16 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 		map_bh(bh, inode->i_sb, invalid_block);
 		set_buffer_new(bh);
 		set_buffer_delay(bh);
+	} else if (retval > 0) {
+		int ret;
+		unsigned long long status;
+
+		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+		ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+					    map->m_pblk, status);
+		if (ret != 0)
+			retval = ret;
 	}
 
 out_unlock:
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index ef2f96e..d278ced 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2122,7 +2122,7 @@ TRACE_EVENT(ext4_es_remove_extent,
 		  __entry->lblk, __entry->len)
 );
 
-TRACE_EVENT(ext4_es_find_extent_enter,
+TRACE_EVENT(ext4_es_find_delayed_extent_enter,
 	TP_PROTO(struct inode *inode, ext4_lblk_t lblk),
 
 	TP_ARGS(inode, lblk),
@@ -2144,7 +2144,7 @@ TRACE_EVENT(ext4_es_find_extent_enter,
 		  (unsigned long) __entry->ino, __entry->lblk)
 );
 
-TRACE_EVENT(ext4_es_find_extent_exit,
+TRACE_EVENT(ext4_es_find_delayed_extent_exit,
 	TP_PROTO(struct inode *inode, struct extent_status *es,
 		 ext4_lblk_t ret),
 
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 05/10 v5] ext4: lookup block mapping in extent status tree
  2013-02-08  8:43 [PATCH 00/10 v5] ext4: extent status tree (step2) Zheng Liu
                   ` (3 preceding siblings ...)
  2013-02-08  8:44 ` [PATCH 04/10 v5] ext4: track all extent status in extent status tree Zheng Liu
@ 2013-02-08  8:44 ` Zheng Liu
  2013-02-12 12:31   ` Jan Kara
  2013-02-08  8:44 ` [PATCH 06/10 v5] ext4: remove single extent cache Zheng Liu
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Zheng Liu @ 2013-02-08  8:44 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o, Jan kara

From: Zheng Liu <wenqing.lz@taobao.com>

After tracking all extent status, we already have a extent cache in
memory.  Every time we want to lookup a block mapping, we can first
try to lookup it in extent status tree to avoid a potential disk I/O.

A new function called ext4_es_lookup_extent is defined to finish this
work.  When we try to lookup a block mapping, we always call
ext4_map_blocks and/or ext4_da_map_blocks.  So in these functions we
first try to lookup a block mapping in extent status tree.

A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks
in order not to put a hole into extent status tree because this hole
will be converted to delayed extent in the tree immediately.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
---
 fs/ext4/ext4.h              |  2 ++
 fs/ext4/extents.c           |  7 ++++-
 fs/ext4/extents_status.c    | 59 +++++++++++++++++++++++++++++++++++++++++
 fs/ext4/extents_status.h    |  1 +
 fs/ext4/inode.c             | 64 +++++++++++++++++++++++++++++++++++++++++++--
 include/trace/events/ext4.h | 56 +++++++++++++++++++++++++++++++++++++++
 6 files changed, 186 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8462eb3..ad885b5 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -582,6 +582,8 @@ enum {
 #define EXT4_GET_BLOCKS_KEEP_SIZE		0x0080
 	/* Do not take i_data_sem locking in ext4_map_blocks */
 #define EXT4_GET_BLOCKS_NO_LOCK			0x0100
+	/* Do not put hole in extent cache */
+#define EXT4_GET_BLOCKS_NO_PUT_HOLE		0x0200
 
 /*
  * Flags used by ext4_free_blocks
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 4b065ff..1be8955 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2154,6 +2154,8 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
 				block,
 				le32_to_cpu(ex->ee_block),
 				 ext4_ext_get_actual_len(ex));
+		ext4_es_insert_extent(inode, lblock, len, ~0,
+				      EXTENT_STATUS_HOLE);
 	} else if (block >= le32_to_cpu(ex->ee_block)
 			+ ext4_ext_get_actual_len(ex)) {
 		ext4_lblk_t next;
@@ -2167,6 +2169,8 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
 				block);
 		BUG_ON(next == lblock);
 		len = next - lblock;
+		ext4_es_insert_extent(inode, lblock, len, ~0,
+				      EXTENT_STATUS_HOLE);
 	} else {
 		lblock = len = 0;
 		BUG();
@@ -4006,7 +4010,8 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		 * put just found gap into cache to speed up
 		 * subsequent requests
 		 */
-		ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
+		if ((flags & EXT4_GET_BLOCKS_NO_PUT_HOLE) == 0)
+			ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
 		goto out2;
 	}
 
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 71cb75a..ca7dc9f 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -468,6 +468,65 @@ error:
 	return err;
 }
 
+/*
+ * ext4_es_lookup_extent() looks up an extent in extent status tree.
+ *
+ * ext4_es_lookup_extent is called by ext4_map_blocks/ext4_da_map_blocks.
+ *
+ * Return: 1 on found, 0 on not
+ */
+int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es)
+{
+	struct ext4_es_tree *tree;
+	struct extent_status *es1 = NULL;
+	struct rb_node *node;
+	int found = 0;
+
+	trace_ext4_es_lookup_extent_enter(inode, es->es_lblk);
+	es_debug("lookup extent in block %u\n", es->es_lblk);
+
+	tree = &EXT4_I(inode)->i_es_tree;
+	read_lock(&EXT4_I(inode)->i_es_lock);
+
+	/* find extent in cache firstly */
+	es->es_len = es->es_pblk = 0;
+	if (tree->cache_es) {
+		es1 = tree->cache_es;
+		if (in_range(es->es_lblk, es1->es_lblk, es1->es_len)) {
+			es_debug("%u cached by [%u/%u)\n",
+				 es->es_lblk, es1->es_lblk, es1->es_len);
+			found = 1;
+			goto out;
+		}
+	}
+
+	node = tree->root.rb_node;
+	while (node) {
+		es1 = rb_entry(node, struct extent_status, rb_node);
+		if (es->es_lblk < es1->es_lblk)
+			node = node->rb_left;
+		else if (es->es_lblk > ext4_es_end(es1))
+			node = node->rb_right;
+		else {
+			found = 1;
+			break;
+		}
+	}
+
+out:
+	if (found) {
+		BUG_ON(!es1);
+		es->es_lblk = es1->es_lblk;
+		es->es_len = es1->es_len;
+		es->es_pblk = es1->es_pblk;
+	}
+
+	read_unlock(&EXT4_I(inode)->i_es_lock);
+
+	trace_ext4_es_lookup_extent_exit(inode, es, found);
+	return found;
+}
+
 static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 				 ext4_lblk_t end)
 {
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index b5788eb..effe78c 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -53,6 +53,7 @@ extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t len);
 extern ext4_lblk_t ext4_es_find_delayed_extent(struct inode *inode,
 					       struct extent_status *es);
+extern int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es);
 
 static inline int ext4_es_is_written(struct extent_status *es)
 {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 16454fc..670779a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -508,12 +508,34 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
 int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		    struct ext4_map_blocks *map, int flags)
 {
+	struct extent_status es;
 	int retval;
 
 	map->m_flags = 0;
 	ext_debug("ext4_map_blocks(): inode %lu, flag %d, max_blocks %u,"
 		  "logical block %lu\n", inode->i_ino, flags, map->m_len,
 		  (unsigned long) map->m_lblk);
+
+	/* Lookup extent status tree firstly */
+	es.es_lblk = map->m_lblk;
+	if (ext4_es_lookup_extent(inode, &es)) {
+		if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
+			map->m_pblk = ext4_es_pblock(&es) +
+					map->m_lblk - es.es_lblk;
+			map->m_flags |= ext4_es_is_written(&es) ?
+					EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
+			retval = es.es_len - (map->m_lblk - es.es_lblk);
+			if (retval > map->m_len)
+				retval = map->m_len;
+			map->m_len = retval;
+		} else if (ext4_es_is_delayed(&es) || ext4_es_is_hole(&es)) {
+			retval = 0;
+		} else {
+			BUG_ON(1);
+		}
+		goto found;
+	}
+
 	/*
 	 * Try to see if we can get the block without requesting a new
 	 * file system block.
@@ -541,6 +563,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
 		up_read((&EXT4_I(inode)->i_data_sem));
 
+found:
 	if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
 		int ret = check_block_validity(inode, map);
 		if (ret != 0)
@@ -1772,6 +1795,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 			      struct ext4_map_blocks *map,
 			      struct buffer_head *bh)
 {
+	struct extent_status es;
 	int retval;
 	sector_t invalid_block = ~((sector_t) 0xffff);
 
@@ -1782,6 +1806,39 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 	ext_debug("ext4_da_map_blocks(): inode %lu, max_blocks %u,"
 		  "logical block %lu\n", inode->i_ino, map->m_len,
 		  (unsigned long) map->m_lblk);
+
+	/* Lookup extent status tree firstly */
+	es.es_lblk = iblock;
+	if (ext4_es_lookup_extent(inode, &es)) {
+
+		if (ext4_es_is_hole(&es)) {
+			retval = 0;
+			down_read((&EXT4_I(inode)->i_data_sem));
+			goto add_delayed;
+		}
+
+		if (ext4_es_is_delayed(&es)) {
+			map_bh(bh, inode->i_sb, invalid_block);
+			set_buffer_new(bh);
+			set_buffer_delay(bh);
+			return 0;
+		}
+
+		map->m_pblk = ext4_es_pblock(&es) + iblock - es.es_lblk;
+		retval = es.es_len - (iblock - es.es_lblk);
+		if (retval > map->m_len)
+			retval = map->m_len;
+		map->m_len = retval;
+		if (ext4_es_is_written(&es))
+			map->m_flags |= EXT4_MAP_MAPPED;
+		else if (ext4_es_is_unwritten(&es))
+			map->m_flags |= EXT4_MAP_UNWRITTEN;
+		else
+			BUG_ON(1);
+
+		return retval;
+	}
+
 	/*
 	 * Try to see if we can get the block without requesting a new
 	 * file system block.
@@ -1800,10 +1857,13 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 			map->m_flags |= EXT4_MAP_FROM_CLUSTER;
 		retval = 0;
 	} else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
-		retval = ext4_ext_map_blocks(NULL, inode, map, 0);
+		retval = ext4_ext_map_blocks(NULL, inode, map,
+					     EXT4_GET_BLOCKS_NO_PUT_HOLE);
 	else
-		retval = ext4_ind_map_blocks(NULL, inode, map, 0);
+		retval = ext4_ind_map_blocks(NULL, inode, map,
+					     EXT4_GET_BLOCKS_NO_PUT_HOLE);
 
+add_delayed:
 	if (retval == 0) {
 		int ret;
 		/*
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index d278ced..822780a 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2177,6 +2177,62 @@ TRACE_EVENT(ext4_es_find_delayed_extent_exit,
 		  __entry->pblk, __entry->status, __entry->ret)
 );
 
+TRACE_EVENT(ext4_es_lookup_extent_enter,
+	TP_PROTO(struct inode *inode, ext4_lblk_t lblk),
+
+	TP_ARGS(inode, lblk),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,		dev		)
+		__field(	ino_t,		ino		)
+		__field(	ext4_lblk_t,	lblk		)
+	),
+
+	TP_fast_assign(
+		__entry->dev	= inode->i_sb->s_dev;
+		__entry->ino	= inode->i_ino;
+		__entry->lblk	= lblk;
+	),
+
+	TP_printk("dev %d,%d ino %lu lblk %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  (unsigned long) __entry->ino, __entry->lblk)
+);
+
+TRACE_EVENT(ext4_es_lookup_extent_exit,
+	TP_PROTO(struct inode *inode, struct extent_status *es,
+		 int found),
+
+	TP_ARGS(inode, es, found),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,		dev		)
+		__field(	ino_t,		ino		)
+		__field(	ext4_lblk_t,	lblk		)
+		__field(	ext4_lblk_t,	len		)
+		__field(	ext4_fsblk_t,	pblk		)
+		__field(	unsigned long long,	status	)
+		__field(	int,		found		)
+	),
+
+	TP_fast_assign(
+		__entry->dev	= inode->i_sb->s_dev;
+		__entry->ino	= inode->i_ino;
+		__entry->lblk	= es->es_lblk;
+		__entry->len	= es->es_len;
+		__entry->pblk	= ext4_es_pblock(es);
+		__entry->status	= ext4_es_status(es);
+		__entry->found	= found;
+	),
+
+	TP_printk("dev %d,%d ino %lu found %d [%u/%u) %llu %llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  (unsigned long) __entry->ino, __entry->found,
+		  __entry->lblk, __entry->len,
+		  __entry->found ? __entry->pblk : 0,
+		  __entry->found ? __entry->status : 0)
+);
+
 #endif /* _TRACE_EXT4_H */
 
 /* This part must be outside protection */
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 06/10 v5] ext4: remove single extent cache
  2013-02-08  8:43 [PATCH 00/10 v5] ext4: extent status tree (step2) Zheng Liu
                   ` (4 preceding siblings ...)
  2013-02-08  8:44 ` [PATCH 05/10 v5] ext4: lookup block mapping " Zheng Liu
@ 2013-02-08  8:44 ` Zheng Liu
  2013-02-08  8:44 ` [PATCH 07/10 v5] ext4: adjust some functions for reclaiming extents from extent status tree Zheng Liu
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Zheng Liu @ 2013-02-08  8:44 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o, Jan kara

From: Zheng Liu <wenqing.lz@taobao.com>

Single extent cache could be removed because we have extent status tree
as a extent cache, and it would be better.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
---
 fs/ext4/ext4.h         |  12 ----
 fs/ext4/ext4_extents.h |   6 --
 fs/ext4/extents.c      | 177 +++++++++++--------------------------------------
 fs/ext4/move_extent.c  |   3 -
 fs/ext4/super.c        |   1 -
 5 files changed, 37 insertions(+), 162 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ad885b5..12b1fc7 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -812,17 +812,6 @@ do {									       \
 
 #endif /* defined(__KERNEL__) || defined(__linux__) */
 
-/*
- * storage for cached extent
- * If ec_len == 0, then the cache is invalid.
- * If ec_start == 0, then the cache represents a gap (null mapping)
- */
-struct ext4_ext_cache {
-	ext4_fsblk_t	ec_start;
-	ext4_lblk_t	ec_block;
-	__u32		ec_len; /* must be 32bit to return holes */
-};
-
 #include "extents_status.h"
 
 /*
@@ -889,7 +878,6 @@ struct ext4_inode_info {
 	struct inode vfs_inode;
 	struct jbd2_inode *jinode;
 
-	struct ext4_ext_cache i_cached_extent;
 	/*
 	 * File creation time. Its function is same as that of
 	 * struct timespec i_{a,c,m}time in the generic inode.
diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
index 487fda1..8643ff5 100644
--- a/fs/ext4/ext4_extents.h
+++ b/fs/ext4/ext4_extents.h
@@ -193,12 +193,6 @@ static inline unsigned short ext_depth(struct inode *inode)
 	return le16_to_cpu(ext_inode_hdr(inode)->eh_depth);
 }
 
-static inline void
-ext4_ext_invalidate_cache(struct inode *inode)
-{
-	EXT4_I(inode)->i_cached_extent.ec_len = 0;
-}
-
 static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext)
 {
 	/* We can not have an uninitialized extent of zero length! */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 1be8955..9f21430 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -112,7 +112,7 @@ static int ext4_split_extent_at(handle_t *handle,
 			     int flags);
 
 static int ext4_find_delayed_extent(struct inode *inode,
-				    struct ext4_ext_cache *newex);
+				    struct extent_status *newes);
 
 static int ext4_ext_truncate_extend_restart(handle_t *handle,
 					    struct inode *inode,
@@ -714,7 +714,6 @@ int ext4_ext_tree_init(handle_t *handle, struct inode *inode)
 	eh->eh_magic = EXT4_EXT_MAGIC;
 	eh->eh_max = cpu_to_le16(ext4_ext_space_root(inode, 0));
 	ext4_mark_inode_dirty(handle, inode);
-	ext4_ext_invalidate_cache(inode);
 	return 0;
 }
 
@@ -1960,7 +1959,6 @@ cleanup:
 		ext4_ext_drop_refs(npath);
 		kfree(npath);
 	}
-	ext4_ext_invalidate_cache(inode);
 	return err;
 }
 
@@ -1969,8 +1967,8 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
 				    struct fiemap_extent_info *fieinfo)
 {
 	struct ext4_ext_path *path = NULL;
-	struct ext4_ext_cache newex;
 	struct ext4_extent *ex;
+	struct extent_status es;
 	ext4_lblk_t next, next_del, start = 0, end = 0;
 	ext4_lblk_t last = block + num;
 	int exists, depth = 0, err = 0;
@@ -2044,31 +2042,31 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
 		BUG_ON(end <= start);
 
 		if (!exists) {
-			newex.ec_block = start;
-			newex.ec_len = end - start;
-			newex.ec_start = 0;
+			es.es_lblk = start;
+			es.es_len = end - start;
+			es.es_pblk = 0;
 		} else {
-			newex.ec_block = le32_to_cpu(ex->ee_block);
-			newex.ec_len = ext4_ext_get_actual_len(ex);
-			newex.ec_start = ext4_ext_pblock(ex);
+			es.es_lblk = le32_to_cpu(ex->ee_block);
+			es.es_len = ext4_ext_get_actual_len(ex);
+			es.es_pblk = ext4_ext_pblock(ex);
 			if (ext4_ext_is_uninitialized(ex))
 				flags |= FIEMAP_EXTENT_UNWRITTEN;
 		}
 
 		/*
-		 * Find delayed extent and update newex accordingly. We call
-		 * it even in !exists case to find out whether newex is the
+		 * Find delayed extent and update es accordingly. We call
+		 * it even in !exists case to find out whether es is the
 		 * last existing extent or not.
 		 */
-		next_del = ext4_find_delayed_extent(inode, &newex);
+		next_del = ext4_find_delayed_extent(inode, &es);
 		if (!exists && next_del) {
 			exists = 1;
 			flags |= FIEMAP_EXTENT_DELALLOC;
 		}
 		up_read(&EXT4_I(inode)->i_data_sem);
 
-		if (unlikely(newex.ec_len == 0)) {
-			EXT4_ERROR_INODE(inode, "newex.ec_len == 0");
+		if (unlikely(es.es_len == 0)) {
+			EXT4_ERROR_INODE(inode, "es.es_len == 0");
 			err = -EIO;
 			break;
 		}
@@ -2089,9 +2087,9 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
 
 		if (exists) {
 			err = fiemap_fill_next_extent(fieinfo,
-				(__u64)newex.ec_block << blksize_bits,
-				(__u64)newex.ec_start << blksize_bits,
-				(__u64)newex.ec_len << blksize_bits,
+				(__u64)es.es_lblk << blksize_bits,
+				(__u64)es.es_pblk << blksize_bits,
+				(__u64)es.es_len << blksize_bits,
 				flags);
 			if (err < 0)
 				break;
@@ -2101,7 +2099,7 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
 			}
 		}
 
-		block = newex.ec_block + newex.ec_len;
+		block = es.es_lblk + es.es_len;
 	}
 
 	if (path) {
@@ -2112,21 +2110,6 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
 	return err;
 }
 
-static void
-ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block,
-			__u32 len, ext4_fsblk_t start)
-{
-	struct ext4_ext_cache *cex;
-	BUG_ON(len == 0);
-	spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
-	trace_ext4_ext_put_in_cache(inode, block, len, start);
-	cex = &EXT4_I(inode)->i_cached_extent;
-	cex->ec_block = block;
-	cex->ec_len = len;
-	cex->ec_start = start;
-	spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
-}
-
 /*
  * ext4_ext_put_gap_in_cache:
  * calculate boundaries of the gap that the requested block fits into
@@ -2143,9 +2126,10 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
 
 	ex = path[depth].p_ext;
 	if (ex == NULL) {
-		/* there is no extent yet, so gap is [0;-] */
-		lblock = 0;
-		len = EXT_MAX_BLOCKS;
+		/*
+		 * there is no extent yet, so gap is [0;-] and we
+		 * don't cache it
+		 */
 		ext_debug("cache gap(whole file):");
 	} else if (block < le32_to_cpu(ex->ee_block)) {
 		lblock = block;
@@ -2177,52 +2161,6 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
 	}
 
 	ext_debug(" -> %u:%lu\n", lblock, len);
-	ext4_ext_put_in_cache(inode, lblock, len, 0);
-}
-
-/*
- * ext4_ext_in_cache()
- * Checks to see if the given block is in the cache.
- * If it is, the cached extent is stored in the given
- * cache extent pointer.
- *
- * @inode: The files inode
- * @block: The block to look for in the cache
- * @ex:    Pointer where the cached extent will be stored
- *         if it contains block
- *
- * Return 0 if cache is invalid; 1 if the cache is valid
- */
-static int
-ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block,
-		  struct ext4_extent *ex)
-{
-	struct ext4_ext_cache *cex;
-	int ret = 0;
-
-	/*
-	 * We borrow i_block_reservation_lock to protect i_cached_extent
-	 */
-	spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
-	cex = &EXT4_I(inode)->i_cached_extent;
-
-	/* has cache valid data? */
-	if (cex->ec_len == 0)
-		goto errout;
-
-	if (in_range(block, cex->ec_block, cex->ec_len)) {
-		ex->ee_block = cpu_to_le32(cex->ec_block);
-		ext4_ext_store_pblock(ex, cex->ec_start);
-		ex->ee_len = cpu_to_le16(cex->ec_len);
-		ext_debug("%u cached by %u:%u:%llu\n",
-				block,
-				cex->ec_block, cex->ec_len, cex->ec_start);
-		ret = 1;
-	}
-errout:
-	trace_ext4_ext_in_cache(inode, block, ret);
-	spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
-	return ret;
 }
 
 /*
@@ -2662,8 +2600,6 @@ static int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 		return PTR_ERR(handle);
 
 again:
-	ext4_ext_invalidate_cache(inode);
-
 	trace_ext4_ext_remove_space(inode, start, depth);
 
 	/*
@@ -3906,35 +3842,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		  map->m_lblk, map->m_len, inode->i_ino);
 	trace_ext4_ext_map_blocks_enter(inode, map->m_lblk, map->m_len, flags);
 
-	/* check in cache */
-	if (ext4_ext_in_cache(inode, map->m_lblk, &newex)) {
-		if (!newex.ee_start_lo && !newex.ee_start_hi) {
-			if ((sbi->s_cluster_ratio > 1) &&
-			    ext4_find_delalloc_cluster(inode, map->m_lblk))
-				map->m_flags |= EXT4_MAP_FROM_CLUSTER;
-
-			if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
-				/*
-				 * block isn't allocated yet and
-				 * user doesn't want to allocate it
-				 */
-				goto out2;
-			}
-			/* we should allocate requested block */
-		} else {
-			/* block is already allocated */
-			if (sbi->s_cluster_ratio > 1)
-				map->m_flags |= EXT4_MAP_FROM_CLUSTER;
-			newblock = map->m_lblk
-				   - le32_to_cpu(newex.ee_block)
-				   + ext4_ext_pblock(&newex);
-			/* number of remaining blocks in the extent */
-			allocated = ext4_ext_get_actual_len(&newex) -
-				(map->m_lblk - le32_to_cpu(newex.ee_block));
-			goto out;
-		}
-	}
-
 	/* find extent for this block */
 	path = ext4_ext_find_extent(inode, map->m_lblk, NULL);
 	if (IS_ERR(path)) {
@@ -3981,15 +3888,9 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 			ext_debug("%u fit into %u:%d -> %llu\n", map->m_lblk,
 				  ee_block, ee_len, newblock);
 
-			/*
-			 * Do not put uninitialized extent
-			 * in the cache
-			 */
-			if (!ext4_ext_is_uninitialized(ex)) {
-				ext4_ext_put_in_cache(inode, ee_block,
-					ee_len, ee_start);
+			if (!ext4_ext_is_uninitialized(ex))
 				goto out;
-			}
+
 			allocated = ext4_ext_handle_uninitialized_extents(
 				handle, inode, map, path, flags,
 				allocated, newblock);
@@ -4251,10 +4152,9 @@ got_allocated_blocks:
 	 * Cache the extent and update transaction to commit on fdatasync only
 	 * when it is _not_ an uninitialized extent.
 	 */
-	if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0) {
-		ext4_ext_put_in_cache(inode, map->m_lblk, allocated, newblock);
+	if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0)
 		ext4_update_inode_fsync_trans(handle, inode, 1);
-	} else
+	else
 		ext4_update_inode_fsync_trans(handle, inode, 0);
 out:
 	if (allocated > map->m_len)
@@ -4313,7 +4213,6 @@ void ext4_ext_truncate(struct inode *inode)
 		goto out_stop;
 
 	down_write(&EXT4_I(inode)->i_data_sem);
-	ext4_ext_invalidate_cache(inode);
 
 	ext4_discard_preallocations(inode);
 
@@ -4563,40 +4462,40 @@ int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
 }
 
 /*
- * If newex is not existing extent (newex->ec_start equals zero) find
- * delayed extent at start of newex and update newex accordingly and
+ * If newes is not existing extent (newes->ec_pblk equals zero) find
+ * delayed extent at start of newes and update newes accordingly and
  * return start of the next delayed extent.
  *
- * If newex is existing extent (newex->ec_start is not equal zero)
+ * If newes is existing extent (newes->ec_pblk is not equal zero)
  * return start of next delayed extent or EXT_MAX_BLOCKS if no delayed
- * extent found. Leave newex unmodified.
+ * extent found. Leave newes unmodified.
  */
 static int ext4_find_delayed_extent(struct inode *inode,
-				    struct ext4_ext_cache *newex)
+				    struct extent_status *newes)
 {
 	struct extent_status es;
 	ext4_lblk_t next_del;
 
-	es.es_lblk = newex->ec_block;
+	es.es_lblk = newes->es_lblk;
 	next_del = ext4_es_find_delayed_extent(inode, &es);
 
-	if (newex->ec_start == 0) {
+	if (newes->es_pblk == 0) {
 		/*
-		 * No extent in extent-tree contains block @newex->ec_start,
+		 * No extent in extent-tree contains block @newes->es_pblk,
 		 * then the block may stay in 1)a hole or 2)delayed-extent.
 		 */
 		if (es.es_len == 0)
 			/* A hole found. */
 			return 0;
 
-		if (es.es_lblk > newex->ec_block) {
+		if (es.es_lblk > newes->es_lblk) {
 			/* A hole found. */
-			newex->ec_len = min(es.es_lblk - newex->ec_block,
-					    newex->ec_len);
+			newes->es_len = min(es.es_lblk - newes->es_lblk,
+					    newes->es_len);
 			return 0;
 		}
 
-		newex->ec_len = es.es_lblk + es.es_len - newex->ec_block;
+		newes->es_len = es.es_lblk + es.es_len - newes->es_lblk;
 	}
 
 	return next_del;
@@ -4796,14 +4695,12 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
 		goto out;
 
 	down_write(&EXT4_I(inode)->i_data_sem);
-	ext4_ext_invalidate_cache(inode);
 	ext4_discard_preallocations(inode);
 
 	err = ext4_es_remove_extent(inode, first_block,
 				    stop_block - first_block);
 	err = ext4_ext_remove_space(inode, first_block, stop_block - 1);
 
-	ext4_ext_invalidate_cache(inode);
 	ext4_discard_preallocations(inode);
 
 	if (IS_SYNC(inode))
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index d9cc5ee..b9222c8 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -761,9 +761,6 @@ out:
 		kfree(donor_path);
 	}
 
-	ext4_ext_invalidate_cache(orig_inode);
-	ext4_ext_invalidate_cache(donor_inode);

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 07/10 v5] ext4: adjust some functions for reclaiming extents from extent status tree
  2013-02-08  8:43 [PATCH 00/10 v5] ext4: extent status tree (step2) Zheng Liu
                   ` (5 preceding siblings ...)
  2013-02-08  8:44 ` [PATCH 06/10 v5] ext4: remove single extent cache Zheng Liu
@ 2013-02-08  8:44 ` Zheng Liu
  2013-02-08  8:44 ` [PATCH 08/10 v5] ext4: reclaim " Zheng Liu
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Zheng Liu @ 2013-02-08  8:44 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o, Jan kara

From: Zheng Liu <wenqing.lz@taobao.com>

This commit changes some interfaces in extent status tree because we
need to use inode to count the cached objects in a extent status tree.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
---
 fs/ext4/extents_status.c | 50 +++++++++++++++++++++++-------------------------
 1 file changed, 24 insertions(+), 26 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index ca7dc9f..84fe78d 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -142,9 +142,8 @@
 
 static struct kmem_cache *ext4_es_cachep;
 
-static int __es_insert_extent(struct ext4_es_tree *tree,
-			      struct extent_status *newes);
-static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
+static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
+static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 			      ext4_lblk_t end);
 
 int __init ext4_init_es(void)
@@ -296,7 +295,8 @@ out:
 }
 
 static struct extent_status *
-ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len, ext4_fsblk_t pblk)
+ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
+		     ext4_fsblk_t pblk)
 {
 	struct extent_status *es;
 	es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
@@ -308,7 +308,7 @@ ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len, ext4_fsblk_t pblk)
 	return es;
 }
 
-static void ext4_es_free_extent(struct extent_status *es)
+static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
 {
 	kmem_cache_free(ext4_es_cachep, es);
 }
@@ -337,8 +337,9 @@ static int ext4_es_can_be_merged(struct extent_status *es1,
 }
 
 static struct extent_status *
-ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
+ext4_es_try_to_merge_left(struct inode *inode, struct extent_status *es)
 {
+	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
 	struct extent_status *es1;
 	struct rb_node *node;
 
@@ -350,7 +351,7 @@ ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
 	if (ext4_es_can_be_merged(es1, es)) {
 		es1->es_len += es->es_len;
 		rb_erase(&es->rb_node, &tree->root);
-		ext4_es_free_extent(es);
+		ext4_es_free_extent(inode, es);
 		es = es1;
 	}
 
@@ -358,8 +359,9 @@ ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
 }
 
 static struct extent_status *
-ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
+ext4_es_try_to_merge_right(struct inode *inode, struct extent_status *es)
 {
+	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
 	struct extent_status *es1;
 	struct rb_node *node;
 
@@ -371,15 +373,15 @@ ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
 	if (ext4_es_can_be_merged(es, es1)) {
 		es->es_len += es1->es_len;
 		rb_erase(node, &tree->root);
-		ext4_es_free_extent(es1);
+		ext4_es_free_extent(inode, es1);
 	}
 
 	return es;
 }
 
-static int __es_insert_extent(struct ext4_es_tree *tree,
-			      struct extent_status *newes)
+static int __es_insert_extent(struct inode *inode, struct extent_status *newes)
 {
+	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
 	struct rb_node **p = &tree->root.rb_node;
 	struct rb_node *parent = NULL;
 	struct extent_status *es;
@@ -396,14 +398,14 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
 				    ext4_es_is_unwritten(es))
 					ext4_es_store_pblock(es,
 							     newes->es_pblk);
-				es = ext4_es_try_to_merge_left(tree, es);
+				es = ext4_es_try_to_merge_left(inode, es);
 				goto out;
 			}
 			p = &(*p)->rb_left;
 		} else if (newes->es_lblk > ext4_es_end(es)) {
 			if (ext4_es_can_be_merged(es, newes)) {
 				es->es_len += newes->es_len;
-				es = ext4_es_try_to_merge_right(tree, es);
+				es = ext4_es_try_to_merge_right(inode, es);
 				goto out;
 			}
 			p = &(*p)->rb_right;
@@ -413,7 +415,7 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
 		}
 	}
 
-	es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len,
+	es = ext4_es_alloc_extent(inode, newes->es_lblk, newes->es_len,
 				  newes->es_pblk);
 	if (!es)
 		return -ENOMEM;
@@ -437,7 +439,6 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 			  ext4_lblk_t len, ext4_fsblk_t pblk,
 			  unsigned long long status)
 {
-	struct ext4_es_tree *tree;
 	struct extent_status newes;
 	ext4_lblk_t end = lblk + len - 1;
 	int err = 0;
@@ -454,11 +455,10 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	trace_ext4_es_insert_extent(inode, &newes);
 
 	write_lock(&EXT4_I(inode)->i_es_lock);
-	tree = &EXT4_I(inode)->i_es_tree;
-	err = __es_remove_extent(tree, lblk, end);
+	err = __es_remove_extent(inode, lblk, end);
 	if (err != 0)
 		goto error;
-	err = __es_insert_extent(tree, &newes);
+	err = __es_insert_extent(inode, &newes);
 
 error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
@@ -527,9 +527,10 @@ out:
 	return found;
 }
 
-static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
-				 ext4_lblk_t end)
+static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
+			      ext4_lblk_t end)
 {
+	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
 	struct rb_node *node;
 	struct extent_status *es;
 	struct extent_status orig_es;
@@ -567,7 +568,7 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 				ext4_es_store_pblock(&newes, block);
 			}
 			ext4_es_store_status(&newes, ext4_es_status(&orig_es));
-			err = __es_insert_extent(tree, &newes);
+			err = __es_insert_extent(inode, &newes);
 			if (err) {
 				es->es_lblk = orig_es.es_lblk;
 				es->es_len = orig_es.es_len;
@@ -596,7 +597,7 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 	while (es && ext4_es_end(es) <= end) {
 		node = rb_next(&es->rb_node);
 		rb_erase(&es->rb_node, &tree->root);
-		ext4_es_free_extent(es);
+		ext4_es_free_extent(inode, es);
 		if (!node) {
 			es = NULL;
 			break;
@@ -628,7 +629,6 @@ out:
 int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 			  ext4_lblk_t len)
 {
-	struct ext4_es_tree *tree;
 	ext4_lblk_t end;
 	int err = 0;
 
@@ -639,10 +639,8 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	end = lblk + len - 1;
 	BUG_ON(end < lblk);
 
-	tree = &EXT4_I(inode)->i_es_tree;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 08/10 v5] ext4: reclaim extents from extent status tree
  2013-02-08  8:43 [PATCH 00/10 v5] ext4: extent status tree (step2) Zheng Liu
                   ` (6 preceding siblings ...)
  2013-02-08  8:44 ` [PATCH 07/10 v5] ext4: adjust some functions for reclaiming extents from extent status tree Zheng Liu
@ 2013-02-08  8:44 ` Zheng Liu
  2013-02-08  8:44 ` [PATCH 09/10 v5] ext4: convert unwritten extents from extent status tree in end_io Zheng Liu
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Zheng Liu @ 2013-02-08  8:44 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o, Jan kara

From: Zheng Liu <wenqing.lz@taobao.com>

Although extent status is loaded on-demand, we also need to reclaim
extent from the tree when we are under a heavy memory pressure because
in some cases fragmented extent tree causes status tree costs too much
memory.

Here we maintain a lru list in super_block.  When the extent status of
an inode is accessed and changed, this inode will be move to the tail
of the list.  The inode will be dropped from this list when it is
cleared.  In the inode, a counter is added to count the number of
cached objects in extent status tree.  Here only written/unwritten
extent is counted because delayed extent doesn't be reclaimed due to
fiemap, bigalloc and seek_data/hole need it.  The counter will be
increased as a new extent is allocated, and it will be decreased as a
extent is freed.

In this commit we use normal shrinker framework to reclaim memory from
the status tree.  ext4_es_reclaim_extents_count() traverses the lru list
to count the number of reclaimable extents.  ext4_es_shrink() tries to
reclaim written/unwritten extents from extent status tree.  The inode
that has been shrunk is moved to the tail of lru list.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
---
 fs/ext4/ext4.h              |   7 ++
 fs/ext4/extents_status.c    | 156 ++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/extents_status.h    |   5 ++
 fs/ext4/super.c             |   7 ++
 include/trace/events/ext4.h |  60 +++++++++++++++++
 5 files changed, 235 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 12b1fc7..dc46fe2 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -891,6 +891,8 @@ struct ext4_inode_info {
 	/* extents status tree */
 	struct ext4_es_tree i_es_tree;
 	rwlock_t i_es_lock;
+	struct list_head i_es_lru;
+	unsigned int i_es_lru_nr;	/* protected by i_es_lock */
 
 	/* ialloc */
 	ext4_group_t	i_last_alloc_group;
@@ -1306,6 +1308,11 @@ struct ext4_sb_info {
 
 	/* Precomputed FS UUID checksum for seeding other checksums */
 	__u32 s_csum_seed;
+
+	/* Reclaim extents from extent status tree */
+	struct shrinker s_es_shrinker;
+	struct list_head s_es_lru;
+	spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
 };
 
 static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 84fe78d..bac5286 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -145,6 +145,9 @@ static struct kmem_cache *ext4_es_cachep;
 static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
 static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 			      ext4_lblk_t end);
+static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
+				       int nr_to_scan);
+static int ext4_es_reclaim_extents_count(struct super_block *sb);
 
 int __init ext4_init_es(void)
 {
@@ -290,6 +293,7 @@ out:
 
 	read_unlock(&EXT4_I(inode)->i_es_lock);
 
+	ext4_es_lru_add(inode);
 	trace_ext4_es_find_delayed_extent_exit(inode, es, ret);
 	return ret;
 }
@@ -305,11 +309,24 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
 	es->es_lblk = lblk;
 	es->es_len = len;
 	es->es_pblk = pblk;
+
+	/*
+	 * We don't count delayed extent because we never try to reclaim them
+	 */
+	if (!ext4_es_is_delayed(es))
+		EXT4_I(inode)->i_es_lru_nr++;
+
 	return es;
 }
 
 static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
 {
+	/* Decrease the lru counter when this es is not delayed */
+	if (!ext4_es_is_delayed(es)) {
+		BUG_ON(EXT4_I(inode)->i_es_lru_nr == 0);
+		EXT4_I(inode)->i_es_lru_nr--;
+	}
+
 	kmem_cache_free(ext4_es_cachep, es);
 }
 
@@ -463,6 +480,7 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 
+	ext4_es_lru_add(inode);
 	ext4_es_print_tree(inode);
 
 	return err;
@@ -523,6 +541,7 @@ out:
 
 	read_unlock(&EXT4_I(inode)->i_es_lock);
 
+	ext4_es_lru_add(inode);
 	trace_ext4_es_lookup_extent_exit(inode, es, found);
 	return found;
 }
@@ -645,3 +664,140 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	ext4_es_print_tree(inode);
 	return err;
 }
+
+static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
+{
+	struct ext4_sb_info *sbi = container_of(shrink,
+					struct ext4_sb_info, s_es_shrinker);
+	struct ext4_inode_info *ei;
+	struct list_head *cur, *tmp, scanned;
+	int nr_to_scan = sc->nr_to_scan;
+	int ret, nr_shrunk = 0;
+
+	trace_ext4_es_shrink_enter(sbi->s_sb, nr_to_scan);
+
+	if (!nr_to_scan)
+		return ext4_es_reclaim_extents_count(sbi->s_sb);
+
+	INIT_LIST_HEAD(&scanned);
+
+	spin_lock(&sbi->s_es_lru_lock);
+	list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
+		list_move_tail(cur, &scanned);
+
+		ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
+
+		read_lock(&ei->i_es_lock);
+		if (ei->i_es_lru_nr == 0) {
+			read_unlock(&ei->i_es_lock);
+			continue;
+		}
+		read_unlock(&ei->i_es_lock);
+
+		write_lock(&ei->i_es_lock);
+		ret = __es_try_to_reclaim_extents(ei, nr_to_scan);
+		write_unlock(&ei->i_es_lock);
+
+		nr_shrunk += ret;
+		nr_to_scan -= ret;
+		if (nr_to_scan == 0)
+			break;
+	}
+	list_splice_tail(&scanned, &sbi->s_es_lru);
+	spin_unlock(&sbi->s_es_lru_lock);
+	trace_ext4_es_shrink_exit(sbi->s_sb, nr_shrunk);
+
+	return ext4_es_reclaim_extents_count(sbi->s_sb);
+}
+
+void ext4_es_register_shrinker(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi;
+
+	sbi = EXT4_SB(sb);
+	INIT_LIST_HEAD(&sbi->s_es_lru);
+	spin_lock_init(&sbi->s_es_lru_lock);
+	sbi->s_es_shrinker.shrink = ext4_es_shrink;
+	sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
+	register_shrinker(&sbi->s_es_shrinker);
+}
+
+void ext4_es_unregister_shrinker(struct super_block *sb)
+{
+	unregister_shrinker(&EXT4_SB(sb)->s_es_shrinker);
+}
+
+void ext4_es_lru_add(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+	spin_lock(&sbi->s_es_lru_lock);
+	if (list_empty(&ei->i_es_lru))
+		list_add_tail(&ei->i_es_lru, &sbi->s_es_lru);
+	else
+		list_move_tail(&ei->i_es_lru, &sbi->s_es_lru);
+	spin_unlock(&sbi->s_es_lru_lock);
+}
+
+void ext4_es_lru_del(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+	spin_lock(&sbi->s_es_lru_lock);
+	if (!list_empty(&ei->i_es_lru))
+		list_del_init(&ei->i_es_lru);
+	spin_unlock(&sbi->s_es_lru_lock);
+}
+
+static int ext4_es_reclaim_extents_count(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *ei;
+	struct list_head *cur;
+	int nr_cached = 0;
+
+	spin_lock(&sbi->s_es_lru_lock);
+	list_for_each(cur, &sbi->s_es_lru) {
+		ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
+		read_lock(&ei->i_es_lock);
+		nr_cached += ei->i_es_lru_nr;
+		read_unlock(&ei->i_es_lock);
+	}
+	spin_unlock(&sbi->s_es_lru_lock);
+	trace_ext4_es_reclaim_extents_count(sb, nr_cached);
+	return nr_cached;
+}
+
+static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
+				       int nr_to_scan)
+{
+	struct inode *inode = &ei->vfs_inode;
+	struct ext4_es_tree *tree = &ei->i_es_tree;
+	struct rb_node *node;
+	struct extent_status *es;
+	int nr_shrunk = 0;
+
+	if (ei->i_es_lru_nr == 0)
+		return 0;
+
+	node = rb_first(&tree->root);
+	while (node != NULL) {
+		es = rb_entry(node, struct extent_status, rb_node);
+		node = rb_next(&es->rb_node);
+		/*
+		 * We can't reclaim delayed extent from status tree because
+		 * fiemap, bigallic, and seek_data/hole need to use it.
+		 */
+		if (!ext4_es_is_delayed(es)) {
+			rb_erase(&es->rb_node, &tree->root);
+			ext4_es_free_extent(inode, es);
+			nr_shrunk++;
+			if (--nr_to_scan == 0)
+				break;
+		}
+	}
+	tree->cache_es = NULL;
+	return nr_shrunk;
+}
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index effe78c..938ad2b 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -105,4 +105,9 @@ static inline void ext4_es_store_status(struct extent_status *es,
 	es->es_pblk = block;
 }
 
+extern void ext4_es_register_shrinker(struct super_block *sb);
+extern void ext4_es_unregister_shrinker(struct super_block *sb);
+extern void ext4_es_lru_add(struct inode *inode);
+extern void ext4_es_lru_del(struct inode *inode);
+
 #endif /* _EXT4_EXTENTS_STATUS_H */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a35c6c1..64d78b1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -858,6 +858,7 @@ static void ext4_put_super(struct super_block *sb)
 			ext4_abort(sb, "Couldn't clean up the journal");
 	}
 
+	ext4_es_unregister_shrinker(sb);
 	del_timer(&sbi->s_err_report);
 	ext4_release_system_zone(sb);
 	ext4_mb_release(sb);
@@ -943,6 +944,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	spin_lock_init(&ei->i_prealloc_lock);
 	ext4_es_init_tree(&ei->i_es_tree);
 	rwlock_init(&ei->i_es_lock);
+	INIT_LIST_HEAD(&ei->i_es_lru);
+	ei->i_es_lru_nr = 0;
 	ei->i_reserved_data_blocks = 0;
 	ei->i_reserved_meta_blocks = 0;
 	ei->i_allocated_meta_blocks = 0;
@@ -1030,6 +1033,7 @@ void ext4_clear_inode(struct inode *inode)
 	dquot_drop(inode);
 	ext4_discard_preallocations(inode);
 	ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS);
+	ext4_es_lru_del(inode);
 	if (EXT4_I(inode)->jinode) {
 		jbd2_journal_release_jbd_inode(EXT4_JOURNAL(inode),
 					       EXT4_I(inode)->jinode);
@@ -3771,6 +3775,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	sbi->s_max_writeback_mb_bump = 128;
 	sbi->s_extent_max_zeroout_kb = 32;
 
+	/* Register extent status tree shrinker */
+	ext4_es_register_shrinker(sb);
+
 	/*
 	 * set up enough so that it can read an inode
 	 */
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 822780a..f0734b3 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2233,6 +2233,66 @@ TRACE_EVENT(ext4_es_lookup_extent_exit,
 		  __entry->found ? __entry->status : 0)
 );
 
+TRACE_EVENT(ext4_es_reclaim_extents_count,
+	TP_PROTO(struct super_block *sb, int nr_cached),
+
+	TP_ARGS(sb, nr_cached),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,	dev			)
+		__field(	int,	nr_cached		)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= sb->s_dev;
+		__entry->nr_cached	= nr_cached;
+	),
+
+	TP_printk("dev %d,%d cached objects nr %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->nr_cached)
+);
+
+TRACE_EVENT(ext4_es_shrink_enter,
+	TP_PROTO(struct super_block *sb, int nr_to_scan),
+
+	TP_ARGS(sb, nr_to_scan),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,	dev			)
+		__field(	int,	nr_to_scan		)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= sb->s_dev;
+		__entry->nr_to_scan	= nr_to_scan;
+	),
+
+	TP_printk("dev %d,%d nr to scan %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->nr_to_scan)
+);
+
+TRACE_EVENT(ext4_es_shrink_exit,
+	TP_PROTO(struct super_block *sb, int shrunk_nr),
+
+	TP_ARGS(sb, shrunk_nr),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,	dev			)
+		__field(	int,	shrunk_nr		)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= sb->s_dev;
+		__entry->shrunk_nr	= shrunk_nr;
+	),
+
+	TP_printk("dev %d,%d nr to scan %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->shrunk_nr)
+);
+
 #endif /* _TRACE_EXT4_H */
 
 /* This part must be outside protection */
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 09/10 v5] ext4: convert unwritten extents from extent status tree in end_io
  2013-02-08  8:43 [PATCH 00/10 v5] ext4: extent status tree (step2) Zheng Liu
                   ` (7 preceding siblings ...)
  2013-02-08  8:44 ` [PATCH 08/10 v5] ext4: reclaim " Zheng Liu
@ 2013-02-08  8:44 ` Zheng Liu
  2013-02-10  8:45   ` Zheng Liu
  2013-02-12 12:51   ` Jan Kara
  2013-02-08  8:44 ` [PATCH 10/10 v5] ext4: remove bogus wait for unwritten extents in ext4_ind_direct_IO Zheng Liu
  2013-02-10  1:38 ` [PATCH 00/10 v5] ext4: extent status tree (step2) Theodore Ts'o
  10 siblings, 2 replies; 37+ messages in thread
From: Zheng Liu @ 2013-02-08  8:44 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o, Jan kara

From: Zheng Liu <wenqing.lz@taobao.com>

This commit tries to convert unwritten extents from extent status tree
in end_io callback functions and ext4_ext_direct_IO.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
---
 fs/ext4/extents.c           |   6 +-
 fs/ext4/extents_status.c    | 180 ++++++++++++++++++++++++++++++++++++++++----
 fs/ext4/extents_status.h    |   2 +
 fs/ext4/inode.c             |   5 ++
 fs/ext4/page-io.c           |   8 +-
 include/trace/events/ext4.h |  25 ++++++
 6 files changed, 208 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 9f21430..a03cabf 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4443,8 +4443,10 @@ int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
 			ret = PTR_ERR(handle);
 			break;
 		}
-		ret = ext4_map_blocks(handle, inode, &map,
-				      EXT4_GET_BLOCKS_IO_CONVERT_EXT);
+		down_write(&EXT4_I(inode)->i_data_sem);
+		ret = ext4_ext_map_blocks(handle, inode, &map,
+					  EXT4_GET_BLOCKS_IO_CONVERT_EXT);
+		up_write(&EXT4_I(inode)->i_data_sem);
 		if (ret <= 0) {
 			WARN_ON(ret <= 0);
 			ext4_msg(inode->i_sb, KERN_ERR,
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index bac5286..eab8893 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -248,10 +248,11 @@ ext4_lblk_t ext4_es_find_delayed_extent(struct inode *inode,
 	struct extent_status *es1 = NULL;
 	struct rb_node *node;
 	ext4_lblk_t ret = EXT_MAX_BLOCKS;
+	unsigned long flags;
 
 	trace_ext4_es_find_delayed_extent_enter(inode, es->es_lblk);
 
-	read_lock(&EXT4_I(inode)->i_es_lock);
+	read_lock_irqsave(&EXT4_I(inode)->i_es_lock, flags);
 	tree = &EXT4_I(inode)->i_es_tree;
 
 	/* find extent in cache firstly */
@@ -291,7 +292,7 @@ out:
 		}
 	}
 
-	read_unlock(&EXT4_I(inode)->i_es_lock);
+	read_unlock_irqrestore(&EXT4_I(inode)->i_es_lock, flags);
 
 	ext4_es_lru_add(inode);
 	trace_ext4_es_find_delayed_extent_exit(inode, es, ret);
@@ -458,6 +459,7 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 {
 	struct extent_status newes;
 	ext4_lblk_t end = lblk + len - 1;
+	unsigned long flags;
 	int err = 0;
 
 	es_debug("add [%u/%u) %llu %llx to extent status tree of inode %lu\n",
@@ -471,14 +473,14 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	ext4_es_store_status(&newes, status);
 	trace_ext4_es_insert_extent(inode, &newes);
 
-	write_lock(&EXT4_I(inode)->i_es_lock);
+	write_lock_irqsave(&EXT4_I(inode)->i_es_lock, flags);
 	err = __es_remove_extent(inode, lblk, end);
 	if (err != 0)
 		goto error;
 	err = __es_insert_extent(inode, &newes);
 
 error:
-	write_unlock(&EXT4_I(inode)->i_es_lock);
+	write_unlock_irqrestore(&EXT4_I(inode)->i_es_lock, flags);
 
 	ext4_es_lru_add(inode);
 	ext4_es_print_tree(inode);
@@ -498,13 +500,14 @@ int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es)
 	struct ext4_es_tree *tree;
 	struct extent_status *es1 = NULL;
 	struct rb_node *node;
+	unsigned long flags;
 	int found = 0;
 
 	trace_ext4_es_lookup_extent_enter(inode, es->es_lblk);
 	es_debug("lookup extent in block %u\n", es->es_lblk);
 
 	tree = &EXT4_I(inode)->i_es_tree;
-	read_lock(&EXT4_I(inode)->i_es_lock);
+	read_lock_irqsave(&EXT4_I(inode)->i_es_lock, flags);
 
 	/* find extent in cache firstly */
 	es->es_len = es->es_pblk = 0;
@@ -539,7 +542,7 @@ out:
 		es->es_pblk = es1->es_pblk;
 	}
 
-	read_unlock(&EXT4_I(inode)->i_es_lock);
+	read_unlock_irqrestore(&EXT4_I(inode)->i_es_lock, flags);
 
 	ext4_es_lru_add(inode);
 	trace_ext4_es_lookup_extent_exit(inode, es, found);
@@ -649,6 +652,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 			  ext4_lblk_t len)
 {
 	ext4_lblk_t end;
+	unsigned long flags;
 	int err = 0;
 
 	trace_ext4_es_remove_extent(inode, lblk, len);
@@ -658,9 +662,9 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	end = lblk + len - 1;
 	BUG_ON(end < lblk);
 
-	write_lock(&EXT4_I(inode)->i_es_lock);
+	write_lock_irqsave(&EXT4_I(inode)->i_es_lock, flags);
 	err = __es_remove_extent(inode, lblk, end);
-	write_unlock(&EXT4_I(inode)->i_es_lock);
+	write_unlock_irqrestore(&EXT4_I(inode)->i_es_lock, flags);
 	ext4_es_print_tree(inode);
 	return err;
 }
@@ -671,6 +675,7 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
 					struct ext4_sb_info, s_es_shrinker);
 	struct ext4_inode_info *ei;
 	struct list_head *cur, *tmp, scanned;
+	unsigned long flags;
 	int nr_to_scan = sc->nr_to_scan;
 	int ret, nr_shrunk = 0;
 
@@ -687,16 +692,16 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
 
 		ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
 
-		read_lock(&ei->i_es_lock);
+		read_lock_irqsave(&ei->i_es_lock, flags);
 		if (ei->i_es_lru_nr == 0) {
-			read_unlock(&ei->i_es_lock);
+			read_unlock_irqrestore(&ei->i_es_lock, flags);
 			continue;
 		}
-		read_unlock(&ei->i_es_lock);
+		read_unlock_irqrestore(&ei->i_es_lock, flags);
 
-		write_lock(&ei->i_es_lock);
+		write_lock_irqsave(&ei->i_es_lock, flags);
 		ret = __es_try_to_reclaim_extents(ei, nr_to_scan);
-		write_unlock(&ei->i_es_lock);
+		write_unlock_irqrestore(&ei->i_es_lock, flags);
 
 		nr_shrunk += ret;
 		nr_to_scan -= ret;
@@ -756,14 +761,15 @@ static int ext4_es_reclaim_extents_count(struct super_block *sb)
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct ext4_inode_info *ei;
 	struct list_head *cur;
+	unsigned long flags;
 	int nr_cached = 0;
 
 	spin_lock(&sbi->s_es_lru_lock);
 	list_for_each(cur, &sbi->s_es_lru) {
 		ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
-		read_lock(&ei->i_es_lock);
+		read_lock_irqsave(&ei->i_es_lock, flags);
 		nr_cached += ei->i_es_lru_nr;
-		read_unlock(&ei->i_es_lock);
+		read_unlock_irqrestore(&ei->i_es_lock, flags);
 	}
 	spin_unlock(&sbi->s_es_lru_lock);
 	trace_ext4_es_reclaim_extents_count(sb, nr_cached);
@@ -801,3 +807,147 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
 	tree->cache_es = NULL;
 	return nr_shrunk;
 }
+
+int ext4_es_convert_unwritten_extents(struct inode *inode, loff_t offset,
+				      size_t size)
+{
+	struct ext4_es_tree *tree;
+	struct rb_node *node;
+	struct extent_status *es, orig_es, conv_es;
+	ext4_lblk_t end, len1, len2;
+	ext4_lblk_t lblk = 0, len = 0;
+	ext4_fsblk_t block;
+	unsigned long flags;
+	unsigned int blkbits;
+	int err = 0;
+
+	trace_ext4_es_convert_unwritten_extents(inode, offset, size);
+	blkbits = inode->i_blkbits;
+	lblk = offset >> blkbits;
+	len = (EXT4_BLOCK_ALIGN(offset + size, blkbits) >> blkbits) - lblk;
+
+	end = lblk + len - 1;
+	BUG_ON(end < lblk);
+
+	tree = &EXT4_I(inode)->i_es_tree;
+
+	write_lock_irqsave(&EXT4_I(inode)->i_es_lock, flags);
+	es = __es_tree_search(&tree->root, lblk);
+	if (!es)
+		goto out;
+	if (es->es_lblk > end)
+		goto out;
+
+	tree->cache_es = NULL;
+
+	orig_es.es_lblk = es->es_lblk;
+	orig_es.es_len = es->es_len;
+	orig_es.es_pblk = es->es_pblk;
+
+	len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0;
+	len2 = ext4_es_end(es) > end ?
+	       ext4_es_end(es) - end : 0;
+	if (len1 > 0)
+		es->es_len = len1;
+	if (len2 > 0) {
+		if (len1 > 0) {
+			struct extent_status newes;
+
+			newes.es_lblk = end + 1;
+			newes.es_len = len2;
+			block = ext4_es_pblock(&orig_es) +
+				orig_es.es_len - len2;
+			ext4_es_store_pblock(&newes, block);
+			ext4_es_store_status(&newes, ext4_es_status(&orig_es));
+			err = __es_insert_extent(inode, &newes);
+			if (err) {
+				es->es_lblk = orig_es.es_lblk;
+				es->es_len = orig_es.es_len;
+				es->es_pblk = orig_es.es_pblk;
+				goto out;
+			}
+
+			conv_es.es_lblk = orig_es.es_lblk + len1;
+			conv_es.es_len = orig_es.es_len - len1 - len2;
+			block = ext4_es_pblock(&orig_es) + len1;
+			ext4_es_store_pblock(&conv_es, block);
+			ext4_es_store_status(&conv_es, EXTENT_STATUS_WRITTEN);
+			err = __es_insert_extent(inode, &conv_es);
+			if (err) {
+				int err2 = __es_remove_extent(inode,
+							conv_es.es_lblk,
+							ext4_es_end(&newes));
+				if (err2)
+					goto out;
+				es->es_lblk = orig_es.es_lblk;
+				es->es_len = orig_es.es_len;
+				es->es_pblk = orig_es.es_pblk;
+				goto out;
+			}
+		} else {
+			es->es_lblk = end + 1;
+			es->es_len = len2;
+			block = ext4_es_pblock(&orig_es) +
+				orig_es.es_len - len2;
+			ext4_es_store_pblock(es, block);
+
+			conv_es.es_lblk = orig_es.es_lblk;
+			conv_es.es_len = orig_es.es_len - len2;
+			ext4_es_store_pblock(&conv_es,
+					     ext4_es_pblock(&orig_es));
+			ext4_es_store_status(&conv_es, EXTENT_STATUS_WRITTEN);
+			err = __es_insert_extent(inode, &conv_es);
+			if (err) {
+				es->es_lblk = orig_es.es_lblk;
+				es->es_len = orig_es.es_len;
+				es->es_pblk = orig_es.es_pblk;
+			}
+		}
+		goto out;
+	}
+
+	if (len1 > 0) {
+		node = rb_next(&es->rb_node);
+		if (node)
+			es = rb_entry(node, struct extent_status, rb_node);
+		else
+			es = NULL;
+	}
+
+	while (es && ext4_es_end(es) <= end) {
+		node = rb_next(&es->rb_node);
+		ext4_es_store_status(es, EXTENT_STATUS_WRITTEN);
+		if (!inode) {
+			es = NULL;
+			break;
+		}
+		es = rb_entry(node, struct extent_status, rb_node);
+	}
+
+	if (es && es->es_lblk < end + 1) {
+		ext4_lblk_t orig_len = es->es_len;
+
+		/*
+		 * Here we first set conv_es just because of avoiding copy the
+		 * value of es to a temporary variable.
+		 */
+		len1 = ext4_es_end(es) - end;
+		conv_es.es_lblk = es->es_lblk;
+		conv_es.es_len = es->es_len - len1;
+		ext4_es_store_pblock(&conv_es, ext4_es_pblock(es));
+		ext4_es_store_status(&conv_es, EXTENT_STATUS_WRITTEN);
+
+		es->es_lblk = end + 1;
+		es->es_len = len1;
+		block = ext4_es_pblock(es) + orig_len - len1;
+		ext4_es_store_pblock(es, block);
+
+		err = __es_insert_extent(inode, &conv_es);
+		if (err)
+			goto out;
+	}
+
+out:
+	write_unlock_irqrestore(&EXT4_I(inode)->i_es_lock, flags);
+	return err;
+}
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 938ad2b..2849d74 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -54,6 +54,8 @@ extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 extern ext4_lblk_t ext4_es_find_delayed_extent(struct inode *inode,
 					       struct extent_status *es);
 extern int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es);
+extern int ext4_es_convert_unwritten_extents(struct inode *inode, loff_t offset,
+					     size_t size);
 
 static inline int ext4_es_is_written(struct extent_status *es)
 {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 670779a..08cf720 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3063,6 +3063,7 @@ out:
 		io_end->result = ret;
 	}
 
+	ext4_es_convert_unwritten_extents(inode, offset, size);
 	ext4_add_complete_io(io_end);
 }
 
@@ -3088,6 +3089,7 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
 	 */
 	inode = io_end->inode;
 	ext4_set_io_unwritten_flag(inode, io_end);
+	ext4_es_convert_unwritten_extents(inode, io_end->offset, io_end->size);
 	ext4_add_complete_io(io_end);
 out:
 	bh->b_private = NULL;
@@ -3246,6 +3248,9 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 	} else if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
 						EXT4_STATE_DIO_UNWRITTEN)) {
 		int err;
+		err = ext4_es_convert_unwritten_extents(inode, offset, ret);
+		if (err)
+			ret = err;
 		/*
 		 * for non AIO case, since the IO is already
 		 * completed, we could do the conversion right here
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 0016fbc..66ea30e 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -276,6 +276,13 @@ static void ext4_end_bio(struct bio *bio, int error)
 		error = 0;
 	bio_put(bio);
 
+	/*
+	 * We need to convert unwrittne extents in extent status tree before
+	 * end_page_writeback() is called.  Otherwise, when dioread_nolock is
+	 * enabled, we will be likely to read stale data.
+	 */
+	inode = io_end->inode;
+	ext4_es_convert_unwritten_extents(inode, io_end->offset, io_end->size);
 	for (i = 0; i < io_end->num_io_pages; i++) {
 		struct page *page = io_end->pages[i]->p_page;
 		struct buffer_head *bh, *head;
@@ -305,7 +312,6 @@ static void ext4_end_bio(struct bio *bio, int error)
 		put_io_page(io_end->pages[i]);
 	}
 	io_end->num_io_pages = 0;
-	inode = io_end->inode;
 
 	if (error) {
 		io_end->flag |= EXT4_IO_END_ERROR;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index f0734b3..d32e3d5 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2233,6 +2233,31 @@ TRACE_EVENT(ext4_es_lookup_extent_exit,
 		  __entry->found ? __entry->status : 0)
 );
 
+TRACE_EVENT(ext4_es_convert_unwritten_extents,
+	TP_PROTO(struct inode *inode, loff_t offset, loff_t size),
+
+	TP_ARGS(inode, offset, size),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,	dev			)
+		__field(	ino_t,	ino			)
+		__field(	loff_t,	offset			)
+		__field(	loff_t, size			)
+	),
+
+	TP_fast_assign(
+		__entry->dev	= inode->i_sb->s_dev;
+		__entry->ino	= inode->i_ino;
+		__entry->offset	= offset;
+		__entry->size	= size;
+	),
+
+	TP_printk("dev %d,%d ino %lu convert unwritten extents [%llu/%llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  (unsigned long) __entry->ino,
+		  __entry->offset, __entry->size)
+);
+
 TRACE_EVENT(ext4_es_reclaim_extents_count,
 	TP_PROTO(struct super_block *sb, int nr_cached),
 
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 10/10 v5] ext4: remove bogus wait for unwritten extents in ext4_ind_direct_IO
  2013-02-08  8:43 [PATCH 00/10 v5] ext4: extent status tree (step2) Zheng Liu
                   ` (8 preceding siblings ...)
  2013-02-08  8:44 ` [PATCH 09/10 v5] ext4: convert unwritten extents from extent status tree in end_io Zheng Liu
@ 2013-02-08  8:44 ` Zheng Liu
  2013-02-12 12:58   ` Jan Kara
  2013-02-10  1:38 ` [PATCH 00/10 v5] ext4: extent status tree (step2) Theodore Ts'o
  10 siblings, 1 reply; 37+ messages in thread
From: Zheng Liu @ 2013-02-08  8:44 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o, Jan kara

From: Zheng Liu <wenqing.lz@taobao.com>

After converting unwritten extents from extent status tree in end_io, we
can safely remove this bogus wait and don't worry about read stale data
because we always try to lookup a block mapping in extent status tree
firstly and unwritten extents in the tree has been converted at this
time.

Before that commit, we need to flush unwritten ios before a dio read
when dioread_nolock is enabled because in ext4_end_io_buffer_write and
ext4_end_bio end_page_writeback() is called before converting unwritten
extents in disk.  So here is a window that a dio reader will read stale
data as below if we don't wait for unwritten extents:

   dio read                         buffered write
                                    ->ext4_file_write
                                      ->ext4_da_write_begin
                                      ->ext4_da_write_end
                                      [buffered write has finished, but
                                       the data and metadata has not
                                       been flushed]
   ->generic_file_aio_read
     ->filemap_write_and_wait_range
       ->do_writepages
         ->ext4_da_writepages
     ->filemap_fdatawait_range
       ->wait_on_page_writeback
                                    ->ext4_end_bio
                                      ->end_page_writeback
                                        [unwritten extent has not been
                                         converted]
     ->ext4_ind_direct_IO
       [here we need to flush unwritten io]

After that commit, we never need to wait for unwritten extents.

   dio read                         buffered write
                                    ->ext4_file_write
                                      ->ext4_da_write_begin
                                      ->ext4_da_write_end
                                      [buffered write has finished, but
                                       the data and metadata has not
                                       been flushed]
   ->generic_file_aio_read
     ->filemap_write_and_wait_range
       ->do_writepages
         ->ext4_da_writepages
     ->filemap_fdatawait_range
       ->wait_on_page_writeback
                                    ->ext4_end_bio
                                      ->ext4_es_convert_unwritten_extents
                                      ->end_page_writeback
                                        [unwritten extent has not been
                                         converted in disk, but they are
                                         converted in extent status tree]
     ->ext4_ind_direct_IO
       [here we will see the written
        extents in extent status tree]


Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
---
 fs/ext4/indirect.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 20862f9..993247c 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -807,11 +807,6 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
 
 retry:
 	if (rw == READ && ext4_should_dioread_nolock(inode)) {
-		if (unlikely(atomic_read(&EXT4_I(inode)->i_unwritten))) {
-			mutex_lock(&inode->i_mutex);
-			ext4_flush_unwritten_io(inode);
-			mutex_unlock(&inode->i_mutex);
-		}
 		/*
 		 * Nolock dioread optimization may be dynamically disabled
 		 * via ext4_inode_block_unlocked_dio(). Check inode's state
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/10 v5] ext4: refine extent status tree
  2013-02-08  8:43 ` [PATCH 01/10 v5] ext4: refine extent status tree Zheng Liu
@ 2013-02-08 15:35   ` Jan Kara
  2013-02-15  6:38     ` Zheng Liu
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Kara @ 2013-02-08 15:35 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o, Jan kara

On Fri 08-02-13 16:43:57, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> This commit refines the extent status tree code.
> 
> 1) A prefix 'es_' is added to to the extent status tree structure
> members.
> 
> 2) Refactored es_remove_extent() so that __es_remove_extent() can be
> used by es_insert_extent() to remove the old extent entry(-ies) before
> inserting a new one.
> 
> 3) Rename extent_status_end() to ext4_es_end()
> 
> 4) ext4_es_can_be_merged() is define to check whether two extents can
> be merged or not.
> 
> 5) Update and clarified comments.
  Just one minor comment below. Otherwise the patch looks good (although I
admit I didn't check all the renaming changes carefully. You can add:
  Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> 
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Jan kara <jack@suse.cz>
> ---
>  fs/ext4/extents.c           |  21 +--
>  fs/ext4/extents_status.c    | 318 ++++++++++++++++++++++++--------------------
>  fs/ext4/extents_status.h    |   8 +-
>  fs/ext4/file.c              |  12 +-
>  include/trace/events/ext4.h |  40 +++---
>  5 files changed, 217 insertions(+), 182 deletions(-)
> 
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
...
> @@ -320,60 +352,39 @@ ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
>  	return es;
>  }
>  
> -static int __es_insert_extent(struct ext4_es_tree *tree, ext4_lblk_t offset,
> -			      ext4_lblk_t len)
> +static int __es_insert_extent(struct ext4_es_tree *tree,
> +			      struct extent_status *newes)
>  {
>  	struct rb_node **p = &tree->root.rb_node;
>  	struct rb_node *parent = NULL;
>  	struct extent_status *es;
> -	ext4_lblk_t end = offset + len - 1;
> -
> -	BUG_ON(end < offset);
> -	es = tree->cache_es;
> -	if (es && offset == (extent_status_end(es) + 1)) {
> -		es_debug("cached by [%u/%u)\n", es->start, es->len);
> -		es->len += len;
> -		es = ext4_es_try_to_merge_right(tree, es);
> -		goto out;
> -	} else if (es && es->start == end + 1) {
> -		es_debug("cached by [%u/%u)\n", es->start, es->len);
> -		es->start = offset;
> -		es->len += len;
> -		es = ext4_es_try_to_merge_left(tree, es);
> -		goto out;
> -	} else if (es && es->start <= offset &&
> -		   end <= extent_status_end(es)) {
> -		es_debug("cached by [%u/%u)\n", es->start, es->len);
> -		goto out;
> -	}
>  
>  	while (*p) {
>  		parent = *p;
>  		es = rb_entry(parent, struct extent_status, rb_node);
>  
> -		if (offset < es->start) {
> -			if (es->start == end + 1) {
> -				es->start = offset;
> -				es->len += len;
> +		if (newes->es_lblk < es->es_lblk) {
> +			if (ext4_es_can_be_merged(newes, es)) {
> +				es->es_lblk = newes->es_lblk;
> +				es->es_len += newes->es_len;
  This is wrong, isn't it? You cannot change es->es_lblk because that can
break ordering of elements in the tree... thinking ... ah, it's OK because
you have non-overlapping intervals. But it deserves a comment I guess.

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 02/10 v5] ext4: add physical block and status member into extent status tree
  2013-02-08  8:43 ` [PATCH 02/10 v5] ext4: add physical block and status member into " Zheng Liu
@ 2013-02-08 15:39   ` Jan Kara
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Kara @ 2013-02-08 15:39 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o, Jan kara

On Fri 08-02-13 16:43:58, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> This commit adds two members in extent_status structure to let it record
> physical block and extent status.  Here es_pblk is used to record both
> of them because physical block only has 48 bits.  So extent status could
> be stashed into it so that we can save some memory.  Now written,
> unwritten, delayed and hole are defined as status.
> 
> Due to new member is added into extent status tree, all interfaces need
> to be adjusted.
  The patch looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> 
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Jan kara <jack@suse.cz>
> ---
>  fs/ext4/extents_status.c    | 67 +++++++++++++++++++++++++++++++++++++--------
>  fs/ext4/extents_status.h    | 64 ++++++++++++++++++++++++++++++++++++++++++-
>  fs/ext4/inode.c             |  3 +-
>  include/trace/events/ext4.h | 34 +++++++++++++++--------
>  4 files changed, 142 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index aa4d346..5093cee 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -179,7 +179,9 @@ static void ext4_es_print_tree(struct inode *inode)
>  	while (node) {
>  		struct extent_status *es;
>  		es = rb_entry(node, struct extent_status, rb_node);
> -		printk(KERN_DEBUG " [%u/%u)", es->es_lblk, es->es_len);
> +		printk(KERN_DEBUG " [%u/%u) %llu %llx",
> +		       es->es_lblk, es->es_len,
> +		       ext4_es_pblock(es), ext4_es_status(es));
>  		node = rb_next(node);
>  	}
>  	printk(KERN_DEBUG "\n");
> @@ -234,7 +236,7 @@ static struct extent_status *__es_tree_search(struct rb_root *root,
>   * @es: delayed extent that we found
>   *
>   * Returns the first block of the next extent after es, otherwise
> - * EXT_MAX_BLOCKS if no delay extent is found.
> + * EXT_MAX_BLOCKS if no extent is found.
>   * Delayed extent is returned via @es.
>   */
>  ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
> @@ -249,17 +251,18 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
>  	read_lock(&EXT4_I(inode)->i_es_lock);
>  	tree = &EXT4_I(inode)->i_es_tree;
>  
> -	/* find delay extent in cache firstly */
> +	/* find extent in cache firstly */
> +	es->es_len = es->es_pblk = 0;
>  	if (tree->cache_es) {
>  		es1 = tree->cache_es;
>  		if (in_range(es->es_lblk, es1->es_lblk, es1->es_len)) {
> -			es_debug("%u cached by [%u/%u)\n",
> -				 es->es_lblk, es1->es_lblk, es1->es_len);
> +			es_debug("%u cached by [%u/%u) %llu %llx\n",
> +				 es->es_lblk, es1->es_lblk, es1->es_len,
> +				 ext4_es_pblock(es1), ext4_es_status(es1));
>  			goto out;
>  		}
>  	}
>  
> -	es->es_len = 0;
>  	es1 = __es_tree_search(&tree->root, es->es_lblk);
>  
>  out:
> @@ -267,6 +270,7 @@ out:
>  		tree->cache_es = es1;
>  		es->es_lblk = es1->es_lblk;
>  		es->es_len = es1->es_len;
> +		es->es_pblk = es1->es_pblk;
>  		node = rb_next(&es1->rb_node);
>  		if (node) {
>  			es1 = rb_entry(node, struct extent_status, rb_node);
> @@ -281,7 +285,7 @@ out:
>  }
>  
>  static struct extent_status *
> -ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len)
> +ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len, ext4_fsblk_t pblk)
>  {
>  	struct extent_status *es;
>  	es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
> @@ -289,6 +293,7 @@ ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len)
>  		return NULL;
>  	es->es_lblk = lblk;
>  	es->es_len = len;
> +	es->es_pblk = pblk;
>  	return es;
>  }
>  
> @@ -301,6 +306,8 @@ static void ext4_es_free_extent(struct extent_status *es)
>   * Check whether or not two extents can be merged
>   * Condition:
>   *  - logical block number is contiguous
> + *  - physical block number is contiguous
> + *  - status is equal
>   */
>  static int ext4_es_can_be_merged(struct extent_status *es1,
>  				 struct extent_status *es2)
> @@ -308,6 +315,13 @@ static int ext4_es_can_be_merged(struct extent_status *es1,
>  	if (es1->es_lblk + es1->es_len != es2->es_lblk)
>  		return 0;
>  
> +	if (ext4_es_status(es1) != ext4_es_status(es2))
> +		return 0;
> +
> +	if ((ext4_es_is_written(es1) || ext4_es_is_unwritten(es1)) &&
> +	    (ext4_es_pblock(es1) + es1->es_len != ext4_es_pblock(es2)))
> +		return 0;
> +
>  	return 1;
>  }
>  
> @@ -367,6 +381,10 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
>  			if (ext4_es_can_be_merged(newes, es)) {
>  				es->es_lblk = newes->es_lblk;
>  				es->es_len += newes->es_len;
> +				if (ext4_es_is_written(es) ||
> +				    ext4_es_is_unwritten(es))
> +					ext4_es_store_pblock(es,
> +							     newes->es_pblk);
>  				es = ext4_es_try_to_merge_left(tree, es);
>  				goto out;
>  			}
> @@ -384,7 +402,8 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
>  		}
>  	}
>  
> -	es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len);
> +	es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len,
> +				  newes->es_pblk);
>  	if (!es)
>  		return -ENOMEM;
>  	rb_link_node(&es->rb_node, parent, p);
> @@ -404,21 +423,24 @@ out:
>   * Return 0 on success, error code on failure.
>   */
>  int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> -			  ext4_lblk_t len)
> +			  ext4_lblk_t len, ext4_fsblk_t pblk,
> +			  unsigned long long status)
>  {
>  	struct ext4_es_tree *tree;
>  	struct extent_status newes;
>  	ext4_lblk_t end = lblk + len - 1;
>  	int err = 0;
>  
> -	trace_ext4_es_insert_extent(inode, lblk, len);
> -	es_debug("add [%u/%u) to extent status tree of inode %lu\n",
> -		 lblk, len, inode->i_ino);
> +	es_debug("add [%u/%u) %llu %llx to extent status tree of inode %lu\n",
> +		 lblk, len, pblk, status, inode->i_ino);
>  
>  	BUG_ON(end < lblk);
>  
>  	newes.es_lblk = lblk;
>  	newes.es_len = len;
> +	ext4_es_store_pblock(&newes, pblk);
> +	ext4_es_store_status(&newes, status);
> +	trace_ext4_es_insert_extent(inode, &newes);
>  
>  	write_lock(&EXT4_I(inode)->i_es_lock);
>  	tree = &EXT4_I(inode)->i_es_tree;
> @@ -442,6 +464,7 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
>  	struct extent_status *es;
>  	struct extent_status orig_es;
>  	ext4_lblk_t len1, len2;
> +	ext4_fsblk_t block;
>  	int err = 0;
>  
>  	es = __es_tree_search(&tree->root, lblk);
> @@ -455,6 +478,8 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
>  
>  	orig_es.es_lblk = es->es_lblk;
>  	orig_es.es_len = es->es_len;
> +	orig_es.es_pblk = es->es_pblk;
> +
>  	len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0;
>  	len2 = ext4_es_end(es) > end ? ext4_es_end(es) - end : 0;
>  	if (len1 > 0)
> @@ -465,6 +490,13 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
>  
>  			newes.es_lblk = end + 1;
>  			newes.es_len = len2;
> +			if (ext4_es_is_written(&orig_es) ||
> +			    ext4_es_is_unwritten(&orig_es)) {
> +				block = ext4_es_pblock(&orig_es) +
> +					orig_es.es_len - len2;
> +				ext4_es_store_pblock(&newes, block);
> +			}
> +			ext4_es_store_status(&newes, ext4_es_status(&orig_es));
>  			err = __es_insert_extent(tree, &newes);
>  			if (err) {
>  				es->es_lblk = orig_es.es_lblk;
> @@ -474,6 +506,11 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
>  		} else {
>  			es->es_lblk = end + 1;
>  			es->es_len = len2;
> +			if (ext4_es_is_written(es) ||
> +			    ext4_es_is_unwritten(es)) {
> +				block = orig_es.es_pblk + orig_es.es_len - len2;
> +				ext4_es_store_pblock(es, block);
> +			}
>  		}
>  		goto out;
>  	}
> @@ -498,9 +535,15 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
>  	}
>  
>  	if (es && es->es_lblk < end + 1) {
> +		ext4_lblk_t orig_len = es->es_len;
> +
>  		len1 = ext4_es_end(es) - end;
>  		es->es_lblk = end + 1;
>  		es->es_len = len1;
> +		if (ext4_es_is_written(es) || ext4_es_is_unwritten(es)) {
> +			block = es->es_pblk + orig_len - len1;
> +			ext4_es_store_pblock(es, block);
> +		}
>  	}
>  
>  out:
> diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
> index 81e9339..2a5d69e 100644
> --- a/fs/ext4/extents_status.h
> +++ b/fs/ext4/extents_status.h
> @@ -20,10 +20,21 @@
>  #define es_debug(fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
>  #endif
>  
> +#define EXTENT_STATUS_WRITTEN	0x10000000	/* written extent */
> +#define EXTENT_STATUS_UNWRITTEN	0x20000000	/* unwritten extent */
> +#define EXTENT_STATUS_DELAYED	0x40000000	/* delayed extent */
> +#define EXTENT_STATUS_HOLE	0x80000000	/* hole */
> +
> +#define EXTENT_STATUS_FLAGS	(EXTENT_STATUS_WRITTEN | \
> +				 EXTENT_STATUS_UNWRITTEN | \
> +				 EXTENT_STATUS_DELAYED | \
> +				 EXTENT_STATUS_HOLE)
> +
>  struct extent_status {
>  	struct rb_node rb_node;
>  	ext4_lblk_t es_lblk;	/* first logical block extent covers */
>  	ext4_lblk_t es_len;	/* length of extent in block */
> +	ext4_fsblk_t es_pblk;	/* first physical block */
>  };
>  
>  struct ext4_es_tree {
> @@ -36,10 +47,61 @@ extern void ext4_exit_es(void);
>  extern void ext4_es_init_tree(struct ext4_es_tree *tree);
>  
>  extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> -				 ext4_lblk_t len);
> +				 ext4_lblk_t len, ext4_fsblk_t pblk,
> +				 unsigned long long status);
>  extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  				 ext4_lblk_t len);
>  extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
>  				struct extent_status *es);
>  
> +static inline int ext4_es_is_written(struct extent_status *es)
> +{
> +	return (es->es_pblk & EXTENT_STATUS_WRITTEN);
> +}
> +
> +static inline int ext4_es_is_unwritten(struct extent_status *es)
> +{
> +	return (es->es_pblk & EXTENT_STATUS_UNWRITTEN);
> +}
> +
> +static inline int ext4_es_is_delayed(struct extent_status *es)
> +{
> +	return (es->es_pblk & EXTENT_STATUS_DELAYED);
> +}
> +
> +static inline int ext4_es_is_hole(struct extent_status *es)
> +{
> +	return (es->es_pblk & EXTENT_STATUS_HOLE);
> +}
> +
> +static inline ext4_fsblk_t ext4_es_status(struct extent_status *es)
> +{
> +	return (es->es_pblk & EXTENT_STATUS_FLAGS);
> +}
> +
> +static inline ext4_fsblk_t ext4_es_pblock(struct extent_status *es)
> +{
> +	return (es->es_pblk & ~EXTENT_STATUS_FLAGS);
> +}
> +
> +static inline void ext4_es_store_pblock(struct extent_status *es,
> +					ext4_fsblk_t pb)
> +{
> +	ext4_fsblk_t block;
> +
> +	block = (pb & ~EXTENT_STATUS_FLAGS) |
> +		(es->es_pblk & EXTENT_STATUS_FLAGS);
> +	es->es_pblk = block;
> +}
> +
> +static inline void ext4_es_store_status(struct extent_status *es,
> +					unsigned long long status)
> +{
> +	ext4_fsblk_t block;
> +
> +	block = (status & EXTENT_STATUS_FLAGS) |
> +		(es->es_pblk & ~EXTENT_STATUS_FLAGS);
> +	es->es_pblk = block;
> +}
> +
>  #endif /* _EXT4_EXTENTS_STATUS_H */
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index cbfe13b..7fb00d8 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1821,7 +1821,8 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
>  				goto out_unlock;
>  		}
>  
> -		retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len);
> +		retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> +					       ~0, EXTENT_STATUS_DELAYED);
>  		if (retval)
>  			goto out_unlock;
>  
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index 952628a..ef2f96e 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -2068,28 +2068,33 @@ TRACE_EVENT(ext4_ext_remove_space_done,
>  );
>  
>  TRACE_EVENT(ext4_es_insert_extent,
> -	TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
> +	TP_PROTO(struct inode *inode, struct extent_status *es),
>  
> -	TP_ARGS(inode, lblk, len),
> +	TP_ARGS(inode, es),
>  
>  	TP_STRUCT__entry(
> -		__field(	dev_t,	dev			)
> -		__field(	ino_t,	ino			)
> -		__field(	loff_t,	lblk			)
> -		__field(	loff_t, len			)
> +		__field(	dev_t,		dev		)
> +		__field(	ino_t,		ino		)
> +		__field(	ext4_lblk_t,	lblk		)
> +		__field(	ext4_lblk_t,	len		)
> +		__field(	ext4_fsblk_t,	pblk		)
> +		__field(	unsigned long long, status	)
>  	),
>  
>  	TP_fast_assign(
>  		__entry->dev	= inode->i_sb->s_dev;
>  		__entry->ino	= inode->i_ino;
> -		__entry->lblk	= lblk;
> -		__entry->len	= len;
> +		__entry->lblk	= es->es_lblk;
> +		__entry->len	= es->es_len;
> +		__entry->pblk	= ext4_es_pblock(es);
> +		__entry->status	= ext4_es_status(es);
>  	),
>  
> -	TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
> +	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %llx",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  (unsigned long) __entry->ino,
> -		  __entry->lblk, __entry->len)
> +		  __entry->lblk, __entry->len,
> +		  __entry->pblk, __entry->status)
>  );
>  
>  TRACE_EVENT(ext4_es_remove_extent,
> @@ -2150,6 +2155,8 @@ TRACE_EVENT(ext4_es_find_extent_exit,
>  		__field(	ino_t,		ino		)
>  		__field(	ext4_lblk_t,	lblk		)
>  		__field(	ext4_lblk_t,	len		)
> +		__field(	ext4_fsblk_t,	pblk		)
> +		__field(	unsigned long long, status	)
>  		__field(	ext4_lblk_t,	ret		)
>  	),
>  
> @@ -2158,13 +2165,16 @@ TRACE_EVENT(ext4_es_find_extent_exit,
>  		__entry->ino	= inode->i_ino;
>  		__entry->lblk	= es->es_lblk;
>  		__entry->len	= es->es_len;
> +		__entry->pblk	= ext4_es_pblock(es);
> +		__entry->status	= ext4_es_status(es);
>  		__entry->ret	= ret;
>  	),
>  
> -	TP_printk("dev %d,%d ino %lu es [%u/%u) ret %u",
> +	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %llx ret %u",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  (unsigned long) __entry->ino,
> -		  __entry->lblk, __entry->len, __entry->ret)
> +		  __entry->lblk, __entry->len,
> +		  __entry->pblk, __entry->status, __entry->ret)
>  );
>  
>  #endif /* _TRACE_EXT4_H */
> -- 
> 1.7.12.rc2.18.g61b472e
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 03/10 v5] ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
  2013-02-08  8:43 ` [PATCH 03/10 v5] ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag Zheng Liu
@ 2013-02-08 15:41   ` Jan Kara
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Kara @ 2013-02-08 15:41 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o, Jan kara

On Fri 08-02-13 16:43:59, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> This commit lets ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
> because in later commit ext4_map_blocks needs to use this flag to
> determine the extent status.
  The patch looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> 
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Jan kara <jack@suse.cz>
> ---
>  fs/ext4/extents.c |  6 +++++-
>  fs/ext4/inode.c   | 12 +++---------
>  2 files changed, 8 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index f7bf616..d92947f 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3657,6 +3657,7 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
>  			ext4_set_io_unwritten_flag(inode, io);
>  		else
>  			ext4_set_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
> +		map->m_flags |= EXT4_MAP_UNWRITTEN;
>  		if (ext4_should_dioread_nolock(inode))
>  			map->m_flags |= EXT4_MAP_UNINIT;
>  		goto out;
> @@ -3678,8 +3679,10 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
>  	 * repeat fallocate creation request
>  	 * we already have an unwritten extent
>  	 */
> -	if (flags & EXT4_GET_BLOCKS_UNINIT_EXT)
> +	if (flags & EXT4_GET_BLOCKS_UNINIT_EXT) {
> +		map->m_flags |= EXT4_MAP_UNWRITTEN;
>  		goto map_out;
> +	}
>  
>  	/* buffered READ or buffered write_begin() lookup */
>  	if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
> @@ -4109,6 +4112,7 @@ got_allocated_blocks:
>  	/* Mark uninitialized */
>  	if (flags & EXT4_GET_BLOCKS_UNINIT_EXT){
>  		ext4_ext_mark_uninitialized(&newex);
> +		map->m_flags |= EXT4_MAP_UNWRITTEN;
>  		/*
>  		 * io_end structure was created for every IO write to an
>  		 * uninitialized extent. To avoid unnecessary conversion,
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 7fb00d8..c7e9665 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -560,16 +560,10 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
>  		return retval;
>  
>  	/*
> -	 * When we call get_blocks without the create flag, the
> -	 * BH_Unwritten flag could have gotten set if the blocks
> -	 * requested were part of a uninitialized extent.  We need to
> -	 * clear this flag now that we are committed to convert all or
> -	 * part of the uninitialized extent to be an initialized
> -	 * extent.  This is because we need to avoid the combination
> -	 * of BH_Unwritten and BH_Mapped flags being simultaneously
> -	 * set on the buffer_head.
> +	 * Here we clear m_flags because after allocating an new extent,
> +	 * it will be set again.
>  	 */
> -	map->m_flags &= ~EXT4_MAP_UNWRITTEN;
> +	map->m_flags &= ~EXT4_MAP_FLAGS;
>  
>  	/*
>  	 * New blocks allocate and/or writing to uninitialized extent
> -- 
> 1.7.12.rc2.18.g61b472e
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/10 v5] ext4: extent status tree (step2)
  2013-02-08  8:43 [PATCH 00/10 v5] ext4: extent status tree (step2) Zheng Liu
                   ` (9 preceding siblings ...)
  2013-02-08  8:44 ` [PATCH 10/10 v5] ext4: remove bogus wait for unwritten extents in ext4_ind_direct_IO Zheng Liu
@ 2013-02-10  1:38 ` Theodore Ts'o
  2013-02-10  8:40   ` Zheng Liu
  10 siblings, 1 reply; 37+ messages in thread
From: Theodore Ts'o @ 2013-02-10  1:38 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Zheng Liu, Jan kara, Dmitry Monakhov

Hi Zheng,

Thanks for working on this!  I've included your v5 extent status tree
patches into the ext4 dev tree.  I will start doing further testing.

I suspected it wouldn't make a different, but I tested and confirmed
that your patches don't help fix the regression introduced by the
patch: disable-merging-of-uninitialized-extents

As a result, I've moved the following patches to the unstable portion
of the ext4 patch queue, pending further investigation and discussion:

disable-merging-of-uninitialized-extents
remove-unnecessary-wait-for-extent-coversion-in-ext4_fallocate
ext4_split_extent_should_take_care_of_extent_zeroout

I will be running a full set of tests and looking more deeply at the
patches, but for now I want to get them into linux-next for more
testing and review.

Cheers, 

					- Ted

On Fri, Feb 08, 2013 at 04:43:56PM +0800, Zheng Liu wrote:
> Hi all,
> 
> This is my fifth try to implement the second step of extent status tree.
> The patch set can be divided into the following parts.
> 
> Patch 1/10
>   This patch refines the extent status tree
> 
> Patch 2/10-6/10
>   These patches try to track all extent status in extent status tree and
> make it as a extent cache.  In extent_status structure bit field is removed
> because we get some warnings from 'sparse'.  Now es_pblk and es_status are
> manipulated by ext4_es_*_pblock and ext4_es_*_status directly.  Currently
> when an unwritten extent is allocated, we never know it from map->m_flags
> because ext4_ext_map_blocks doesn't return EXT4_MAP_UNWRITTEN flag.  A
> patch fixes it and we can determine the extent status according to m_flags.
>   According to Jan's feedback, we put the hole into extent cache to avoid
> to access extent tree in disk as far as possible.  Here if the whole file
> is a hole, this hole will not be cached in extent status tree because it
> is always splitted immediately.  Meanwhile the hole will not be cached
> when ext4_da_map_blocks looks up a block mapping because this hole will be
> as a delayed extent later.
> 
> Patch 7/10-8/10
>   This two patches try to reclaim memory from extent status tree when we
> are under a high memeory pressure.
> 
> Patch 9/10-10/10
>   Thses patches are picked up again from 1st version because I aware that
> they could remove a bogus wait in ext4_ind_direct_IO when dioread_nolock
> is enabled.  After applied them, the latency of dio read can be reduced.
> 
> I measure it using fio and the result shows as below.
> 
> config file
> -----------
> [global]
> ioengine=psync
> direct=1
> bs=4k
> thread
> group_reporting
> directory=/mnt/sda1/
> filename=testfile
> filesize=10g
> size=10g
> runtime=120
> iodepth=16
> 
> [fio]
> rw=randrw
> numjobs=4
> 
> result
> ------
> w/ bogus wait
>   read : io=1508.1MB, bw=12876KB/s, iops=3218 , runt=120001msec
>     clat (usec): min=128 , max=268738 , avg=718.62, stdev=3703.97
>      lat (usec): min=128 , max=268739 , avg=718.78, stdev=3703.97
>   write: io=1505.2MB, bw=12843KB/s, iops=3210 , runt=120001msec
>     clat (usec): min=47 , max=991727 , avg=520.94, stdev=3451.63
>      lat (usec): min=47 , max=991727 , avg=521.31, stdev=3451.63
> 
> w/o bogus wait
>   read : io=1576.4MB, bw=13451KB/s, iops=3362 , runt=120001msec
>     clat (usec): min=128 , max=283906 , avg=685.88, stdev=2762.64
>      lat (usec): min=128 , max=283907 , avg=686.05, stdev=2762.64
>   write: io=1577.9MB, bw=13458KB/s, iops=3364 , runt=120001msec
>     clat (usec): min=48 , max=977942 , avg=498.97, stdev=3093.08
>      lat (usec): min=48 , max=977943 , avg=499.33, stdev=3093.08
> 
> From the result we can see that the avg. of latency could be reduced a little.
> 
> changelog:
> v5 <- v4:
>  - drop a patch that removes EXT4_MAP_FROM_CLUSTER flag
>    (I will revise it in the patch set of get_block_t refinement)
>  - fold original patch 3/9 into patch 4/9
>  - manipulate es_pblk and es_status directly
>    (bit field is removed because it causes some warnings from 'sparse')
>  - let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
>  - rename ext4_es_find_extent with ext4_es_find_delayed_extent
>  - add hole status and put hole into extent status tree as a cache
>  - convert unwritten extents from extent status tree in ext4_ext_direct_IO
>    and end_io callback
>  - remove a bogus wait in ext4_ind_direct_IO when dioread_nolock is enabled
> 
> v4 <- v3:
>  - register a normal shrinker to reclaim extent from extent status tree
> 
> v3 <- v2:
>  - use prune_super() to reclaim extents from extent status tree
>  - stashed es_status into es_pblk
>  - remove single extent cache
>  - rebase against 3.8-rc4
> 
> v2 <- v1:
>  - drop patches that try to improve unwritten extent conversion
>  - remove EXT4_MAP_FROM_CLUSTER flag
>  - add tracepoint for ext4_es_lookup_extent()
>  - drop a patch, which tries to fix a warning when bigalloc and delalloc
>    are enabled
>  - add a shrinker to reclaim memory from extent status tree
>  - rebase against 3.8-rc2
> 
> v4: http://lwn.net/Articles/536037/
> v3: http://lwn.net/Articles/533730/
> v2: http://lwn.net/Articles/532446/
> v1: http://lwn.net/Articles/531065/
> 
> As always, any comments or feedbacks are welcome.
> 
> FWIW, when I try to implement patch 3/10, I realize that get_block_t and
> *_map_blocks functions need to be refactored because in ext4 we already
> have six get_block_t functions
>  - ext4_get_block
>  - ext4_get_block_write
>  - ext4_get_block_write_nolock
>  - noalloc_get_block_write
>  - ext4_da_get_block_prep
>  - _ext4_get_block
> 
> and four *_map_blocks
>  - ext4_map_blocks
>  - ext4_da_map_blocks
>  - ext4_ext_map_blocks
>  - ext4_ind_map_blocks
> 
> So I am planning to refine them.  First I will try to split ext4_map_blocks
> into two parts, e.g. ext4_map_blocks_read and ext4_map_blocks_write, and 
> then try other cleanups and improvmentes.
> 
> Thanks,
> 						- Zheng
> 
> Zheng Liu (10):
>   ext4: refine extent status tree
>   ext4: add physical block and status member into extent status tree
>   ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
>   ext4: track all extent status in extent status tree
>   ext4: lookup block mapping in extent status tree
>   ext4: remove single extent cache
>   ext4: adjust some functions for reclaiming extents from extent status
>     tree
>   ext4: reclaim extents from extent status tree
>   ext4: convert unwritten extents from extent status tree in end_io
>   ext4: remove bogus wait for unwritten extents in ext4_ind_direct_IO
> 
>  fs/ext4/ext4.h              |  21 +-
>  fs/ext4/ext4_extents.h      |   6 -
>  fs/ext4/extents.c           | 211 ++++--------
>  fs/ext4/extents_status.c    | 779 +++++++++++++++++++++++++++++++++++---------
>  fs/ext4/extents_status.h    |  84 ++++-
>  fs/ext4/file.c              |  16 +-
>  fs/ext4/indirect.c          |   5 -
>  fs/ext4/inode.c             | 148 +++++++--
>  fs/ext4/move_extent.c       |   3 -
>  fs/ext4/page-io.c           |   8 +-
>  fs/ext4/super.c             |   8 +-
>  include/trace/events/ext4.h | 207 ++++++++++--
>  12 files changed, 1075 insertions(+), 421 deletions(-)
> 
> -- 
> 1.7.12.rc2.18.g61b472e
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/10 v5] ext4: extent status tree (step2)
  2013-02-10  1:38 ` [PATCH 00/10 v5] ext4: extent status tree (step2) Theodore Ts'o
@ 2013-02-10  8:40   ` Zheng Liu
  0 siblings, 0 replies; 37+ messages in thread
From: Zheng Liu @ 2013-02-10  8:40 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Zheng Liu, Jan kara, Dmitry Monakhov

Hi Ted,

On Sat, Feb 09, 2013 at 08:38:57PM -0500, Theodore Ts'o wrote:
> Hi Zheng,
> 
> Thanks for working on this!  I've included your v5 extent status tree
> patches into the ext4 dev tree.  I will start doing further testing.
> 
> I suspected it wouldn't make a different, but I tested and confirmed
> that your patches don't help fix the regression introduced by the
> patch: disable-merging-of-uninitialized-extents
> 
> As a result, I've moved the following patches to the unstable portion
> of the ext4 patch queue, pending further investigation and discussion:
> 
> disable-merging-of-uninitialized-extents
> remove-unnecessary-wait-for-extent-coversion-in-ext4_fallocate
> ext4_split_extent_should_take_care_of_extent_zeroout
> 
> I will be running a full set of tests and looking more deeply at the
> patches, but for now I want to get them into linux-next for more
> testing and review.

I have seen that the patch series has been merged into 'dev' branch of
ext4 repo.  But I found a problem that a chunk in patch 9/10 is missing
in your branch.  In that patch ext4_map_blocks is replaced with
ext4_ext_map_blocks because when we call ext4_map_blocks to convert
unwritten extents it will fail.  In ext4_map_blocks it first tries
to lookup extent status tree and the unwritten extent in this tree has
been converted in end_io callback function.  So that means that
ext4_map_blocks will return immediately and unwritten extent in disk
will never be converted.  I have replied a mail for that patch to
describe it.  Please check it.

Thanks,
                                                - Zheng

> On Fri, Feb 08, 2013 at 04:43:56PM +0800, Zheng Liu wrote:
> > Hi all,
> > 
> > This is my fifth try to implement the second step of extent status tree.
> > The patch set can be divided into the following parts.
> > 
> > Patch 1/10
> >   This patch refines the extent status tree
> > 
> > Patch 2/10-6/10
> >   These patches try to track all extent status in extent status tree and
> > make it as a extent cache.  In extent_status structure bit field is removed
> > because we get some warnings from 'sparse'.  Now es_pblk and es_status are
> > manipulated by ext4_es_*_pblock and ext4_es_*_status directly.  Currently
> > when an unwritten extent is allocated, we never know it from map->m_flags
> > because ext4_ext_map_blocks doesn't return EXT4_MAP_UNWRITTEN flag.  A
> > patch fixes it and we can determine the extent status according to m_flags.
> >   According to Jan's feedback, we put the hole into extent cache to avoid
> > to access extent tree in disk as far as possible.  Here if the whole file
> > is a hole, this hole will not be cached in extent status tree because it
> > is always splitted immediately.  Meanwhile the hole will not be cached
> > when ext4_da_map_blocks looks up a block mapping because this hole will be
> > as a delayed extent later.
> > 
> > Patch 7/10-8/10
> >   This two patches try to reclaim memory from extent status tree when we
> > are under a high memeory pressure.
> > 
> > Patch 9/10-10/10
> >   Thses patches are picked up again from 1st version because I aware that
> > they could remove a bogus wait in ext4_ind_direct_IO when dioread_nolock
> > is enabled.  After applied them, the latency of dio read can be reduced.
> > 
> > I measure it using fio and the result shows as below.
> > 
> > config file
> > -----------
> > [global]
> > ioengine=psync
> > direct=1
> > bs=4k
> > thread
> > group_reporting
> > directory=/mnt/sda1/
> > filename=testfile
> > filesize=10g
> > size=10g
> > runtime=120
> > iodepth=16
> > 
> > [fio]
> > rw=randrw
> > numjobs=4
> > 
> > result
> > ------
> > w/ bogus wait
> >   read : io=1508.1MB, bw=12876KB/s, iops=3218 , runt=120001msec
> >     clat (usec): min=128 , max=268738 , avg=718.62, stdev=3703.97
> >      lat (usec): min=128 , max=268739 , avg=718.78, stdev=3703.97
> >   write: io=1505.2MB, bw=12843KB/s, iops=3210 , runt=120001msec
> >     clat (usec): min=47 , max=991727 , avg=520.94, stdev=3451.63
> >      lat (usec): min=47 , max=991727 , avg=521.31, stdev=3451.63
> > 
> > w/o bogus wait
> >   read : io=1576.4MB, bw=13451KB/s, iops=3362 , runt=120001msec
> >     clat (usec): min=128 , max=283906 , avg=685.88, stdev=2762.64
> >      lat (usec): min=128 , max=283907 , avg=686.05, stdev=2762.64
> >   write: io=1577.9MB, bw=13458KB/s, iops=3364 , runt=120001msec
> >     clat (usec): min=48 , max=977942 , avg=498.97, stdev=3093.08
> >      lat (usec): min=48 , max=977943 , avg=499.33, stdev=3093.08
> > 
> > From the result we can see that the avg. of latency could be reduced a little.
> > 
> > changelog:
> > v5 <- v4:
> >  - drop a patch that removes EXT4_MAP_FROM_CLUSTER flag
> >    (I will revise it in the patch set of get_block_t refinement)
> >  - fold original patch 3/9 into patch 4/9
> >  - manipulate es_pblk and es_status directly
> >    (bit field is removed because it causes some warnings from 'sparse')
> >  - let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
> >  - rename ext4_es_find_extent with ext4_es_find_delayed_extent
> >  - add hole status and put hole into extent status tree as a cache
> >  - convert unwritten extents from extent status tree in ext4_ext_direct_IO
> >    and end_io callback
> >  - remove a bogus wait in ext4_ind_direct_IO when dioread_nolock is enabled
> > 
> > v4 <- v3:
> >  - register a normal shrinker to reclaim extent from extent status tree
> > 
> > v3 <- v2:
> >  - use prune_super() to reclaim extents from extent status tree
> >  - stashed es_status into es_pblk
> >  - remove single extent cache
> >  - rebase against 3.8-rc4
> > 
> > v2 <- v1:
> >  - drop patches that try to improve unwritten extent conversion
> >  - remove EXT4_MAP_FROM_CLUSTER flag
> >  - add tracepoint for ext4_es_lookup_extent()
> >  - drop a patch, which tries to fix a warning when bigalloc and delalloc
> >    are enabled
> >  - add a shrinker to reclaim memory from extent status tree
> >  - rebase against 3.8-rc2
> > 
> > v4: http://lwn.net/Articles/536037/
> > v3: http://lwn.net/Articles/533730/
> > v2: http://lwn.net/Articles/532446/
> > v1: http://lwn.net/Articles/531065/
> > 
> > As always, any comments or feedbacks are welcome.
> > 
> > FWIW, when I try to implement patch 3/10, I realize that get_block_t and
> > *_map_blocks functions need to be refactored because in ext4 we already
> > have six get_block_t functions
> >  - ext4_get_block
> >  - ext4_get_block_write
> >  - ext4_get_block_write_nolock
> >  - noalloc_get_block_write
> >  - ext4_da_get_block_prep
> >  - _ext4_get_block
> > 
> > and four *_map_blocks
> >  - ext4_map_blocks
> >  - ext4_da_map_blocks
> >  - ext4_ext_map_blocks
> >  - ext4_ind_map_blocks
> > 
> > So I am planning to refine them.  First I will try to split ext4_map_blocks
> > into two parts, e.g. ext4_map_blocks_read and ext4_map_blocks_write, and 
> > then try other cleanups and improvmentes.
> > 
> > Thanks,
> > 						- Zheng
> > 
> > Zheng Liu (10):
> >   ext4: refine extent status tree
> >   ext4: add physical block and status member into extent status tree
> >   ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
> >   ext4: track all extent status in extent status tree
> >   ext4: lookup block mapping in extent status tree
> >   ext4: remove single extent cache
> >   ext4: adjust some functions for reclaiming extents from extent status
> >     tree
> >   ext4: reclaim extents from extent status tree
> >   ext4: convert unwritten extents from extent status tree in end_io
> >   ext4: remove bogus wait for unwritten extents in ext4_ind_direct_IO
> > 
> >  fs/ext4/ext4.h              |  21 +-
> >  fs/ext4/ext4_extents.h      |   6 -
> >  fs/ext4/extents.c           | 211 ++++--------
> >  fs/ext4/extents_status.c    | 779 +++++++++++++++++++++++++++++++++++---------
> >  fs/ext4/extents_status.h    |  84 ++++-
> >  fs/ext4/file.c              |  16 +-
> >  fs/ext4/indirect.c          |   5 -
> >  fs/ext4/inode.c             | 148 +++++++--
> >  fs/ext4/move_extent.c       |   3 -
> >  fs/ext4/page-io.c           |   8 +-
> >  fs/ext4/super.c             |   8 +-
> >  include/trace/events/ext4.h | 207 ++++++++++--
> >  12 files changed, 1075 insertions(+), 421 deletions(-)
> > 
> > -- 
> > 1.7.12.rc2.18.g61b472e
> > 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/10 v5] ext4: convert unwritten extents from extent status tree in end_io
  2013-02-08  8:44 ` [PATCH 09/10 v5] ext4: convert unwritten extents from extent status tree in end_io Zheng Liu
@ 2013-02-10  8:45   ` Zheng Liu
  2013-02-11  1:52     ` Theodore Ts'o
  2013-02-12 12:51   ` Jan Kara
  1 sibling, 1 reply; 37+ messages in thread
From: Zheng Liu @ 2013-02-10  8:45 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Zheng Liu, Jan kara, linux-ext4

On Fri, Feb 08, 2013 at 04:44:05PM +0800, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> This commit tries to convert unwritten extents from extent status tree
> in end_io callback functions and ext4_ext_direct_IO.
> 
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Jan kara <jack@suse.cz>
> ---
>  fs/ext4/extents.c           |   6 +-
>  fs/ext4/extents_status.c    | 180 ++++++++++++++++++++++++++++++++++++++++----
>  fs/ext4/extents_status.h    |   2 +
>  fs/ext4/inode.c             |   5 ++
>  fs/ext4/page-io.c           |   8 +-
>  include/trace/events/ext4.h |  25 ++++++
>  6 files changed, 208 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 9f21430..a03cabf 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4443,8 +4443,10 @@ int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
>  			ret = PTR_ERR(handle);
>  			break;
>  		}
> -		ret = ext4_map_blocks(handle, inode, &map,
> -				      EXT4_GET_BLOCKS_IO_CONVERT_EXT);
> +		down_write(&EXT4_I(inode)->i_data_sem);
> +		ret = ext4_ext_map_blocks(handle, inode, &map,
> +					  EXT4_GET_BLOCKS_IO_CONVERT_EXT);
> +		up_write(&EXT4_I(inode)->i_data_sem);
>  		if (ret <= 0) {
>  			WARN_ON(ret <= 0);
>  			ext4_msg(inode->i_sb, KERN_ERR,

Hi Ted,

This chunk is missing in your 'dev' branch of ext4.  If we call
ext4_map_blocks here, unwritten extent in disk will never be converted
because ext4_map_blocks first tries to lookup extent status tree and the
unwriten extent in this tree has been converted in end_io callback
function.  It always looks up a written extent in cache, and return
immdiately.  Please check it again.

Thanks,
                                                - Zheng

> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index bac5286..eab8893 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -248,10 +248,11 @@ ext4_lblk_t ext4_es_find_delayed_extent(struct inode *inode,
>  	struct extent_status *es1 = NULL;
>  	struct rb_node *node;
>  	ext4_lblk_t ret = EXT_MAX_BLOCKS;
> +	unsigned long flags;
>  
>  	trace_ext4_es_find_delayed_extent_enter(inode, es->es_lblk);
>  
> -	read_lock(&EXT4_I(inode)->i_es_lock);
> +	read_lock_irqsave(&EXT4_I(inode)->i_es_lock, flags);
>  	tree = &EXT4_I(inode)->i_es_tree;
>  
>  	/* find extent in cache firstly */
> @@ -291,7 +292,7 @@ out:
>  		}
>  	}
>  
> -	read_unlock(&EXT4_I(inode)->i_es_lock);
> +	read_unlock_irqrestore(&EXT4_I(inode)->i_es_lock, flags);
>  
>  	ext4_es_lru_add(inode);
>  	trace_ext4_es_find_delayed_extent_exit(inode, es, ret);
> @@ -458,6 +459,7 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>  {
>  	struct extent_status newes;
>  	ext4_lblk_t end = lblk + len - 1;
> +	unsigned long flags;
>  	int err = 0;
>  
>  	es_debug("add [%u/%u) %llu %llx to extent status tree of inode %lu\n",
> @@ -471,14 +473,14 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>  	ext4_es_store_status(&newes, status);
>  	trace_ext4_es_insert_extent(inode, &newes);
>  
> -	write_lock(&EXT4_I(inode)->i_es_lock);
> +	write_lock_irqsave(&EXT4_I(inode)->i_es_lock, flags);
>  	err = __es_remove_extent(inode, lblk, end);
>  	if (err != 0)
>  		goto error;
>  	err = __es_insert_extent(inode, &newes);
>  
>  error:
> -	write_unlock(&EXT4_I(inode)->i_es_lock);
> +	write_unlock_irqrestore(&EXT4_I(inode)->i_es_lock, flags);
>  
>  	ext4_es_lru_add(inode);
>  	ext4_es_print_tree(inode);
> @@ -498,13 +500,14 @@ int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es)
>  	struct ext4_es_tree *tree;
>  	struct extent_status *es1 = NULL;
>  	struct rb_node *node;
> +	unsigned long flags;
>  	int found = 0;
>  
>  	trace_ext4_es_lookup_extent_enter(inode, es->es_lblk);
>  	es_debug("lookup extent in block %u\n", es->es_lblk);
>  
>  	tree = &EXT4_I(inode)->i_es_tree;
> -	read_lock(&EXT4_I(inode)->i_es_lock);
> +	read_lock_irqsave(&EXT4_I(inode)->i_es_lock, flags);
>  
>  	/* find extent in cache firstly */
>  	es->es_len = es->es_pblk = 0;
> @@ -539,7 +542,7 @@ out:
>  		es->es_pblk = es1->es_pblk;
>  	}
>  
> -	read_unlock(&EXT4_I(inode)->i_es_lock);
> +	read_unlock_irqrestore(&EXT4_I(inode)->i_es_lock, flags);
>  
>  	ext4_es_lru_add(inode);
>  	trace_ext4_es_lookup_extent_exit(inode, es, found);
> @@ -649,6 +652,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  			  ext4_lblk_t len)
>  {
>  	ext4_lblk_t end;
> +	unsigned long flags;
>  	int err = 0;
>  
>  	trace_ext4_es_remove_extent(inode, lblk, len);
> @@ -658,9 +662,9 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  	end = lblk + len - 1;
>  	BUG_ON(end < lblk);
>  
> -	write_lock(&EXT4_I(inode)->i_es_lock);
> +	write_lock_irqsave(&EXT4_I(inode)->i_es_lock, flags);
>  	err = __es_remove_extent(inode, lblk, end);
> -	write_unlock(&EXT4_I(inode)->i_es_lock);
> +	write_unlock_irqrestore(&EXT4_I(inode)->i_es_lock, flags);
>  	ext4_es_print_tree(inode);
>  	return err;
>  }
> @@ -671,6 +675,7 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
>  					struct ext4_sb_info, s_es_shrinker);
>  	struct ext4_inode_info *ei;
>  	struct list_head *cur, *tmp, scanned;
> +	unsigned long flags;
>  	int nr_to_scan = sc->nr_to_scan;
>  	int ret, nr_shrunk = 0;
>  
> @@ -687,16 +692,16 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
>  
>  		ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
>  
> -		read_lock(&ei->i_es_lock);
> +		read_lock_irqsave(&ei->i_es_lock, flags);
>  		if (ei->i_es_lru_nr == 0) {
> -			read_unlock(&ei->i_es_lock);
> +			read_unlock_irqrestore(&ei->i_es_lock, flags);
>  			continue;
>  		}
> -		read_unlock(&ei->i_es_lock);
> +		read_unlock_irqrestore(&ei->i_es_lock, flags);
>  
> -		write_lock(&ei->i_es_lock);
> +		write_lock_irqsave(&ei->i_es_lock, flags);
>  		ret = __es_try_to_reclaim_extents(ei, nr_to_scan);
> -		write_unlock(&ei->i_es_lock);
> +		write_unlock_irqrestore(&ei->i_es_lock, flags);
>  
>  		nr_shrunk += ret;
>  		nr_to_scan -= ret;
> @@ -756,14 +761,15 @@ static int ext4_es_reclaim_extents_count(struct super_block *sb)
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
>  	struct ext4_inode_info *ei;
>  	struct list_head *cur;
> +	unsigned long flags;
>  	int nr_cached = 0;
>  
>  	spin_lock(&sbi->s_es_lru_lock);
>  	list_for_each(cur, &sbi->s_es_lru) {
>  		ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
> -		read_lock(&ei->i_es_lock);
> +		read_lock_irqsave(&ei->i_es_lock, flags);
>  		nr_cached += ei->i_es_lru_nr;
> -		read_unlock(&ei->i_es_lock);
> +		read_unlock_irqrestore(&ei->i_es_lock, flags);
>  	}
>  	spin_unlock(&sbi->s_es_lru_lock);
>  	trace_ext4_es_reclaim_extents_count(sb, nr_cached);
> @@ -801,3 +807,147 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
>  	tree->cache_es = NULL;
>  	return nr_shrunk;
>  }
> +
> +int ext4_es_convert_unwritten_extents(struct inode *inode, loff_t offset,
> +				      size_t size)
> +{
> +	struct ext4_es_tree *tree;
> +	struct rb_node *node;
> +	struct extent_status *es, orig_es, conv_es;
> +	ext4_lblk_t end, len1, len2;
> +	ext4_lblk_t lblk = 0, len = 0;
> +	ext4_fsblk_t block;
> +	unsigned long flags;
> +	unsigned int blkbits;
> +	int err = 0;
> +
> +	trace_ext4_es_convert_unwritten_extents(inode, offset, size);
> +	blkbits = inode->i_blkbits;
> +	lblk = offset >> blkbits;
> +	len = (EXT4_BLOCK_ALIGN(offset + size, blkbits) >> blkbits) - lblk;
> +
> +	end = lblk + len - 1;
> +	BUG_ON(end < lblk);
> +
> +	tree = &EXT4_I(inode)->i_es_tree;
> +
> +	write_lock_irqsave(&EXT4_I(inode)->i_es_lock, flags);
> +	es = __es_tree_search(&tree->root, lblk);
> +	if (!es)
> +		goto out;
> +	if (es->es_lblk > end)
> +		goto out;
> +
> +	tree->cache_es = NULL;
> +
> +	orig_es.es_lblk = es->es_lblk;
> +	orig_es.es_len = es->es_len;
> +	orig_es.es_pblk = es->es_pblk;
> +
> +	len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0;
> +	len2 = ext4_es_end(es) > end ?
> +	       ext4_es_end(es) - end : 0;
> +	if (len1 > 0)
> +		es->es_len = len1;
> +	if (len2 > 0) {
> +		if (len1 > 0) {
> +			struct extent_status newes;
> +
> +			newes.es_lblk = end + 1;
> +			newes.es_len = len2;
> +			block = ext4_es_pblock(&orig_es) +
> +				orig_es.es_len - len2;
> +			ext4_es_store_pblock(&newes, block);
> +			ext4_es_store_status(&newes, ext4_es_status(&orig_es));
> +			err = __es_insert_extent(inode, &newes);
> +			if (err) {
> +				es->es_lblk = orig_es.es_lblk;
> +				es->es_len = orig_es.es_len;
> +				es->es_pblk = orig_es.es_pblk;
> +				goto out;
> +			}
> +
> +			conv_es.es_lblk = orig_es.es_lblk + len1;
> +			conv_es.es_len = orig_es.es_len - len1 - len2;
> +			block = ext4_es_pblock(&orig_es) + len1;
> +			ext4_es_store_pblock(&conv_es, block);
> +			ext4_es_store_status(&conv_es, EXTENT_STATUS_WRITTEN);
> +			err = __es_insert_extent(inode, &conv_es);
> +			if (err) {
> +				int err2 = __es_remove_extent(inode,
> +							conv_es.es_lblk,
> +							ext4_es_end(&newes));
> +				if (err2)
> +					goto out;
> +				es->es_lblk = orig_es.es_lblk;
> +				es->es_len = orig_es.es_len;
> +				es->es_pblk = orig_es.es_pblk;
> +				goto out;
> +			}
> +		} else {
> +			es->es_lblk = end + 1;
> +			es->es_len = len2;
> +			block = ext4_es_pblock(&orig_es) +
> +				orig_es.es_len - len2;
> +			ext4_es_store_pblock(es, block);
> +
> +			conv_es.es_lblk = orig_es.es_lblk;
> +			conv_es.es_len = orig_es.es_len - len2;
> +			ext4_es_store_pblock(&conv_es,
> +					     ext4_es_pblock(&orig_es));
> +			ext4_es_store_status(&conv_es, EXTENT_STATUS_WRITTEN);
> +			err = __es_insert_extent(inode, &conv_es);
> +			if (err) {
> +				es->es_lblk = orig_es.es_lblk;
> +				es->es_len = orig_es.es_len;
> +				es->es_pblk = orig_es.es_pblk;
> +			}
> +		}
> +		goto out;
> +	}
> +
> +	if (len1 > 0) {
> +		node = rb_next(&es->rb_node);
> +		if (node)
> +			es = rb_entry(node, struct extent_status, rb_node);
> +		else
> +			es = NULL;
> +	}
> +
> +	while (es && ext4_es_end(es) <= end) {
> +		node = rb_next(&es->rb_node);
> +		ext4_es_store_status(es, EXTENT_STATUS_WRITTEN);
> +		if (!inode) {
> +			es = NULL;
> +			break;
> +		}
> +		es = rb_entry(node, struct extent_status, rb_node);
> +	}
> +
> +	if (es && es->es_lblk < end + 1) {
> +		ext4_lblk_t orig_len = es->es_len;
> +
> +		/*
> +		 * Here we first set conv_es just because of avoiding copy the
> +		 * value of es to a temporary variable.
> +		 */
> +		len1 = ext4_es_end(es) - end;
> +		conv_es.es_lblk = es->es_lblk;
> +		conv_es.es_len = es->es_len - len1;
> +		ext4_es_store_pblock(&conv_es, ext4_es_pblock(es));
> +		ext4_es_store_status(&conv_es, EXTENT_STATUS_WRITTEN);
> +
> +		es->es_lblk = end + 1;
> +		es->es_len = len1;
> +		block = ext4_es_pblock(es) + orig_len - len1;
> +		ext4_es_store_pblock(es, block);
> +
> +		err = __es_insert_extent(inode, &conv_es);
> +		if (err)
> +			goto out;
> +	}
> +
> +out:
> +	write_unlock_irqrestore(&EXT4_I(inode)->i_es_lock, flags);
> +	return err;
> +}
> diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
> index 938ad2b..2849d74 100644
> --- a/fs/ext4/extents_status.h
> +++ b/fs/ext4/extents_status.h
> @@ -54,6 +54,8 @@ extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  extern ext4_lblk_t ext4_es_find_delayed_extent(struct inode *inode,
>  					       struct extent_status *es);
>  extern int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es);
> +extern int ext4_es_convert_unwritten_extents(struct inode *inode, loff_t offset,
> +					     size_t size);
>  
>  static inline int ext4_es_is_written(struct extent_status *es)
>  {
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 670779a..08cf720 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3063,6 +3063,7 @@ out:
>  		io_end->result = ret;
>  	}
>  
> +	ext4_es_convert_unwritten_extents(inode, offset, size);
>  	ext4_add_complete_io(io_end);
>  }
>  
> @@ -3088,6 +3089,7 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
>  	 */
>  	inode = io_end->inode;
>  	ext4_set_io_unwritten_flag(inode, io_end);
> +	ext4_es_convert_unwritten_extents(inode, io_end->offset, io_end->size);
>  	ext4_add_complete_io(io_end);
>  out:
>  	bh->b_private = NULL;
> @@ -3246,6 +3248,9 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
>  	} else if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
>  						EXT4_STATE_DIO_UNWRITTEN)) {
>  		int err;
> +		err = ext4_es_convert_unwritten_extents(inode, offset, ret);
> +		if (err)
> +			ret = err;
>  		/*
>  		 * for non AIO case, since the IO is already
>  		 * completed, we could do the conversion right here
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 0016fbc..66ea30e 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -276,6 +276,13 @@ static void ext4_end_bio(struct bio *bio, int error)
>  		error = 0;
>  	bio_put(bio);
>  
> +	/*
> +	 * We need to convert unwrittne extents in extent status tree before
> +	 * end_page_writeback() is called.  Otherwise, when dioread_nolock is
> +	 * enabled, we will be likely to read stale data.
> +	 */
> +	inode = io_end->inode;
> +	ext4_es_convert_unwritten_extents(inode, io_end->offset, io_end->size);
>  	for (i = 0; i < io_end->num_io_pages; i++) {
>  		struct page *page = io_end->pages[i]->p_page;
>  		struct buffer_head *bh, *head;
> @@ -305,7 +312,6 @@ static void ext4_end_bio(struct bio *bio, int error)
>  		put_io_page(io_end->pages[i]);
>  	}
>  	io_end->num_io_pages = 0;
> -	inode = io_end->inode;
>  
>  	if (error) {
>  		io_end->flag |= EXT4_IO_END_ERROR;
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index f0734b3..d32e3d5 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -2233,6 +2233,31 @@ TRACE_EVENT(ext4_es_lookup_extent_exit,
>  		  __entry->found ? __entry->status : 0)
>  );
>  
> +TRACE_EVENT(ext4_es_convert_unwritten_extents,
> +	TP_PROTO(struct inode *inode, loff_t offset, loff_t size),
> +
> +	TP_ARGS(inode, offset, size),
> +
> +	TP_STRUCT__entry(
> +		__field(	dev_t,	dev			)
> +		__field(	ino_t,	ino			)
> +		__field(	loff_t,	offset			)
> +		__field(	loff_t, size			)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev	= inode->i_sb->s_dev;
> +		__entry->ino	= inode->i_ino;
> +		__entry->offset	= offset;
> +		__entry->size	= size;
> +	),
> +
> +	TP_printk("dev %d,%d ino %lu convert unwritten extents [%llu/%llu",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  (unsigned long) __entry->ino,
> +		  __entry->offset, __entry->size)
> +);
> +
>  TRACE_EVENT(ext4_es_reclaim_extents_count,
>  	TP_PROTO(struct super_block *sb, int nr_cached),
>  
> -- 
> 1.7.12.rc2.18.g61b472e
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/10 v5] ext4: convert unwritten extents from extent status tree in end_io
  2013-02-10  8:45   ` Zheng Liu
@ 2013-02-11  1:52     ` Theodore Ts'o
  0 siblings, 0 replies; 37+ messages in thread
From: Theodore Ts'o @ 2013-02-11  1:52 UTC (permalink / raw)
  To: Zheng Liu, Jan kara, linux-ext4

On Sun, Feb 10, 2013 at 04:45:11PM +0800, Zheng Liu wrote:
> 
> This chunk is missing in your 'dev' branch of ext4.  If we call
> ext4_map_blocks here, unwritten extent in disk will never be converted
> because ext4_map_blocks first tries to lookup extent status tree and the
> unwriten extent in this tree has been converted in end_io callback
> function.  It always looks up a written extent in cache, and return
> immdiately.  Please check it again.

Thanks for catching this!  I missed the the patch hunk getting
rejected because I had been distracted with the patch hunk rejection
caused by the fact that we've since dropped the function
ext4_end_io_buffer_write().

I've fixed this up and pushed an updated git "dev" branch.

     	   	       	      	 	 - Ted

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/10 v5] ext4: track all extent status in extent status tree
  2013-02-08  8:44 ` [PATCH 04/10 v5] ext4: track all extent status in extent status tree Zheng Liu
@ 2013-02-11 12:21   ` Jan Kara
  2013-02-15  6:45     ` Zheng Liu
  2013-02-13  3:28   ` Theodore Ts'o
  1 sibling, 1 reply; 37+ messages in thread
From: Jan Kara @ 2013-02-11 12:21 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o, Jan kara

On Fri 08-02-13 16:44:00, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> By recording the phycisal block and status, extent status tree is able
> to track the status of every extents.  When we call _map_blocks
> functions to lookup an extent or create a new written/unwritten/delayed
> extent, this extent will be inserted into extent status tree.  The hole
> extent is inserted in ext4_ext_put_gap_in_cache().  If there is no any
> extent, we will not insert a hole extent [0, ~0] into the extent status
> tree in order to reduce the complextiy of code.
> 
> We don't load all extents from disk in alloc_inode() because it costs
> too much memory, and if a file is opened and closed frequently it will
> takes too much time to load all extent information.  So currently when
> we create/lookup an extent, this extent will be inserted into extent
> status tree.  Hence, the extent status tree may not comprehensively
> contain all of the extents found in the file.
> 
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Jan kara <jack@suse.cz>
> ---
>  fs/ext4/extents.c           |  4 +--
>  fs/ext4/extents_status.c    | 27 ++++++++++++------
>  fs/ext4/extents_status.h    |  4 +--
>  fs/ext4/file.c              |  4 +--
>  fs/ext4/inode.c             | 68 ++++++++++++++++++++++++++++-----------------
>  include/trace/events/ext4.h |  4 +--
>  6 files changed, 70 insertions(+), 41 deletions(-)
> 
...
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index 5093cee..71cb75a 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -239,14 +239,15 @@ static struct extent_status *__es_tree_search(struct rb_root *root,
>   * EXT_MAX_BLOCKS if no extent is found.
>   * Delayed extent is returned via @es.
>   */
> -ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
> +ext4_lblk_t ext4_es_find_delayed_extent(struct inode *inode,
> +					struct extent_status *es)
>  {
  I have to say I'm still not very happy about this function (but it's much
better than it used to be so thanks for that!). I have two suggestions for
improvement:
1) 'es' is both input and output argument where for input only es_lblk is
used. That's a bit confusing so how about making the function like:

ext4_es_find_delayed_extent(struct inode *inode, ext4_lblk_t offset,
			    struct extent_status *out);

  to separate input and output? Also you can comment that we use the 'out'
parameter instead of returning the extent_status from the tree because that
can be freed once we drop the spinlock protecting status tree.

2) The returned value is somewhat surprisingly the logical offset of the
*next* delalloc extent. It's used only in ext4_fill_fiemap_extents()
AFAICS. It would be easier to understand if the function didn't return
anything. ext4_fill_fiemap_extents() would use
ext4_es_find_delayed_extent() to find both current and next delalloc extent
(which would become the 'current' one in the next iteration). As a bonus
you would also save some iteration of the extent status tree...

>  	struct ext4_es_tree *tree = NULL;
>  	struct extent_status *es1 = NULL;
>  	struct rb_node *node;
>  	ext4_lblk_t ret = EXT_MAX_BLOCKS;
>  
> -	trace_ext4_es_find_extent_enter(inode, es->es_lblk);
> +	trace_ext4_es_find_delayed_extent_enter(inode, es->es_lblk);
>  
>  	read_lock(&EXT4_I(inode)->i_es_lock);
>  	tree = &EXT4_I(inode)->i_es_tree;
> @@ -266,21 +267,31 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
>  	es1 = __es_tree_search(&tree->root, es->es_lblk);
>  
>  out:
> -	if (es1) {
> +	if (es1 && !ext4_es_is_delayed(es1)) {
> +		while ((node = rb_next(&es1->rb_node)) != NULL) {
> +			es1 = rb_entry(node, struct extent_status, rb_node);
> +			if (ext4_es_is_delayed(es1))
> +				break;
> +		}
> +	}
> +
> +	if (es1 && ext4_es_is_delayed(es1)) {
>  		tree->cache_es = es1;
>  		es->es_lblk = es1->es_lblk;
>  		es->es_len = es1->es_len;
>  		es->es_pblk = es1->es_pblk;
> -		node = rb_next(&es1->rb_node);
> -		if (node) {
> +		while ((node = rb_next(&es1->rb_node)) != NULL) {
>  			es1 = rb_entry(node, struct extent_status, rb_node);
> -			ret = es1->es_lblk;
> +			if (ext4_es_is_delayed(es1)) {
> +				ret = es1->es_lblk;
> +				break;
> +			}
>  		}
>  	}
>  
>  	read_unlock(&EXT4_I(inode)->i_es_lock);
>  
> -	trace_ext4_es_find_extent_exit(inode, es, ret);
> +	trace_ext4_es_find_delayed_extent_exit(inode, es, ret);
>  	return ret;
>  }
>  
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 05/10 v5] ext4: lookup block mapping in extent status tree
  2013-02-08  8:44 ` [PATCH 05/10 v5] ext4: lookup block mapping " Zheng Liu
@ 2013-02-12 12:31   ` Jan Kara
  2013-02-15  7:06     ` Zheng Liu
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Kara @ 2013-02-12 12:31 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o, Jan kara

On Fri 08-02-13 16:44:01, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> After tracking all extent status, we already have a extent cache in
> memory.  Every time we want to lookup a block mapping, we can first
> try to lookup it in extent status tree to avoid a potential disk I/O.
> 
> A new function called ext4_es_lookup_extent is defined to finish this
> work.  When we try to lookup a block mapping, we always call
> ext4_map_blocks and/or ext4_da_map_blocks.  So in these functions we
> first try to lookup a block mapping in extent status tree.
> 
> A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks
> in order not to put a hole into extent status tree because this hole
> will be converted to delayed extent in the tree immediately.
  It looks somewhat inconsistent that you put hole into the extent tree in
ext4_ext_map_blocks() but all other extent types are handled in
ext4_map_blocks() or ext4_da_map_blocks(). Can we put the handling in one
place?

Otherwise the patch looks OK.

								Honza

> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Jan kara <jack@suse.cz>
> ---
>  fs/ext4/ext4.h              |  2 ++
>  fs/ext4/extents.c           |  7 ++++-
>  fs/ext4/extents_status.c    | 59 +++++++++++++++++++++++++++++++++++++++++
>  fs/ext4/extents_status.h    |  1 +
>  fs/ext4/inode.c             | 64 +++++++++++++++++++++++++++++++++++++++++++--
>  include/trace/events/ext4.h | 56 +++++++++++++++++++++++++++++++++++++++
>  6 files changed, 186 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 8462eb3..ad885b5 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -582,6 +582,8 @@ enum {
>  #define EXT4_GET_BLOCKS_KEEP_SIZE		0x0080
>  	/* Do not take i_data_sem locking in ext4_map_blocks */
>  #define EXT4_GET_BLOCKS_NO_LOCK			0x0100
> +	/* Do not put hole in extent cache */
> +#define EXT4_GET_BLOCKS_NO_PUT_HOLE		0x0200
>  
>  /*
>   * Flags used by ext4_free_blocks
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 4b065ff..1be8955 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -2154,6 +2154,8 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
>  				block,
>  				le32_to_cpu(ex->ee_block),
>  				 ext4_ext_get_actual_len(ex));
> +		ext4_es_insert_extent(inode, lblock, len, ~0,
> +				      EXTENT_STATUS_HOLE);
>  	} else if (block >= le32_to_cpu(ex->ee_block)
>  			+ ext4_ext_get_actual_len(ex)) {
>  		ext4_lblk_t next;
> @@ -2167,6 +2169,8 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
>  				block);
>  		BUG_ON(next == lblock);
>  		len = next - lblock;
> +		ext4_es_insert_extent(inode, lblock, len, ~0,
> +				      EXTENT_STATUS_HOLE);
>  	} else {
>  		lblock = len = 0;
>  		BUG();
> @@ -4006,7 +4010,8 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
>  		 * put just found gap into cache to speed up
>  		 * subsequent requests
>  		 */
> -		ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
> +		if ((flags & EXT4_GET_BLOCKS_NO_PUT_HOLE) == 0)
> +			ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
>  		goto out2;
>  	}
>  
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index 71cb75a..ca7dc9f 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -468,6 +468,65 @@ error:
>  	return err;
>  }
>  
> +/*
> + * ext4_es_lookup_extent() looks up an extent in extent status tree.
> + *
> + * ext4_es_lookup_extent is called by ext4_map_blocks/ext4_da_map_blocks.
> + *
> + * Return: 1 on found, 0 on not
> + */
> +int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es)
> +{
> +	struct ext4_es_tree *tree;
> +	struct extent_status *es1 = NULL;
> +	struct rb_node *node;
> +	int found = 0;
> +
> +	trace_ext4_es_lookup_extent_enter(inode, es->es_lblk);
> +	es_debug("lookup extent in block %u\n", es->es_lblk);
> +
> +	tree = &EXT4_I(inode)->i_es_tree;
> +	read_lock(&EXT4_I(inode)->i_es_lock);
> +
> +	/* find extent in cache firstly */
> +	es->es_len = es->es_pblk = 0;
> +	if (tree->cache_es) {
> +		es1 = tree->cache_es;
> +		if (in_range(es->es_lblk, es1->es_lblk, es1->es_len)) {
> +			es_debug("%u cached by [%u/%u)\n",
> +				 es->es_lblk, es1->es_lblk, es1->es_len);
> +			found = 1;
> +			goto out;
> +		}
> +	}
> +
> +	node = tree->root.rb_node;
> +	while (node) {
> +		es1 = rb_entry(node, struct extent_status, rb_node);
> +		if (es->es_lblk < es1->es_lblk)
> +			node = node->rb_left;
> +		else if (es->es_lblk > ext4_es_end(es1))
> +			node = node->rb_right;
> +		else {
> +			found = 1;
> +			break;
> +		}
> +	}
> +
> +out:
> +	if (found) {
> +		BUG_ON(!es1);
> +		es->es_lblk = es1->es_lblk;
> +		es->es_len = es1->es_len;
> +		es->es_pblk = es1->es_pblk;
> +	}
> +
> +	read_unlock(&EXT4_I(inode)->i_es_lock);
> +
> +	trace_ext4_es_lookup_extent_exit(inode, es, found);
> +	return found;
> +}
> +
>  static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
>  				 ext4_lblk_t end)
>  {
> diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
> index b5788eb..effe78c 100644
> --- a/fs/ext4/extents_status.h
> +++ b/fs/ext4/extents_status.h
> @@ -53,6 +53,7 @@ extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  				 ext4_lblk_t len);
>  extern ext4_lblk_t ext4_es_find_delayed_extent(struct inode *inode,
>  					       struct extent_status *es);
> +extern int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es);
>  
>  static inline int ext4_es_is_written(struct extent_status *es)
>  {
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 16454fc..670779a 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -508,12 +508,34 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
>  int ext4_map_blocks(handle_t *handle, struct inode *inode,
>  		    struct ext4_map_blocks *map, int flags)
>  {
> +	struct extent_status es;
>  	int retval;
>  
>  	map->m_flags = 0;
>  	ext_debug("ext4_map_blocks(): inode %lu, flag %d, max_blocks %u,"
>  		  "logical block %lu\n", inode->i_ino, flags, map->m_len,
>  		  (unsigned long) map->m_lblk);
> +
> +	/* Lookup extent status tree firstly */
> +	es.es_lblk = map->m_lblk;
> +	if (ext4_es_lookup_extent(inode, &es)) {
> +		if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
> +			map->m_pblk = ext4_es_pblock(&es) +
> +					map->m_lblk - es.es_lblk;
> +			map->m_flags |= ext4_es_is_written(&es) ?
> +					EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
> +			retval = es.es_len - (map->m_lblk - es.es_lblk);
> +			if (retval > map->m_len)
> +				retval = map->m_len;
> +			map->m_len = retval;
> +		} else if (ext4_es_is_delayed(&es) || ext4_es_is_hole(&es)) {
> +			retval = 0;
> +		} else {
> +			BUG_ON(1);
> +		}
> +		goto found;
> +	}
> +
>  	/*
>  	 * Try to see if we can get the block without requesting a new
>  	 * file system block.
> @@ -541,6 +563,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
>  	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
>  		up_read((&EXT4_I(inode)->i_data_sem));
>  
> +found:
>  	if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
>  		int ret = check_block_validity(inode, map);
>  		if (ret != 0)
> @@ -1772,6 +1795,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
>  			      struct ext4_map_blocks *map,
>  			      struct buffer_head *bh)
>  {
> +	struct extent_status es;
>  	int retval;
>  	sector_t invalid_block = ~((sector_t) 0xffff);
>  
> @@ -1782,6 +1806,39 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
>  	ext_debug("ext4_da_map_blocks(): inode %lu, max_blocks %u,"
>  		  "logical block %lu\n", inode->i_ino, map->m_len,
>  		  (unsigned long) map->m_lblk);
> +
> +	/* Lookup extent status tree firstly */
> +	es.es_lblk = iblock;
> +	if (ext4_es_lookup_extent(inode, &es)) {
> +
> +		if (ext4_es_is_hole(&es)) {
> +			retval = 0;
> +			down_read((&EXT4_I(inode)->i_data_sem));
> +			goto add_delayed;
> +		}
> +
> +		if (ext4_es_is_delayed(&es)) {
> +			map_bh(bh, inode->i_sb, invalid_block);
> +			set_buffer_new(bh);
> +			set_buffer_delay(bh);
> +			return 0;
> +		}
> +
> +		map->m_pblk = ext4_es_pblock(&es) + iblock - es.es_lblk;
> +		retval = es.es_len - (iblock - es.es_lblk);
> +		if (retval > map->m_len)
> +			retval = map->m_len;
> +		map->m_len = retval;
> +		if (ext4_es_is_written(&es))
> +			map->m_flags |= EXT4_MAP_MAPPED;
> +		else if (ext4_es_is_unwritten(&es))
> +			map->m_flags |= EXT4_MAP_UNWRITTEN;
> +		else
> +			BUG_ON(1);
> +
> +		return retval;
> +	}
> +
>  	/*
>  	 * Try to see if we can get the block without requesting a new
>  	 * file system block.
> @@ -1800,10 +1857,13 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
>  			map->m_flags |= EXT4_MAP_FROM_CLUSTER;
>  		retval = 0;
>  	} else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> -		retval = ext4_ext_map_blocks(NULL, inode, map, 0);
> +		retval = ext4_ext_map_blocks(NULL, inode, map,
> +					     EXT4_GET_BLOCKS_NO_PUT_HOLE);
>  	else
> -		retval = ext4_ind_map_blocks(NULL, inode, map, 0);
> +		retval = ext4_ind_map_blocks(NULL, inode, map,
> +					     EXT4_GET_BLOCKS_NO_PUT_HOLE);
>  
> +add_delayed:
>  	if (retval == 0) {
>  		int ret;
>  		/*
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index d278ced..822780a 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -2177,6 +2177,62 @@ TRACE_EVENT(ext4_es_find_delayed_extent_exit,
>  		  __entry->pblk, __entry->status, __entry->ret)
>  );
>  
> +TRACE_EVENT(ext4_es_lookup_extent_enter,
> +	TP_PROTO(struct inode *inode, ext4_lblk_t lblk),
> +
> +	TP_ARGS(inode, lblk),
> +
> +	TP_STRUCT__entry(
> +		__field(	dev_t,		dev		)
> +		__field(	ino_t,		ino		)
> +		__field(	ext4_lblk_t,	lblk		)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev	= inode->i_sb->s_dev;
> +		__entry->ino	= inode->i_ino;
> +		__entry->lblk	= lblk;
> +	),
> +
> +	TP_printk("dev %d,%d ino %lu lblk %u",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  (unsigned long) __entry->ino, __entry->lblk)
> +);
> +
> +TRACE_EVENT(ext4_es_lookup_extent_exit,
> +	TP_PROTO(struct inode *inode, struct extent_status *es,
> +		 int found),
> +
> +	TP_ARGS(inode, es, found),
> +
> +	TP_STRUCT__entry(
> +		__field(	dev_t,		dev		)
> +		__field(	ino_t,		ino		)
> +		__field(	ext4_lblk_t,	lblk		)
> +		__field(	ext4_lblk_t,	len		)
> +		__field(	ext4_fsblk_t,	pblk		)
> +		__field(	unsigned long long,	status	)
> +		__field(	int,		found		)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev	= inode->i_sb->s_dev;
> +		__entry->ino	= inode->i_ino;
> +		__entry->lblk	= es->es_lblk;
> +		__entry->len	= es->es_len;
> +		__entry->pblk	= ext4_es_pblock(es);
> +		__entry->status	= ext4_es_status(es);
> +		__entry->found	= found;
> +	),
> +
> +	TP_printk("dev %d,%d ino %lu found %d [%u/%u) %llu %llx",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  (unsigned long) __entry->ino, __entry->found,
> +		  __entry->lblk, __entry->len,
> +		  __entry->found ? __entry->pblk : 0,
> +		  __entry->found ? __entry->status : 0)
> +);
> +
>  #endif /* _TRACE_EXT4_H */
>  
>  /* This part must be outside protection */
> -- 
> 1.7.12.rc2.18.g61b472e
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/10 v5] ext4: convert unwritten extents from extent status tree in end_io
  2013-02-08  8:44 ` [PATCH 09/10 v5] ext4: convert unwritten extents from extent status tree in end_io Zheng Liu
  2013-02-10  8:45   ` Zheng Liu
@ 2013-02-12 12:51   ` Jan Kara
  2013-02-15  7:12     ` Zheng Liu
  1 sibling, 1 reply; 37+ messages in thread
From: Jan Kara @ 2013-02-12 12:51 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o, Jan kara

On Fri 08-02-13 16:44:05, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> This commit tries to convert unwritten extents from extent status tree
> in end_io callback functions and ext4_ext_direct_IO.
  Why should we do this?

...
> @@ -801,3 +807,147 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
>  	tree->cache_es = NULL;
>  	return nr_shrunk;
>  }
> +
> +int ext4_es_convert_unwritten_extents(struct inode *inode, loff_t offset,
> +				      size_t size)
> +{
> +	struct ext4_es_tree *tree;
> +	struct rb_node *node;
> +	struct extent_status *es, orig_es, conv_es;
> +	ext4_lblk_t end, len1, len2;
> +	ext4_lblk_t lblk = 0, len = 0;
> +	ext4_fsblk_t block;
> +	unsigned long flags;
> +	unsigned int blkbits;
> +	int err = 0;
> +
> +	trace_ext4_es_convert_unwritten_extents(inode, offset, size);
> +	blkbits = inode->i_blkbits;
> +	lblk = offset >> blkbits;
> +	len = (EXT4_BLOCK_ALIGN(offset + size, blkbits) >> blkbits) - lblk;
> +
> +	end = lblk + len - 1;
> +	BUG_ON(end < lblk);
> +
> +	tree = &EXT4_I(inode)->i_es_tree;
> +
> +	write_lock_irqsave(&EXT4_I(inode)->i_es_lock, flags);
> +	es = __es_tree_search(&tree->root, lblk);
> +	if (!es)
> +		goto out;
> +	if (es->es_lblk > end)
> +		goto out;
> +
> +	tree->cache_es = NULL;
> +
> +	orig_es.es_lblk = es->es_lblk;
> +	orig_es.es_len = es->es_len;
> +	orig_es.es_pblk = es->es_pblk;
> +
> +	len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0;
> +	len2 = ext4_es_end(es) > end ?
> +	       ext4_es_end(es) - end : 0;
> +	if (len1 > 0)
> +		es->es_len = len1;
> +	if (len2 > 0) {
> +		if (len1 > 0) {
> +			struct extent_status newes;
> +
> +			newes.es_lblk = end + 1;
> +			newes.es_len = len2;
> +			block = ext4_es_pblock(&orig_es) +
> +				orig_es.es_len - len2;
> +			ext4_es_store_pblock(&newes, block);
> +			ext4_es_store_status(&newes, ext4_es_status(&orig_es));
> +			err = __es_insert_extent(inode, &newes);
> +			if (err) {
> +				es->es_lblk = orig_es.es_lblk;
> +				es->es_len = orig_es.es_len;
> +				es->es_pblk = orig_es.es_pblk;
> +				goto out;
> +			}
> +
> +			conv_es.es_lblk = orig_es.es_lblk + len1;
> +			conv_es.es_len = orig_es.es_len - len1 - len2;
> +			block = ext4_es_pblock(&orig_es) + len1;
> +			ext4_es_store_pblock(&conv_es, block);
> +			ext4_es_store_status(&conv_es, EXTENT_STATUS_WRITTEN);
> +			err = __es_insert_extent(inode, &conv_es);
> +			if (err) {
> +				int err2 = __es_remove_extent(inode,
> +							conv_es.es_lblk,
> +							ext4_es_end(&newes));
> +				if (err2)
> +					goto out;
> +				es->es_lblk = orig_es.es_lblk;
> +				es->es_len = orig_es.es_len;
> +				es->es_pblk = orig_es.es_pblk;
> +				goto out;
> +			}
> +		} else {
> +			es->es_lblk = end + 1;
> +			es->es_len = len2;
> +			block = ext4_es_pblock(&orig_es) +
> +				orig_es.es_len - len2;
> +			ext4_es_store_pblock(es, block);
> +
> +			conv_es.es_lblk = orig_es.es_lblk;
> +			conv_es.es_len = orig_es.es_len - len2;
> +			ext4_es_store_pblock(&conv_es,
> +					     ext4_es_pblock(&orig_es));
> +			ext4_es_store_status(&conv_es, EXTENT_STATUS_WRITTEN);
> +			err = __es_insert_extent(inode, &conv_es);
> +			if (err) {
> +				es->es_lblk = orig_es.es_lblk;
> +				es->es_len = orig_es.es_len;
> +				es->es_pblk = orig_es.es_pblk;
> +			}
> +		}
> +		goto out;
> +	}
> +
> +	if (len1 > 0) {
> +		node = rb_next(&es->rb_node);
> +		if (node)
> +			es = rb_entry(node, struct extent_status, rb_node);
> +		else
> +			es = NULL;
> +	}
> +
> +	while (es && ext4_es_end(es) <= end) {
> +		node = rb_next(&es->rb_node);
> +		ext4_es_store_status(es, EXTENT_STATUS_WRITTEN);
> +		if (!inode) {
> +			es = NULL;
> +			break;
> +		}
> +		es = rb_entry(node, struct extent_status, rb_node);
> +	}
> +
> +	if (es && es->es_lblk < end + 1) {
> +		ext4_lblk_t orig_len = es->es_len;
> +
> +		/*
> +		 * Here we first set conv_es just because of avoiding copy the
> +		 * value of es to a temporary variable.
> +		 */
> +		len1 = ext4_es_end(es) - end;
> +		conv_es.es_lblk = es->es_lblk;
> +		conv_es.es_len = es->es_len - len1;
> +		ext4_es_store_pblock(&conv_es, ext4_es_pblock(es));
> +		ext4_es_store_status(&conv_es, EXTENT_STATUS_WRITTEN);
> +
> +		es->es_lblk = end + 1;
> +		es->es_len = len1;
> +		block = ext4_es_pblock(es) + orig_len - len1;
> +		ext4_es_store_pblock(es, block);
> +
> +		err = __es_insert_extent(inode, &conv_es);
> +		if (err)
> +			goto out;
> +	}
> +
> +out:
> +	write_unlock_irqrestore(&EXT4_I(inode)->i_es_lock, flags);
> +	return err;
> +}
  Is this really needed? Why don't you just use ext4_es_insert_extent() to
insert new extent of proper type? Also the way you wrote it, we can return
(freshly written) data to the user, then reclaim the extent status from
memory and later return 0s because we read the original status from disk
(conversion hasn't still happened on disk). That would be certainly
confusing.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 10/10 v5] ext4: remove bogus wait for unwritten extents in ext4_ind_direct_IO
  2013-02-08  8:44 ` [PATCH 10/10 v5] ext4: remove bogus wait for unwritten extents in ext4_ind_direct_IO Zheng Liu
@ 2013-02-12 12:58   ` Jan Kara
  2013-02-15  7:14     ` Zheng Liu
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Kara @ 2013-02-12 12:58 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o, Jan kara

On Fri 08-02-13 16:44:06, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> After converting unwritten extents from extent status tree in end_io, we
> can safely remove this bogus wait and don't worry about read stale data
> because we always try to lookup a block mapping in extent status tree
> firstly and unwritten extents in the tree has been converted at this
> time.
  But you have to make sure the extent for the range really is in the
status tree (you didn't in the previous patch I think) and cannot be
reclaimed until the conversion is performed on disk.

								Honza

> Before that commit, we need to flush unwritten ios before a dio read
> when dioread_nolock is enabled because in ext4_end_io_buffer_write and
> ext4_end_bio end_page_writeback() is called before converting unwritten
> extents in disk.  So here is a window that a dio reader will read stale
> data as below if we don't wait for unwritten extents:
> 
>    dio read                         buffered write
>                                     ->ext4_file_write
>                                       ->ext4_da_write_begin
>                                       ->ext4_da_write_end
>                                       [buffered write has finished, but
>                                        the data and metadata has not
>                                        been flushed]
>    ->generic_file_aio_read
>      ->filemap_write_and_wait_range
>        ->do_writepages
>          ->ext4_da_writepages
>      ->filemap_fdatawait_range
>        ->wait_on_page_writeback
>                                     ->ext4_end_bio
>                                       ->end_page_writeback
>                                         [unwritten extent has not been
>                                          converted]
>      ->ext4_ind_direct_IO
>        [here we need to flush unwritten io]
> 
> After that commit, we never need to wait for unwritten extents.
> 
>    dio read                         buffered write
>                                     ->ext4_file_write
>                                       ->ext4_da_write_begin
>                                       ->ext4_da_write_end
>                                       [buffered write has finished, but
>                                        the data and metadata has not
>                                        been flushed]
>    ->generic_file_aio_read
>      ->filemap_write_and_wait_range
>        ->do_writepages
>          ->ext4_da_writepages
>      ->filemap_fdatawait_range
>        ->wait_on_page_writeback
>                                     ->ext4_end_bio
>                                       ->ext4_es_convert_unwritten_extents
>                                       ->end_page_writeback
>                                         [unwritten extent has not been
>                                          converted in disk, but they are
>                                          converted in extent status tree]
>      ->ext4_ind_direct_IO
>        [here we will see the written
>         extents in extent status tree]
> 
> 
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Jan kara <jack@suse.cz>
> ---
>  fs/ext4/indirect.c | 5 -----
>  1 file changed, 5 deletions(-)
> 
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 20862f9..993247c 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -807,11 +807,6 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>  
>  retry:
>  	if (rw == READ && ext4_should_dioread_nolock(inode)) {
> -		if (unlikely(atomic_read(&EXT4_I(inode)->i_unwritten))) {
> -			mutex_lock(&inode->i_mutex);
> -			ext4_flush_unwritten_io(inode);
> -			mutex_unlock(&inode->i_mutex);
> -		}
>  		/*
>  		 * Nolock dioread optimization may be dynamically disabled
>  		 * via ext4_inode_block_unlocked_dio(). Check inode's state
> -- 
> 1.7.12.rc2.18.g61b472e
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/10 v5] ext4: track all extent status in extent status tree
  2013-02-08  8:44 ` [PATCH 04/10 v5] ext4: track all extent status in extent status tree Zheng Liu
  2013-02-11 12:21   ` Jan Kara
@ 2013-02-13  3:28   ` Theodore Ts'o
  2013-02-13  3:46     ` [PATCH 1/2] ext4: rename ext4_es_find_extent() to ext4_es_find_delayed_extent() Theodore Ts'o
                       ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Theodore Ts'o @ 2013-02-13  3:28 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Zheng Liu, Jan kara

On Fri, Feb 08, 2013 at 04:44:00PM +0800, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> By recording the phycisal block and status, extent status tree is able
> to track the status of every extents.  When we call _map_blocks
> functions to lookup an extent or create a new written/unwritten/delayed
> extent, this extent will be inserted into extent status tree.  The hole
> extent is inserted in ext4_ext_put_gap_in_cache().  If there is no any
> extent, we will not insert a hole extent [0, ~0] into the extent status
> tree in order to reduce the complextiy of code.
> 
> We don't load all extents from disk in alloc_inode() because it costs
> too much memory, and if a file is opened and closed frequently it will
> takes too much time to load all extent information.  So currently when
> we create/lookup an extent, this extent will be inserted into extent
> status tree.  Hence, the extent status tree may not comprehensively
> contain all of the extents found in the file.
> 
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Jan kara <jack@suse.cz>

Unfortunately, this commit is apparently causing test failures with
bigalloc:

--- 013.out	2013-01-01 22:52:04.000000000 -0500
+++ 013.out.bad	2013-02-12 22:08:47.110766615 -0500
@@ -8,7 +8,4 @@
 -----------------------------------------------
 fsstress.2 : -p 20 -r
 -----------------------------------------------
-
------------------------------------------------
-fsstress.3 : -p 4 -z -f rmdir=10 -f link=10 -f creat=10 -f mkdir=10 -f rename=30 -f stat=30 -f unlink=30 -f truncate=20
------------------------------------------------
+_check_generic_filesystem: filesystem on /dev/vdd is inconsistent (see 013.full)
_check_generic_filesystem: filesystem on /dev/vdd is inconsistent (see 013.full)
Ran: 013
Failures: 013
Failed 1 of 1 tests
END TEST: Ext4 4k block w/bigalloc Tue Feb 12 22:08:49 EST 2013
e2fsck 1.43-WIP (15-Jan-2013)
Pass 1: Checking inodes, blocks, and sizes
Inode 618, i_blocks is 1408, should be 1536.  Fix? yes

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vdd: ***** FILE SYSTEM WAS MODIFIED *****
/dev/vdd: 3969/81936 files (13.1% non-contiguous), 176208/1310720 blocks


I haven't been able to figure out what is going on here, but if we
can't figure this out I may need to push off this patch series to the
next merge window.  I've tried splitting up this patch into two pieces
to make it clearer what is going on, but I still can't see how this
would be affecting the i_blocks calculation.

							- Ted


					

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/2] ext4: rename ext4_es_find_extent() to ext4_es_find_delayed_extent()
  2013-02-13  3:28   ` Theodore Ts'o
@ 2013-02-13  3:46     ` Theodore Ts'o
  2013-02-13  3:46       ` [PATCH 2/2] ext4: track all extent status in extent status tree Theodore Ts'o
  2013-02-15  6:53     ` [PATCH 04/10 v5] " Zheng Liu
  2013-02-17 16:26     ` Zheng Liu
  2 siblings, 1 reply; 37+ messages in thread
From: Theodore Ts'o @ 2013-02-13  3:46 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Zheng Liu, Theodore Ts'o

From: Zheng Liu <wenqing.lz@taobao.com>

Rename ext4_es_find_extent() to ext4_es_find_delayed_extent() to make
the purpose of this function clearer.

[ This was originally part of the next commit, but I separated this
  out to make it easier to review and debug the next commit. -- Ted ]

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/extents.c           | 4 ++--
 fs/ext4/extents_status.c    | 9 +++++----
 fs/ext4/extents_status.h    | 4 ++--
 fs/ext4/file.c              | 4 ++--
 include/trace/events/ext4.h | 4 ++--
 5 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 06caa54..9802d64 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3529,7 +3529,7 @@ static int ext4_find_delalloc_range(struct inode *inode,
 	struct extent_status es;
 
 	es.es_lblk = lblk_start;
-	(void)ext4_es_find_extent(inode, &es);
+	(void)ext4_es_find_delayed_extent(inode, &es);
 	if (es.es_len == 0)
 		return 0; /* there is no delay extent in this tree */
 	else if (es.es_lblk <= lblk_start &&
@@ -4575,7 +4575,7 @@ static int ext4_find_delayed_extent(struct inode *inode,
 	ext4_lblk_t next_del;
 
 	es.es_lblk = newex->ec_block;
-	next_del = ext4_es_find_extent(inode, &es);
+	next_del = ext4_es_find_delayed_extent(inode, &es);
 
 	if (newex->ec_start == 0) {
 		/*
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index e7e1622..df72e95 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -229,7 +229,7 @@ static struct extent_status *__es_tree_search(struct rb_root *root,
 }
 
 /*
- * ext4_es_find_extent: find the 1st delayed extent covering @es->lblk
+ * ext4_es_find_delayed_extent: find the 1st delayed extent covering @es->lblk
  * if it exists, otherwise, the next extent after @es->lblk.
  *
  * @inode: the inode which owns delayed extents
@@ -239,14 +239,15 @@ static struct extent_status *__es_tree_search(struct rb_root *root,
  * EXT_MAX_BLOCKS if no extent is found.
  * Delayed extent is returned via @es.
  */
-ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
+ext4_lblk_t ext4_es_find_delayed_extent(struct inode *inode,
+					struct extent_status *es)
 {
 	struct ext4_es_tree *tree = NULL;
 	struct extent_status *es1 = NULL;
 	struct rb_node *node;
 	ext4_lblk_t ret = EXT_MAX_BLOCKS;
 
-	trace_ext4_es_find_extent_enter(inode, es->es_lblk);
+	trace_ext4_es_find_delayed_extent_enter(inode, es->es_lblk);
 
 	read_lock(&EXT4_I(inode)->i_es_lock);
 	tree = &EXT4_I(inode)->i_es_tree;
@@ -280,7 +281,7 @@ out:
 
 	read_unlock(&EXT4_I(inode)->i_es_lock);
 
-	trace_ext4_es_find_extent_exit(inode, es, ret);
+	trace_ext4_es_find_delayed_extent_exit(inode, es, ret);
 	return ret;
 }
 
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 2a5d69e..b5788eb 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -51,8 +51,8 @@ extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 				 unsigned long long status);
 extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t len);
-extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
-				struct extent_status *es);
+extern ext4_lblk_t ext4_es_find_delayed_extent(struct inode *inode,
+					       struct extent_status *es);
 
 static inline int ext4_es_is_written(struct extent_status *es)
 {
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 209802b..d2df517 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -465,7 +465,7 @@ static loff_t ext4_seek_data(struct file *file, loff_t offset, loff_t maxsize)
 		 * it will be as a data.
 		 */
 		es.es_lblk = last;
-		(void)ext4_es_find_extent(inode, &es);
+		(void)ext4_es_find_delayed_extent(inode, &es);
 		if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
 			if (last != start)
 				dataoff = last << blkbits;
@@ -549,7 +549,7 @@ static loff_t ext4_seek_hole(struct file *file, loff_t offset, loff_t maxsize)
 		 * we will skip this extent.
 		 */
 		es.es_lblk = last;
-		(void)ext4_es_find_extent(inode, &es);
+		(void)ext4_es_find_delayed_extent(inode, &es);
 		if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
 			last = es.es_lblk + es.es_len;
 			holeoff = last << blkbits;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 0ee507f..0f30a8e 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2147,7 +2147,7 @@ TRACE_EVENT(ext4_es_remove_extent,
 		  __entry->lblk, __entry->len)
 );
 
-TRACE_EVENT(ext4_es_find_extent_enter,
+TRACE_EVENT(ext4_es_find_delayed_extent_enter,
 	TP_PROTO(struct inode *inode, ext4_lblk_t lblk),
 
 	TP_ARGS(inode, lblk),
@@ -2169,7 +2169,7 @@ TRACE_EVENT(ext4_es_find_extent_enter,
 		  (unsigned long) __entry->ino, __entry->lblk)
 );
 
-TRACE_EVENT(ext4_es_find_extent_exit,
+TRACE_EVENT(ext4_es_find_delayed_extent_exit,
 	TP_PROTO(struct inode *inode, struct extent_status *es,
 		 ext4_lblk_t ret),
 
-- 
1.7.12.rc0.22.gcdd159b


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/2] ext4: track all extent status in extent status tree
  2013-02-13  3:46     ` [PATCH 1/2] ext4: rename ext4_es_find_extent() to ext4_es_find_delayed_extent() Theodore Ts'o
@ 2013-02-13  3:46       ` Theodore Ts'o
  0 siblings, 0 replies; 37+ messages in thread
From: Theodore Ts'o @ 2013-02-13  3:46 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Zheng Liu, Theodore Ts'o, Jan kara

From: Zheng Liu <wenqing.lz@taobao.com>

By recording the phycisal block and status, extent status tree is able
to track the status of every extents.  When we call _map_blocks
functions to lookup an extent or create a new written/unwritten/delayed
extent, this extent will be inserted into extent status tree.  The hole
extent is inserted in ext4_ext_put_gap_in_cache().  If there is no any
extent, we will not insert a hole extent [0, ~0] into the extent status
tree in order to reduce the complextiy of code.

We don't load all extents from disk in alloc_inode() because it costs
too much memory, and if a file is opened and closed frequently it will
takes too much time to load all extent information.  So currently when
we create/lookup an extent, this extent will be inserted into extent
status tree.  Hence, the extent status tree may not comprehensively
contain all of the extents found in the file.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
---
 fs/ext4/extents_status.c | 18 ++++++++++++----
 fs/ext4/inode.c          | 53 ++++++++++++++++++++++++++++++------------------
 2 files changed, 47 insertions(+), 24 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index df72e95..839a91d 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -267,15 +267,25 @@ ext4_lblk_t ext4_es_find_delayed_extent(struct inode *inode,
 	es1 = __es_tree_search(&tree->root, es->es_lblk);
 
 out:
-	if (es1) {
+	if (es1 && !ext4_es_is_delayed(es1)) {
+		while ((node = rb_next(&es1->rb_node)) != NULL) {
+			es1 = rb_entry(node, struct extent_status, rb_node);
+			if (ext4_es_is_delayed(es1))
+				break;
+		}
+	}
+
+	if (es1 && ext4_es_is_delayed(es1)) {
 		tree->cache_es = es1;
 		es->es_lblk = es1->es_lblk;
 		es->es_len = es1->es_len;
 		es->es_pblk = es1->es_pblk;
-		node = rb_next(&es1->rb_node);
-		if (node) {
+		while ((node = rb_next(&es1->rb_node)) != NULL) {
 			es1 = rb_entry(node, struct extent_status, rb_node);
-			ret = es1->es_lblk;
+			if (ext4_es_is_delayed(es1)) {
+				ret = es1->es_lblk;
+				break;
+			}
 		}
 	}
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cd75f65..01875fe 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -524,20 +524,22 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		retval = ext4_ind_map_blocks(handle, inode, map, flags &
 					     EXT4_GET_BLOCKS_KEEP_SIZE);
 	}
+	if (retval > 0) {
+		int ret;
+		unsigned long long status;
+
+		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+		ret = ext4_es_insert_extent(inode, map->m_lblk,
+					    map->m_len, map->m_pblk, status);
+		if (ret < 0)
+			retval = ret;
+	}
 	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
 		up_read((&EXT4_I(inode)->i_data_sem));
 
 	if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
-		int ret;
-		if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
-			/* delayed alloc may be allocated by fallocate and
-			 * coverted to initialized by directIO.
-			 * we need to handle delayed extent here.
-			 */
-			down_write((&EXT4_I(inode)->i_data_sem));
-			goto delayed_mapped;
-		}
-		ret = check_block_validity(inode, map);
+		int ret = check_block_validity(inode, map);
 		if (ret != 0)
 			return ret;
 	}
@@ -606,18 +608,19 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 			(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE))
 			ext4_da_update_reserve_space(inode, retval, 1);
 	}
-	if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
+	if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
 		ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);
 
-		if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
-			int ret;
-delayed_mapped:
-			/* delayed allocation blocks has been allocated */
-			ret = ext4_es_remove_extent(inode, map->m_lblk,
-						    map->m_len);
-			if (ret < 0)
-				retval = ret;
-		}
+	if (retval > 0) {
+		int ret;
+		unsigned long long status;
+
+		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+		ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+					    map->m_pblk, status);
+		if (ret < 0)
+			retval = ret;
 	}
 
 	up_write((&EXT4_I(inode)->i_data_sem));
@@ -1787,6 +1790,16 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 		map_bh(bh, inode->i_sb, invalid_block);
 		set_buffer_new(bh);
 		set_buffer_delay(bh);
+	} else if (retval > 0) {
+		int ret;
+		unsigned long long status;
+
+		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+		ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+					    map->m_pblk, status);
+		if (ret != 0)
+			retval = ret;
 	}
 
 out_unlock:
-- 
1.7.12.rc0.22.gcdd159b


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/10 v5] ext4: refine extent status tree
  2013-02-08 15:35   ` Jan Kara
@ 2013-02-15  6:38     ` Zheng Liu
  0 siblings, 0 replies; 37+ messages in thread
From: Zheng Liu @ 2013-02-15  6:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o

On Fri, Feb 08, 2013 at 04:35:00PM +0100, Jan Kara wrote:
[snip]
> > -static int __es_insert_extent(struct ext4_es_tree *tree, ext4_lblk_t offset,
> > -			      ext4_lblk_t len)
> > +static int __es_insert_extent(struct ext4_es_tree *tree,
> > +			      struct extent_status *newes)
> >  {
> >  	struct rb_node **p = &tree->root.rb_node;
> >  	struct rb_node *parent = NULL;
> >  	struct extent_status *es;
> > -	ext4_lblk_t end = offset + len - 1;
> > -
> > -	BUG_ON(end < offset);
> > -	es = tree->cache_es;
> > -	if (es && offset == (extent_status_end(es) + 1)) {
> > -		es_debug("cached by [%u/%u)\n", es->start, es->len);
> > -		es->len += len;
> > -		es = ext4_es_try_to_merge_right(tree, es);
> > -		goto out;
> > -	} else if (es && es->start == end + 1) {
> > -		es_debug("cached by [%u/%u)\n", es->start, es->len);
> > -		es->start = offset;
> > -		es->len += len;
> > -		es = ext4_es_try_to_merge_left(tree, es);
> > -		goto out;
> > -	} else if (es && es->start <= offset &&
> > -		   end <= extent_status_end(es)) {
> > -		es_debug("cached by [%u/%u)\n", es->start, es->len);
> > -		goto out;
> > -	}
> >  
> >  	while (*p) {
> >  		parent = *p;
> >  		es = rb_entry(parent, struct extent_status, rb_node);
> >  
> > -		if (offset < es->start) {
> > -			if (es->start == end + 1) {
> > -				es->start = offset;
> > -				es->len += len;
> > +		if (newes->es_lblk < es->es_lblk) {
> > +			if (ext4_es_can_be_merged(newes, es)) {
> > +				es->es_lblk = newes->es_lblk;
> > +				es->es_len += newes->es_len;
>   This is wrong, isn't it? You cannot change es->es_lblk because that can
> break ordering of elements in the tree... thinking ... ah, it's OK because
> you have non-overlapping intervals. But it deserves a comment I guess.

Hi Jan,

Sorry for the delay reply.  I will add a comment here to describe why
es_lblk can be changed directly.

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/10 v5] ext4: track all extent status in extent status tree
  2013-02-11 12:21   ` Jan Kara
@ 2013-02-15  6:45     ` Zheng Liu
  0 siblings, 0 replies; 37+ messages in thread
From: Zheng Liu @ 2013-02-15  6:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o

On Mon, Feb 11, 2013 at 01:21:27PM +0100, Jan Kara wrote:
> On Fri 08-02-13 16:44:00, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > By recording the phycisal block and status, extent status tree is able
> > to track the status of every extents.  When we call _map_blocks
> > functions to lookup an extent or create a new written/unwritten/delayed
> > extent, this extent will be inserted into extent status tree.  The hole
> > extent is inserted in ext4_ext_put_gap_in_cache().  If there is no any
> > extent, we will not insert a hole extent [0, ~0] into the extent status
> > tree in order to reduce the complextiy of code.
> > 
> > We don't load all extents from disk in alloc_inode() because it costs
> > too much memory, and if a file is opened and closed frequently it will
> > takes too much time to load all extent information.  So currently when
> > we create/lookup an extent, this extent will be inserted into extent
> > status tree.  Hence, the extent status tree may not comprehensively
> > contain all of the extents found in the file.
> > 
> > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > Cc: Jan kara <jack@suse.cz>
> > ---
> >  fs/ext4/extents.c           |  4 +--
> >  fs/ext4/extents_status.c    | 27 ++++++++++++------
> >  fs/ext4/extents_status.h    |  4 +--
> >  fs/ext4/file.c              |  4 +--
> >  fs/ext4/inode.c             | 68 ++++++++++++++++++++++++++++-----------------
> >  include/trace/events/ext4.h |  4 +--
> >  6 files changed, 70 insertions(+), 41 deletions(-)
> > 
> ...
> > diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> > index 5093cee..71cb75a 100644
> > --- a/fs/ext4/extents_status.c
> > +++ b/fs/ext4/extents_status.c
> > @@ -239,14 +239,15 @@ static struct extent_status *__es_tree_search(struct rb_root *root,
> >   * EXT_MAX_BLOCKS if no extent is found.
> >   * Delayed extent is returned via @es.
> >   */
> > -ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
> > +ext4_lblk_t ext4_es_find_delayed_extent(struct inode *inode,
> > +					struct extent_status *es)
> >  {
>   I have to say I'm still not very happy about this function (but it's much
> better than it used to be so thanks for that!). I have two suggestions for
> improvement:
> 1) 'es' is both input and output argument where for input only es_lblk is
> used. That's a bit confusing so how about making the function like:
> 
> ext4_es_find_delayed_extent(struct inode *inode, ext4_lblk_t offset,
> 			    struct extent_status *out);
> 
>   to separate input and output? Also you can comment that we use the 'out'
> parameter instead of returning the extent_status from the tree because that
> can be freed once we drop the spinlock protecting status tree.
> 
> 2) The returned value is somewhat surprisingly the logical offset of the
> *next* delalloc extent. It's used only in ext4_fill_fiemap_extents()
> AFAICS. It would be easier to understand if the function didn't return
> anything. ext4_fill_fiemap_extents() would use
> ext4_es_find_delayed_extent() to find both current and next delalloc extent
> (which would become the 'current' one in the next iteration). As a bonus
> you would also save some iteration of the extent status tree...

I have seen that Ted has sent out two patches that split this patch.  I
will pick them up and polish them according to your comments.  But
before this, I need to fix the problem that triggers a failure of
xfstests #13 with bigalloc.

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/10 v5] ext4: track all extent status in extent status tree
  2013-02-13  3:28   ` Theodore Ts'o
  2013-02-13  3:46     ` [PATCH 1/2] ext4: rename ext4_es_find_extent() to ext4_es_find_delayed_extent() Theodore Ts'o
@ 2013-02-15  6:53     ` Zheng Liu
  2013-02-17 16:26     ` Zheng Liu
  2 siblings, 0 replies; 37+ messages in thread
From: Zheng Liu @ 2013-02-15  6:53 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Zheng Liu, Jan kara

On Tue, Feb 12, 2013 at 10:28:19PM -0500, Theodore Ts'o wrote:
> On Fri, Feb 08, 2013 at 04:44:00PM +0800, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > By recording the phycisal block and status, extent status tree is able
> > to track the status of every extents.  When we call _map_blocks
> > functions to lookup an extent or create a new written/unwritten/delayed
> > extent, this extent will be inserted into extent status tree.  The hole
> > extent is inserted in ext4_ext_put_gap_in_cache().  If there is no any
> > extent, we will not insert a hole extent [0, ~0] into the extent status
> > tree in order to reduce the complextiy of code.
> > 
> > We don't load all extents from disk in alloc_inode() because it costs
> > too much memory, and if a file is opened and closed frequently it will
> > takes too much time to load all extent information.  So currently when
> > we create/lookup an extent, this extent will be inserted into extent
> > status tree.  Hence, the extent status tree may not comprehensively
> > contain all of the extents found in the file.
> > 
> > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > Cc: Jan kara <jack@suse.cz>
> 
> Unfortunately, this commit is apparently causing test failures with
> bigalloc:
> 
> --- 013.out	2013-01-01 22:52:04.000000000 -0500
> +++ 013.out.bad	2013-02-12 22:08:47.110766615 -0500
> @@ -8,7 +8,4 @@
>  -----------------------------------------------
>  fsstress.2 : -p 20 -r
>  -----------------------------------------------
> -
> ------------------------------------------------
> -fsstress.3 : -p 4 -z -f rmdir=10 -f link=10 -f creat=10 -f mkdir=10 -f rename=30 -f stat=30 -f unlink=30 -f truncate=20
> ------------------------------------------------
> +_check_generic_filesystem: filesystem on /dev/vdd is inconsistent (see 013.full)
> _check_generic_filesystem: filesystem on /dev/vdd is inconsistent (see 013.full)
> Ran: 013
> Failures: 013
> Failed 1 of 1 tests
> END TEST: Ext4 4k block w/bigalloc Tue Feb 12 22:08:49 EST 2013
> e2fsck 1.43-WIP (15-Jan-2013)
> Pass 1: Checking inodes, blocks, and sizes
> Inode 618, i_blocks is 1408, should be 1536.  Fix? yes
> 
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> 
> /dev/vdd: ***** FILE SYSTEM WAS MODIFIED *****
> /dev/vdd: 3969/81936 files (13.1% non-contiguous), 176208/1310720 blocks
> 
> 
> I haven't been able to figure out what is going on here, but if we
> can't figure this out I may need to push off this patch series to the
> next merge window.  I've tried splitting up this patch into two pieces
> to make it clearer what is going on, but I still can't see how this
> would be affecting the i_blocks calculation.

Hi Ted,

Oops, I run xfstests #13 serveral times and this bug can be triggered.
Sorry about that.  I will look at it, but I am not sure whether it can
be fixed before merge window opens.  Thanks for let me know.

For this bug, I guess that the problem is in ext4_da_invalidatepages()
because when a file is truncated this function will be called.  But it
seems that the delay space reservation is the root cause.  Ah, let me
trace it.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 05/10 v5] ext4: lookup block mapping in extent status tree
  2013-02-12 12:31   ` Jan Kara
@ 2013-02-15  7:06     ` Zheng Liu
  2013-02-15 16:47       ` Jan Kara
  2013-02-15 17:25       ` Theodore Ts'o
  0 siblings, 2 replies; 37+ messages in thread
From: Zheng Liu @ 2013-02-15  7:06 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o

On Tue, Feb 12, 2013 at 01:31:42PM +0100, Jan Kara wrote:
> On Fri 08-02-13 16:44:01, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > After tracking all extent status, we already have a extent cache in
> > memory.  Every time we want to lookup a block mapping, we can first
> > try to lookup it in extent status tree to avoid a potential disk I/O.
> > 
> > A new function called ext4_es_lookup_extent is defined to finish this
> > work.  When we try to lookup a block mapping, we always call
> > ext4_map_blocks and/or ext4_da_map_blocks.  So in these functions we
> > first try to lookup a block mapping in extent status tree.
> > 
> > A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks
> > in order not to put a hole into extent status tree because this hole
> > will be converted to delayed extent in the tree immediately.
>   It looks somewhat inconsistent that you put hole into the extent tree in
> ext4_ext_map_blocks() but all other extent types are handled in
> ext4_map_blocks() or ext4_da_map_blocks(). Can we put the handling in one
> place?

It seems that putting all handlings in one place is too complex because
ext4_da_map_blocks() calls ext4_ext_map_blocks() and ext4_ind_map_blocks()
directly.  So now we put all extent except hole in ext4_map_blocks() and
ext4_da_map_blocks().  For the hole, it will be inserted into the status
tree in ext4_ext_put_gap_in_cache().  In this function we can get the
the length of the hole.  If we handle it in ext4_da_map_blocks() or
ext4_map_blocks(), we only can insert a hole which the length of this
hole is 1 because in these functions we couldn't know the length of the
hole.

I am planning to refine the get_block_t and *map_blocks functions.  At
that time I will try to fix this problem.

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/10 v5] ext4: convert unwritten extents from extent status tree in end_io
  2013-02-12 12:51   ` Jan Kara
@ 2013-02-15  7:12     ` Zheng Liu
  0 siblings, 0 replies; 37+ messages in thread
From: Zheng Liu @ 2013-02-15  7:12 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o

On Tue, Feb 12, 2013 at 01:51:59PM +0100, Jan Kara wrote:
> On Fri 08-02-13 16:44:05, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > This commit tries to convert unwritten extents from extent status tree
> > in end_io callback functions and ext4_ext_direct_IO.
>   Why should we do this?
> 
> ...
> > @@ -801,3 +807,147 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> >  	tree->cache_es = NULL;
> >  	return nr_shrunk;
> >  }
> > +
> > +int ext4_es_convert_unwritten_extents(struct inode *inode, loff_t offset,
> > +				      size_t size)
> > +{
> > +	struct ext4_es_tree *tree;
> > +	struct rb_node *node;
> > +	struct extent_status *es, orig_es, conv_es;
> > +	ext4_lblk_t end, len1, len2;
> > +	ext4_lblk_t lblk = 0, len = 0;
> > +	ext4_fsblk_t block;
> > +	unsigned long flags;
> > +	unsigned int blkbits;
> > +	int err = 0;
> > +
> > +	trace_ext4_es_convert_unwritten_extents(inode, offset, size);
> > +	blkbits = inode->i_blkbits;
> > +	lblk = offset >> blkbits;
> > +	len = (EXT4_BLOCK_ALIGN(offset + size, blkbits) >> blkbits) - lblk;
> > +
> > +	end = lblk + len - 1;
> > +	BUG_ON(end < lblk);
> > +
> > +	tree = &EXT4_I(inode)->i_es_tree;
> > +
> > +	write_lock_irqsave(&EXT4_I(inode)->i_es_lock, flags);
> > +	es = __es_tree_search(&tree->root, lblk);
> > +	if (!es)
> > +		goto out;
> > +	if (es->es_lblk > end)
> > +		goto out;
> > +
> > +	tree->cache_es = NULL;
> > +
> > +	orig_es.es_lblk = es->es_lblk;
> > +	orig_es.es_len = es->es_len;
> > +	orig_es.es_pblk = es->es_pblk;
> > +
> > +	len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0;
> > +	len2 = ext4_es_end(es) > end ?
> > +	       ext4_es_end(es) - end : 0;
> > +	if (len1 > 0)
> > +		es->es_len = len1;
> > +	if (len2 > 0) {
> > +		if (len1 > 0) {
> > +			struct extent_status newes;
> > +
> > +			newes.es_lblk = end + 1;
> > +			newes.es_len = len2;
> > +			block = ext4_es_pblock(&orig_es) +
> > +				orig_es.es_len - len2;
> > +			ext4_es_store_pblock(&newes, block);
> > +			ext4_es_store_status(&newes, ext4_es_status(&orig_es));
> > +			err = __es_insert_extent(inode, &newes);
> > +			if (err) {
> > +				es->es_lblk = orig_es.es_lblk;
> > +				es->es_len = orig_es.es_len;
> > +				es->es_pblk = orig_es.es_pblk;
> > +				goto out;
> > +			}
> > +
> > +			conv_es.es_lblk = orig_es.es_lblk + len1;
> > +			conv_es.es_len = orig_es.es_len - len1 - len2;
> > +			block = ext4_es_pblock(&orig_es) + len1;
> > +			ext4_es_store_pblock(&conv_es, block);
> > +			ext4_es_store_status(&conv_es, EXTENT_STATUS_WRITTEN);
> > +			err = __es_insert_extent(inode, &conv_es);
> > +			if (err) {
> > +				int err2 = __es_remove_extent(inode,
> > +							conv_es.es_lblk,
> > +							ext4_es_end(&newes));
> > +				if (err2)
> > +					goto out;
> > +				es->es_lblk = orig_es.es_lblk;
> > +				es->es_len = orig_es.es_len;
> > +				es->es_pblk = orig_es.es_pblk;
> > +				goto out;
> > +			}
> > +		} else {
> > +			es->es_lblk = end + 1;
> > +			es->es_len = len2;
> > +			block = ext4_es_pblock(&orig_es) +
> > +				orig_es.es_len - len2;
> > +			ext4_es_store_pblock(es, block);
> > +
> > +			conv_es.es_lblk = orig_es.es_lblk;
> > +			conv_es.es_len = orig_es.es_len - len2;
> > +			ext4_es_store_pblock(&conv_es,
> > +					     ext4_es_pblock(&orig_es));
> > +			ext4_es_store_status(&conv_es, EXTENT_STATUS_WRITTEN);
> > +			err = __es_insert_extent(inode, &conv_es);
> > +			if (err) {
> > +				es->es_lblk = orig_es.es_lblk;
> > +				es->es_len = orig_es.es_len;
> > +				es->es_pblk = orig_es.es_pblk;
> > +			}
> > +		}
> > +		goto out;
> > +	}
> > +
> > +	if (len1 > 0) {
> > +		node = rb_next(&es->rb_node);
> > +		if (node)
> > +			es = rb_entry(node, struct extent_status, rb_node);
> > +		else
> > +			es = NULL;
> > +	}
> > +
> > +	while (es && ext4_es_end(es) <= end) {
> > +		node = rb_next(&es->rb_node);
> > +		ext4_es_store_status(es, EXTENT_STATUS_WRITTEN);
> > +		if (!inode) {
> > +			es = NULL;
> > +			break;
> > +		}
> > +		es = rb_entry(node, struct extent_status, rb_node);
> > +	}
> > +
> > +	if (es && es->es_lblk < end + 1) {
> > +		ext4_lblk_t orig_len = es->es_len;
> > +
> > +		/*
> > +		 * Here we first set conv_es just because of avoiding copy the
> > +		 * value of es to a temporary variable.
> > +		 */
> > +		len1 = ext4_es_end(es) - end;
> > +		conv_es.es_lblk = es->es_lblk;
> > +		conv_es.es_len = es->es_len - len1;
> > +		ext4_es_store_pblock(&conv_es, ext4_es_pblock(es));
> > +		ext4_es_store_status(&conv_es, EXTENT_STATUS_WRITTEN);
> > +
> > +		es->es_lblk = end + 1;
> > +		es->es_len = len1;
> > +		block = ext4_es_pblock(es) + orig_len - len1;
> > +		ext4_es_store_pblock(es, block);
> > +
> > +		err = __es_insert_extent(inode, &conv_es);
> > +		if (err)
> > +			goto out;
> > +	}
> > +
> > +out:
> > +	write_unlock_irqrestore(&EXT4_I(inode)->i_es_lock, flags);
> > +	return err;
> > +}
>   Is this really needed? Why don't you just use ext4_es_insert_extent() to
> insert new extent of proper type? Also the way you wrote it, we can return
> (freshly written) data to the user, then reclaim the extent status from
> memory and later return 0s because we read the original status from disk
> (conversion hasn't still happened on disk). That would be certainly
> confusing.

Yes, you are right.  So it seems that a new flag called
EXTENT_STATUS_DIRTY need to be defined to prevent memory reclaim from
shrinker.

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 10/10 v5] ext4: remove bogus wait for unwritten extents in ext4_ind_direct_IO
  2013-02-12 12:58   ` Jan Kara
@ 2013-02-15  7:14     ` Zheng Liu
  0 siblings, 0 replies; 37+ messages in thread
From: Zheng Liu @ 2013-02-15  7:14 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Zheng Liu, Theodore Ts'o

On Tue, Feb 12, 2013 at 01:58:19PM +0100, Jan Kara wrote:
> On Fri 08-02-13 16:44:06, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > After converting unwritten extents from extent status tree in end_io, we
> > can safely remove this bogus wait and don't worry about read stale data
> > because we always try to lookup a block mapping in extent status tree
> > firstly and unwritten extents in the tree has been converted at this
> > time.
>   But you have to make sure the extent for the range really is in the
> status tree (you didn't in the previous patch I think) and cannot be
> reclaimed until the conversion is performed on disk.

Thanks for pointing out.  I will fix this problem.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 05/10 v5] ext4: lookup block mapping in extent status tree
  2013-02-15  7:06     ` Zheng Liu
@ 2013-02-15 16:47       ` Jan Kara
  2013-02-15 17:25       ` Theodore Ts'o
  1 sibling, 0 replies; 37+ messages in thread
From: Jan Kara @ 2013-02-15 16:47 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Jan Kara, linux-ext4, Zheng Liu, Theodore Ts'o

On Fri 15-02-13 15:06:26, Zheng Liu wrote:
> On Tue, Feb 12, 2013 at 01:31:42PM +0100, Jan Kara wrote:
> > On Fri 08-02-13 16:44:01, Zheng Liu wrote:
> > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > 
> > > After tracking all extent status, we already have a extent cache in
> > > memory.  Every time we want to lookup a block mapping, we can first
> > > try to lookup it in extent status tree to avoid a potential disk I/O.
> > > 
> > > A new function called ext4_es_lookup_extent is defined to finish this
> > > work.  When we try to lookup a block mapping, we always call
> > > ext4_map_blocks and/or ext4_da_map_blocks.  So in these functions we
> > > first try to lookup a block mapping in extent status tree.
> > > 
> > > A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks
> > > in order not to put a hole into extent status tree because this hole
> > > will be converted to delayed extent in the tree immediately.
> >   It looks somewhat inconsistent that you put hole into the extent tree in
> > ext4_ext_map_blocks() but all other extent types are handled in
> > ext4_map_blocks() or ext4_da_map_blocks(). Can we put the handling in one
> > place?
> 
> It seems that putting all handlings in one place is too complex because
> ext4_da_map_blocks() calls ext4_ext_map_blocks() and ext4_ind_map_blocks()
> directly.  So now we put all extent except hole in ext4_map_blocks() and
> ext4_da_map_blocks().  For the hole, it will be inserted into the status
> tree in ext4_ext_put_gap_in_cache().  In this function we can get the
> the length of the hole.  If we handle it in ext4_da_map_blocks() or
> ext4_map_blocks(), we only can insert a hole which the length of this
> hole is 1 because in these functions we couldn't know the length of the
> hole.
  Ah right, thanks for explanation.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 05/10 v5] ext4: lookup block mapping in extent status tree
  2013-02-15  7:06     ` Zheng Liu
  2013-02-15 16:47       ` Jan Kara
@ 2013-02-15 17:25       ` Theodore Ts'o
  2013-02-16  2:32         ` Zheng Liu
  1 sibling, 1 reply; 37+ messages in thread
From: Theodore Ts'o @ 2013-02-15 17:25 UTC (permalink / raw)
  To: Jan Kara, linux-ext4, Zheng Liu

On Fri, Feb 15, 2013 at 03:06:26PM +0800, Zheng Liu wrote:
> 
> I am planning to refine the get_block_t and *map_blocks functions.  At
> that time I will try to fix this problem.

Note that get_block_t can't be changed without disrupting the Direct
I/O functions which are generic VFS functions.  There's been talk of
trying to clean up DIO, but it will probably require building a
parallel infrastructure in the generic layer, and then transitioning
individual file systems over to it.  It is definitely a mess, but it's
going to be a very tricky problem.  I suspect we'll be talking about
it at LSF/MM.

One thing thing which might be an interesting thing to do that
wouldn't require wholesale changes to generic code would be to
transition ext4_readpages() to use fs/ext4/page-io.c.  Not for this
merge window, in all likelihood, but right now we are calling
ext4_get_block() for every single page that we read in, while is
wasteful.  It would be nice if ext4_readpages() called
ext4_map_blocks() for each extent, and then submitted it using the
page-io.c functions so we don't end up calling into ext4_map_blocks()
quite as much.

That will ease our scalability and remove locking overhead, in
addition to saving CPU for the buffered I/O readpages path.
Eventually it would be good to do this for DIO as well, but that's
going to require a lot more work, and coordination with the developers
of btrfs, xfs, etc.

					- Ted

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 05/10 v5] ext4: lookup block mapping in extent status tree
  2013-02-15 17:25       ` Theodore Ts'o
@ 2013-02-16  2:32         ` Zheng Liu
  2013-02-16 16:18           ` Possible TODO projects for the map_blocks() code path (was: Re: [PATCH 05/10 v5] ext4: lookup block mapping in extent status tree) Theodore Ts'o
  0 siblings, 1 reply; 37+ messages in thread
From: Zheng Liu @ 2013-02-16  2:32 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, linux-ext4, Zheng Liu

On Fri, Feb 15, 2013 at 12:25:49PM -0500, Theodore Ts'o wrote:
> On Fri, Feb 15, 2013 at 03:06:26PM +0800, Zheng Liu wrote:
> > 
> > I am planning to refine the get_block_t and *map_blocks functions.  At
> > that time I will try to fix this problem.
> 
> Note that get_block_t can't be changed without disrupting the Direct
> I/O functions which are generic VFS functions.  There's been talk of
> trying to clean up DIO, but it will probably require building a
> parallel infrastructure in the generic layer, and then transitioning
> individual file systems over to it.  It is definitely a mess, but it's
> going to be a very tricky problem.  I suspect we'll be talking about
> it at LSF/MM.
> 
> One thing thing which might be an interesting thing to do that
> wouldn't require wholesale changes to generic code would be to
> transition ext4_readpages() to use fs/ext4/page-io.c.  Not for this
> merge window, in all likelihood, but right now we are calling
> ext4_get_block() for every single page that we read in, while is
> wasteful.  It would be nice if ext4_readpages() called
> ext4_map_blocks() for each extent, and then submitted it using the
> page-io.c functions so we don't end up calling into ext4_map_blocks()
> quite as much.
> 
> That will ease our scalability and remove locking overhead, in
> addition to saving CPU for the buffered I/O readpages path.
> Eventually it would be good to do this for DIO as well, but that's
> going to require a lot more work, and coordination with the developers
> of btrfs, xfs, etc.

To be honest, my initial idea is only to split ext4_map_blocks into
ext4_map_blocks_read and ext4_map_blocks_write, and do some cleanups.
Thanks for your suggestions.  I will look at it carefully after the
patch series of extent status tree has been applied.

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Possible TODO projects for the map_blocks() code path (was: Re: [PATCH 05/10 v5] ext4: lookup block mapping in extent status tree)
  2013-02-16  2:32         ` Zheng Liu
@ 2013-02-16 16:18           ` Theodore Ts'o
  2013-02-17  3:15             ` Zheng Liu
  0 siblings, 1 reply; 37+ messages in thread
From: Theodore Ts'o @ 2013-02-16 16:18 UTC (permalink / raw)
  To: Jan Kara, linux-ext4, Zheng Liu

On Sat, Feb 16, 2013 at 10:32:51AM +0800, Zheng Liu wrote:
> 
> To be honest, my initial idea is only to split ext4_map_blocks into
> ext4_map_blocks_read and ext4_map_blocks_write, and do some cleanups.
> Thanks for your suggestions.  I will look at it carefully after the
> patch series of extent status tree has been applied.

Ah, when you said get_block_t functions, I had assumed you had meant
changing the function signature --- because the function signature
being fixed by the generic DIO code is one of the things holding back
a number of improvements in the map_blocks code paths.

For example:

1) Thanks to the DIO code, we are ab(using) a struct buffer_head data
structure to pass the mapping to the DIO code.  Normally the
buffer_head maps only a single block's worth of data, but here b_size
is repurposed to indcate the size of the logical to physical block
mapping, and b_data is invalid (since it isn't a real buffer head).
There are a number of other fields in the struct buffer_head which in
the DIO codepath which are completely unused, which isn't just an
aesthetic issue --- it's also wasting valuable (and limited) kernel
stack space, since the struct buffer_head is allocated on the stack of
do_blockdev_direct_IO().

2) We are currently using inode flags to pass state flags between
different parts of the writepages code and the map_blocks code.  This
is bad because (a) it makes the code much harder to understand and
maintain, and (b) it blocks us from being able to call map_blocks() in
parallel.  If we fix this, it would be relatively trivial to add
support for parallel non-create map_block calls, and if we decide to
try to use the extent status tree for range locking, it might be
possible to do parallel block allocations sa well.  (I believe some
locking may be needed in mballoc.c for the inode-specific
preallocation code, but that should be doable.)

If we have multiple interested in working on various different
projects, it might be useful to start documenting some of these
proposed enhancements on the wiki, and certainly these would be good
things for us to discuss at the ext4 developer's workshop in April.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Possible TODO projects for the map_blocks() code path (was: Re: [PATCH 05/10 v5] ext4: lookup block mapping in extent status tree)
  2013-02-16 16:18           ` Possible TODO projects for the map_blocks() code path (was: Re: [PATCH 05/10 v5] ext4: lookup block mapping in extent status tree) Theodore Ts'o
@ 2013-02-17  3:15             ` Zheng Liu
  0 siblings, 0 replies; 37+ messages in thread
From: Zheng Liu @ 2013-02-17  3:15 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, linux-ext4, Zheng Liu

On Sat, Feb 16, 2013 at 11:18:46AM -0500, Theodore Ts'o wrote:
> On Sat, Feb 16, 2013 at 10:32:51AM +0800, Zheng Liu wrote:
> > 
> > To be honest, my initial idea is only to split ext4_map_blocks into
> > ext4_map_blocks_read and ext4_map_blocks_write, and do some cleanups.
> > Thanks for your suggestions.  I will look at it carefully after the
> > patch series of extent status tree has been applied.
> 
> Ah, when you said get_block_t functions, I had assumed you had meant
> changing the function signature --- because the function signature
> being fixed by the generic DIO code is one of the things holding back
> a number of improvements in the map_blocks code paths.
> 
> For example:
> 
> 1) Thanks to the DIO code, we are ab(using) a struct buffer_head data
> structure to pass the mapping to the DIO code.  Normally the
> buffer_head maps only a single block's worth of data, but here b_size
> is repurposed to indcate the size of the logical to physical block
> mapping, and b_data is invalid (since it isn't a real buffer head).
> There are a number of other fields in the struct buffer_head which in
> the DIO codepath which are completely unused, which isn't just an
> aesthetic issue --- it's also wasting valuable (and limited) kernel
> stack space, since the struct buffer_head is allocated on the stack of
> do_blockdev_direct_IO().

Yes, I also think struct buffer_head could be dropped, and get_block_t
signature should be changed.  But it needs to modify vfs layer.  We can
discuss this topic with other developers in LSF/MM summit and fsdevel@
mailing list.  AFAIK, this year in LSF/MM summit there is a topic about
this.

> 
> 2) We are currently using inode flags to pass state flags between
> different parts of the writepages code and the map_blocks code.  This
> is bad because (a) it makes the code much harder to understand and
> maintain, and (b) it blocks us from being able to call map_blocks() in
> parallel.  If we fix this, it would be relatively trivial to add
> support for parallel non-create map_block calls, and if we decide to
> try to use the extent status tree for range locking, it might be
> possible to do parallel block allocations sa well.  (I believe some
> locking may be needed in mballoc.c for the inode-specific
> preallocation code, but that should be doable.)

I think about extent-level locking when I run xfstests.  It seems that
we need to improve *_map_blocks() function if we want to use status tree
for range locking because we always need to lookup an extent in this
tree and lock this range before we do other things.  That is why I want
to split ext4_map_blocks function.  Certainly this is only my simple
idea, and I still need to consider it.  But I believe an improvement
and/or cleanup is a good start.

> 
> If we have multiple interested in working on various different
> projects, it might be useful to start documenting some of these
> proposed enhancements on the wiki, and certainly these would be good
> things for us to discuss at the ext4 developer's workshop in April.

Yes, there are a lot of things on ext4 wiki, which are out of date.
That would be great if it can be updated.

Looking forward to discussing this topics. :-)

                                                - Zheng

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/10 v5] ext4: track all extent status in extent status tree
  2013-02-13  3:28   ` Theodore Ts'o
  2013-02-13  3:46     ` [PATCH 1/2] ext4: rename ext4_es_find_extent() to ext4_es_find_delayed_extent() Theodore Ts'o
  2013-02-15  6:53     ` [PATCH 04/10 v5] " Zheng Liu
@ 2013-02-17 16:26     ` Zheng Liu
  2 siblings, 0 replies; 37+ messages in thread
From: Zheng Liu @ 2013-02-17 16:26 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Zheng Liu, Jan kara

On Tue, Feb 12, 2013 at 10:28:19PM -0500, Theodore Ts'o wrote:
> On Fri, Feb 08, 2013 at 04:44:00PM +0800, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > By recording the phycisal block and status, extent status tree is able
> > to track the status of every extents.  When we call _map_blocks
> > functions to lookup an extent or create a new written/unwritten/delayed
> > extent, this extent will be inserted into extent status tree.  The hole
> > extent is inserted in ext4_ext_put_gap_in_cache().  If there is no any
> > extent, we will not insert a hole extent [0, ~0] into the extent status
> > tree in order to reduce the complextiy of code.
> > 
> > We don't load all extents from disk in alloc_inode() because it costs
> > too much memory, and if a file is opened and closed frequently it will
> > takes too much time to load all extent information.  So currently when
> > we create/lookup an extent, this extent will be inserted into extent
> > status tree.  Hence, the extent status tree may not comprehensively
> > contain all of the extents found in the file.
> > 
> > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > Cc: Jan kara <jack@suse.cz>
> 
> Unfortunately, this commit is apparently causing test failures with
> bigalloc:
> 
> --- 013.out	2013-01-01 22:52:04.000000000 -0500
> +++ 013.out.bad	2013-02-12 22:08:47.110766615 -0500
> @@ -8,7 +8,4 @@
>  -----------------------------------------------
>  fsstress.2 : -p 20 -r
>  -----------------------------------------------
> -
> ------------------------------------------------
> -fsstress.3 : -p 4 -z -f rmdir=10 -f link=10 -f creat=10 -f mkdir=10 -f rename=30 -f stat=30 -f unlink=30 -f truncate=20
> ------------------------------------------------
> +_check_generic_filesystem: filesystem on /dev/vdd is inconsistent (see 013.full)
> _check_generic_filesystem: filesystem on /dev/vdd is inconsistent (see 013.full)
> Ran: 013
> Failures: 013
> Failed 1 of 1 tests
> END TEST: Ext4 4k block w/bigalloc Tue Feb 12 22:08:49 EST 2013
> e2fsck 1.43-WIP (15-Jan-2013)
> Pass 1: Checking inodes, blocks, and sizes
> Inode 618, i_blocks is 1408, should be 1536.  Fix? yes
> 
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> 
> /dev/vdd: ***** FILE SYSTEM WAS MODIFIED *****
> /dev/vdd: 3969/81936 files (13.1% non-contiguous), 176208/1310720 blocks
> 
> 
> I haven't been able to figure out what is going on here, but if we
> can't figure this out I may need to push off this patch series to the
> next merge window.  I've tried splitting up this patch into two pieces
> to make it clearer what is going on, but I still can't see how this
> would be affecting the i_blocks calculation.

Hi Ted,

I have fixed this regression.  The reason is that we miss to release
reserved space in ext4_da_page_release_reservation().  The root cause is
that when an extent is delayed allocated and later it could be allocated
by fallocate.  In this condition this extent need to keep as delayed
extent until the extent is written out because we need to update
reserved space according to these delayed extent.

I have run xfstests #13 serveral times and this regression never be
triggered.  Later the latest patch set will be sent out.

BTW, when I run xfstests to test the extent status tree patch sereis, I
get a regression against 'dev' branch of ext4 and a bug against 3.8-rc7.
I will file them in other mails.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2013-02-17 16:11 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-08  8:43 [PATCH 00/10 v5] ext4: extent status tree (step2) Zheng Liu
2013-02-08  8:43 ` [PATCH 01/10 v5] ext4: refine extent status tree Zheng Liu
2013-02-08 15:35   ` Jan Kara
2013-02-15  6:38     ` Zheng Liu
2013-02-08  8:43 ` [PATCH 02/10 v5] ext4: add physical block and status member into " Zheng Liu
2013-02-08 15:39   ` Jan Kara
2013-02-08  8:43 ` [PATCH 03/10 v5] ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag Zheng Liu
2013-02-08 15:41   ` Jan Kara
2013-02-08  8:44 ` [PATCH 04/10 v5] ext4: track all extent status in extent status tree Zheng Liu
2013-02-11 12:21   ` Jan Kara
2013-02-15  6:45     ` Zheng Liu
2013-02-13  3:28   ` Theodore Ts'o
2013-02-13  3:46     ` [PATCH 1/2] ext4: rename ext4_es_find_extent() to ext4_es_find_delayed_extent() Theodore Ts'o
2013-02-13  3:46       ` [PATCH 2/2] ext4: track all extent status in extent status tree Theodore Ts'o
2013-02-15  6:53     ` [PATCH 04/10 v5] " Zheng Liu
2013-02-17 16:26     ` Zheng Liu
2013-02-08  8:44 ` [PATCH 05/10 v5] ext4: lookup block mapping " Zheng Liu
2013-02-12 12:31   ` Jan Kara
2013-02-15  7:06     ` Zheng Liu
2013-02-15 16:47       ` Jan Kara
2013-02-15 17:25       ` Theodore Ts'o
2013-02-16  2:32         ` Zheng Liu
2013-02-16 16:18           ` Possible TODO projects for the map_blocks() code path (was: Re: [PATCH 05/10 v5] ext4: lookup block mapping in extent status tree) Theodore Ts'o
2013-02-17  3:15             ` Zheng Liu
2013-02-08  8:44 ` [PATCH 06/10 v5] ext4: remove single extent cache Zheng Liu
2013-02-08  8:44 ` [PATCH 07/10 v5] ext4: adjust some functions for reclaiming extents from extent status tree Zheng Liu
2013-02-08  8:44 ` [PATCH 08/10 v5] ext4: reclaim " Zheng Liu
2013-02-08  8:44 ` [PATCH 09/10 v5] ext4: convert unwritten extents from extent status tree in end_io Zheng Liu
2013-02-10  8:45   ` Zheng Liu
2013-02-11  1:52     ` Theodore Ts'o
2013-02-12 12:51   ` Jan Kara
2013-02-15  7:12     ` Zheng Liu
2013-02-08  8:44 ` [PATCH 10/10 v5] ext4: remove bogus wait for unwritten extents in ext4_ind_direct_IO Zheng Liu
2013-02-12 12:58   ` Jan Kara
2013-02-15  7:14     ` Zheng Liu
2013-02-10  1:38 ` [PATCH 00/10 v5] ext4: extent status tree (step2) Theodore Ts'o
2013-02-10  8:40   ` Zheng Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.