All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7 v2] ext4: extent status tree (step2)
@ 2013-01-11 10:53 Zheng Liu
  2013-01-11 10:53 ` [PATCH 1/7 v2] ext4: refine extent status tree Zheng Liu
                   ` (6 more replies)
  0 siblings, 7 replies; 29+ messages in thread
From: Zheng Liu @ 2013-01-11 10:53 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Jan kara, Theodore Ts'o

Hi all,

Here is my second try to implement the second step of extent status tree in
ext4.  The changelog is as below.

v2 <- v1:
 - drop patches that try to improve unwritten extent conversion
 - remove EXT4_MAP_FROM_CLUSTER flag
 - add tracepoint for ext4_es_lookup_extent()
 - drop a patch, which tries to fix a warning when bigalloc and delalloc are
   enabled
 - add a shrinker to reclaim memory from extent status tree
 - rebase against 3.8-rc2

Now the patch set makes extent status tree to track all extent status in memory
and makes it as a extent cache.

The patches that try to improve unwritten extent conversion are dropped because
Jan has worked on it and has a perfect solution.

Now when bigalloc and delalloc are enabled, there still has some works to be
done.  The key issue is delayed space reservation.  I tried to improve it using
extent status tree, but I don't have a good idea in my mind.  That would be
great if some one could give me some suggestions.  I think this work should be
as a new patch series.  So I drop a patch that is in the first version.

As Jan and Dave advised, fragmented extent tree will causes that status tree
costs too much memory.  So in this patch set a shrinker is defined to reclaim
memory.  When the status tree of an inode is accessed, the inode will be
inserted into the tail of lru list.  It will be dropped as ext4_clear_inode is
called.  When shrinker tries to reclaim some memory, written/unwritten extents
will be freed from status tree.  Delayed extent in the tree shouldn't be
reclaimed because they are used by fiemap, bigalloc, and seek_data/hole.  I am
worry about the lock contention because a lru lock is used to protect lru list.
This lock will be taken by all inodes in this file system.  So I do some
comparisons to measure this overhead.  The result shows that we don't need to
care this problem.

First I use fio to measure iops on a SSD in my desktop.  The config file is as
below:

== config file ==
[global]
ioengine=libaio
iodepth=8
bs=4k
filesize=1G
fallocate=none
size=8G
directory=/mnt/sda1
thread
group_reporting
runtime=600

[jobs]
numjobs=4
rw=randrw
nrfiles=16

== result of iops ==
    read    write
w/  2237    2233
w/o 2225    2227

In addition, I use 'perf lock' to re-run above test case to understand whether
there is a heavy contention, and the result shows that it is OK.

Other changes in this patch set are minor and straightforward.  Please review
it.  Any feedbacks or comments are welcome.

Thanks,
						- Zheng

Zheng Liu (7):
  ext4: refine extent status tree
  ext4: remove EXT4_MAP_FROM_CLUSTER flag
  ext4: add physical block and status member into extent status tree
  ext4: adjust interfaces of extent status tree
  ext4: track all extent status in extent status tree
  ext4: lookup block mapping in extent status tree
  ext4: reclaim extents from extent status tree

 fs/ext4/ext4.h              |  19 +-
 fs/ext4/extents.c           |  44 ++--
 fs/ext4/extents_status.c    | 505 ++++++++++++++++++++++++++++++++------------
 fs/ext4/extents_status.h    |  53 ++++-
 fs/ext4/file.c              |  14 +-
 fs/ext4/inode.c             | 138 +++++++++---
 fs/ext4/super.c             |   6 +
 include/trace/events/ext4.h | 118 ++++++++---
 8 files changed, 652 insertions(+), 245 deletions(-)

-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/7 v2] ext4: refine extent status tree
  2013-01-11 10:53 [PATCH 0/7 v2] ext4: extent status tree (step2) Zheng Liu
@ 2013-01-11 10:53 ` Zheng Liu
  2013-01-23  4:20   ` Theodore Ts'o
  2013-01-11 10:53 ` [PATCH 2/7 v2] ext4: remove EXT4_MAP_FROM_CLUSTER flag Zheng Liu
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Zheng Liu @ 2013-01-11 10:53 UTC (permalink / raw)
  To: linux-ext4; +Cc: Jan kara, Theodore Ts'o, Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

This patch tries to refine the extent status tree.  A prefix 'es_' is added to
make code clearly.

CC: Jan kara <jack@suse.cz>
CC: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/extents.c           |  21 ++--
 fs/ext4/extents_status.c    | 300 ++++++++++++++++++++++++--------------------
 fs/ext4/extents_status.h    |   8 +-
 fs/ext4/file.c              |  12 +-
 include/trace/events/ext4.h |  40 +++---
 5 files changed, 207 insertions(+), 174 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 5ae1674..f7bf616 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3525,13 +3525,14 @@ static int ext4_find_delalloc_range(struct inode *inode,
 {
 	struct extent_status es;
 
-	es.start = lblk_start;
-	ext4_es_find_extent(inode, &es);
-	if (es.len == 0)
+	es.es_lblk = lblk_start;
+	(void)ext4_es_find_extent(inode, &es);
+	if (es.es_len == 0)
 		return 0; /* there is no delay extent in this tree */
-	else if (es.start <= lblk_start && lblk_start < es.start + es.len)
+	else if (es.es_lblk <= lblk_start &&
+		 lblk_start < es.es_lblk + es.es_len)
 		return 1;
-	else if (lblk_start <= es.start && es.start <= lblk_end)
+	else if (lblk_start <= es.es_lblk && es.es_lblk <= lblk_end)
 		return 1;
 	else
 		return 0;
@@ -4567,7 +4568,7 @@ static int ext4_find_delayed_extent(struct inode *inode,
 	struct extent_status es;
 	ext4_lblk_t next_del;
 
-	es.start = newex->ec_block;
+	es.es_lblk = newex->ec_block;
 	next_del = ext4_es_find_extent(inode, &es);
 
 	if (newex->ec_start == 0) {
@@ -4575,18 +4576,18 @@ static int ext4_find_delayed_extent(struct inode *inode,
 		 * No extent in extent-tree contains block @newex->ec_start,
 		 * then the block may stay in 1)a hole or 2)delayed-extent.
 		 */
-		if (es.len == 0)
+		if (es.es_len == 0)
 			/* A hole found. */
 			return 0;
 
-		if (es.start > newex->ec_block) {
+		if (es.es_lblk > newex->ec_block) {
 			/* A hole found. */
-			newex->ec_len = min(es.start - newex->ec_block,
+			newex->ec_len = min(es.es_lblk - newex->ec_block,
 					    newex->ec_len);
 			return 0;
 		}
 
-		newex->ec_len = es.start + es.len - newex->ec_block;
+		newex->ec_len = es.es_lblk + es.es_len - newex->ec_block;
 	}
 
 	return next_del;
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 564d981..0662369 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -26,9 +26,10 @@
  * extent tree, whose goal is only track delay extent in memory to
  * simplify the implementation of fiemap and bigalloc, and introduce
  * lseek SEEK_DATA/SEEK_HOLE support.  That is why it is still called
- * delay extent tree at the following comment.  But for better
- * understand what it does, it has been rename to extent status tree.
+ * delay extent tree at the first commit.  But for better understand
+ * what it does, it has been rename to extent status tree.
  *
+ * Step1:
  * Currently the first step has been done.  All delay extents are
  * tracked in the tree.  It maintains the delay extent when a delay
  * allocation is issued, and the delay extent is written out or
@@ -37,26 +38,38 @@
  *
  * The following comment describes the implemenmtation of extent
  * status tree and future works.
+ *
+ * Step2:
+ * In this step all extent status is tracked by extent status tree.
+ * Thus, we can first try to lookup a block mapping in this tree before
+ * find it in extent tree.  Unwritten extent conversion also can be
+ * improved.  Currently this conversion need to be done in a workqueue
+ * because this conversion can not be done in end_io function due to it
+ * needs to take i_data_sem locking in a irq context.  After looking block
+ * mapping in extent status tree, we can first convert unwritten extent in
+ * extent status tree, call aio_comlete() and inode_dio_done() in end_io
+ * function, and don't need to be worried about expose a stale data.
+ * Meanwhile when dioread_nolock is enabled, reader won't need to wait
+ * this unwritten extent conversion, and latency also is reduced.
  */
 
 /*
- * extents status tree implementation for ext4.
+ * Extent status tree implementation for ext4.
  *
  *
  * ==========================================================================
- * Extents status encompass delayed extents and extent locks
+ * Extent status tracks all extent status.
  *
- * 1. Why delayed extent implementation ?
+ * 1. Why we need to implement extent status tree?
  *
- * Without delayed extent, ext4 identifies a delayed extent by looking
+ * Without extent status tree, ext4 identifies a delayed extent by looking
  * up page cache, this has several deficiencies - complicated, buggy,
  * and inefficient code.
  *
- * FIEMAP, SEEK_HOLE/DATA, bigalloc, punch hole and writeout all need
- * to know if a block or a range of blocks are belonged to a delayed
- * extent.
+ * FIEMAP, SEEK_HOLE/DATA, bigalloc, and writeout all need to know if a
+ * block or a range of blocks are belonged to a delayed extent.
  *
- * Let us have a look at how they do without delayed extents implementation.
+ * Let us have a look at how they do without extent status tree.
  *   --	FIEMAP
  *	FIEMAP looks up page cache to identify delayed allocations from holes.
  *
@@ -68,47 +81,43 @@
  *	already under delayed allocation or not to determine whether
  *	quota reserving is needed for the cluster.
  *
- *   -- punch hole
- *	punch hole looks up page cache to identify a delayed extent.
- *
  *   --	writeout
  *	Writeout looks up whole page cache to see if a buffer is
  *	mapped, If there are not very many delayed buffers, then it is
  *	time comsuming.
  *
- * With delayed extents implementation, FIEMAP, SEEK_HOLE/DATA,
+ * With extent status tree implementation, FIEMAP, SEEK_HOLE/DATA,
  * bigalloc and writeout can figure out if a block or a range of
  * blocks is under delayed allocation(belonged to a delayed extent) or
- * not by searching the delayed extent tree.
+ * not by searching the extent tree.
  *
  *
  * ==========================================================================
- * 2. ext4 delayed extents impelmentation
+ * 2. Ext4 extent status tree impelmentation
  *
- *   --	delayed extent
- *	A delayed extent is a range of blocks which are contiguous
- *	logically and under delayed allocation.  Unlike extent in
- *	ext4, delayed extent in ext4 is a in-memory struct, there is
- *	no corresponding on-disk data.  There is no limit on length of
- *	delayed extent, so a delayed extent can contain as many blocks
- *	as they are contiguous logically.
+ *   --	extent
+ *	A extent is a range of blocks which are contiguous logically and
+ *	physically.  Unlike extent in extent tree, this extent in ext4 is
+ *	a in-memory struct, there is no corresponding on-disk data.  There
+ *	is no limit on length of extent, so an extent can contain as many
+ *	blocks as they are contiguous logically and physically.
  *
- *   --	delayed extent tree
- *	Every inode has a delayed extent tree and all under delayed
- *	allocation blocks are added to the tree as delayed extents.
- *	Delayed extents in the tree are ordered by logical block no.
+ *   --	extent status tree
+ *	Every inode has an extent status tree and all allocation blocks
+ *	are added to the tree with different status.  The extent in the
+ *	tree are ordered by logical block no.
  *
- *   --	operations on a delayed extent tree
- *	There are three operations on a delayed extent tree: find next
- *	delayed extent, adding a space(a range of blocks) and removing
- *	a space.
+ *   --	operations on a extent status tree
+ *	There are three important operations on a delayed extent tree: find
+ *	next extent, adding a extent(a range of blocks) and removing a extent.
  *
- *   --	race on a delayed extent tree
- *	Delayed extent tree is protected inode->i_es_lock.
+ *   --	race on a extent status tree
+ *	Extent status tree is protected inode->i_es_lock.
  *
  *
  * ==========================================================================
- * 3. performance analysis
+ * 3. Performance analysis
+ *
  *   --	overhead
  *	1. There is a cache extent for write access, so if writes are
  *	not very random, adding space operaions are in O(1) time.
@@ -120,15 +129,19 @@
  *
  * ==========================================================================
  * 4. TODO list
- *   -- Track all extent status
  *
- *   -- Improve get block process
+ *   -- Refactor delayed reserve space
  *
  *   -- Extent-level locking
  */
 
 static struct kmem_cache *ext4_es_cachep;
 
+static int __es_insert_extent(struct ext4_es_tree *tree,
+			      struct extent_status *newes);
+static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
+			      ext4_lblk_t end);
+
 int __init ext4_init_es(void)
 {
 	ext4_es_cachep = KMEM_CACHE(extent_status, SLAB_RECLAIM_ACCOUNT);
@@ -161,7 +174,7 @@ static void ext4_es_print_tree(struct inode *inode)
 	while (node) {
 		struct extent_status *es;
 		es = rb_entry(node, struct extent_status, rb_node);
-		printk(KERN_DEBUG " [%u/%u)", es->start, es->len);
+		printk(KERN_DEBUG " [%u/%u)", es->es_lblk, es->es_len);
 		node = rb_next(node);
 	}
 	printk(KERN_DEBUG "\n");
@@ -172,8 +185,8 @@ static void ext4_es_print_tree(struct inode *inode)
 
 static inline ext4_lblk_t extent_status_end(struct extent_status *es)
 {
-	BUG_ON(es->start + es->len < es->start);
-	return es->start + es->len - 1;
+	BUG_ON(es->es_lblk + es->es_len < es->es_lblk);
+	return es->es_lblk + es->es_len - 1;
 }
 
 /*
@@ -181,25 +194,25 @@ static inline ext4_lblk_t extent_status_end(struct extent_status *es)
  * it can't be found, try to find next extent.
  */
 static struct extent_status *__es_tree_search(struct rb_root *root,
-					      ext4_lblk_t offset)
+					      ext4_lblk_t lblk)
 {
 	struct rb_node *node = root->rb_node;
 	struct extent_status *es = NULL;
 
 	while (node) {
 		es = rb_entry(node, struct extent_status, rb_node);
-		if (offset < es->start)
+		if (lblk < es->es_lblk)
 			node = node->rb_left;
-		else if (offset > extent_status_end(es))
+		else if (lblk > extent_status_end(es))
 			node = node->rb_right;
 		else
 			return es;
 	}
 
-	if (es && offset < es->start)
+	if (es && lblk < es->es_lblk)
 		return es;
 
-	if (es && offset > extent_status_end(es)) {
+	if (es && lblk > extent_status_end(es)) {
 		node = rb_next(&es->rb_node);
 		return node ? rb_entry(node, struct extent_status, rb_node) :
 			      NULL;
@@ -209,8 +222,8 @@ static struct extent_status *__es_tree_search(struct rb_root *root,
 }
 
 /*
- * ext4_es_find_extent: find the 1st delayed extent covering @es->start
- * if it exists, otherwise, the next extent after @es->start.
+ * ext4_es_find_extent: find the 1st delayed extent covering @es->lblk
+ * if it exists, otherwise, the next extent after @es->lblk.
  *
  * @inode: the inode which owns delayed extents
  * @es: delayed extent that we found
@@ -226,7 +239,7 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
 	struct rb_node *node;
 	ext4_lblk_t ret = EXT_MAX_BLOCKS;
 
-	trace_ext4_es_find_extent_enter(inode, es->start);
+	trace_ext4_es_find_extent_enter(inode, es->es_lblk);
 
 	read_lock(&EXT4_I(inode)->i_es_lock);
 	tree = &EXT4_I(inode)->i_es_tree;
@@ -234,25 +247,25 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
 	/* find delay extent in cache firstly */
 	if (tree->cache_es) {
 		es1 = tree->cache_es;
-		if (in_range(es->start, es1->start, es1->len)) {
+		if (in_range(es->es_lblk, es1->es_lblk, es1->es_len)) {
 			es_debug("%u cached by [%u/%u)\n",
-				 es->start, es1->start, es1->len);
+				 es->es_lblk, es1->es_lblk, es1->es_len);
 			goto out;
 		}
 	}
 
-	es->len = 0;
-	es1 = __es_tree_search(&tree->root, es->start);
+	es->es_len = 0;
+	es1 = __es_tree_search(&tree->root, es->es_lblk);
 
 out:
 	if (es1) {
 		tree->cache_es = es1;
-		es->start = es1->start;
-		es->len = es1->len;
+		es->es_lblk = es1->es_lblk;
+		es->es_len = es1->es_len;
 		node = rb_next(&es1->rb_node);
 		if (node) {
 			es1 = rb_entry(node, struct extent_status, rb_node);
-			ret = es1->start;
+			ret = es1->es_lblk;
 		}
 	}
 
@@ -263,14 +276,14 @@ out:
 }
 
 static struct extent_status *
-ext4_es_alloc_extent(ext4_lblk_t start, ext4_lblk_t len)
+ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len)
 {
 	struct extent_status *es;
 	es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
 	if (es == NULL)
 		return NULL;
-	es->start = start;
-	es->len = len;
+	es->es_lblk = lblk;
+	es->es_len = len;
 	return es;
 }
 
@@ -279,6 +292,20 @@ static void ext4_es_free_extent(struct extent_status *es)
 	kmem_cache_free(ext4_es_cachep, es);
 }
 
+/*
+ * Check whether or not two extents can be merged
+ * Condition:
+ *  - logical block number is contiguous
+ */
+static int ext4_es_can_be_merged(struct extent_status *es1,
+				 struct extent_status *es2)
+{
+	if (es1->es_lblk + es1->es_len != es2->es_lblk)
+		return 0;
+
+	return 1;
+}
+
 static struct extent_status *
 ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
 {
@@ -290,8 +317,8 @@ ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
 		return es;
 
 	es1 = rb_entry(node, struct extent_status, rb_node);
-	if (es->start == extent_status_end(es1) + 1) {
-		es1->len += es->len;
+	if (ext4_es_can_be_merged(es1, es)) {
+		es1->es_len += es->es_len;
 		rb_erase(&es->rb_node, &tree->root);
 		ext4_es_free_extent(es);
 		es = es1;
@@ -311,8 +338,8 @@ ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
 		return es;
 
 	es1 = rb_entry(node, struct extent_status, rb_node);
-	if (es1->start == extent_status_end(es) + 1) {
-		es->len += es1->len;
+	if (ext4_es_can_be_merged(es, es1)) {
+		es->es_len += es1->es_len;
 		rb_erase(node, &tree->root);
 		ext4_es_free_extent(es1);
 	}
@@ -320,92 +347,85 @@ ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
 	return es;
 }
 
-static int __es_insert_extent(struct ext4_es_tree *tree, ext4_lblk_t offset,
-			      ext4_lblk_t len)
+static int __es_insert_extent(struct ext4_es_tree *tree,
+			      struct extent_status *newes)
 {
 	struct rb_node **p = &tree->root.rb_node;
 	struct rb_node *parent = NULL;
 	struct extent_status *es;
-	ext4_lblk_t end = offset + len - 1;
-
-	BUG_ON(end < offset);
-	es = tree->cache_es;
-	if (es && offset == (extent_status_end(es) + 1)) {
-		es_debug("cached by [%u/%u)\n", es->start, es->len);
-		es->len += len;
-		es = ext4_es_try_to_merge_right(tree, es);
-		goto out;
-	} else if (es && es->start == end + 1) {
-		es_debug("cached by [%u/%u)\n", es->start, es->len);
-		es->start = offset;
-		es->len += len;
-		es = ext4_es_try_to_merge_left(tree, es);
-		goto out;
-	} else if (es && es->start <= offset &&
-		   end <= extent_status_end(es)) {
-		es_debug("cached by [%u/%u)\n", es->start, es->len);
-		goto out;
-	}
+
+	/* invalidate cache */
+	/* TODO: first try to lookup in cache */
+	tree->cache_es = NULL;
 
 	while (*p) {
 		parent = *p;
 		es = rb_entry(parent, struct extent_status, rb_node);
 
-		if (offset < es->start) {
-			if (es->start == end + 1) {
-				es->start = offset;
-				es->len += len;
+		if (newes->es_lblk < es->es_lblk) {
+			if (ext4_es_can_be_merged(newes, es)) {
+				es->es_lblk = newes->es_lblk;
+				es->es_len += newes->es_len;
 				es = ext4_es_try_to_merge_left(tree, es);
 				goto out;
 			}
 			p = &(*p)->rb_left;
-		} else if (offset > extent_status_end(es)) {
-			if (offset == extent_status_end(es) + 1) {
-				es->len += len;
+		} else if (newes->es_lblk > extent_status_end(es)) {
+			if (ext4_es_can_be_merged(es, newes)) {
+				es->es_len += newes->es_len;
 				es = ext4_es_try_to_merge_right(tree, es);
 				goto out;
 			}
 			p = &(*p)->rb_right;
 		} else {
-			if (extent_status_end(es) <= end)
-				es->len = offset - es->start + len;
-			goto out;
+			BUG_ON(1);
+			return -EINVAL;
 		}
 	}
 
-	es = ext4_es_alloc_extent(offset, len);
+	es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len);
 	if (!es)
 		return -ENOMEM;
 	rb_link_node(&es->rb_node, parent, p);
 	rb_insert_color(&es->rb_node, &tree->root);
 
 out:
-	tree->cache_es = es;
 	return 0;
 }
 
 /*
- * ext4_es_insert_extent() adds a space to a delayed extent tree.
- * Caller holds inode->i_es_lock.
+ * ext4_es_insert_extent() adds a space to a extent status tree.
  *
  * ext4_es_insert_extent is called by ext4_da_write_begin and
  * ext4_es_remove_extent.
  *
  * Return 0 on success, error code on failure.
  */
-int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t offset,
+int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 			  ext4_lblk_t len)
 {
 	struct ext4_es_tree *tree;
+	struct extent_status newes;
+	ext4_lblk_t end = lblk + len - 1;
 	int err = 0;
 
-	trace_ext4_es_insert_extent(inode, offset, len);
+	trace_ext4_es_insert_extent(inode, lblk, len);
 	es_debug("add [%u/%u) to extent status tree of inode %lu\n",
-		 offset, len, inode->i_ino);
+		 lblk, len, inode->i_ino);
+
+	BUG_ON(end < lblk);
+
+	newes.es_lblk = lblk;
+	newes.es_len = len;
 
 	write_lock(&EXT4_I(inode)->i_es_lock);
 	tree = &EXT4_I(inode)->i_es_tree;
-	err = __es_insert_extent(tree, offset, len);
+	err = __es_remove_extent(tree, lblk, end);
+	if (err != 0)
+		goto error;
+	err = __es_insert_extent(tree, &newes);
+
+error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 
 	ext4_es_print_tree(inode);
@@ -413,57 +433,46 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t offset,
 	return err;
 }
 
-/*
- * ext4_es_remove_extent() removes a space from a delayed extent tree.
- * Caller holds inode->i_es_lock.
- *
- * Return 0 on success, error code on failure.
- */
-int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t offset,
-			  ext4_lblk_t len)
+static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
+				 ext4_lblk_t end)
 {
 	struct rb_node *node;
-	struct ext4_es_tree *tree;
 	struct extent_status *es;
 	struct extent_status orig_es;
-	ext4_lblk_t len1, len2, end;
+	ext4_lblk_t len1, len2;
 	int err = 0;
 
-	trace_ext4_es_remove_extent(inode, offset, len);
-	es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
-		 offset, len, inode->i_ino);
-
-	end = offset + len - 1;
-	BUG_ON(end < offset);
-	write_lock(&EXT4_I(inode)->i_es_lock);
-	tree = &EXT4_I(inode)->i_es_tree;
-	es = __es_tree_search(&tree->root, offset);
+	es = __es_tree_search(&tree->root, lblk);
 	if (!es)
 		goto out;
-	if (es->start > end)
+	if (es->es_lblk > end)
 		goto out;
 
 	/* Simply invalidate cache_es. */
 	tree->cache_es = NULL;
 
-	orig_es.start = es->start;
-	orig_es.len = es->len;
-	len1 = offset > es->start ? offset - es->start : 0;
+	orig_es.es_lblk = es->es_lblk;
+	orig_es.es_len = es->es_len;
+	len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0;
 	len2 = extent_status_end(es) > end ?
 	       extent_status_end(es) - end : 0;
 	if (len1 > 0)
-		es->len = len1;
+		es->es_len = len1;
 	if (len2 > 0) {
 		if (len1 > 0) {
-			err = __es_insert_extent(tree, end + 1, len2);
+			struct extent_status newes;
+
+			newes.es_lblk = end + 1;
+			newes.es_len = len2;
+			err = __es_insert_extent(tree, &newes);
 			if (err) {
-				es->start = orig_es.start;
-				es->len = orig_es.len;
+				es->es_lblk = orig_es.es_lblk;
+				es->es_len = orig_es.es_len;
 				goto out;
 			}
 		} else {
-			es->start = end + 1;
-			es->len = len2;
+			es->es_lblk = end + 1;
+			es->es_len = len2;
 		}
 		goto out;
 	}
@@ -487,13 +496,38 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t offset,
 		es = rb_entry(node, struct extent_status, rb_node);
 	}
 
-	if (es && es->start < end + 1) {
+	if (es && es->es_lblk < end + 1) {
 		len1 = extent_status_end(es) - end;
-		es->start = end + 1;
-		es->len = len1;
+		es->es_lblk = end + 1;
+		es->es_len = len1;
 	}
 
 out:
+	return err;
+}
+/*
+ * ext4_es_remove_extent() removes a space from a extent status tree.
+ *
+ * Return 0 on success, error code on failure.
+ */
+int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
+			  ext4_lblk_t len)
+{
+	struct ext4_es_tree *tree;
+	ext4_lblk_t end;
+	int err = 0;
+
+	trace_ext4_es_remove_extent(inode, lblk, len);
+	es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
+		 lblk, len, inode->i_ino);
+
+	end = lblk + len - 1;
+	BUG_ON(end < lblk);
+
+	tree = &EXT4_I(inode)->i_es_tree;
+
+	write_lock(&EXT4_I(inode)->i_es_lock);
+	err = __es_remove_extent(tree, lblk, end);
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 	ext4_es_print_tree(inode);
 	return err;
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 077f82d..81e9339 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -22,8 +22,8 @@
 
 struct extent_status {
 	struct rb_node rb_node;
-	ext4_lblk_t start;	/* first block extent covers */
-	ext4_lblk_t len;	/* length of extent in block */
+	ext4_lblk_t es_lblk;	/* first logical block extent covers */
+	ext4_lblk_t es_len;	/* length of extent in block */
 };
 
 struct ext4_es_tree {
@@ -35,9 +35,9 @@ extern int __init ext4_init_es(void);
 extern void ext4_exit_es(void);
 extern void ext4_es_init_tree(struct ext4_es_tree *tree);
 
-extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t start,
+extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t len);
-extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t start,
+extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t len);
 extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
 				struct extent_status *es);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 405565a..718c49f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -464,10 +464,9 @@ static loff_t ext4_seek_data(struct file *file, loff_t offset, loff_t maxsize)
 		 * If there is a delay extent at this offset,
 		 * it will be as a data.
 		 */
-		es.start = last;
+		es.es_lblk = last;
 		(void)ext4_es_find_extent(inode, &es);
-		if (last >= es.start &&
-		    last < es.start + es.len) {
+		if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
 			if (last != start)
 				dataoff = last << blkbits;
 			break;
@@ -549,11 +548,10 @@ static loff_t ext4_seek_hole(struct file *file, loff_t offset, loff_t maxsize)
 		 * If there is a delay extent at this offset,
 		 * we will skip this extent.
 		 */
-		es.start = last;
+		es.es_lblk = last;
 		(void)ext4_es_find_extent(inode, &es);
-		if (last >= es.start &&
-		    last < es.start + es.len) {
-			last = es.start + es.len;
+		if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
+			last = es.es_lblk + es.es_len;
 			holeoff = last << blkbits;
 			continue;
 		}
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 7e8c36b..952628a 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2068,75 +2068,75 @@ TRACE_EVENT(ext4_ext_remove_space_done,
 );
 
 TRACE_EVENT(ext4_es_insert_extent,
-	TP_PROTO(struct inode *inode, ext4_lblk_t start, ext4_lblk_t len),
+	TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
 
-	TP_ARGS(inode, start, len),
+	TP_ARGS(inode, lblk, len),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,	dev			)
 		__field(	ino_t,	ino			)
-		__field(	loff_t,	start			)
+		__field(	loff_t,	lblk			)
 		__field(	loff_t, len			)
 	),
 
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
-		__entry->start	= start;
+		__entry->lblk	= lblk;
 		__entry->len	= len;
 	),
 
 	TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->start, __entry->len)
+		  __entry->lblk, __entry->len)
 );
 
 TRACE_EVENT(ext4_es_remove_extent,
-	TP_PROTO(struct inode *inode, ext4_lblk_t start, ext4_lblk_t len),
+	TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
 
-	TP_ARGS(inode, start, len),
+	TP_ARGS(inode, lblk, len),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,	dev			)
 		__field(	ino_t,	ino			)
-		__field(	loff_t,	start			)
+		__field(	loff_t,	lblk			)
 		__field(	loff_t,	len			)
 	),
 
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
-		__entry->start	= start;
+		__entry->lblk	= lblk;
 		__entry->len	= len;
 	),
 
 	TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->start, __entry->len)
+		  __entry->lblk, __entry->len)
 );
 
 TRACE_EVENT(ext4_es_find_extent_enter,
-	TP_PROTO(struct inode *inode, ext4_lblk_t start),
+	TP_PROTO(struct inode *inode, ext4_lblk_t lblk),
 
-	TP_ARGS(inode, start),
+	TP_ARGS(inode, lblk),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,		dev		)
 		__field(	ino_t,		ino		)
-		__field(	ext4_lblk_t,	start		)
+		__field(	ext4_lblk_t,	lblk		)
 	),
 
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
-		__entry->start	= start;
+		__entry->lblk	= lblk;
 	),
 
-	TP_printk("dev %d,%d ino %lu start %u",
+	TP_printk("dev %d,%d ino %lu lblk %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  (unsigned long) __entry->ino, __entry->start)
+		  (unsigned long) __entry->ino, __entry->lblk)
 );
 
 TRACE_EVENT(ext4_es_find_extent_exit,
@@ -2148,7 +2148,7 @@ TRACE_EVENT(ext4_es_find_extent_exit,
 	TP_STRUCT__entry(
 		__field(	dev_t,		dev		)
 		__field(	ino_t,		ino		)
-		__field(	ext4_lblk_t,	start		)
+		__field(	ext4_lblk_t,	lblk		)
 		__field(	ext4_lblk_t,	len		)
 		__field(	ext4_lblk_t,	ret		)
 	),
@@ -2156,15 +2156,15 @@ TRACE_EVENT(ext4_es_find_extent_exit,
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
-		__entry->start	= es->start;
-		__entry->len	= es->len;
+		__entry->lblk	= es->es_lblk;
+		__entry->len	= es->es_len;
 		__entry->ret	= ret;
 	),
 
 	TP_printk("dev %d,%d ino %lu es [%u/%u) ret %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->start, __entry->len, __entry->ret)
+		  __entry->lblk, __entry->len, __entry->ret)
 );
 
 #endif /* _TRACE_EXT4_H */
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/7 v2] ext4: remove EXT4_MAP_FROM_CLUSTER flag
  2013-01-11 10:53 [PATCH 0/7 v2] ext4: extent status tree (step2) Zheng Liu
  2013-01-11 10:53 ` [PATCH 1/7 v2] ext4: refine extent status tree Zheng Liu
@ 2013-01-11 10:53 ` Zheng Liu
  2013-01-11 10:53 ` [PATCH 3/7 v2] ext4: add physical block and status member into extent status tree Zheng Liu
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 29+ messages in thread
From: Zheng Liu @ 2013-01-11 10:53 UTC (permalink / raw)
  To: linux-ext4; +Cc: Jan kara, Theodore Ts'o, Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

EXT4_MAP_FROM_CLUSTER flag is only used in ext4_da_map_blocks() and
ext4_ext_map_blocks() to indicate whether a cluster has been reserved or
not.  But in ext4_da_map_blocks() we can check it directly.  Meanwhile
this flag couldn't appear on the buffer head.  So it can be replaced
with a local variable.

CC: Jan kara <jack@suse.cz>
CC: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/ext4.h    | 15 +--------------
 fs/ext4/extents.c | 18 ++++--------------
 fs/ext4/inode.c   | 11 ++---------
 3 files changed, 7 insertions(+), 37 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8462eb3..36145ef1 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -155,17 +155,9 @@ struct ext4_allocation_request {
 #define EXT4_MAP_UNWRITTEN	(1 << BH_Unwritten)
 #define EXT4_MAP_BOUNDARY	(1 << BH_Boundary)
 #define EXT4_MAP_UNINIT		(1 << BH_Uninit)
-/* Sometimes (in the bigalloc case, from ext4_da_get_block_prep) the caller of
- * ext4_map_blocks wants to know whether or not the underlying cluster has
- * already been accounted for. EXT4_MAP_FROM_CLUSTER conveys to the caller that
- * the requested mapping was from previously mapped (or delayed allocated)
- * cluster. We use BH_AllocFromCluster only for this flag. BH_AllocFromCluster
- * should never appear on buffer_head's state flags.
- */
-#define EXT4_MAP_FROM_CLUSTER	(1 << BH_AllocFromCluster)
 #define EXT4_MAP_FLAGS		(EXT4_MAP_NEW | EXT4_MAP_MAPPED |\
 				 EXT4_MAP_UNWRITTEN | EXT4_MAP_BOUNDARY |\
-				 EXT4_MAP_UNINIT | EXT4_MAP_FROM_CLUSTER)
+				 EXT4_MAP_UNINIT)
 
 struct ext4_map_blocks {
 	ext4_fsblk_t m_pblk;
@@ -2553,11 +2545,6 @@ extern int ext4_mmp_csum_verify(struct super_block *sb,
 enum ext4_state_bits {
 	BH_Uninit	/* blocks are allocated but uninitialized on disk */
 	  = BH_JBDPrivateStart,
-	BH_AllocFromCluster,	/* allocated blocks were part of already
-				 * allocated cluster. Note that this flag will
-				 * never, ever appear in a buffer_head's state
-				 * flag. See EXT4_MAP_FROM_CLUSTER to see where
-				 * this is used. */
 };
 
 BUFFER_FNS(Uninit, uninit)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index f7bf616..aa9a6d2 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3894,6 +3894,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	ext4_io_end_t *io = ext4_inode_aio(inode);
 	ext4_lblk_t cluster_offset;
 	int set_unwritten = 0;
+	int from_cluster = 0;
 
 	ext_debug("blocks %u/%u requested for inode %lu\n",
 		  map->m_lblk, map->m_len, inode->i_ino);
@@ -3902,10 +3903,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	/* check in cache */
 	if (ext4_ext_in_cache(inode, map->m_lblk, &newex)) {
 		if (!newex.ee_start_lo && !newex.ee_start_hi) {
-			if ((sbi->s_cluster_ratio > 1) &&
-			    ext4_find_delalloc_cluster(inode, map->m_lblk))
-				map->m_flags |= EXT4_MAP_FROM_CLUSTER;
-
 			if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
 				/*
 				 * block isn't allocated yet and
@@ -3916,8 +3913,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 			/* we should allocate requested block */
 		} else {
 			/* block is already allocated */
-			if (sbi->s_cluster_ratio > 1)
-				map->m_flags |= EXT4_MAP_FROM_CLUSTER;
 			newblock = map->m_lblk
 				   - le32_to_cpu(newex.ee_block)
 				   + ext4_ext_pblock(&newex);
@@ -3990,10 +3985,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		}
 	}
 
-	if ((sbi->s_cluster_ratio > 1) &&
-	    ext4_find_delalloc_cluster(inode, map->m_lblk))
-		map->m_flags |= EXT4_MAP_FROM_CLUSTER;
-
 	/*
 	 * requested block isn't allocated yet;
 	 * we couldn't try to create block if create flag is zero
@@ -4010,7 +4001,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	/*
 	 * Okay, we need to do block allocation.
 	 */
-	map->m_flags &= ~EXT4_MAP_FROM_CLUSTER;
 	newex.ee_block = cpu_to_le32(map->m_lblk);
 	cluster_offset = map->m_lblk & (sbi->s_cluster_ratio-1);
 
@@ -4022,7 +4012,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	    get_implied_cluster_alloc(inode->i_sb, map, ex, path)) {
 		ar.len = allocated = map->m_len;
 		newblock = map->m_pblk;
-		map->m_flags |= EXT4_MAP_FROM_CLUSTER;
+		from_cluster = 1;
 		goto got_allocated_blocks;
 	}
 
@@ -4043,7 +4033,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	    get_implied_cluster_alloc(inode->i_sb, map, ex2, path)) {
 		ar.len = allocated = map->m_len;
 		newblock = map->m_pblk;
-		map->m_flags |= EXT4_MAP_FROM_CLUSTER;
+		from_cluster = 1;
 		goto got_allocated_blocks;
 	}
 
@@ -4168,7 +4158,7 @@ got_allocated_blocks:
 		 */
 		reserved_clusters = get_reserved_cluster_alloc(inode,
 						map->m_lblk, allocated);
-		if (map->m_flags & EXT4_MAP_FROM_CLUSTER) {
+		if (from_cluster) {
 			if (reserved_clusters) {
 				/*
 				 * We have clusters reserved for this range.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cbfe13b..4d066f3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1798,9 +1798,6 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 		 * of mapping from cluster so that the reserved space
 		 * is calculated properly.
 		 */
-		if ((EXT4_SB(inode->i_sb)->s_cluster_ratio > 1) &&
-		    ext4_find_delalloc_cluster(inode, map->m_lblk))
-			map->m_flags |= EXT4_MAP_FROM_CLUSTER;
 		retval = 0;
 	} else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
 		retval = ext4_ext_map_blocks(NULL, inode, map, 0);
@@ -1814,7 +1811,8 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 		 */
 		/* If the block was allocated from previously allocated cluster,
 		 * then we dont need to reserve it again. */
-		if (!(map->m_flags & EXT4_MAP_FROM_CLUSTER)) {
+		if ((EXT4_SB(inode->i_sb)->s_cluster_ratio == 1) ||
+		    !ext4_find_delalloc_cluster(inode, map->m_lblk)) {
 			retval = ext4_da_reserve_space(inode, iblock);
 			if (retval)
 				/* not enough space to reserve */
@@ -1825,11 +1823,6 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 		if (retval)
 			goto out_unlock;
 
-		/* Clear EXT4_MAP_FROM_CLUSTER flag since its purpose is served
-		 * and it should not appear on the bh->b_state.
-		 */
-		map->m_flags &= ~EXT4_MAP_FROM_CLUSTER;

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/7 v2] ext4: add physical block and status member into extent status tree
  2013-01-11 10:53 [PATCH 0/7 v2] ext4: extent status tree (step2) Zheng Liu
  2013-01-11 10:53 ` [PATCH 1/7 v2] ext4: refine extent status tree Zheng Liu
  2013-01-11 10:53 ` [PATCH 2/7 v2] ext4: remove EXT4_MAP_FROM_CLUSTER flag Zheng Liu
@ 2013-01-11 10:53 ` Zheng Liu
  2013-01-17  4:42   ` Theodore Ts'o
  2013-01-11 10:53 ` [PATCH 4/7 v2] ext4: adjust interfaces of " Zheng Liu
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Zheng Liu @ 2013-01-11 10:53 UTC (permalink / raw)
  To: linux-ext4; +Cc: Jan kara, Theodore Ts'o, Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

es_pblk is used to record physical block that maps to the disk.  es_status is
used to record the status of the extent.  Three status are defined, which are
written, unwritten and delayed.

CC: Jan kara <jack@suse.cz>
CC: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/extents_status.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 81e9339..85115bb 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -20,10 +20,18 @@
 #define es_debug(fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
 #endif
 
+enum {
+	EXTENT_STATUS_WRITTEN = 0,	/* written extent */
+	EXTENT_STATUS_UNWRITTEN = 1,	/* unwritten extent */
+	EXTENT_STATUS_DELAYED = 2,	/* delayed extent */
+};
+
 struct extent_status {
 	struct rb_node rb_node;
 	ext4_lblk_t es_lblk;	/* first logical block extent covers */
 	ext4_lblk_t es_len;	/* length of extent in block */
+	ext4_fsblk_t es_pblk;	/* first physical block */
+	int es_status;		/* record the status of extent */
 };
 
 struct ext4_es_tree {
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/7 v2] ext4: adjust interfaces of extent status tree
  2013-01-11 10:53 [PATCH 0/7 v2] ext4: extent status tree (step2) Zheng Liu
                   ` (2 preceding siblings ...)
  2013-01-11 10:53 ` [PATCH 3/7 v2] ext4: add physical block and status member into extent status tree Zheng Liu
@ 2013-01-11 10:53 ` Zheng Liu
  2013-01-11 10:53 ` [PATCH 5/7 v2] ext4: track all extent status in " Zheng Liu
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 29+ messages in thread
From: Zheng Liu @ 2013-01-11 10:53 UTC (permalink / raw)
  To: linux-ext4; +Cc: Jan kara, Theodore Ts'o, Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

Due to two members are added into extent status tree, all interfaces need to be
adjusted.

CC: Jan kara <jack@suse.cz>
CC: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/extents_status.c    | 55 ++++++++++++++++++++++++++++++++++++---------
 fs/ext4/extents_status.h    | 24 +++++++++++++++++++-
 fs/ext4/inode.c             |  3 ++-
 include/trace/events/ext4.h | 34 ++++++++++++++++++----------
 4 files changed, 91 insertions(+), 25 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 0662369..b5a5224 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -174,7 +174,8 @@ static void ext4_es_print_tree(struct inode *inode)
 	while (node) {
 		struct extent_status *es;
 		es = rb_entry(node, struct extent_status, rb_node);
-		printk(KERN_DEBUG " [%u/%u)", es->es_lblk, es->es_len);
+		printk(KERN_DEBUG " [%u/%u) %llu %d",
+		       es->es_lblk, es->es_len, es->es_pblk, es->es_status);
 		node = rb_next(node);
 	}
 	printk(KERN_DEBUG "\n");
@@ -229,7 +230,7 @@ static struct extent_status *__es_tree_search(struct rb_root *root,
  * @es: delayed extent that we found
  *
  * Returns the first block of the next extent after es, otherwise
- * EXT_MAX_BLOCKS if no delay extent is found.
+ * EXT_MAX_BLOCKS if no extent is found.
  * Delayed extent is returned via @es.
  */
 ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
@@ -244,12 +245,13 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
 	read_lock(&EXT4_I(inode)->i_es_lock);
 	tree = &EXT4_I(inode)->i_es_tree;
 
-	/* find delay extent in cache firstly */
+	/* find extent in cache firstly */
 	if (tree->cache_es) {
 		es1 = tree->cache_es;
 		if (in_range(es->es_lblk, es1->es_lblk, es1->es_len)) {
-			es_debug("%u cached by [%u/%u)\n",
-				 es->es_lblk, es1->es_lblk, es1->es_len);
+			es_debug("%u cached by [%u/%u) %llu %d\n",
+				 es->es_lblk, es1->es_lblk, es1->es_len,
+				 es1->es_pblk, es1->es_status);
 			goto out;
 		}
 	}
@@ -262,6 +264,8 @@ out:
 		tree->cache_es = es1;
 		es->es_lblk = es1->es_lblk;
 		es->es_len = es1->es_len;
+		es->es_pblk = es1->es_pblk;
+		es->es_status = es1->es_status;
 		node = rb_next(&es1->rb_node);
 		if (node) {
 			es1 = rb_entry(node, struct extent_status, rb_node);
@@ -276,7 +280,8 @@ out:
 }
 
 static struct extent_status *
-ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len)
+ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len,
+		     ext4_fsblk_t pblk, int status)
 {
 	struct extent_status *es;
 	es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
@@ -284,6 +289,8 @@ ext4_es_alloc_extent(ext4_lblk_t lblk, ext4_lblk_t len)
 		return NULL;
 	es->es_lblk = lblk;
 	es->es_len = len;
+	es->es_pblk = pblk;
+	es->es_status = status;
 	return es;
 }
 
@@ -296,6 +303,8 @@ static void ext4_es_free_extent(struct extent_status *es)
  * Check whether or not two extents can be merged
  * Condition:
  *  - logical block number is contiguous
+ *  - physical block number is contiguous
+ *  - status is equal
  */
 static int ext4_es_can_be_merged(struct extent_status *es1,
 				 struct extent_status *es2)
@@ -303,6 +312,13 @@ static int ext4_es_can_be_merged(struct extent_status *es1,
 	if (es1->es_lblk + es1->es_len != es2->es_lblk)
 		return 0;
 
+	if (es1->es_status != es2->es_status)
+		return 0;
+
+	if (!ext4_es_is_delayed(es1) &&
+	    (es1->es_pblk + es1->es_len != es2->es_pblk))
+		return 0;
+
 	return 1;
 }
 
@@ -366,6 +382,8 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
 			if (ext4_es_can_be_merged(newes, es)) {
 				es->es_lblk = newes->es_lblk;
 				es->es_len += newes->es_len;
+				es->es_pblk = ext4_es_get_pblock(es,
+							newes->es_pblk);
 				es = ext4_es_try_to_merge_left(tree, es);
 				goto out;
 			}
@@ -383,7 +401,8 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
 		}
 	}
 
-	es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len);
+	es = ext4_es_alloc_extent(newes->es_lblk, newes->es_len,
+				  newes->es_pblk, newes->es_status);
 	if (!es)
 		return -ENOMEM;
 	rb_link_node(&es->rb_node, parent, p);
@@ -402,21 +421,23 @@ out:
  * Return 0 on success, error code on failure.
  */
 int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
-			  ext4_lblk_t len)
+			  ext4_lblk_t len, ext4_fsblk_t pblk, int status)
 {
 	struct ext4_es_tree *tree;
 	struct extent_status newes;
 	ext4_lblk_t end = lblk + len - 1;
 	int err = 0;
 
-	trace_ext4_es_insert_extent(inode, lblk, len);
-	es_debug("add [%u/%u) to extent status tree of inode %lu\n",
-		 lblk, len, inode->i_ino);
+	es_debug("add [%u/%u) %llu %d to extent status tree of inode %lu\n",
+		 lblk, len, pblk, status, inode->i_ino);
 
 	BUG_ON(end < lblk);
 
 	newes.es_lblk = lblk;
 	newes.es_len = len;
+	newes.es_pblk = pblk;
+	newes.es_status = status;
+	trace_ext4_es_insert_extent(inode, &newes);
 
 	write_lock(&EXT4_I(inode)->i_es_lock);
 	tree = &EXT4_I(inode)->i_es_tree;
@@ -453,6 +474,9 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 
 	orig_es.es_lblk = es->es_lblk;
 	orig_es.es_len = es->es_len;
+	orig_es.es_pblk = es->es_pblk;
+	orig_es.es_status = es->es_status;
+
 	len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0;
 	len2 = extent_status_end(es) > end ?
 	       extent_status_end(es) - end : 0;
@@ -464,6 +488,9 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 
 			newes.es_lblk = end + 1;
 			newes.es_len = len2;
+			newes.es_pblk = ext4_es_get_pblock(&orig_es,
+				orig_es.es_pblk + orig_es.es_len - len2);
+			newes.es_status = orig_es.es_status;
 			err = __es_insert_extent(tree, &newes);
 			if (err) {
 				es->es_lblk = orig_es.es_lblk;
@@ -473,6 +500,8 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 		} else {
 			es->es_lblk = end + 1;
 			es->es_len = len2;
+			es->es_pblk = ext4_es_get_pblock(es,
+				orig_es.es_pblk + orig_es.es_len - len2);
 		}
 		goto out;
 	}
@@ -497,9 +526,13 @@ static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 	}
 
 	if (es && es->es_lblk < end + 1) {
+		ext4_lblk_t orig_len = es->es_len;
+
 		len1 = extent_status_end(es) - end;
 		es->es_lblk = end + 1;
 		es->es_len = len1;
+		es->es_pblk = ext4_es_get_pblock(es,
+						 es->es_pblk + orig_len - len1);
 	}
 
 out:
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 85115bb..dad3b5e 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -44,10 +44,32 @@ extern void ext4_exit_es(void);
 extern void ext4_es_init_tree(struct ext4_es_tree *tree);
 
 extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
-				 ext4_lblk_t len);
+				 ext4_lblk_t len, ext4_fsblk_t pblk,
+				 int status);
 extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t len);
 extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
 				struct extent_status *es);
 
+static inline int ext4_es_is_written(struct extent_status *es)
+{
+	return (es->es_status == EXTENT_STATUS_WRITTEN);
+}
+
+static inline int ext4_es_is_unwritten(struct extent_status *es)
+{
+	return (es->es_status == EXTENT_STATUS_UNWRITTEN);
+}
+
+static inline int ext4_es_is_delayed(struct extent_status *es)
+{
+	return (es->es_status == EXTENT_STATUS_DELAYED);
+}
+
+static inline ext4_fsblk_t ext4_es_get_pblock(struct extent_status *es,
+					      ext4_fsblk_t pb)
+{
+	return (ext4_es_is_delayed(es) ? ~0 : pb);
+}
+
 #endif /* _EXT4_EXTENTS_STATUS_H */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4d066f3..e09c7cf 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1819,7 +1819,8 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 				goto out_unlock;
 		}
 
-		retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len);
+		retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+					       ~0, EXTENT_STATUS_DELAYED);
 		if (retval)
 			goto out_unlock;
 
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 952628a..5483f62 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2068,28 +2068,33 @@ TRACE_EVENT(ext4_ext_remove_space_done,
 );
 
 TRACE_EVENT(ext4_es_insert_extent,
-	TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
+	TP_PROTO(struct inode *inode, struct extent_status *es),
 
-	TP_ARGS(inode, lblk, len),
+	TP_ARGS(inode, es),
 
 	TP_STRUCT__entry(
-		__field(	dev_t,	dev			)
-		__field(	ino_t,	ino			)
-		__field(	loff_t,	lblk			)
-		__field(	loff_t, len			)
+		__field(	dev_t,		dev		)
+		__field(	ino_t,		ino		)
+		__field(	ext4_lblk_t,	lblk		)
+		__field(	ext4_lblk_t,	len		)
+		__field(	ext4_fsblk_t,	pblk		)
+		__field(	int,		status		)
 	),
 
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
-		__entry->lblk	= lblk;
-		__entry->len	= len;
+		__entry->lblk	= es->es_lblk;
+		__entry->len	= es->es_len;
+		__entry->pblk	= es->es_pblk;
+		__entry->status	= es->es_status;
 	),
 
-	TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
+	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->lblk, __entry->len)
+		  __entry->lblk, __entry->len,
+		  __entry->pblk, __entry->status)
 );
 
 TRACE_EVENT(ext4_es_remove_extent,
@@ -2150,6 +2155,8 @@ TRACE_EVENT(ext4_es_find_extent_exit,
 		__field(	ino_t,		ino		)
 		__field(	ext4_lblk_t,	lblk		)
 		__field(	ext4_lblk_t,	len		)
+		__field(	ext4_fsblk_t,	pblk		)
+		__field(	int,		status		)
 		__field(	ext4_lblk_t,	ret		)
 	),
 
@@ -2158,13 +2165,16 @@ TRACE_EVENT(ext4_es_find_extent_exit,
 		__entry->ino	= inode->i_ino;
 		__entry->lblk	= es->es_lblk;
 		__entry->len	= es->es_len;
+		__entry->pblk	= es->es_pblk;
+		__entry->status	= es->es_status;
 		__entry->ret	= ret;
 	),
 
-	TP_printk("dev %d,%d ino %lu es [%u/%u) ret %u",
+	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %d ret %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->lblk, __entry->len, __entry->ret)
+		  __entry->lblk, __entry->len,
+		  __entry->pblk, __entry->status, __entry->ret)
 );
 
 #endif /* _TRACE_EXT4_H */
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 5/7 v2] ext4: track all extent status in extent status tree
  2013-01-11 10:53 [PATCH 0/7 v2] ext4: extent status tree (step2) Zheng Liu
                   ` (3 preceding siblings ...)
  2013-01-11 10:53 ` [PATCH 4/7 v2] ext4: adjust interfaces of " Zheng Liu
@ 2013-01-11 10:53 ` Zheng Liu
  2013-01-23  4:31   ` Theodore Ts'o
  2013-01-11 10:53 ` [PATCH 6/7 v2] ext4: lookup block mapping " Zheng Liu
  2013-01-11 10:53 ` [PATCH 7/7 v2] ext4: reclaim extents from " Zheng Liu
  6 siblings, 1 reply; 29+ messages in thread
From: Zheng Liu @ 2013-01-11 10:53 UTC (permalink / raw)
  To: linux-ext4; +Cc: Jan kara, Theodore Ts'o, Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

After record phycisal block and status, extent status tree is able to track the
status of every extents.  When we call _map_blocks functions to lookup an extent
or create a new written/unwritten/delayed extent, this extent will be inserted
into extent status tree.

We don't load all extents from disk in alloc_inode() because it costs too much
memory, and, when open/close a file very frequently, it will takes too much time
to load all extent information.  So currently when we create/lookup an extent,
this extent will be inserted into extent status tree.  Hence, the status in
extent status tree might be not completely.

CC: Jan kara <jack@suse.cz>
CC: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/extents.c |  5 +++-
 fs/ext4/file.c    |  6 +++--
 fs/ext4/inode.c   | 73 ++++++++++++++++++++++++++++++++++++-------------------
 3 files changed, 56 insertions(+), 28 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index aa9a6d2..d23a654 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2074,7 +2074,7 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
 		}
 
 		/* This is possible iff next == next_del == EXT_MAX_BLOCKS */
-		if (next == next_del) {
+		if (next == next_del && next_del == EXT_MAX_BLOCKS) {
 			flags |= FIEMAP_EXTENT_LAST;
 			if (unlikely(next_del != EXT_MAX_BLOCKS ||
 				     next != EXT_MAX_BLOCKS)) {
@@ -4570,6 +4570,9 @@ static int ext4_find_delayed_extent(struct inode *inode,
 			/* A hole found. */
 			return 0;
 
+		if (!ext4_es_is_delayed(&es))
+			return 0;
+
 		if (es.es_lblk > newex->ec_block) {
 			/* A hole found. */
 			newex->ec_len = min(es.es_lblk - newex->ec_block,
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 718c49f..afaf9f15 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -466,7 +466,8 @@ static loff_t ext4_seek_data(struct file *file, loff_t offset, loff_t maxsize)
 		 */
 		es.es_lblk = last;
 		(void)ext4_es_find_extent(inode, &es);
-		if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
+		if (ext4_es_is_delayed(&es) &&
+		    last >= es.es_lblk && last < es.es_lblk + es.es_len) {
 			if (last != start)
 				dataoff = last << blkbits;
 			break;
@@ -550,7 +551,8 @@ static loff_t ext4_seek_hole(struct file *file, loff_t offset, loff_t maxsize)
 		 */
 		es.es_lblk = last;
 		(void)ext4_es_find_extent(inode, &es);
-		if (last >= es.es_lblk && last < es.es_lblk + es.es_len) {
+		if (ext4_es_is_delayed(&es) &&
+		    last >= es.es_lblk && last < es.es_lblk + es.es_len) {
 			last = es.es_lblk + es.es_len;
 			holeoff = last << blkbits;
 			continue;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e09c7cf..f0dda2a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -527,20 +527,20 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		retval = ext4_ind_map_blocks(handle, inode, map, flags &
 					     EXT4_GET_BLOCKS_KEEP_SIZE);
 	}
+	if (retval > 0) {
+		int ret, status;
+		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+		ret = ext4_es_insert_extent(inode, map->m_lblk,
+					    map->m_len, map->m_pblk, status);
+		if (ret < 0)
+			retval = ret;
+	}
 	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
 		up_read((&EXT4_I(inode)->i_data_sem));
 
 	if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
-		int ret;
-		if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
-			/* delayed alloc may be allocated by fallocate and
-			 * coverted to initialized by directIO.
-			 * we need to handle delayed extent here.
-			 */
-			down_write((&EXT4_I(inode)->i_data_sem));
-			goto delayed_mapped;
-		}
-		ret = check_block_validity(inode, map);
+		int ret = check_block_validity(inode, map);
 		if (ret != 0)
 			return ret;
 	}
@@ -615,18 +615,27 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 			(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE))
 			ext4_da_update_reserve_space(inode, retval, 1);
 	}
-	if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
+	if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
 		ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);
 
-		if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
-			int ret;
-delayed_mapped:
-			/* delayed allocation blocks has been allocated */
-			ret = ext4_es_remove_extent(inode, map->m_lblk,
-						    map->m_len);
-			if (ret < 0)
-				retval = ret;
-		}
+	if (retval > 0) {
+		int ret, status;
+
+		if (flags & EXT4_GET_BLOCKS_PRE_IO)
+			status = EXTENT_STATUS_UNWRITTEN;
+		else if (flags & EXT4_GET_BLOCKS_CONVERT)
+			status = EXTENT_STATUS_WRITTEN;
+		else if (flags & EXT4_GET_BLOCKS_UNINIT_EXT)
+			status = EXTENT_STATUS_UNWRITTEN;
+		else if (flags & EXT4_GET_BLOCKS_CREATE)
+			status = EXTENT_STATUS_WRITTEN;
+		else
+			BUG_ON(1);
+
+		ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+					    map->m_pblk, status);
+		if (ret < 0)
+			retval = ret;
 	}
 
 	up_write((&EXT4_I(inode)->i_data_sem));
@@ -1805,6 +1814,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 		retval = ext4_ind_map_blocks(NULL, inode, map, 0);
 
 	if (retval == 0) {
+		int ret;
 		/*
 		 * XXX: __block_prepare_write() unmaps passed block,
 		 * is it OK?
@@ -1813,20 +1823,33 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 		 * then we dont need to reserve it again. */
 		if ((EXT4_SB(inode->i_sb)->s_cluster_ratio == 1) ||
 		    !ext4_find_delalloc_cluster(inode, map->m_lblk)) {
-			retval = ext4_da_reserve_space(inode, iblock);
-			if (retval)
+			ret = ext4_da_reserve_space(inode, iblock);
+			if (ret) {
 				/* not enough space to reserve */
+				retval = ret;
 				goto out_unlock;
+			}
 		}
 
-		retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
-					       ~0, EXTENT_STATUS_DELAYED);
-		if (retval)
+		ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+					    ~0, EXTENT_STATUS_DELAYED);
+		if (ret) {
+			retval = ret;
 			goto out_unlock;
+		}
 
 		map_bh(bh, inode->i_sb, invalid_block);
 		set_buffer_new(bh);
 		set_buffer_delay(bh);
+	} else if (retval > 0) {
+		int ret, status;
+
+		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+		ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+					    map->m_pblk, status);
+		if (ret != 0)
+			retval = ret;
 	}
 
 out_unlock:
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 6/7 v2] ext4: lookup block mapping in extent status tree
  2013-01-11 10:53 [PATCH 0/7 v2] ext4: extent status tree (step2) Zheng Liu
                   ` (4 preceding siblings ...)
  2013-01-11 10:53 ` [PATCH 5/7 v2] ext4: track all extent status in " Zheng Liu
@ 2013-01-11 10:53 ` Zheng Liu
  2013-01-18  4:00   ` Theodore Ts'o
  2013-01-11 10:53 ` [PATCH 7/7 v2] ext4: reclaim extents from " Zheng Liu
  6 siblings, 1 reply; 29+ messages in thread
From: Zheng Liu @ 2013-01-11 10:53 UTC (permalink / raw)
  To: linux-ext4; +Cc: Jan kara, Theodore Ts'o, Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

After tracking all extent status, we already have a extent cache in memory.
Every time we want to lookup a block mapping, we can first try to lookup it in
extent status tree to avoid a potential disk I/O.

A new function called ext4_es_lookup_extent is defined to finish this work.
When we try to lookup a block mapping, we always call ext4_map_blocks and/or
ext4_da_map_blocks.  So in these functions we first try to lookup a block
mapping in extent status tree.

CC: Jan kara <jack@suse.cz>
CC: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
v2 <- v1:
 * add tracepoint for ext4_es_lookup_extent()

 fs/ext4/extents_status.c    | 59 +++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/extents_status.h    |  1 +
 fs/ext4/inode.c             | 55 ++++++++++++++++++++++++++++++++++++++++++
 include/trace/events/ext4.h | 56 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 171 insertions(+)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index b5a5224..ac9bb80 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -454,6 +454,65 @@ error:
 	return err;
 }
 
+/*
+ * ext4_es_lookup_extent() looks up an extent in extent status tree.
+ *
+ * ext4_es_lookup_extent is called by ext4_map_blocks/ext4_da_map_blocks.
+ *
+ * Return: 1 on found, 0 on not
+ */
+int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es)
+{
+	struct ext4_es_tree *tree;
+	struct extent_status *es1;
+	struct rb_node *node;
+	int found = 0;
+
+	trace_ext4_es_lookup_extent_enter(inode, es->es_lblk);
+	es_debug("lookup extent in block %u\n", es->es_lblk);
+
+	tree = &EXT4_I(inode)->i_es_tree;
+	read_lock(&EXT4_I(inode)->i_es_lock);
+
+	/* find extent in cache firstly */
+	if (tree->cache_es) {
+		es1 = tree->cache_es;
+		if (in_range(es->es_lblk, es1->es_lblk, es1->es_len)) {
+			es_debug("%u cached by [%u/%u)\n",
+				 es->es_lblk, es1->es_lblk, es1->es_len);
+			found = 1;
+			goto out;
+		}
+	}
+
+	es->es_len = 0;
+	node = tree->root.rb_node;
+	while (node) {
+		es1 = rb_entry(node, struct extent_status, rb_node);
+		if (es->es_lblk < es1->es_lblk)
+			node = node->rb_left;
+		else if (es->es_lblk > extent_status_end(es1))
+			node = node->rb_right;
+		else {
+			found = 1;
+			break;
+		}
+	}
+
+out:
+	if (found) {
+		es->es_lblk = es1->es_lblk;
+		es->es_len = es1->es_len;
+		es->es_pblk = es1->es_pblk;
+		es->es_status = es1->es_status;
+	}
+
+	read_unlock(&EXT4_I(inode)->i_es_lock);
+
+	trace_ext4_es_lookup_extent_exit(inode, es, found);
+	return found;
+}
+
 static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 				 ext4_lblk_t end)
 {
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index dad3b5e..45e623e 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -50,6 +50,7 @@ extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t len);
 extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
 				struct extent_status *es);
+extern int ext4_es_lookup_extent(struct inode *inode, struct extent_status *es);
 
 static inline int ext4_es_is_written(struct extent_status *es)
 {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f0dda2a..c86b086 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -508,12 +508,39 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
 int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		    struct ext4_map_blocks *map, int flags)
 {
+	struct extent_status es;
 	int retval;
 
 	map->m_flags = 0;
 	ext_debug("ext4_map_blocks(): inode %lu, flag %d, max_blocks %u,"
 		  "logical block %lu\n", inode->i_ino, flags, map->m_len,
 		  (unsigned long) map->m_lblk);
+
+	/* Lookup extent status tree firstly */
+	es.es_lblk = map->m_lblk;
+	if (ext4_es_lookup_extent(inode, &es)) {
+		if (ext4_es_is_written(&es)) {
+			map->m_pblk = es.es_pblk + map->m_lblk - es.es_lblk;
+			map->m_flags |= EXT4_MAP_MAPPED;
+			retval = es.es_len - (map->m_lblk - es.es_lblk);
+			if (retval > map->m_len)
+				retval = map->m_len;
+			map->m_len = retval;
+		} else if (ext4_es_is_unwritten(&es)) {
+			map->m_pblk = es.es_pblk + map->m_lblk - es.es_lblk;
+			map->m_flags |= EXT4_MAP_UNWRITTEN;
+			retval = es.es_len - (map->m_lblk - es.es_lblk);
+			if (retval > map->m_len)
+				retval = map->m_len;
+			map->m_len = retval;
+		} else if (ext4_es_is_delayed(&es)) {
+			retval = 0;
+		} else {
+			BUG_ON(1);
+		}
+		goto found;
+	}
+
 	/*
 	 * Try to see if we can get the block without requesting a new
 	 * file system block.
@@ -539,6 +566,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
 		up_read((&EXT4_I(inode)->i_data_sem));
 
+found:
 	if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
 		int ret = check_block_validity(inode, map);
 		if (ret != 0)
@@ -1784,6 +1812,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 			      struct ext4_map_blocks *map,
 			      struct buffer_head *bh)
 {
+	struct extent_status es;
 	int retval;
 	sector_t invalid_block = ~((sector_t) 0xffff);
 
@@ -1794,6 +1823,32 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 	ext_debug("ext4_da_map_blocks(): inode %lu, max_blocks %u,"
 		  "logical block %lu\n", inode->i_ino, map->m_len,
 		  (unsigned long) map->m_lblk);
+
+	/* Lookup extent status tree firstly */
+	es.es_lblk = iblock;
+	if (ext4_es_lookup_extent(inode, &es)) {
+		map->m_pblk = es.es_pblk + iblock - es.es_lblk;
+		retval = es.es_len - (iblock - es.es_lblk);
+		if (retval > map->m_len)
+			retval = map->m_len;
+		map->m_len = retval;
+		if (ext4_es_is_written(&es)) {
+			map->m_flags |= EXT4_MAP_MAPPED;
+		} else if (ext4_es_is_unwritten(&es)) {
+			map->m_flags |= EXT4_MAP_UNWRITTEN;
+		} else if (ext4_es_is_delayed(&es)) {
+			map_bh(bh, inode->i_sb, invalid_block);
+			set_buffer_new(bh);
+			set_buffer_delay(bh);
+
+			return 0;
+		} else {
+			BUG_ON(1);
+		}
+
+		return retval;
+	}
+
 	/*
 	 * Try to see if we can get the block without requesting a new
 	 * file system block.
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 5483f62..e14f541 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2177,6 +2177,62 @@ TRACE_EVENT(ext4_es_find_extent_exit,
 		  __entry->pblk, __entry->status, __entry->ret)
 );
 
+TRACE_EVENT(ext4_es_lookup_extent_enter,
+	TP_PROTO(struct inode *inode, ext4_lblk_t lblk),
+
+	TP_ARGS(inode, lblk),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,		dev		)
+		__field(	ino_t,		ino		)
+		__field(	ext4_lblk_t,	lblk		)
+	),
+
+	TP_fast_assign(
+		__entry->dev	= inode->i_sb->s_dev;
+		__entry->ino	= inode->i_ino;
+		__entry->lblk	= lblk;
+	),
+
+	TP_printk("dev %d,%d ino %lu lblk %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  (unsigned long) __entry->ino, __entry->lblk)
+);
+
+TRACE_EVENT(ext4_es_lookup_extent_exit,
+	TP_PROTO(struct inode *inode, struct extent_status *es,
+		 int found),
+
+	TP_ARGS(inode, es, found),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,		dev		)
+		__field(	ino_t,		ino		)
+		__field(	ext4_lblk_t,	lblk		)
+		__field(	ext4_lblk_t,	len		)
+		__field(	ext4_fsblk_t,	pblk		)
+		__field(	int,		status		)
+		__field(	int,		found		)
+	),
+
+	TP_fast_assign(
+		__entry->dev	= inode->i_sb->s_dev;
+		__entry->ino	= inode->i_ino;
+		__entry->lblk	= es->es_lblk;
+		__entry->len	= es->es_len;
+		__entry->pblk	= es->es_pblk;
+		__entry->status	= es->es_status;
+		__entry->found	= found;
+	),
+
+	TP_printk("dev %d,%d ino %lu found %d [%u/%u) %llu %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  (unsigned long) __entry->ino, __entry->found,
+		  __entry->lblk, __entry->len,
+		  __entry->found ? __entry->pblk : 0,
+		  __entry->found ? __entry->status : 0)
+);
+
 #endif /* _TRACE_EXT4_H */
 
 /* This part must be outside protection */
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 7/7 v2] ext4: reclaim extents from extent status tree
  2013-01-11 10:53 [PATCH 0/7 v2] ext4: extent status tree (step2) Zheng Liu
                   ` (5 preceding siblings ...)
  2013-01-11 10:53 ` [PATCH 6/7 v2] ext4: lookup block mapping " Zheng Liu
@ 2013-01-11 10:53 ` Zheng Liu
  2013-01-18  5:19   ` Theodore Ts'o
  6 siblings, 1 reply; 29+ messages in thread
From: Zheng Liu @ 2013-01-11 10:53 UTC (permalink / raw)
  To: linux-ext4; +Cc: Jan kara, Theodore Ts'o, Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

Although extent status is loaded on-demand, we also need to reclaim extent
from the tree when we are under a heavy memory pressure because in some cases
fragmented extent tree causes status tree costs too much memory.

Here shrinker framework is used to do this work.  When the tree of an inode is
accessed, this inode will be inserted into the tail of lru list.  The inode is
removed from lru list as clear_inode is called.  When shrinker tries to reclaim
some memory, we will try to reclaim written/unwritten extents from the tree.
Delayed extent shouldn't be reclaimed because they are used by fiemap, bigalloc,
and seek_data/hole.

CC: Jan kara <jack@suse.cz>
CC: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/ext4.h           |   4 ++
 fs/ext4/extents_status.c | 109 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/extents_status.h |  12 ++++++
 fs/ext4/super.c          |   6 +++
 4 files changed, 131 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 36145ef1..d31f254 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -893,6 +893,7 @@ struct ext4_inode_info {
 	/* extents status tree */
 	struct ext4_es_tree i_es_tree;
 	rwlock_t i_es_lock;
+	struct list_head i_es_lru;
 
 	/* ialloc */
 	ext4_group_t	i_last_alloc_group;
@@ -1308,6 +1309,9 @@ struct ext4_sb_info {
 
 	/* Precomputed FS UUID checksum for seeding other checksums */
 	__u32 s_csum_seed;
+
+	/* Extent status tree shrinker */
+	struct ext4_es_shrinker s_es_shrinker;
 };
 
 static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index ac9bb80..f9e7867 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -141,6 +141,7 @@ static int __es_insert_extent(struct ext4_es_tree *tree,
 			      struct extent_status *newes);
 static int __es_remove_extent(struct ext4_es_tree *tree, ext4_lblk_t lblk,
 			      ext4_lblk_t end);
+static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc);
 
 int __init ext4_init_es(void)
 {
@@ -275,6 +276,8 @@ out:
 
 	read_unlock(&EXT4_I(inode)->i_es_lock);
 
+	ext4_es_lru_add(inode);
+
 	trace_ext4_es_find_extent_exit(inode, es, ret);
 	return ret;
 }
@@ -449,6 +452,8 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 
+	ext4_es_lru_add(inode);
+
 	ext4_es_print_tree(inode);
 
 	return err;
@@ -509,6 +514,8 @@ out:
 
 	read_unlock(&EXT4_I(inode)->i_es_lock);
 
+	ext4_es_lru_add(inode);
+
 	trace_ext4_es_lookup_extent_exit(inode, es, found);
 	return found;
 }
@@ -621,6 +628,108 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	write_lock(&EXT4_I(inode)->i_es_lock);
 	err = __es_remove_extent(tree, lblk, end);
 	write_unlock(&EXT4_I(inode)->i_es_lock);
+	ext4_es_lru_add(inode);
 	ext4_es_print_tree(inode);
 	return err;
 }
+
+void ext4_es_register_shrinker(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi;
+
+	sbi = EXT4_SB(sb);
+	INIT_LIST_HEAD(&sbi->s_es_shrinker.es_lru);
+	spin_lock_init(&sbi->s_es_shrinker.es_lru_lock);
+	sbi->s_es_shrinker.es_shrinker.shrink = ext4_es_shrink;
+	sbi->s_es_shrinker.es_shrinker.seeks = DEFAULT_SEEKS;
+	register_shrinker(&sbi->s_es_shrinker.es_shrinker);
+}
+
+void ext4_es_unregister_shrinker(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi;
+
+	sbi = EXT4_SB(sb);
+	unregister_shrinker(&sbi->s_es_shrinker.es_shrinker);
+}
+
+static int ext4_es_do_shrink(struct ext4_inode_info *ei)
+{
+	struct ext4_es_tree *tree;
+	struct rb_node *node;
+	struct extent_status *es;
+	int shrunk_nr = 0;
+
+	tree = &ei->i_es_tree;
+
+	write_lock(&ei->i_es_lock);
+	node = rb_first(&tree->root);
+	while (node != NULL) {
+		es = rb_entry(node, struct extent_status, rb_node);
+		node = rb_next(&es->rb_node);
+		/*
+		 * We can't reclaim delayed extent from status tree because
+		 * fiemap, bigalloc, and seek_data/hole need to use it.
+		 */
+		if (!ext4_es_is_delayed(es)) {
+			rb_erase(&es->rb_node, &tree->root);
+			ext4_es_free_extent(es);
+			shrunk_nr++;
+		}
+	}
+	tree->cache_es = NULL;
+	write_unlock(&ei->i_es_lock);
+	return shrunk_nr;
+}
+
+static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
+{
+	struct ext4_es_shrinker *es_shrinker = container_of(shrink,
+				struct ext4_es_shrinker, es_shrinker);
+	struct ext4_inode_info *ei;
+	int nr_to_scan = sc->nr_to_scan;
+	int ret, shrunk_nr = 0;
+
+	if (!nr_to_scan)
+		return shrunk_nr;
+
+	spin_lock(&es_shrinker->es_lru_lock);
+	while (!list_empty(&es_shrinker->es_lru)) {
+		if (nr_to_scan-- <= 0)
+			break;
+
+		ei = list_first_entry(&es_shrinker->es_lru,
+				      struct ext4_inode_info, i_es_lru);
+		ret = ext4_es_do_shrink(ei);
+		shrunk_nr += ret;
+		nr_to_scan -= ret;
+	}
+	spin_unlock(&es_shrinker->es_lru_lock);
+	return shrunk_nr;
+}
+
+void ext4_es_lru_add(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+	struct ext4_es_shrinker *es_shrinker = &sbi->s_es_shrinker;
+
+	spin_lock(&es_shrinker->es_lru_lock);
+	if (list_empty(&ei->i_es_lru))
+		list_add_tail(&ei->i_es_lru, &es_shrinker->es_lru);
+	else
+		list_move_tail(&ei->i_es_lru, &es_shrinker->es_lru);
+	spin_unlock(&es_shrinker->es_lru_lock);
+}
+
+void ext4_es_lru_del(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+	struct ext4_es_shrinker *es_shrinker = &sbi->s_es_shrinker;
+
+	spin_lock(&es_shrinker->es_lru_lock);
+	if (!list_empty(&ei->i_es_lru))
+		list_del_init(&ei->i_es_lru);
+	spin_unlock(&es_shrinker->es_lru_lock);
+}
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 45e623e..422a11e 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -39,6 +39,13 @@ struct ext4_es_tree {
 	struct extent_status *cache_es;	/* recently accessed extent */
 };
 
+struct ext4_es_shrinker {
+	struct shrinker es_shrinker;
+	struct list_head es_lru;
+	spinlock_t es_lru_lock ____cacheline_aligned_in_smp;
+	unsigned int es_lru_nr;
+};
+
 extern int __init ext4_init_es(void);
 extern void ext4_exit_es(void);
 extern void ext4_es_init_tree(struct ext4_es_tree *tree);
@@ -73,4 +80,9 @@ static inline ext4_fsblk_t ext4_es_get_pblock(struct extent_status *es,
 	return (ext4_es_is_delayed(es) ? ~0 : pb);
 }
 
+extern void ext4_es_register_shrinker(struct super_block *sb);
+extern void ext4_es_unregister_shrinker(struct super_block *sb);
+extern void ext4_es_lru_add(struct inode *inode);
+extern void ext4_es_lru_del(struct inode *inode);
+
 #endif /* _EXT4_EXTENTS_STATUS_H */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3d4fb81..baedc75 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -858,6 +858,7 @@ static void ext4_put_super(struct super_block *sb)
 			ext4_abort(sb, "Couldn't clean up the journal");
 	}
 
+	ext4_es_unregister_shrinker(sb);
 	del_timer(&sbi->s_err_report);
 	ext4_release_system_zone(sb);
 	ext4_mb_release(sb);
@@ -944,6 +945,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	spin_lock_init(&ei->i_prealloc_lock);
 	ext4_es_init_tree(&ei->i_es_tree);
 	rwlock_init(&ei->i_es_lock);
+	INIT_LIST_HEAD(&ei->i_es_lru);
 	ei->i_reserved_data_blocks = 0;
 	ei->i_reserved_meta_blocks = 0;
 	ei->i_allocated_meta_blocks = 0;
@@ -1031,6 +1033,7 @@ void ext4_clear_inode(struct inode *inode)
 	dquot_drop(inode);
 	ext4_discard_preallocations(inode);
 	ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS);
+	ext4_es_lru_del(inode);
 	if (EXT4_I(inode)->jinode) {
 		jbd2_journal_release_jbd_inode(EXT4_JOURNAL(inode),
 					       EXT4_I(inode)->jinode);
@@ -3324,6 +3327,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		goto cantfind_ext4;
 	sbi->s_kbytes_written = le64_to_cpu(es->s_kbytes_written);
 
+	/* Register extent status tree shrinker */
+	ext4_es_register_shrinker(sb);
+
 	/* Warn if metadata_csum and gdt_csum are both set. */
 	if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
 				       EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) &&
-- 
1.7.12.rc2.18.g61b472e


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7 v2] ext4: add physical block and status member into extent status tree
  2013-01-11 10:53 ` [PATCH 3/7 v2] ext4: add physical block and status member into extent status tree Zheng Liu
@ 2013-01-17  4:42   ` Theodore Ts'o
  2013-01-18  9:49     ` Zheng Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Theodore Ts'o @ 2013-01-17  4:42 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Jan kara, Zheng Liu

On Fri, Jan 11, 2013 at 06:53:43PM +0800, Zheng Liu wrote:
>  struct extent_status {
>  	struct rb_node rb_node;
>  	ext4_lblk_t es_lblk;	/* first logical block extent covers */
>  	ext4_lblk_t es_len;	/* length of extent in block */
> +	ext4_fsblk_t es_pblk;	/* first physical block */
> +	int es_status;		/* record the status of extent */
>  };

Given that we only support 48 bits for the physical block number, and
we only need two bits for es_status, one thing we could do to save
memory is to stash the es_status bits in the top two bits of es_pblk.

       	     	       		      	     - Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/7 v2] ext4: lookup block mapping in extent status tree
  2013-01-11 10:53 ` [PATCH 6/7 v2] ext4: lookup block mapping " Zheng Liu
@ 2013-01-18  4:00   ` Theodore Ts'o
  2013-01-18  9:52     ` Zheng Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Theodore Ts'o @ 2013-01-18  4:00 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Jan kara, Zheng Liu

On Fri, Jan 11, 2013 at 06:53:46PM +0800, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> After tracking all extent status, we already have a extent cache in memory.
> Every time we want to lookup a block mapping, we can first try to lookup it in
> extent status tree to avoid a potential disk I/O.
> 
> A new function called ext4_es_lookup_extent is defined to finish this work.
> When we try to lookup a block mapping, we always call ext4_map_blocks and/or
> ext4_da_map_blocks.  So in these functions we first try to lookup a block
> mapping in extent status tree.
> 
> CC: Jan kara <jack@suse.cz>
> CC: "Theodore Ts'o" <tytso@mit.edu>
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>

Once we apply this this patch, we should be able to remove the the
single-entry extent cache in fs/ext4/extents.c ---
ext4_ext_put_in_cache(), ext4_ext_put_gap_in_cache(),
ext4_ext_in_cache() --- since the extent status tree makes this code
redundant (and will do a better job).  This would be a good follow up,
cleanup patch.

	       	       	 	- Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7 v2] ext4: reclaim extents from extent status tree
  2013-01-11 10:53 ` [PATCH 7/7 v2] ext4: reclaim extents from " Zheng Liu
@ 2013-01-18  5:19   ` Theodore Ts'o
  2013-01-18  5:39     ` Theodore Ts'o
  0 siblings, 1 reply; 29+ messages in thread
From: Theodore Ts'o @ 2013-01-18  5:19 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Jan kara, Zheng Liu

On Fri, Jan 11, 2013 at 06:53:47PM +0800, Zheng Liu wrote:
> +
> +static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
> +{
> +	struct ext4_es_shrinker *es_shrinker = container_of(shrink,
> +				struct ext4_es_shrinker, es_shrinker);
> +	struct ext4_inode_info *ei;
> +	int nr_to_scan = sc->nr_to_scan;
> +	int ret, shrunk_nr = 0;
> +
> +	if (!nr_to_scan)
> +		return shrunk_nr;

This doesn't look right.  To quote from include/linux/shrinker.h:

/*
 * A callback you can register to apply pressure to ageable caches.
 *
 * 'sc' is passed shrink_control which includes a count 'nr_to_scan'
 * and a 'gfpmask'.  It should look through the least-recently-used
 * 'nr_to_scan' entries and attempt to free them up.  It should return
 * the number of objects which remain in the cache.  If it returns -1, it means
 * it cannot do any scanning at this time (eg. there is a risk of deadlock).
 *
 * ...
 *
 * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
 * querying the cache size, so a fastpath for that case is appropriate.
 */

The first thing the shrink_slab() function will do is call the
shrinker with nr_to_scan set to zero.  Since the shrinker function is
currently returning the number of items that were discarded, instead
of the number of objects that were deleted, when nr_to_scan is zero,
the function returns zero.  This will cause shrink_slab() to bail out,
which means the shrinker code isn't actually going to release any
objects.  (i.e., at the moment it is a no-op).

It might also be a good idea to add a trace point so we can debug what
is going on with the shrinker, so we can known when its called, and
how much progress it has made in releasing objcts when the system is
under memory pressure.


Also, one of the things that we need to think about is making sure we
have the right balance.  We don't want to be too aggressive in
shrinking the extent status tree cache, but we want to be a good
citizen as well.  I'm a bit concerned we might be too aggressive,
because there are two ways that items can be freed from the
extent_status tree.  One is if the inode is not used at all, and when
we release the inode, we'll drop all of the entries in the
extent_status_tree for that inode.  The second way is via the shrinker
which we've registered.

So I am a bit concerned that we may end up giving twice.  There's also
a place where we can register a fs-specific shrinker via
sb->s_op->nr_cached_objects() and sb->s_op->free_cached_objects().
That might be better since it will allow us to balance across file
systems a bit more fairly.

Anyway, we're going to have to do some testing to make sure we're
doing something sane in low memory situations.  Not doing any
shrinking is clearly bad, but I'm a bit worried that we could end up
doing too much shrinking, and our performance in memory constrained
scenarios might suffer as a result.

					- Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7 v2] ext4: reclaim extents from extent status tree
  2013-01-18  5:19   ` Theodore Ts'o
@ 2013-01-18  5:39     ` Theodore Ts'o
  2013-01-21  7:24       ` Zheng Liu
  2013-01-21 14:43       ` [PATCH 7/7 v2] ext4: reclaim extents from extent status tree Jan Kara
  0 siblings, 2 replies; 29+ messages in thread
From: Theodore Ts'o @ 2013-01-18  5:39 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Jan kara, Zheng Liu

On Fri, Jan 18, 2013 at 12:19:21AM -0500, Theodore Ts'o wrote:
> I'm a bit concerned we might be too aggressive,
> because there are two ways that items can be freed from the
> extent_status tree.  One is if the inode is not used at all, and when
> we release the inode, we'll drop all of the entries in the
> extent_status_tree for that inode.  The second way is via the shrinker
> which we've registered.

If we use the sb->s_op->free_cached_objects() approach, something like
the following change to prune_super() in fs/super.c might address the
above concern:

diff --git a/fs/super.c b/fs/super.c
index 12f1237..fb57bd2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -80,6 +80,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 	if (sc->nr_to_scan) {
 		int	dentries;
 		int	inodes;
+		int	fs_to_scan = 0;
 
 		/* proportion the scan between the caches */
 		dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) /
@@ -87,7 +88,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 		inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) /
 							total_objects;
 		if (fs_objects)
-			fs_objects = (sc->nr_to_scan * fs_objects) /
+			fs_to_scan = (sc->nr_to_scan * fs_objects) /
 							total_objects;
 		/*
 		 * prune the dcache first as the icache is pinned by it, then
@@ -96,8 +97,23 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 		prune_dcache_sb(sb, dentries);
 		prune_icache_sb(sb, inodes);
 
-		if (fs_objects && sb->s_op->free_cached_objects) {
-			sb->s_op->free_cached_objects(sb, fs_objects);
+		/*
+		 * If as a result of pruning the icache, we released some
+		 * of the fs_objects, give credit to the fact and
+		 * reduce the number of fs objects that we should try
+		 * to release.
+		 */
+		if (fs_to_scan) {
+			int fs_objects_now = sb->s_op->nr_cached_objects(sb);
+
+			if (fs_objects_now < fs_objects)
+				fs_to_scan -= fs_objects - fs_objects_now;
+			if (fs_to_scan < 0)
+				fs_to_scan = 0;
+		}
+
+		if (fs_to_scan && sb->s_op->free_cached_objects) {
+			sb->s_op->free_cached_objects(sb, fs_to_scan);
 			fs_objects = sb->s_op->nr_cached_objects(sb);
 		}
 		total_objects = sb->s_nr_dentry_unused +

What do folks think?

						- Ted

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7 v2] ext4: add physical block and status member into extent status tree
  2013-01-17  4:42   ` Theodore Ts'o
@ 2013-01-18  9:49     ` Zheng Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Zheng Liu @ 2013-01-18  9:49 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Jan kara, Zheng Liu

On Wed, Jan 16, 2013 at 11:42:13PM -0500, Theodore Ts'o wrote:
> On Fri, Jan 11, 2013 at 06:53:43PM +0800, Zheng Liu wrote:
> >  struct extent_status {
> >  	struct rb_node rb_node;
> >  	ext4_lblk_t es_lblk;	/* first logical block extent covers */
> >  	ext4_lblk_t es_len;	/* length of extent in block */
> > +	ext4_fsblk_t es_pblk;	/* first physical block */
> > +	int es_status;		/* record the status of extent */
> >  };
> 
> Given that we only support 48 bits for the physical block number, and
> we only need two bits for es_status, one thing we could do to save
> memory is to stash the es_status bits in the top two bits of es_pblk.

Thanks for the comment.  In next version, es_status will be stashed into
es_pblk.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/7 v2] ext4: lookup block mapping in extent status tree
  2013-01-18  4:00   ` Theodore Ts'o
@ 2013-01-18  9:52     ` Zheng Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Zheng Liu @ 2013-01-18  9:52 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Jan kara, Zheng Liu

On Thu, Jan 17, 2013 at 11:00:06PM -0500, Theodore Ts'o wrote:
> On Fri, Jan 11, 2013 at 06:53:46PM +0800, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > After tracking all extent status, we already have a extent cache in memory.
> > Every time we want to lookup a block mapping, we can first try to lookup it in
> > extent status tree to avoid a potential disk I/O.
> > 
> > A new function called ext4_es_lookup_extent is defined to finish this work.
> > When we try to lookup a block mapping, we always call ext4_map_blocks and/or
> > ext4_da_map_blocks.  So in these functions we first try to lookup a block
> > mapping in extent status tree.
> > 
> > CC: Jan kara <jack@suse.cz>
> > CC: "Theodore Ts'o" <tytso@mit.edu>
> > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> 
> Once we apply this this patch, we should be able to remove the the
> single-entry extent cache in fs/ext4/extents.c ---
> ext4_ext_put_in_cache(), ext4_ext_put_gap_in_cache(),
> ext4_ext_in_cache() --- since the extent status tree makes this code
> redundant (and will do a better job).  This would be a good follow up,
> cleanup patch.

Fair enough.  The patch that tries to remove extent cache will be sent out
in next version.

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7 v2] ext4: reclaim extents from extent status tree
  2013-01-18  5:39     ` Theodore Ts'o
@ 2013-01-21  7:24       ` Zheng Liu
  2013-01-21 15:00         ` Theodore Ts'o
  2013-01-21 14:43       ` [PATCH 7/7 v2] ext4: reclaim extents from extent status tree Jan Kara
  1 sibling, 1 reply; 29+ messages in thread
From: Zheng Liu @ 2013-01-21  7:24 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Jan kara, Zheng Liu

On Fri, Jan 18, 2013 at 12:39:47AM -0500, Theodore Ts'o wrote:
> On Fri, Jan 18, 2013 at 12:19:21AM -0500, Theodore Ts'o wrote:
> > I'm a bit concerned we might be too aggressive,
> > because there are two ways that items can be freed from the
> > extent_status tree.  One is if the inode is not used at all, and when
> > we release the inode, we'll drop all of the entries in the
> > extent_status_tree for that inode.  The second way is via the shrinker
> > which we've registered.
> 
> If we use the sb->s_op->free_cached_objects() approach, something like
> the following change to prune_super() in fs/super.c might address the
> above concern:

Sorry for delay reply.  I believe that sb->s_op->free_cached_objbects()
approach is better.  So in next version I will try to implement this approach.

> 
> diff --git a/fs/super.c b/fs/super.c
> index 12f1237..fb57bd2 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -80,6 +80,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  	if (sc->nr_to_scan) {
>  		int	dentries;
>  		int	inodes;
> +		int	fs_to_scan = 0;
>  
>  		/* proportion the scan between the caches */
>  		dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) /
> @@ -87,7 +88,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  		inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) /
>  							total_objects;
>  		if (fs_objects)
> -			fs_objects = (sc->nr_to_scan * fs_objects) /
> +			fs_to_scan = (sc->nr_to_scan * fs_objects) /
>  							total_objects;
>  		/*
>  		 * prune the dcache first as the icache is pinned by it, then
> @@ -96,8 +97,23 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  		prune_dcache_sb(sb, dentries);
>  		prune_icache_sb(sb, inodes);
>  
> -		if (fs_objects && sb->s_op->free_cached_objects) {
> -			sb->s_op->free_cached_objects(sb, fs_objects);
> +		/*
> +		 * If as a result of pruning the icache, we released some
> +		 * of the fs_objects, give credit to the fact and
> +		 * reduce the number of fs objects that we should try
> +		 * to release.
> +		 */
> +		if (fs_to_scan) {
> +			int fs_objects_now = sb->s_op->nr_cached_objects(sb);
> +
> +			if (fs_objects_now < fs_objects)
> +				fs_to_scan -= fs_objects - fs_objects_now;
> +			if (fs_to_scan < 0)
> +				fs_to_scan = 0;
> +		}
> +
> +		if (fs_to_scan && sb->s_op->free_cached_objects) {
> +			sb->s_op->free_cached_objects(sb, fs_to_scan);
>  			fs_objects = sb->s_op->nr_cached_objects(sb);
>  		}
>  		total_objects = sb->s_nr_dentry_unused +
> 
> What do folks think?

Do we need to CC' linux-fsdevel mailling list to let other folks review
this patch?

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7 v2] ext4: reclaim extents from extent status tree
  2013-01-18  5:39     ` Theodore Ts'o
  2013-01-21  7:24       ` Zheng Liu
@ 2013-01-21 14:43       ` Jan Kara
  2013-01-21 15:12         ` Zheng Liu
  1 sibling, 1 reply; 29+ messages in thread
From: Jan Kara @ 2013-01-21 14:43 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Zheng Liu, linux-ext4, Jan kara, Zheng Liu

On Fri 18-01-13 00:39:47, Ted Tso wrote:
> On Fri, Jan 18, 2013 at 12:19:21AM -0500, Theodore Ts'o wrote:
> > I'm a bit concerned we might be too aggressive,
> > because there are two ways that items can be freed from the
> > extent_status tree.  One is if the inode is not used at all, and when
> > we release the inode, we'll drop all of the entries in the
> > extent_status_tree for that inode.  The second way is via the shrinker
> > which we've registered.
> 
> If we use the sb->s_op->free_cached_objects() approach, something like
> the following change to prune_super() in fs/super.c might address the
> above concern:
  Yeah, this would make sence to me. When you submit the final patch don't
forget to include Dave Chinner. He's the author of the shrinker framework
and XFS uses nr_cached_objects / free_cached_objects. AFAICS it uses it for
its separate inode cache so your change shouldn't affect it but better be
sure.

								Honza
> 
> diff --git a/fs/super.c b/fs/super.c
> index 12f1237..fb57bd2 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -80,6 +80,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  	if (sc->nr_to_scan) {
>  		int	dentries;
>  		int	inodes;
> +		int	fs_to_scan = 0;
>  
>  		/* proportion the scan between the caches */
>  		dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) /
> @@ -87,7 +88,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  		inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) /
>  							total_objects;
>  		if (fs_objects)
> -			fs_objects = (sc->nr_to_scan * fs_objects) /
> +			fs_to_scan = (sc->nr_to_scan * fs_objects) /
>  							total_objects;
>  		/*
>  		 * prune the dcache first as the icache is pinned by it, then
> @@ -96,8 +97,23 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  		prune_dcache_sb(sb, dentries);
>  		prune_icache_sb(sb, inodes);
>  
> -		if (fs_objects && sb->s_op->free_cached_objects) {
> -			sb->s_op->free_cached_objects(sb, fs_objects);
> +		/*
> +		 * If as a result of pruning the icache, we released some
> +		 * of the fs_objects, give credit to the fact and
> +		 * reduce the number of fs objects that we should try
> +		 * to release.
> +		 */
> +		if (fs_to_scan) {
> +			int fs_objects_now = sb->s_op->nr_cached_objects(sb);
> +
> +			if (fs_objects_now < fs_objects)
> +				fs_to_scan -= fs_objects - fs_objects_now;
> +			if (fs_to_scan < 0)
> +				fs_to_scan = 0;
> +		}
> +
> +		if (fs_to_scan && sb->s_op->free_cached_objects) {
> +			sb->s_op->free_cached_objects(sb, fs_to_scan);
>  			fs_objects = sb->s_op->nr_cached_objects(sb);
>  		}
>  		total_objects = sb->s_nr_dentry_unused +
> 
> What do folks think?
> 
> 						- Ted
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7 v2] ext4: reclaim extents from extent status tree
  2013-01-21  7:24       ` Zheng Liu
@ 2013-01-21 15:00         ` Theodore Ts'o
  2013-01-21 17:09           ` Zheng Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Theodore Ts'o @ 2013-01-21 15:00 UTC (permalink / raw)
  To: linux-ext4, Jan kara, Zheng Liu

On Mon, Jan 21, 2013 at 03:24:43PM +0800, Zheng Liu wrote:
> 
> Do we need to CC' linux-fsdevel mailling list to let other folks review
> this patch?

For this patch, yes.  And we'll need a commit description which
explains why it's needed.  Since I wrote the patch, I'm happy to
supply the commit description.  We can include it as a separate patch
in your patch series, and I'll cc linux-fsdevel.

As Jan mentioned, the only other user of it is Dave Chinner for the
XFS code, and I had alreday checked to make sure that it shouldn't
affect him --- but it would be good to cc him so he's in the loop.

Speaking of commit descriptions, I had made some other minor
adjustments to some of the commits in your last version of the patch
series, since I had already been assuming that I could take them until
I saw the potential problems with the memory shrinker patch.  So you
can just provide a new version of the memory shrinker patch --- or if
you have other changes you want to make to the earlier patches in that
patch series, let me know and I can extract out the various comments
and whitespace fixes that from the patches I have in my private tree,
so we can merge them with any additional updates to your patches.

Cheers, and thanks for working on this!

					- Ted


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7 v2] ext4: reclaim extents from extent status tree
  2013-01-21 14:43       ` [PATCH 7/7 v2] ext4: reclaim extents from extent status tree Jan Kara
@ 2013-01-21 15:12         ` Zheng Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Zheng Liu @ 2013-01-21 15:12 UTC (permalink / raw)
  To: Jan Kara; +Cc: Theodore Ts'o, linux-ext4, Zheng Liu

On Mon, Jan 21, 2013 at 03:43:36PM +0100, Jan Kara wrote:
> On Fri 18-01-13 00:39:47, Ted Tso wrote:
> > On Fri, Jan 18, 2013 at 12:19:21AM -0500, Theodore Ts'o wrote:
> > > I'm a bit concerned we might be too aggressive,
> > > because there are two ways that items can be freed from the
> > > extent_status tree.  One is if the inode is not used at all, and when
> > > we release the inode, we'll drop all of the entries in the
> > > extent_status_tree for that inode.  The second way is via the shrinker
> > > which we've registered.
> > 
> > If we use the sb->s_op->free_cached_objects() approach, something like
> > the following change to prune_super() in fs/super.c might address the
> > above concern:
>   Yeah, this would make sence to me. When you submit the final patch don't
> forget to include Dave Chinner. He's the author of the shrinker framework
> and XFS uses nr_cached_objects / free_cached_objects. AFAICS it uses it for
> its separate inode cache so your change shouldn't affect it but better be
> sure.

Thanks for reminding.  This patch has been added in my patch series, and
I will CC' it to Dave.

Regards,
                                                - Zheng

> > 
> > diff --git a/fs/super.c b/fs/super.c
> > index 12f1237..fb57bd2 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -80,6 +80,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
> >  	if (sc->nr_to_scan) {
> >  		int	dentries;
> >  		int	inodes;
> > +		int	fs_to_scan = 0;
> >  
> >  		/* proportion the scan between the caches */
> >  		dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) /
> > @@ -87,7 +88,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
> >  		inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) /
> >  							total_objects;
> >  		if (fs_objects)
> > -			fs_objects = (sc->nr_to_scan * fs_objects) /
> > +			fs_to_scan = (sc->nr_to_scan * fs_objects) /
> >  							total_objects;
> >  		/*
> >  		 * prune the dcache first as the icache is pinned by it, then
> > @@ -96,8 +97,23 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
> >  		prune_dcache_sb(sb, dentries);
> >  		prune_icache_sb(sb, inodes);
> >  
> > -		if (fs_objects && sb->s_op->free_cached_objects) {
> > -			sb->s_op->free_cached_objects(sb, fs_objects);
> > +		/*
> > +		 * If as a result of pruning the icache, we released some
> > +		 * of the fs_objects, give credit to the fact and
> > +		 * reduce the number of fs objects that we should try
> > +		 * to release.
> > +		 */
> > +		if (fs_to_scan) {
> > +			int fs_objects_now = sb->s_op->nr_cached_objects(sb);
> > +
> > +			if (fs_objects_now < fs_objects)
> > +				fs_to_scan -= fs_objects - fs_objects_now;
> > +			if (fs_to_scan < 0)
> > +				fs_to_scan = 0;
> > +		}
> > +
> > +		if (fs_to_scan && sb->s_op->free_cached_objects) {
> > +			sb->s_op->free_cached_objects(sb, fs_to_scan);
> >  			fs_objects = sb->s_op->nr_cached_objects(sb);
> >  		}
> >  		total_objects = sb->s_nr_dentry_unused +
> > 
> > What do folks think?
> > 
> > 						- Ted
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7 v2] ext4: reclaim extents from extent status tree
  2013-01-21 15:00         ` Theodore Ts'o
@ 2013-01-21 17:09           ` Zheng Liu
  2013-01-23  6:06             ` [PATCH] fs: allow for fs-specific objects to be pruned as part of pruning inodes Theodore Ts'o
  0 siblings, 1 reply; 29+ messages in thread
From: Zheng Liu @ 2013-01-21 17:09 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Jan kara, Zheng Liu

On Mon, Jan 21, 2013 at 10:00:20AM -0500, Theodore Ts'o wrote:
> On Mon, Jan 21, 2013 at 03:24:43PM +0800, Zheng Liu wrote:
> > 
> > Do we need to CC' linux-fsdevel mailling list to let other folks review
> > this patch?
> 
> For this patch, yes.  And we'll need a commit description which
> explains why it's needed.  Since I wrote the patch, I'm happy to
> supply the commit description.  We can include it as a separate patch
> in your patch series, and I'll cc linux-fsdevel.

That would be great if you could give a commit description.

> 
> As Jan mentioned, the only other user of it is Dave Chinner for the
> XFS code, and I had alreday checked to make sure that it shouldn't
> affect him --- but it would be good to cc him so he's in the loop.

As I metioned at another mail, I will cc him.

> 
> Speaking of commit descriptions, I had made some other minor
> adjustments to some of the commits in your last version of the patch
> series, since I had already been assuming that I could take them until
> I saw the potential problems with the memory shrinker patch.  So you
> can just provide a new version of the memory shrinker patch --- or if
> you have other changes you want to make to the earlier patches in that
> patch series, let me know and I can extract out the various comments
> and whitespace fixes that from the patches I have in my private tree,
> so we can merge them with any additional updates to your patches.

Your meaning is that you have made some changes in my patches of extent
status tree, aren't you?  Now I have stashed es_status into es_pblk in
patch 3 (ext4: add physical block and status member into extent status
tree), and added a new patch that removes single extent cache.  I am
happy to merge your changes if you could give them to me.  Please let me
know if I could provide some helps.

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/7 v2] ext4: refine extent status tree
  2013-01-11 10:53 ` [PATCH 1/7 v2] ext4: refine extent status tree Zheng Liu
@ 2013-01-23  4:20   ` Theodore Ts'o
  0 siblings, 0 replies; 29+ messages in thread
From: Theodore Ts'o @ 2013-01-23  4:20 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Jan kara, Zheng Liu

On Fri, Jan 11, 2013 at 06:53:41PM +0800, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> This patch tries to refine the extent status tree.  A prefix 'es_' is added to
> make code clearly.

I suggest the following commit description:

    ext4: refine extent status tree
    
    This commit refines the extent status tree code.
    
    1) A prefix 'es_' is added to to the extent status tree structure
    members.
    
    2) Refactored es_remove_extent() so that __es_remove_extent() can be
    used by es_insert_extent() to remove the old extent entry(-ies) before
    inserting a new one.
    
    3)  Update and clarified comments.

> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index 564d981..0662369 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> + * Step2:
> + * In this step all extent status is tracked by extent status tree.
> + * Thus, we can first try to lookup a block mapping in this tree before
> + * find it in extent tree.  Unwritten extent conversion also can be
> + * improved.  Currently this conversion need to be done in a workqueue
> + * because this conversion can not be done in end_io function due to it
> + * needs to take i_data_sem locking in a irq context.  After looking block
> + * mapping in extent status tree, we can first convert unwritten extent in
> + * extent status tree, call aio_comlete() and inode_dio_done() in end_io
> + * function, and don't need to be worried about expose a stale data.

Grammar nit:  The last line should read:

"don't need to be worried about exposing stale data"

						- Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/7 v2] ext4: track all extent status in extent status tree
  2013-01-11 10:53 ` [PATCH 5/7 v2] ext4: track all extent status in " Zheng Liu
@ 2013-01-23  4:31   ` Theodore Ts'o
  0 siblings, 0 replies; 29+ messages in thread
From: Theodore Ts'o @ 2013-01-23  4:31 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Jan kara, Zheng Liu

On Fri, Jan 11, 2013 at 06:53:45PM +0800, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 

Please line-wrap the commit description to 72 columns.  (This way the
output of "git log" will look nice even on 80 column terminal windows)".

> After record phycisal block and status, extent status tree is able to track the

"By recording the physical block and status, ...."

> status of every extents.  When we call _map_blocks functions to lookup an extent
> or create a new written/unwritten/delayed extent, this extent will be inserted
> into extent status tree.
> 
> We don't load all extents from disk in alloc_inode() because it costs too much
> memory, and, when open/close a file very frequently, it will takes too much time

"...and if a file is opened and closed frequently"..."

> to load all extent information.  So currently when we create/lookup an extent,
> this extent will be inserted into extent status tree.  Hence, the status in
> extent status tree might be not completely.

"Hence, the extent status tree may not comprehensively contain all of
the extents found in the file."

						- Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] fs: allow for fs-specific objects to be pruned as part of pruning inodes
  2013-01-21 17:09           ` Zheng Liu
@ 2013-01-23  6:06             ` Theodore Ts'o
  2013-01-23 10:52               ` Jan Kara
                                 ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Theodore Ts'o @ 2013-01-23  6:06 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: gnehzuil.liu, Theodore Ts'o, linux-fsdevel

The VFS's prune_super() function allows for the file system to prune
file-system specific objects.  Ext4 would like to use this to prune
parts of the inode's extent cache.  The object lifetime rules used by
ext4 is somewhat different from the those of the dentry and inode in
the VFS.  Ext4's extent cache objects can be pruned without removing
the inode; however if an inode is pruned, all of the extent cache
objects associated with the inode are immediately removed.

To accomodate this rule, we measure the number of fs-specific objects
before the dentry and inodes are pruned, and then measure them again
afterwards.  If the number of fs-specific objects have decreased, we
credit that decrease as part of the shrink operation, so that we do
not end up removing too many fs-specific objects.

In the case where fs-specific objects are not removed when inodes are
removed, this will not change the behavior of prune_super() in any
appreciable way.  (Currently the only other user of this facility is
XFS, and this change should not affect XFS's usage of this facility
for this reason.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
---
 fs/super.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 12f1237..fb57bd2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -80,6 +80,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 	if (sc->nr_to_scan) {
 		int	dentries;
 		int	inodes;
+		int	fs_to_scan = 0;
 
 		/* proportion the scan between the caches */
 		dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) /
@@ -87,7 +88,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 		inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) /
 							total_objects;
 		if (fs_objects)
-			fs_objects = (sc->nr_to_scan * fs_objects) /
+			fs_to_scan = (sc->nr_to_scan * fs_objects) /
 							total_objects;
 		/*
 		 * prune the dcache first as the icache is pinned by it, then
@@ -96,8 +97,23 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 		prune_dcache_sb(sb, dentries);
 		prune_icache_sb(sb, inodes);
 
-		if (fs_objects && sb->s_op->free_cached_objects) {
-			sb->s_op->free_cached_objects(sb, fs_objects);
+		/*
+		 * If as a result of pruning the icache, we released some
+		 * of the fs_objects, give credit to the fact and
+		 * reduce the number of fs objects that we should try
+		 * to release.
+		 */
+		if (fs_to_scan) {
+			int fs_objects_now = sb->s_op->nr_cached_objects(sb);
+
+			if (fs_objects_now < fs_objects)
+				fs_to_scan -= fs_objects - fs_objects_now;
+			if (fs_to_scan < 0)
+				fs_to_scan = 0;
+		}
+
+		if (fs_to_scan && sb->s_op->free_cached_objects) {
+			sb->s_op->free_cached_objects(sb, fs_to_scan);
 			fs_objects = sb->s_op->nr_cached_objects(sb);
 		}
 		total_objects = sb->s_nr_dentry_unused +
-- 
1.7.12.rc0.22.gcdd159b


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] fs: allow for fs-specific objects to be pruned as part of pruning inodes
  2013-01-23  6:06             ` [PATCH] fs: allow for fs-specific objects to be pruned as part of pruning inodes Theodore Ts'o
@ 2013-01-23 10:52               ` Jan Kara
  2013-01-23 13:06               ` Carlos Maiolino
  2013-01-23 13:32               ` Dave Chinner
  2 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2013-01-23 10:52 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Ext4 Developers List, gnehzuil.liu, linux-fsdevel, dchinner

On Wed 23-01-13 01:06:08, Ted Tso wrote:
> The VFS's prune_super() function allows for the file system to prune
> file-system specific objects.  Ext4 would like to use this to prune
> parts of the inode's extent cache.  The object lifetime rules used by
> ext4 is somewhat different from the those of the dentry and inode in
> the VFS.  Ext4's extent cache objects can be pruned without removing
> the inode; however if an inode is pruned, all of the extent cache
> objects associated with the inode are immediately removed.
> 
> To accomodate this rule, we measure the number of fs-specific objects
> before the dentry and inodes are pruned, and then measure them again
> afterwards.  If the number of fs-specific objects have decreased, we
> credit that decrease as part of the shrink operation, so that we do
> not end up removing too many fs-specific objects.
> 
> In the case where fs-specific objects are not removed when inodes are
> removed, this will not change the behavior of prune_super() in any
> appreciable way.  (Currently the only other user of this facility is
> XFS, and this change should not affect XFS's usage of this facility
> for this reason.)
  The patch looks good to me. So you can add:
Reviewed-by: Jan Kara <jack@suse.cz>

I've also added Dave to CC.

								Honza

> 
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> Cc: linux-fsdevel@vger.kernel.org
> ---
>  fs/super.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 12f1237..fb57bd2 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -80,6 +80,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  	if (sc->nr_to_scan) {
>  		int	dentries;
>  		int	inodes;
> +		int	fs_to_scan = 0;
>  
>  		/* proportion the scan between the caches */
>  		dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) /
> @@ -87,7 +88,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  		inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) /
>  							total_objects;
>  		if (fs_objects)
> -			fs_objects = (sc->nr_to_scan * fs_objects) /
> +			fs_to_scan = (sc->nr_to_scan * fs_objects) /
>  							total_objects;
>  		/*
>  		 * prune the dcache first as the icache is pinned by it, then
> @@ -96,8 +97,23 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  		prune_dcache_sb(sb, dentries);
>  		prune_icache_sb(sb, inodes);
>  
> -		if (fs_objects && sb->s_op->free_cached_objects) {
> -			sb->s_op->free_cached_objects(sb, fs_objects);
> +		/*
> +		 * If as a result of pruning the icache, we released some
> +		 * of the fs_objects, give credit to the fact and
> +		 * reduce the number of fs objects that we should try
> +		 * to release.
> +		 */
> +		if (fs_to_scan) {
> +			int fs_objects_now = sb->s_op->nr_cached_objects(sb);
> +
> +			if (fs_objects_now < fs_objects)
> +				fs_to_scan -= fs_objects - fs_objects_now;
> +			if (fs_to_scan < 0)
> +				fs_to_scan = 0;
> +		}
> +
> +		if (fs_to_scan && sb->s_op->free_cached_objects) {
> +			sb->s_op->free_cached_objects(sb, fs_to_scan);
>  			fs_objects = sb->s_op->nr_cached_objects(sb);
>  		}
>  		total_objects = sb->s_nr_dentry_unused +
> -- 
> 1.7.12.rc0.22.gcdd159b
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] fs: allow for fs-specific objects to be pruned as part of pruning inodes
  2013-01-23  6:06             ` [PATCH] fs: allow for fs-specific objects to be pruned as part of pruning inodes Theodore Ts'o
  2013-01-23 10:52               ` Jan Kara
@ 2013-01-23 13:06               ` Carlos Maiolino
  2013-01-23 13:32               ` Dave Chinner
  2 siblings, 0 replies; 29+ messages in thread
From: Carlos Maiolino @ 2013-01-23 13:06 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List, gnehzuil.liu, linux-fsdevel

Looks good to me:

Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

On Wed, Jan 23, 2013 at 01:06:08AM -0500, Theodore Ts'o wrote:
> The VFS's prune_super() function allows for the file system to prune
> file-system specific objects.  Ext4 would like to use this to prune
> parts of the inode's extent cache.  The object lifetime rules used by
> ext4 is somewhat different from the those of the dentry and inode in
> the VFS.  Ext4's extent cache objects can be pruned without removing
> the inode; however if an inode is pruned, all of the extent cache
> objects associated with the inode are immediately removed.
> 
> To accomodate this rule, we measure the number of fs-specific objects
> before the dentry and inodes are pruned, and then measure them again
> afterwards.  If the number of fs-specific objects have decreased, we
> credit that decrease as part of the shrink operation, so that we do
> not end up removing too many fs-specific objects.
> 
> In the case where fs-specific objects are not removed when inodes are
> removed, this will not change the behavior of prune_super() in any
> appreciable way.  (Currently the only other user of this facility is
> XFS, and this change should not affect XFS's usage of this facility
> for this reason.)
> 
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> Cc: linux-fsdevel@vger.kernel.org
> ---
>  fs/super.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 12f1237..fb57bd2 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -80,6 +80,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  	if (sc->nr_to_scan) {
>  		int	dentries;
>  		int	inodes;
> +		int	fs_to_scan = 0;
>  
>  		/* proportion the scan between the caches */
>  		dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) /
> @@ -87,7 +88,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  		inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) /
>  							total_objects;
>  		if (fs_objects)
> -			fs_objects = (sc->nr_to_scan * fs_objects) /
> +			fs_to_scan = (sc->nr_to_scan * fs_objects) /
>  							total_objects;
>  		/*
>  		 * prune the dcache first as the icache is pinned by it, then
> @@ -96,8 +97,23 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  		prune_dcache_sb(sb, dentries);
>  		prune_icache_sb(sb, inodes);
>  
> -		if (fs_objects && sb->s_op->free_cached_objects) {
> -			sb->s_op->free_cached_objects(sb, fs_objects);
> +		/*
> +		 * If as a result of pruning the icache, we released some
> +		 * of the fs_objects, give credit to the fact and
> +		 * reduce the number of fs objects that we should try
> +		 * to release.
> +		 */
> +		if (fs_to_scan) {
> +			int fs_objects_now = sb->s_op->nr_cached_objects(sb);
> +
> +			if (fs_objects_now < fs_objects)
> +				fs_to_scan -= fs_objects - fs_objects_now;
> +			if (fs_to_scan < 0)
> +				fs_to_scan = 0;
> +		}
> +
> +		if (fs_to_scan && sb->s_op->free_cached_objects) {
> +			sb->s_op->free_cached_objects(sb, fs_to_scan);
>  			fs_objects = sb->s_op->nr_cached_objects(sb);
>  		}
>  		total_objects = sb->s_nr_dentry_unused +
> -- 
> 1.7.12.rc0.22.gcdd159b
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Carlos

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] fs: allow for fs-specific objects to be pruned as part of pruning inodes
  2013-01-23  6:06             ` [PATCH] fs: allow for fs-specific objects to be pruned as part of pruning inodes Theodore Ts'o
  2013-01-23 10:52               ` Jan Kara
  2013-01-23 13:06               ` Carlos Maiolino
@ 2013-01-23 13:32               ` Dave Chinner
  2013-01-23 16:34                 ` Theodore Ts'o
  2013-01-24  8:58                 ` Andrew Morton
  2 siblings, 2 replies; 29+ messages in thread
From: Dave Chinner @ 2013-01-23 13:32 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List, gnehzuil.liu, linux-fsdevel

On Wed, Jan 23, 2013 at 01:06:08AM -0500, Theodore Ts'o wrote:
> The VFS's prune_super() function allows for the file system to prune
> file-system specific objects.  Ext4 would like to use this to prune
> parts of the inode's extent cache.  The object lifetime rules used by
> ext4 is somewhat different from the those of the dentry and inode in
> the VFS.  Ext4's extent cache objects can be pruned without removing
> the inode; however if an inode is pruned, all of the extent cache
> objects associated with the inode are immediately removed.
> 
> To accomodate this rule, we measure the number of fs-specific objects
> before the dentry and inodes are pruned, and then measure them again
> afterwards.  If the number of fs-specific objects have decreased, we
> credit that decrease as part of the shrink operation, so that we do
> not end up removing too many fs-specific objects.

Doesn't work. Shrinkers run concurrently on the same context, so a
shrinker can be running on multiple CPUs and hence "interfere" with
each other. i.e. A shrinker call on CPU 2 could see a reduction in
a cache as a result of the same shrinker running on CPU 1 in the
same context, and that would mean the shrinker on CPU 2 doesn't do
the work it was asked (and needs) to do to reclaim memory.

It also seems somewhat incompatible with the proposed memcg/NUMA
aware shrinker infrastructure, where the shrinker has much more
fine-grained context for operation than the current infrastructure.
This seems to assume that there is global context relationship
between inode cache and the fs specific cache.

Also, the superblock shrinker is designed around a direct 1:1:1
dependency relationship between the superblock dentry, inode and "fs
cache" objects. i.e. dentry pins inode pins fs cache object.  It is
designed to keep a direct balance of the three caches by ensuring
they get scanned in amounts directly proportional to the relative
differences in their object counts.  That can't be done with
separate shrinkers, hence the use of the superblock shrinker to
define the dependent relationship between the caches.

i.e. the relationship between the VFS inode and the XFS inode is
that they are the same object, but the "XFS inode" lives for some
time after the VFS is done with it. IOWs, the VFS inode reclaim does
not free any memory, so the reclaim work needs to be transferred
directly to the "fscache shrinker" to free the inode objects.

e.g.  if we have a batch of 100 objects to scan and al the caches
are of equal sizes, 33 dentries are scanned, 33 VFS indoes are
scanned, and 33 XFS inodes are scanned, The result is that for every
100 objects scanned, we'll free 33 dentries and 33 inodes. And if
the caches are out of balance, the biasing of the reclaim towards
different caches will pull them back into even proportions of object
counts.  i.e. the proportioning is very carefully balanced around
maintaining the fixed relationship between the different types
objects...

In your proposed use case, the ext4 extent cache size has no direct
relationship to the size of the VFS inode cache - the can both
change size independently and not impact the balance of the system
as long as the hot objects are kept in their respective caches when
under memory pressure.

When the cache size proportion varies with workload, you want
separate shrinkers for the caches so that the memory pressure and
number of active objects the workload generates determines the cache
size. That's exactly what I'd say is necessary for an extent cache -
it will balloon out massively larger than the inode cache when you
have fragmented files, but if you have well formed files, it will
stay relatively small. However, the number of inodes doesn't change,
and hence what we have here is the optimal cache size proportions
change with workload...

i.e. the superblock fscache shrinker callout is the wrong thing to
use here asit doesn't model the relationship between objects at all
well. A separate shrinker instance for the extent cache is a much
better match....

> In the case where fs-specific objects are not removed when inodes are
> removed, this will not change the behavior of prune_super() in any
> appreciable way.  (Currently the only other user of this facility is
> XFS, and this change should not affect XFS's usage of this facility
> for this reason.)

It can change behaviour is surprisingly subtle, nasty ways.

The XFS superblock shrinker is what provides that memory reclaim
rate throttling for XFS, similar to the way the page cache throttles
writes to the rate at which dirty data can be written to disk. In
effect, it throttles the rate of cache allocation to the rate at
which we clean and free inodes and hence maintains a good system
balance.

The shrinker is also designed to prevent overloading and contention
in the case of concurent execution on large node count machines. It
prevents different CPUs/nodes executing reclaim on the same caches
and hence contending on locks. This also further throttles the rate
of reclaim by blocking shrinkers until they can do the work they
were asked to do by reclaim. IOWs, there are several layers of
throttling in the XFS shrinker, and the system balance is dependent
on the XFS inode shrinker being called appropriately.

With these two cache counters, XFS is guaranteed to see different
inode counts as shrinker reclaim always happens concurrently, not to
mention there is also background inode reclaim also running
concurrently. The result of decreasing counts (as will happen
frequently) is that the cwthis shrinker-based reclaim will not be
run, and hence inode reclaim will not get throttled to the rate at
which we can clean inodes or relcaim them without contention.

The result is that the system becomes unstable and unpredictable
under memory pressure, especially under workloads that dirty a lot
of metadata. Think "random OOM conditions" and "my system stopped
responding for minutes" type of issues....

Like I said, surprisingly subtle and nasty....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] fs: allow for fs-specific objects to be pruned as part of pruning inodes
  2013-01-23 13:32               ` Dave Chinner
@ 2013-01-23 16:34                 ` Theodore Ts'o
  2013-01-23 23:35                   ` Dave Chinner
  2013-01-24  8:58                 ` Andrew Morton
  1 sibling, 1 reply; 29+ messages in thread
From: Theodore Ts'o @ 2013-01-23 16:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Ext4 Developers List, gnehzuil.liu, linux-fsdevel

On Thu, Jan 24, 2013 at 12:32:31AM +1100, Dave Chinner wrote:
> Doesn't work. Shrinkers run concurrently on the same context, so a
> shrinker can be running on multiple CPUs and hence "interfere" with
> each other. i.e. A shrinker call on CPU 2 could see a reduction in
> a cache as a result of the same shrinker running on CPU 1 in the
> same context, and that would mean the shrinker on CPU 2 doesn't do
> the work it was asked (and needs) to do to reclaim memory.

Hmm, I had assumed that a fs would only have a single prune_super()
running at a time.  So you're telling me that was a bad assumption....

> It also seems somewhat incompatible with the proposed memcg/NUMA
> aware shrinker infrastructure, where the shrinker has much more
> fine-grained context for operation than the current infrastructure.
> This seems to assume that there is global context relationship
> between inode cache and the fs specific cache.

Can you point me at a mail archive with this proposed memcg-aware
shrinker?  I was noticing that that at time moment we're not doing any
shrinking at all on a per-memcg basis, and was reflecting on what a
mess that could cause....  I agree that's a problem that needs fixing,
although it seems fundamentally, hard, especially given that we
currently account for memcg memory usage on a per-page basis, and a
single object owned by a different memcg could prevent a page which
was originally allocated (and hence charged) to the first memcg....

> In your proposed use case, the ext4 extent cache size has no direct
> relationship to the size of the VFS inode cache - the can both
> change size independently and not impact the balance of the system
> as long as the hot objects are kept in their respective caches when
> under memory pressure.
> 
> i.e. the superblock fscache shrinker callout is the wrong thing to
> use here asit doesn't model the relationship between objects at all
> well. A separate shrinker instance for the extent cache is a much
> better match....

Yeah, that was Zheng's original implementation.  My concern was that
could cause the extent cache to get charged twice.  It would get hit
one time when we shrank the number of inodes, since the extent cache
currently does not have a lifetime independent of inodes (rather they
are linked to the inode via a tree structure), and then if we had a
separate extent cache shrinker, they would get reduced a second time.

The reason why we need the second shrinker, of course, is because of
the issue you raised; we could have some files which are heavily
fragmented, and hence would have many more extent cache objects, and
so we can't just rely on shrinking the inode cache to keep the growth
of the extent caches in check in a high memory pressure situation.

Hmm....  this is going to require more thought.  Do you have any
sugestions about what might be a better strategy?

Thanks,

						- Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] fs: allow for fs-specific objects to be pruned as part of pruning inodes
  2013-01-23 16:34                 ` Theodore Ts'o
@ 2013-01-23 23:35                   ` Dave Chinner
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2013-01-23 23:35 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List, gnehzuil.liu, linux-fsdevel

On Wed, Jan 23, 2013 at 11:34:21AM -0500, Theodore Ts'o wrote:
> On Thu, Jan 24, 2013 at 12:32:31AM +1100, Dave Chinner wrote:
> > Doesn't work. Shrinkers run concurrently on the same context, so a
> > shrinker can be running on multiple CPUs and hence "interfere" with
> > each other. i.e. A shrinker call on CPU 2 could see a reduction in
> > a cache as a result of the same shrinker running on CPU 1 in the
> > same context, and that would mean the shrinker on CPU 2 doesn't do
> > the work it was asked (and needs) to do to reclaim memory.
> 
> Hmm, I had assumed that a fs would only have a single prune_super()
> running at a time.  So you're telling me that was a bad assumption....

Yes.

> > It also seems somewhat incompatible with the proposed memcg/NUMA
> > aware shrinker infrastructure, where the shrinker has much more
> > fine-grained context for operation than the current infrastructure.
> > This seems to assume that there is global context relationship
> > between inode cache and the fs specific cache.
> 
> Can you point me at a mail archive with this proposed memcg-aware
> shrinker?  I was noticing that that at time moment we're not doing any
> shrinking at all on a per-memcg basis, and was reflecting on what a
> mess that could cause....  I agree that's a problem that needs fixing,
> although it seems fundamentally, hard, especially given that we
> currently account for memcg memory usage on a per-page basis, and a
> single object owned by a different memcg could prevent a page which
> was originally allocated (and hence charged) to the first memcg....

http://oss.sgi.com/archives/xfs/2012-11/msg00643.html

The posting is for numa aware LRUs and shrinkers, and the discussion
follows on how to build memcg awareness on top of that generic
LRU/shrinker infrastructure

> > In your proposed use case, the ext4 extent cache size has no direct
> > relationship to the size of the VFS inode cache - the can both
> > change size independently and not impact the balance of the system
> > as long as the hot objects are kept in their respective caches when
> > under memory pressure.
> > 
> > i.e. the superblock fscache shrinker callout is the wrong thing to
> > use here asit doesn't model the relationship between objects at all
> > well. A separate shrinker instance for the extent cache is a much
> > better match....
> 
> Yeah, that was Zheng's original implementation.  My concern was that
> could cause the extent cache to get charged twice.  It would get hit
> one time when we shrank the number of inodes, since the extent cache
> currently does not have a lifetime independent of inodes (rather they
> are linked to the inode via a tree structure), and then if we had a
> separate extent cache shrinker, they would get reduced a second time.

The decision of how much to shrink a cache is made at the time the
shrinker is invoked, not for each call to the shrinker function. The
number to scan from each cache is based on a fixed value, and hence
all caches are put under the same pressure. The amount of objects to
scan is therefore dependent on the relative difference in the number
of objects in each cache. Hence if we remove objects from cache B
while scanning cache A, the shrinker for cache B will see less
objects in the cache and apply less pressure (i.e. scan less).

However, what you have to consider is that the micro-level
behaviour of a single shrinker call is not important. Shrinkers
often run at thousands of scan cycles per second, and so it's the
macro-level behaviour that results from the interactions of multiple
shrinkers that determines the system balance under memory pressure.
Design and tune for macro-level behaviour, not what seems right for
a single shrinker scan call...

> The reason why we need the second shrinker, of course, is because of
> the issue you raised; we could have some files which are heavily
> fragmented, and hence would have many more extent cache objects, and
> so we can't just rely on shrinking the inode cache to keep the growth
> of the extent caches in check in a high memory pressure situation.
> 
> Hmm....  this is going to require more thought.  Do you have any
> sugestions about what might be a better strategy?

In general, the shrinker mechanism balances separate caches pretty
well, so I'd just use a standard shrinker first. Observe the
behaviour under different workloads to see if the standard cache
balancing causes problems. If you see obvious high level imbalances
or performance problems then you need to start considering "special"
solutions.

The coarse knob the shrinkers have to affect this balance is the
"seeks" parameter of the shrinker. That tells the shrinker the
relative cost of replacing the object in the cache, and so has a
high level bias on the pressure the infrastructure places on the
cache. What you need to decide is whether the cost of replacing
objects is more or less expensive than the cost of replaing an inode
in cache, and bias from there.

The filesystem caches also have another "big hammer" knob in the
form of the /proc/sys/vm/vfs_cache_pressure sysctl. This makes the
caches look larger or smaller w.r.t. the page cache and hence biases
reclaim towards or away from the VFS caches. YOu can use this method
in individual shrinkers to cause the shrinker infrastructure to have
different reclaim characterisitics. Hence if you don't want to
reclaim from a cache, then just tell the shrinker it's size is
zero. (FWIW, the changed API in the above patch set makes this
biasing technique much easier and more reliable.)

I guess what I'm trying to say is just use a standard, stand-alone
shrinker and see how it behaves under real world conditions before
trying anything fancy. Often they "just work". :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] fs: allow for fs-specific objects to be pruned as part of pruning inodes
  2013-01-23 13:32               ` Dave Chinner
  2013-01-23 16:34                 ` Theodore Ts'o
@ 2013-01-24  8:58                 ` Andrew Morton
  2013-01-24 20:03                   ` Dave Chinner
  1 sibling, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2013-01-24  8:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Theodore Ts'o, Ext4 Developers List, gnehzuil.liu, linux-fsdevel

On Thu, 24 Jan 2013 00:32:31 +1100 Dave Chinner <david@fromorbit.com> wrote:

> Also, the superblock shrinker is designed around a direct 1:1:1
> dependency relationship between the superblock dentry, inode and "fs
> cache" objects. i.e. dentry pins inode pins fs cache object.  It is
> designed to keep a direct balance of the three caches by ensuring
> they get scanned in amounts directly proportional to the relative
> differences in their object counts.  That can't be done with
> separate shrinkers, hence the use of the superblock shrinker to
> define the dependent relationship between the caches.

I was staring at the code and at the 0e1fdafd9 changelog trying to work
out why prune_super() does its weird shrinker-in-a-shrinker thing.  And
failing.

IOW it needs a code comment, please.  Ideally one which explains *why*
"It is designed to keep a direct balance of the three caches...".  What
would go wrong if the fs were to just register its own shrinker in the
expected manner?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] fs: allow for fs-specific objects to be pruned as part of pruning inodes
  2013-01-24  8:58                 ` Andrew Morton
@ 2013-01-24 20:03                   ` Dave Chinner
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2013-01-24 20:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, Ext4 Developers List, gnehzuil.liu, linux-fsdevel

On Thu, Jan 24, 2013 at 12:58:16AM -0800, Andrew Morton wrote:
> On Thu, 24 Jan 2013 00:32:31 +1100 Dave Chinner <david@fromorbit.com> wrote:
> 
> > Also, the superblock shrinker is designed around a direct 1:1:1
> > dependency relationship between the superblock dentry, inode and "fs
> > cache" objects. i.e. dentry pins inode pins fs cache object.  It is
> > designed to keep a direct balance of the three caches by ensuring
> > they get scanned in amounts directly proportional to the relative
> > differences in their object counts.  That can't be done with
> > separate shrinkers, hence the use of the superblock shrinker to
> > define the dependent relationship between the caches.
> 
> I was staring at the code and at the 0e1fdafd9 changelog trying to work
> out why prune_super() does its weird shrinker-in-a-shrinker thing.  And
> failing.

commit 8daaa83145ef1f0a146680618328dbbd0fa76939
Author: Dave Chinner <dchinner@redhat.com>
Date:   Fri Jul 8 14:14:46 2011 +1000

    xfs: make use of new shrinker callout for the inode cache
    
    Convert the inode reclaim shrinker to use the new per-sb shrinker
    operations. This allows much bigger reclaim batches to be used, and
    allows the XFS inode cache to be shrunk in proportion with the VFS
    dentry and inode caches. This avoids the problem of the VFS caches
    being shrunk significantly before the XFS inode cache is shrunk
    resulting in imbalances in the caches during reclaim.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Basically, it's to allow filesystems with their own inode cache
management to shrink the cache in direct proportion to the VFS inode
cache so that we don't get situations where the VFS caches get
shrunk to nothing while the filesystem inode cache stays large (and
hence no memory is reclaimed) and vice-versa. The reclaim imbalance
problem ended up so bad I could OOM machines simply by running a
metadata intensive workload....

> IOW it needs a code comment, please.  Ideally one which explains *why*
> "It is designed to keep a direct balance of the three caches...".  What
> would go wrong if the fs were to just register its own shrinker in the
> expected manner?

Nothing, unless there is a direct reclaim relationship between the
filesystem cache and the VFS caches and then independent shrinkers
cannot maintain the balanced relationship between the dependent
caches.  XFS uses normal shrinkers for it's other caches that don't
have a direct 1:1 relationship to the VFS inode and dentry caches,
and that works just fine.

As it is, if you just want to keep random caches at the same object
count as the inodes and dentries, then the fs callout will work just
fine. But if you want something that does not have an equivalent
object count relationship, then a separate shrinker is what is
needed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2013-01-24 20:03 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-11 10:53 [PATCH 0/7 v2] ext4: extent status tree (step2) Zheng Liu
2013-01-11 10:53 ` [PATCH 1/7 v2] ext4: refine extent status tree Zheng Liu
2013-01-23  4:20   ` Theodore Ts'o
2013-01-11 10:53 ` [PATCH 2/7 v2] ext4: remove EXT4_MAP_FROM_CLUSTER flag Zheng Liu
2013-01-11 10:53 ` [PATCH 3/7 v2] ext4: add physical block and status member into extent status tree Zheng Liu
2013-01-17  4:42   ` Theodore Ts'o
2013-01-18  9:49     ` Zheng Liu
2013-01-11 10:53 ` [PATCH 4/7 v2] ext4: adjust interfaces of " Zheng Liu
2013-01-11 10:53 ` [PATCH 5/7 v2] ext4: track all extent status in " Zheng Liu
2013-01-23  4:31   ` Theodore Ts'o
2013-01-11 10:53 ` [PATCH 6/7 v2] ext4: lookup block mapping " Zheng Liu
2013-01-18  4:00   ` Theodore Ts'o
2013-01-18  9:52     ` Zheng Liu
2013-01-11 10:53 ` [PATCH 7/7 v2] ext4: reclaim extents from " Zheng Liu
2013-01-18  5:19   ` Theodore Ts'o
2013-01-18  5:39     ` Theodore Ts'o
2013-01-21  7:24       ` Zheng Liu
2013-01-21 15:00         ` Theodore Ts'o
2013-01-21 17:09           ` Zheng Liu
2013-01-23  6:06             ` [PATCH] fs: allow for fs-specific objects to be pruned as part of pruning inodes Theodore Ts'o
2013-01-23 10:52               ` Jan Kara
2013-01-23 13:06               ` Carlos Maiolino
2013-01-23 13:32               ` Dave Chinner
2013-01-23 16:34                 ` Theodore Ts'o
2013-01-23 23:35                   ` Dave Chinner
2013-01-24  8:58                 ` Andrew Morton
2013-01-24 20:03                   ` Dave Chinner
2013-01-21 14:43       ` [PATCH 7/7 v2] ext4: reclaim extents from extent status tree Jan Kara
2013-01-21 15:12         ` Zheng Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.