[PATCH 0/6 v3] ext4: Extent status tree shrinker improvements

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6 v3] ext4: Extent status tree shrinker improvements
@ 2014-11-14 10:45 Jan Kara
  2014-11-14 10:45 ` [PATCH 1/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks() Jan Kara
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Jan Kara @ 2014-11-14 10:45 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Zheng Liu, Jan Kara

  Hello,

  this is the next revision of Zheng's patches for improving latency of ext4
extent status tree shrinker. Most notably I have modified the patches to
avoid growing the structure of a cached extent and addressed other feedback
I and Ted had.

Here are the measurements of the extent status tree shrinker (done in my
test VM with 6 CPUs and 2GB of RAM).

For my synthetic test maintaing lots of fragmented files the results were
like:
Baseline:
stats:
  2194 objects
  1206 reclaimable objects
  299726/14244181 cache hits/misses
  7796 ms last sorted interval
  2002 inodes on lru list
average:
  858 us scan time
  116 shrunk objects
maximum:
  601 inode (10 objects, 10 reclaimable)
  332987 us max scan time

Patched:
stats:
  4351 objects
  2176 reclaimable objects
  21183492/2809261 cache hits/misses
  220 inodes on list
max nr_to_scan: 128 max inodes walked: 15
average:
  148 us scan time
  125 shrunk objects
maximum:
  1494 inode (20 objects, 10 reclaimable)
  4173 us max scan time

Also for Zheng's write fio workload we get noticeable improvements:
Baseline:
stats:
  261094 objects
  261094 reclaimable objects
  4030024/1063081 cache hits/misses
  366508 ms last sorted interval
  15001 inodes on lru list
average:
  330 us scan time
  125 shrunk objects
maximum:
  9217 inode (46 objects, 46 reclaimable)
  19525 us max scan time

Patched:
stats:
  496796 objects
  466436 reclaimable objects
  1322023/119385 cache hits/misses
  14825 inodes on list
max nr_to_scan: 128 max inodes walked: 3
average:
  112 us scan time
  125 shrunk objects
maximum:
  2 inode (87 objects, 87 reclaimable)
  7158 us max scan time

OTOH I can see a regression in max latency for the randwrite workload:
Baseline:
stats:
  35953 objects
  33264 reclaimable objects
  110208/1794665 cache hits/misses
  56396 ms last sorted interval
  251 inodes on lru list
average:
  286 us scan time
  125 shrunk objects
maximum:
  225 inode (849 objects, 838 reclaimable)
  4220 us max scan time

Patched
stats:
  118256 objects
stats:
  193489 objects
  193489 reclaimable objects
  1133707/65535 cache hits/misses
  251 inodes on list
max nr_to_scan: 128 max inodes walked: 6
average:
  180 us scan time
  125 shrunk objects
maximum:
  15123 inode (1931 objects, 1931 reclaimable)
  6458 us max scan time

  In general you can see there are much more objects cached in the patched
kernel. This is because of the first patch - we now have all the holes in the
files cached as well. This is also the reason for the much higher cache hit
ratio. I'm still somewhat unsatisfied with the worst case latency of the
shrinker which is in order of miliseconds for all the workloads. I did some
more instrumentation of the code and the reason for this s_es_lock. Under
heavy load the wait time for this lock is on average ~100 us (3300 us max).
I have some ideas how to improve on this but I didn't want to delay posting
of the series even more...

								Honza

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
  2014-11-14 10:45 [PATCH 0/6 v3] ext4: Extent status tree shrinker improvements Jan Kara
@ 2014-11-14 10:45 ` Jan Kara
  2014-11-23  5:43   ` Theodore Ts'o
  2014-11-14 10:45 ` [PATCH 2/6] ext4: Move LRU list handling into extent status code Jan Kara
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2014-11-14 10:45 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Zheng Liu, Zheng Liu, Andreas Dilger, Jan Kara

From: Zheng Liu <wenqing.lz@taobao.com>

Currently extent status tree doesn't cache extent hole when a write
looks up in extent tree to make sure whether a block has been allocated
or not.  In this case, we don't put extent hole in extent cache because
later this extent might be removed and a new delayed extent might be
added back.  But it will cause a defect when we do a lot of writes.  If
we don't put extent hole in extent cache, the following writes also need
to access extent tree to look at whether or not a block has been
allocated.  It brings a cache miss.  This commit fixes this defect.
Also if the inode doesn't have any extent, this extent hole will be
cached as well.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h              |  4 +---
 fs/ext4/extents.c           | 31 ++++++++++++++++---------------
 fs/ext4/inode.c             |  6 ++----
 include/trace/events/ext4.h |  3 +--
 4 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c55a1faaed58..0718fc9743fb 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -565,10 +565,8 @@ enum {
 #define EXT4_GET_BLOCKS_KEEP_SIZE		0x0080
 	/* Do not take i_data_sem locking in ext4_map_blocks */
 #define EXT4_GET_BLOCKS_NO_LOCK			0x0100
-	/* Do not put hole in extent cache */
-#define EXT4_GET_BLOCKS_NO_PUT_HOLE		0x0200
 	/* Convert written extents to unwritten */
-#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN	0x0400
+#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN	0x0200
 
 /*
  * The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 37043d0b2be8..0a97e039d0a0 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2306,16 +2306,16 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
 				ext4_lblk_t block)
 {
 	int depth = ext_depth(inode);
-	unsigned long len = 0;
-	ext4_lblk_t lblock = 0;
+	ext4_lblk_t len;
+	ext4_lblk_t lblock;
 	struct ext4_extent *ex;
+	struct extent_status es;
 
 	ex = path[depth].p_ext;
 	if (ex == NULL) {
-		/*
-		 * there is no extent yet, so gap is [0;-] and we
-		 * don't cache it
-		 */
+		/* there is no extent yet, so gap is [0;-] */
+		lblock = 0;
+		len = EXT_MAX_BLOCKS;
 		ext_debug("cache gap(whole file):");
 	} else if (block < le32_to_cpu(ex->ee_block)) {
 		lblock = block;
@@ -2324,9 +2324,6 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
 				block,
 				le32_to_cpu(ex->ee_block),
 				 ext4_ext_get_actual_len(ex));
-		if (!ext4_find_delalloc_range(inode, lblock, lblock + len - 1))
-			ext4_es_insert_extent(inode, lblock, len, ~0,
-					      EXTENT_STATUS_HOLE);
 	} else if (block >= le32_to_cpu(ex->ee_block)
 			+ ext4_ext_get_actual_len(ex)) {
 		ext4_lblk_t next;
@@ -2340,14 +2337,19 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
 				block);
 		BUG_ON(next == lblock);
 		len = next - lblock;
-		if (!ext4_find_delalloc_range(inode, lblock, lblock + len - 1))
-			ext4_es_insert_extent(inode, lblock, len, ~0,
-					      EXTENT_STATUS_HOLE);
 	} else {
 		BUG();
 	}
 
-	ext_debug(" -> %u:%lu\n", lblock, len);
+	ext4_es_find_delayed_extent_range(inode, lblock, lblock + len - 1, &es);
+	if (es.es_len) {
+		/* There's delayed extent containing lblock? */
+		if (es.es_lblk <= lblock)
+			return;
+		len = min(es.es_lblk - lblock, len);
+	}
+	ext_debug(" -> %u:%u\n", lblock, len);
+	ext4_es_insert_extent(inode, lblock, len, ~0, EXTENT_STATUS_HOLE);
 }
 
 /*
@@ -4357,8 +4359,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		 * put just found gap into cache to speed up
 		 * subsequent requests
 		 */
-		if ((flags & EXT4_GET_BLOCKS_NO_PUT_HOLE) == 0)
-			ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
+		ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
 		goto out2;
 	}
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e9777f93cf05..69601d288119 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1447,11 +1447,9 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 			map->m_flags |= EXT4_MAP_FROM_CLUSTER;
 		retval = 0;
 	} else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
-		retval = ext4_ext_map_blocks(NULL, inode, map,
-					     EXT4_GET_BLOCKS_NO_PUT_HOLE);
+		retval = ext4_ext_map_blocks(NULL, inode, map, 0);
 	else
-		retval = ext4_ind_map_blocks(NULL, inode, map,
-					     EXT4_GET_BLOCKS_NO_PUT_HOLE);
+		retval = ext4_ind_map_blocks(NULL, inode, map, 0);
 
 add_delayed:
 	if (retval == 0) {
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index ff4bd1b35246..9337d36ff8c9 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -43,8 +43,7 @@ struct extent_status;
 	{ EXT4_GET_BLOCKS_METADATA_NOFAIL,	"METADATA_NOFAIL" },	\
 	{ EXT4_GET_BLOCKS_NO_NORMALIZE,		"NO_NORMALIZE" },	\
 	{ EXT4_GET_BLOCKS_KEEP_SIZE,		"KEEP_SIZE" },		\
-	{ EXT4_GET_BLOCKS_NO_LOCK,		"NO_LOCK" },		\
-	{ EXT4_GET_BLOCKS_NO_PUT_HOLE,		"NO_PUT_HOLE" })
+	{ EXT4_GET_BLOCKS_NO_LOCK,		"NO_LOCK" })
 
 #define show_mflags(flags) __print_flags(flags, "",	\
 	{ EXT4_MAP_NEW,		"N" },			\
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/6] ext4: Move LRU list handling into extent status code
  2014-11-14 10:45 [PATCH 0/6 v3] ext4: Extent status tree shrinker improvements Jan Kara
  2014-11-14 10:45 ` [PATCH 1/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks() Jan Kara
@ 2014-11-14 10:45 ` Jan Kara
  2014-11-23  5:48   ` Theodore Ts'o
  2014-11-14 10:45 ` [PATCH 3/6] ext4: change lrm to round-robin in extent status tree shrinker Jan Kara
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2014-11-14 10:45 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Zheng Liu, Jan Kara

Currently callers adding extents to extent status tree were responsible
for adding the inode to LRU list. This is error prone and puts LRU list
handling in unnecessarily many places.

Just add inode to LRU automatically when the first non-delay extent is
added to the tree and remove inode from LRU when the last non-delay
extent is removed.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/extents.c        |  2 --
 fs/ext4/extents_status.c | 10 ++++++----
 fs/ext4/extents_status.h |  2 --
 fs/ext4/inode.c          |  2 --
 fs/ext4/ioctl.c          |  2 --
 fs/ext4/super.c          |  1 -
 6 files changed, 6 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0a97e039d0a0..923327719507 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4622,7 +4622,6 @@ out2:
 
 	trace_ext4_ext_map_blocks_exit(inode, flags, map,
 				       err ? err : allocated);
-	ext4_es_lru_add(inode);
 	return err ? err : allocated;
 }
 
@@ -5181,7 +5180,6 @@ int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		error = ext4_fill_fiemap_extents(inode, start_blk,
 						 len_blks, fieinfo);
 	}
-	ext4_es_lru_add(inode);
 	return error;
 }
 
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 94e7855ae71b..339a9e1678e6 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -314,7 +314,8 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
 	 * We don't count delayed extent because we never try to reclaim them
 	 */
 	if (!ext4_es_is_delayed(es)) {
-		EXT4_I(inode)->i_es_lru_nr++;
+		if (!EXT4_I(inode)->i_es_lru_nr++)
+			ext4_es_lru_add(inode);
 		percpu_counter_inc(&EXT4_SB(inode->i_sb)->
 					s_es_stats.es_stats_lru_cnt);
 	}
@@ -333,7 +334,8 @@ static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
 	/* Decrease the lru counter when this es is not delayed */
 	if (!ext4_es_is_delayed(es)) {
 		BUG_ON(EXT4_I(inode)->i_es_lru_nr == 0);
-		EXT4_I(inode)->i_es_lru_nr--;
+		if (!--EXT4_I(inode)->i_es_lru_nr)
+			ext4_es_lru_del(inode);
 		percpu_counter_dec(&EXT4_SB(inode->i_sb)->
 					s_es_stats.es_stats_lru_cnt);
 	}
@@ -1225,7 +1227,7 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
 	unregister_shrinker(&sbi->s_es_shrinker);
 }
 
-void ext4_es_lru_add(struct inode *inode)
+static void ext4_es_lru_add(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -1241,7 +1243,7 @@ void ext4_es_lru_add(struct inode *inode)
 	spin_unlock(&sbi->s_es_lru_lock);
 }
 
-void ext4_es_lru_del(struct inode *inode)
+static void ext4_es_lru_del(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index efd5f970b501..9b5e7616b116 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -151,7 +151,5 @@ static inline void ext4_es_store_pblock_status(struct extent_status *es,
 
 extern int ext4_es_register_shrinker(struct ext4_sb_info *sbi);
 extern void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi);
-extern void ext4_es_lru_add(struct inode *inode);
-extern void ext4_es_lru_del(struct inode *inode);
 
 #endif /* _EXT4_EXTENTS_STATUS_H */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 69601d288119..e9e580152720 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -491,7 +491,6 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 
 	/* Lookup extent status tree firstly */
 	if (ext4_es_lookup_extent(inode, map->m_lblk, &es)) {
-		ext4_es_lru_add(inode);
 		if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
 			map->m_pblk = ext4_es_pblock(&es) +
 					map->m_lblk - es.es_lblk;
@@ -1393,7 +1392,6 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 
 	/* Lookup extent status tree firstly */
 	if (ext4_es_lookup_extent(inode, iblock, &es)) {
-		ext4_es_lru_add(inode);
 		if (ext4_es_is_hole(&es)) {
 			retval = 0;
 			down_read(&EXT4_I(inode)->i_data_sem);
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index bfda18a15592..f58a0d106726 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -78,8 +78,6 @@ static void swap_inode_data(struct inode *inode1, struct inode *inode2)
 	memswap(&ei1->i_disksize, &ei2->i_disksize, sizeof(ei1->i_disksize));
 	ext4_es_remove_extent(inode1, 0, EXT_MAX_BLOCKS);
 	ext4_es_remove_extent(inode2, 0, EXT_MAX_BLOCKS);
-	ext4_es_lru_del(inode1);
-	ext4_es_lru_del(inode2);
 
 	isize = i_size_read(inode1);
 	i_size_write(inode1, i_size_read(inode2));
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 1eda6ab0ef9d..9f052d49553e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -972,7 +972,6 @@ void ext4_clear_inode(struct inode *inode)
 	dquot_drop(inode);
 	ext4_discard_preallocations(inode);
 	ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS);
-	ext4_es_lru_del(inode);
 	if (EXT4_I(inode)->jinode) {
 		jbd2_journal_release_jbd_inode(EXT4_JOURNAL(inode),
 					       EXT4_I(inode)->jinode);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/6] ext4: change lrm to round-robin in extent status tree shrinker
  2014-11-14 10:45 [PATCH 0/6 v3] ext4: Extent status tree shrinker improvements Jan Kara
  2014-11-14 10:45 ` [PATCH 1/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks() Jan Kara
  2014-11-14 10:45 ` [PATCH 2/6] ext4: Move LRU list handling into extent status code Jan Kara
@ 2014-11-14 10:45 ` Jan Kara
  2014-11-14 10:45 ` [PATCH 4/6] ext4: Limit number of scanned extents in " Jan Kara
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2014-11-14 10:45 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Zheng Liu, Zheng Liu, Andreas Dilger, Jan Kara

From: Zheng Liu <wenqing.lz@taobao.com>

In this commit we discard the lru algorithm for inodes with extent
status tree because it takes significant effort to maintain a lru list
in extent status tree shrinker and the shrinker can take a long time to
scan this lru list in order to reclaim some objects.

We replace the lru ordering with a simple round-robin.  After that we
never need to keep a lru list.  That means that the list needn't be
sorted if the shrinker can not reclaim any objects in the first round.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h              |  10 +-
 fs/ext4/extents_status.c    | 228 +++++++++++++++++++-------------------------
 fs/ext4/extents_status.h    |   3 +-
 fs/ext4/super.c             |   5 +-
 include/trace/events/ext4.h |  11 +--
 5 files changed, 111 insertions(+), 146 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0718fc9743fb..653d262e2a34 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -887,10 +887,9 @@ struct ext4_inode_info {
 	/* extents status tree */
 	struct ext4_es_tree i_es_tree;
 	rwlock_t i_es_lock;
-	struct list_head i_es_lru;
+	struct list_head i_es_list;
 	unsigned int i_es_all_nr;	/* protected by i_es_lock */
-	unsigned int i_es_lru_nr;	/* protected by i_es_lock */
-	unsigned long i_touch_when;	/* jiffies of last accessing */
+	unsigned int i_es_shk_nr;	/* protected by i_es_lock */
 
 	/* ialloc */
 	ext4_group_t	i_last_alloc_group;
@@ -1331,10 +1330,11 @@ struct ext4_sb_info {
 
 	/* Reclaim extents from extent status tree */
 	struct shrinker s_es_shrinker;
-	struct list_head s_es_lru;
+	struct list_head s_es_list;
+	long s_es_nr_inode;
 	struct ext4_es_stats s_es_stats;
 	struct mb_cache *s_mb_cache;
-	spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
+	spinlock_t s_es_lock ____cacheline_aligned_in_smp;
 
 	/* Ratelimit ext4 messages. */
 	struct ratelimit_state s_err_ratelimit_state;
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 339a9e1678e6..de2d9d8bf22f 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -149,8 +149,8 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 			      ext4_lblk_t end);
 static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
 				       int nr_to_scan);
-static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
-			    struct ext4_inode_info *locked_ei);
+static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
+		       struct ext4_inode_info *locked_ei);
 
 int __init ext4_init_es(void)
 {
@@ -298,6 +298,36 @@ out:
 	trace_ext4_es_find_delayed_extent_range_exit(inode, es);
 }
 
+static void ext4_es_list_add(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+	if (!list_empty(&ei->i_es_list))
+		return;
+
+	spin_lock(&sbi->s_es_lock);
+	if (list_empty(&ei->i_es_list)) {
+		list_add_tail(&ei->i_es_list, &sbi->s_es_list);
+		sbi->s_es_nr_inode++;
+	}
+	spin_unlock(&sbi->s_es_lock);
+}
+
+static void ext4_es_list_del(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+	spin_lock(&sbi->s_es_lock);
+	if (!list_empty(&ei->i_es_list)) {
+		list_del_init(&ei->i_es_list);
+		sbi->s_es_nr_inode--;
+		WARN_ON_ONCE(sbi->s_es_nr_inode < 0);
+	}
+	spin_unlock(&sbi->s_es_lock);
+}
+
 static struct extent_status *
 ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
 		     ext4_fsblk_t pblk)
@@ -314,10 +344,10 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
 	 * We don't count delayed extent because we never try to reclaim them
 	 */
 	if (!ext4_es_is_delayed(es)) {
-		if (!EXT4_I(inode)->i_es_lru_nr++)
-			ext4_es_lru_add(inode);
+		if (!EXT4_I(inode)->i_es_shk_nr++)
+			ext4_es_list_add(inode);
 		percpu_counter_inc(&EXT4_SB(inode->i_sb)->
-					s_es_stats.es_stats_lru_cnt);
+					s_es_stats.es_stats_shk_cnt);
 	}
 
 	EXT4_I(inode)->i_es_all_nr++;
@@ -331,13 +361,13 @@ static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
 	EXT4_I(inode)->i_es_all_nr--;
 	percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);
 
-	/* Decrease the lru counter when this es is not delayed */
+	/* Decrease the shrink counter when this es is not delayed */
 	if (!ext4_es_is_delayed(es)) {
-		BUG_ON(EXT4_I(inode)->i_es_lru_nr == 0);
-		if (!--EXT4_I(inode)->i_es_lru_nr)
-			ext4_es_lru_del(inode);
+		BUG_ON(EXT4_I(inode)->i_es_shk_nr == 0);
+		if (!--EXT4_I(inode)->i_es_shk_nr)
+			ext4_es_list_del(inode);
 		percpu_counter_dec(&EXT4_SB(inode->i_sb)->
-					s_es_stats.es_stats_lru_cnt);
+					s_es_stats.es_stats_shk_cnt);
 	}
 
 	kmem_cache_free(ext4_es_cachep, es);
@@ -685,8 +715,8 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 		goto error;
 retry:
 	err = __es_insert_extent(inode, &newes);
-	if (err == -ENOMEM && __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
-					       EXT4_I(inode)))
+	if (err == -ENOMEM && __es_shrink(EXT4_SB(inode->i_sb),
+					  1, EXT4_I(inode)))
 		goto retry;
 	if (err == -ENOMEM && !ext4_es_is_delayed(&newes))
 		err = 0;
@@ -843,8 +873,8 @@ retry:
 				es->es_lblk = orig_es.es_lblk;
 				es->es_len = orig_es.es_len;
 				if ((err == -ENOMEM) &&
-				    __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
-						     EXT4_I(inode)))
+				    __es_shrink(EXT4_SB(inode->i_sb),
+							1, EXT4_I(inode)))
 					goto retry;
 				goto out;
 			}
@@ -916,6 +946,11 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	end = lblk + len - 1;
 	BUG_ON(end < lblk);
 
+	/*
+	 * ext4_clear_inode() depends on us taking i_es_lock unconditionally
+	 * so that we are sure __es_shrink() is done with the inode before it
+	 * is reclaimed.
+	 */
 	write_lock(&EXT4_I(inode)->i_es_lock);
 	err = __es_remove_extent(inode, lblk, end);
 	write_unlock(&EXT4_I(inode)->i_es_lock);
@@ -923,114 +958,80 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	return err;
 }
 
-static int ext4_inode_touch_time_cmp(void *priv, struct list_head *a,
-				     struct list_head *b)
-{
-	struct ext4_inode_info *eia, *eib;
-	eia = list_entry(a, struct ext4_inode_info, i_es_lru);
-	eib = list_entry(b, struct ext4_inode_info, i_es_lru);
-
-	if (ext4_test_inode_state(&eia->vfs_inode, EXT4_STATE_EXT_PRECACHED) &&
-	    !ext4_test_inode_state(&eib->vfs_inode, EXT4_STATE_EXT_PRECACHED))
-		return 1;
-	if (!ext4_test_inode_state(&eia->vfs_inode, EXT4_STATE_EXT_PRECACHED) &&
-	    ext4_test_inode_state(&eib->vfs_inode, EXT4_STATE_EXT_PRECACHED))
-		return -1;
-	if (eia->i_touch_when == eib->i_touch_when)
-		return 0;
-	if (time_after(eia->i_touch_when, eib->i_touch_when))
-		return 1;
-	else
-		return -1;
-}
-
-static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
-			    struct ext4_inode_info *locked_ei)
+static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
+		       struct ext4_inode_info *locked_ei)
 {
 	struct ext4_inode_info *ei;
 	struct ext4_es_stats *es_stats;
-	struct list_head *cur, *tmp;
-	LIST_HEAD(skipped);
 	ktime_t start_time;
 	u64 scan_time;
+	int nr_to_walk;
 	int nr_shrunk = 0;
-	int retried = 0, skip_precached = 1, nr_skipped = 0;
+	int retried = 0, nr_skipped = 0;
 
 	es_stats = &sbi->s_es_stats;
 	start_time = ktime_get();
-	spin_lock(&sbi->s_es_lru_lock);
 
 retry:
-	list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
+	spin_lock(&sbi->s_es_lock);
+	nr_to_walk = sbi->s_es_nr_inode;
+	while (nr_to_walk-- > 0) {
 		int shrunk;
 
-		/*
-		 * If we have already reclaimed all extents from extent
-		 * status tree, just stop the loop immediately.
-		 */
-		if (percpu_counter_read_positive(
-				&es_stats->es_stats_lru_cnt) == 0)
-			break;
-
-		ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
+		if (list_empty(&sbi->s_es_list)) {
+			spin_unlock(&sbi->s_es_lock);
+			goto out;
+		}
+		ei = list_first_entry(&sbi->s_es_list, struct ext4_inode_info,
+				      i_es_list);
+		/* Move the inode to the tail */
+		list_move(&ei->i_es_list, sbi->s_es_list.prev);
 
 		/*
-		 * Skip the inode that is newer than the last_sorted
-		 * time.  Normally we try hard to avoid shrinking
-		 * precached inodes, but we will as a last resort.
+		 * Normally we try hard to avoid shrinking precached inodes,
+		 * but we will as a last resort.
 		 */
-		if ((es_stats->es_stats_last_sorted < ei->i_touch_when) ||
-		    (skip_precached && ext4_test_inode_state(&ei->vfs_inode,
-						EXT4_STATE_EXT_PRECACHED))) {
+		if (!retried && ext4_test_inode_state(&ei->vfs_inode,
+						EXT4_STATE_EXT_PRECACHED)) {
 			nr_skipped++;
-			list_move_tail(cur, &skipped);
 			continue;
 		}
 
-		if (ei->i_es_lru_nr == 0 || ei == locked_ei ||
-		    !write_trylock(&ei->i_es_lock))
+		if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
+			nr_skipped++;
 			continue;
+		}
+		/*
+		 * Now we hold i_es_lock which protects us from inode reclaim
+		 * freeing inode under us
+		 */
+		spin_unlock(&sbi->s_es_lock);
 
 		shrunk = __es_try_to_reclaim_extents(ei, nr_to_scan);
-		if (ei->i_es_lru_nr == 0)
-			list_del_init(&ei->i_es_lru);
 		write_unlock(&ei->i_es_lock);
 
 		nr_shrunk += shrunk;
 		nr_to_scan -= shrunk;
+
 		if (nr_to_scan == 0)
-			break;
+			goto out;
+		spin_lock(&sbi->s_es_lock);
 	}
-
-	/* Move the newer inodes into the tail of the LRU list. */
-	list_splice_tail(&skipped, &sbi->s_es_lru);
-	INIT_LIST_HEAD(&skipped);
+	spin_unlock(&sbi->s_es_lock);
 
 	/*
 	 * If we skipped any inodes, and we weren't able to make any
-	 * forward progress, sort the list and try again.
+	 * forward progress, try again to scan precached inodes.
 	 */
 	if ((nr_shrunk == 0) && nr_skipped && !retried) {
 		retried++;
-		list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp);
-		es_stats->es_stats_last_sorted = jiffies;
-		ei = list_first_entry(&sbi->s_es_lru, struct ext4_inode_info,
-				      i_es_lru);
-		/*
-		 * If there are no non-precached inodes left on the
-		 * list, start releasing precached extents.
-		 */
-		if (ext4_test_inode_state(&ei->vfs_inode,
-					  EXT4_STATE_EXT_PRECACHED))
-			skip_precached = 0;
 		goto retry;
 	}
 
-	spin_unlock(&sbi->s_es_lru_lock);
-
 	if (locked_ei && nr_shrunk == 0)
 		nr_shrunk = __es_try_to_reclaim_extents(locked_ei, nr_to_scan);
 
+out:
 	scan_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
 	if (likely(es_stats->es_stats_scan_time))
 		es_stats->es_stats_scan_time = (scan_time +
@@ -1045,7 +1046,7 @@ retry:
 	else
 		es_stats->es_stats_shrunk = nr_shrunk;
 
-	trace_ext4_es_shrink(sbi->s_sb, nr_shrunk, scan_time, skip_precached,
+	trace_ext4_es_shrink(sbi->s_sb, nr_shrunk, scan_time,
 			     nr_skipped, retried);
 	return nr_shrunk;
 }
@@ -1057,7 +1058,7 @@ static unsigned long ext4_es_count(struct shrinker *shrink,
 	struct ext4_sb_info *sbi;
 
 	sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker);
-	nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_lru_cnt);
+	nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
 	trace_ext4_es_shrink_count(sbi->s_sb, sc->nr_to_scan, nr);
 	return nr;
 }
@@ -1070,13 +1071,13 @@ static unsigned long ext4_es_scan(struct shrinker *shrink,
 	int nr_to_scan = sc->nr_to_scan;
 	int ret, nr_shrunk;
 
-	ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_lru_cnt);
+	ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
 	trace_ext4_es_shrink_scan_enter(sbi->s_sb, nr_to_scan, ret);
 
 	if (!nr_to_scan)
 		return ret;
 
-	nr_shrunk = __ext4_es_shrink(sbi, nr_to_scan, NULL);
+	nr_shrunk = __es_shrink(sbi, nr_to_scan, NULL);
 
 	trace_ext4_es_shrink_scan_exit(sbi->s_sb, nr_shrunk, ret);
 	return nr_shrunk;
@@ -1104,28 +1105,24 @@ static int ext4_es_seq_shrinker_info_show(struct seq_file *seq, void *v)
 		return 0;
 
 	/* here we just find an inode that has the max nr. of objects */
-	spin_lock(&sbi->s_es_lru_lock);
-	list_for_each_entry(ei, &sbi->s_es_lru, i_es_lru) {
+	spin_lock(&sbi->s_es_lock);
+	list_for_each_entry(ei, &sbi->s_es_list, i_es_list) {
 		inode_cnt++;
 		if (max && max->i_es_all_nr < ei->i_es_all_nr)
 			max = ei;
 		else if (!max)
 			max = ei;
 	}
-	spin_unlock(&sbi->s_es_lru_lock);
+	spin_unlock(&sbi->s_es_lock);
 
 	seq_printf(seq, "stats:\n  %lld objects\n  %lld reclaimable objects\n",
 		   percpu_counter_sum_positive(&es_stats->es_stats_all_cnt),
-		   percpu_counter_sum_positive(&es_stats->es_stats_lru_cnt));
+		   percpu_counter_sum_positive(&es_stats->es_stats_shk_cnt));
 	seq_printf(seq, "  %lu/%lu cache hits/misses\n",
 		   es_stats->es_stats_cache_hits,
 		   es_stats->es_stats_cache_misses);
-	if (es_stats->es_stats_last_sorted != 0)
-		seq_printf(seq, "  %u ms last sorted interval\n",
-			   jiffies_to_msecs(jiffies -
-					    es_stats->es_stats_last_sorted));
 	if (inode_cnt)
-		seq_printf(seq, "  %d inodes on lru list\n", inode_cnt);
+		seq_printf(seq, "  %d inodes on list\n", inode_cnt);
 
 	seq_printf(seq, "average:\n  %llu us scan time\n",
 	    div_u64(es_stats->es_stats_scan_time, 1000));
@@ -1134,7 +1131,7 @@ static int ext4_es_seq_shrinker_info_show(struct seq_file *seq, void *v)
 		seq_printf(seq,
 		    "maximum:\n  %lu inode (%u objects, %u reclaimable)\n"
 		    "  %llu us max scan time\n",
-		    max->vfs_inode.i_ino, max->i_es_all_nr, max->i_es_lru_nr,
+		    max->vfs_inode.i_ino, max->i_es_all_nr, max->i_es_shk_nr,
 		    div_u64(es_stats->es_stats_max_scan_time, 1000));
 
 	return 0;
@@ -1183,9 +1180,9 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
 {
 	int err;
 
-	INIT_LIST_HEAD(&sbi->s_es_lru);
-	spin_lock_init(&sbi->s_es_lru_lock);
-	sbi->s_es_stats.es_stats_last_sorted = 0;
+	INIT_LIST_HEAD(&sbi->s_es_list);
+	sbi->s_es_nr_inode = 0;
+	spin_lock_init(&sbi->s_es_lock);
 	sbi->s_es_stats.es_stats_shrunk = 0;
 	sbi->s_es_stats.es_stats_cache_hits = 0;
 	sbi->s_es_stats.es_stats_cache_misses = 0;
@@ -1194,7 +1191,7 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
 	err = percpu_counter_init(&sbi->s_es_stats.es_stats_all_cnt, 0, GFP_KERNEL);
 	if (err)
 		return err;
-	err = percpu_counter_init(&sbi->s_es_stats.es_stats_lru_cnt, 0, GFP_KERNEL);
+	err = percpu_counter_init(&sbi->s_es_stats.es_stats_shk_cnt, 0, GFP_KERNEL);
 	if (err)
 		goto err1;
 
@@ -1212,7 +1209,7 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
 	return 0;
 
 err2:
-	percpu_counter_destroy(&sbi->s_es_stats.es_stats_lru_cnt);
+	percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
 err1:
 	percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
 	return err;
@@ -1223,37 +1220,10 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
 	if (sbi->s_proc)
 		remove_proc_entry("es_shrinker_info", sbi->s_proc);
 	percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
-	percpu_counter_destroy(&sbi->s_es_stats.es_stats_lru_cnt);
+	percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
 	unregister_shrinker(&sbi->s_es_shrinker);
 }
 
-static void ext4_es_lru_add(struct inode *inode)
-{
-	struct ext4_inode_info *ei = EXT4_I(inode);
-	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
-
-	ei->i_touch_when = jiffies;
-
-	if (!list_empty(&ei->i_es_lru))
-		return;
-
-	spin_lock(&sbi->s_es_lru_lock);
-	if (list_empty(&ei->i_es_lru))
-		list_add_tail(&ei->i_es_lru, &sbi->s_es_lru);
-	spin_unlock(&sbi->s_es_lru_lock);
-}
-
-static void ext4_es_lru_del(struct inode *inode)
-{
-	struct ext4_inode_info *ei = EXT4_I(inode);
-	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
-
-	spin_lock(&sbi->s_es_lru_lock);
-	if (!list_empty(&ei->i_es_lru))
-		list_del_init(&ei->i_es_lru);
-	spin_unlock(&sbi->s_es_lru_lock);
-}
-
 static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
 				       int nr_to_scan)
 {
@@ -1265,7 +1235,7 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
 	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
 				      DEFAULT_RATELIMIT_BURST);
 
-	if (ei->i_es_lru_nr == 0)
+	if (ei->i_es_shk_nr == 0)
 		return 0;
 
 	if (ext4_test_inode_state(inode, EXT4_STATE_EXT_PRECACHED) &&
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 9b5e7616b116..b0b78b95f481 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -65,14 +65,13 @@ struct ext4_es_tree {
 };
 
 struct ext4_es_stats {
-	unsigned long es_stats_last_sorted;
 	unsigned long es_stats_shrunk;
 	unsigned long es_stats_cache_hits;
 	unsigned long es_stats_cache_misses;
 	u64 es_stats_scan_time;
 	u64 es_stats_max_scan_time;
 	struct percpu_counter es_stats_all_cnt;
-	struct percpu_counter es_stats_lru_cnt;
+	struct percpu_counter es_stats_shk_cnt;
 };
 
 extern int __init ext4_init_es(void);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9f052d49553e..f108d84e7da2 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -880,10 +880,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	spin_lock_init(&ei->i_prealloc_lock);
 	ext4_es_init_tree(&ei->i_es_tree);
 	rwlock_init(&ei->i_es_lock);
-	INIT_LIST_HEAD(&ei->i_es_lru);
+	INIT_LIST_HEAD(&ei->i_es_list);
 	ei->i_es_all_nr = 0;
-	ei->i_es_lru_nr = 0;
-	ei->i_touch_when = 0;
+	ei->i_es_shk_nr = 0;
 	ei->i_reserved_data_blocks = 0;
 	ei->i_reserved_meta_blocks = 0;
 	ei->i_allocated_meta_blocks = 0;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 9337d36ff8c9..27918caa8621 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2451,15 +2451,14 @@ TRACE_EVENT(ext4_collapse_range,
 
 TRACE_EVENT(ext4_es_shrink,
 	TP_PROTO(struct super_block *sb, int nr_shrunk, u64 scan_time,
-		 int skip_precached, int nr_skipped, int retried),
+		 int nr_skipped, int retried),
 
-	TP_ARGS(sb, nr_shrunk, scan_time, skip_precached, nr_skipped, retried),
+	TP_ARGS(sb, nr_shrunk, scan_time, nr_skipped, retried),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,		dev		)
 		__field(	int,		nr_shrunk	)
 		__field(	unsigned long long, scan_time	)
-		__field(	int,		skip_precached	)
 		__field(	int,		nr_skipped	)
 		__field(	int,		retried		)
 	),
@@ -2468,16 +2467,14 @@ TRACE_EVENT(ext4_es_shrink,
 		__entry->dev		= sb->s_dev;
 		__entry->nr_shrunk	= nr_shrunk;
 		__entry->scan_time	= div_u64(scan_time, 1000);
-		__entry->skip_precached = skip_precached;
 		__entry->nr_skipped	= nr_skipped;
 		__entry->retried	= retried;
 	),
 
-	TP_printk("dev %d,%d nr_shrunk %d, scan_time %llu skip_precached %d "
+	TP_printk("dev %d,%d nr_shrunk %d, scan_time %llu "
 		  "nr_skipped %d retried %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->nr_shrunk,
-		  __entry->scan_time, __entry->skip_precached,
-		  __entry->nr_skipped, __entry->retried)
+		  __entry->scan_time, __entry->nr_skipped, __entry->retried)
 );
 
 #endif /* _TRACE_EXT4_H */
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/6] ext4: Limit number of scanned extents in status tree shrinker
  2014-11-14 10:45 [PATCH 0/6 v3] ext4: Extent status tree shrinker improvements Jan Kara
                   ` (2 preceding siblings ...)
  2014-11-14 10:45 ` [PATCH 3/6] ext4: change lrm to round-robin in extent status tree shrinker Jan Kara
@ 2014-11-14 10:45 ` Jan Kara
  2014-11-14 10:45 ` [PATCH 5/6] ext4: Cleanup flag definitions for extent status tree Jan Kara
  2014-11-14 10:45 ` [PATCH 6/6] ext4: Introduce aging to " Jan Kara
  5 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2014-11-14 10:45 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Zheng Liu, Jan Kara

Currently we scan extent status trees of inodes until we reclaim nr_to_scan
extents. This can however require a lot of scanning when there are lots
of delayed extents (as those cannot be reclaimed).

Change shrinker to work as shrinkers are supposed to and *scan* only
nr_to_scan extents regardless of how many extents did we actually
reclaim. We however need to be careful and avoid scanning each status
tree from the beginning - that could lead to a situation where we would
not be able to reclaim anything at all when first nr_to_scan extents in
the tree are always unreclaimable. We remember with each inode offset
where we stopped scanning and continue from there when we next come
across the inode.

Note that we also need to update places calling __es_shrink() manually
to pass reasonable nr_to_scan to have a chance of reclaiming anything and
not just 1.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h           |  5 ++-
 fs/ext4/extents_status.c | 91 +++++++++++++++++++++++++++++++-----------------
 fs/ext4/super.c          |  1 +
 3 files changed, 64 insertions(+), 33 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 653d262e2a34..1515486c671b 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -890,6 +890,9 @@ struct ext4_inode_info {
 	struct list_head i_es_list;
 	unsigned int i_es_all_nr;	/* protected by i_es_lock */
 	unsigned int i_es_shk_nr;	/* protected by i_es_lock */
+	ext4_lblk_t i_es_shrink_lblk;	/* Offset where we start searching for
+					   extents to shrink. Protected by
+					   i_es_lock  */
 
 	/* ialloc */
 	ext4_group_t	i_last_alloc_group;
@@ -1330,7 +1333,7 @@ struct ext4_sb_info {
 
 	/* Reclaim extents from extent status tree */
 	struct shrinker s_es_shrinker;
-	struct list_head s_es_list;
+	struct list_head s_es_list;	/* List of inodes with reclaimable extents */
 	long s_es_nr_inode;
 	struct ext4_es_stats s_es_stats;
 	struct mb_cache *s_mb_cache;
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index de2d9d8bf22f..8f2aac4006d2 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -147,8 +147,7 @@ static struct kmem_cache *ext4_es_cachep;
 static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
 static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 			      ext4_lblk_t end);
-static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
-				       int nr_to_scan);
+static int es_reclaim_extents(struct ext4_inode_info *ei, int *nr_to_scan);
 static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
 		       struct ext4_inode_info *locked_ei);
 
@@ -716,7 +715,7 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 retry:
 	err = __es_insert_extent(inode, &newes);
 	if (err == -ENOMEM && __es_shrink(EXT4_SB(inode->i_sb),
-					  1, EXT4_I(inode)))
+					  128, EXT4_I(inode)))
 		goto retry;
 	if (err == -ENOMEM && !ext4_es_is_delayed(&newes))
 		err = 0;
@@ -874,7 +873,7 @@ retry:
 				es->es_len = orig_es.es_len;
 				if ((err == -ENOMEM) &&
 				    __es_shrink(EXT4_SB(inode->i_sb),
-							1, EXT4_I(inode)))
+							128, EXT4_I(inode)))
 					goto retry;
 				goto out;
 			}
@@ -976,8 +975,6 @@ retry:
 	spin_lock(&sbi->s_es_lock);
 	nr_to_walk = sbi->s_es_nr_inode;
 	while (nr_to_walk-- > 0) {
-		int shrunk;
-
 		if (list_empty(&sbi->s_es_list)) {
 			spin_unlock(&sbi->s_es_lock);
 			goto out;
@@ -985,7 +982,7 @@ retry:
 		ei = list_first_entry(&sbi->s_es_list, struct ext4_inode_info,
 				      i_es_list);
 		/* Move the inode to the tail */
-		list_move(&ei->i_es_list, sbi->s_es_list.prev);
+		list_move_tail(&ei->i_es_list, &sbi->s_es_list);
 
 		/*
 		 * Normally we try hard to avoid shrinking precached inodes,
@@ -1007,13 +1004,10 @@ retry:
 		 */
 		spin_unlock(&sbi->s_es_lock);
 
-		shrunk = __es_try_to_reclaim_extents(ei, nr_to_scan);
+		nr_shrunk += es_reclaim_extents(ei, &nr_to_scan);
 		write_unlock(&ei->i_es_lock);
 
-		nr_shrunk += shrunk;
-		nr_to_scan -= shrunk;
-
-		if (nr_to_scan == 0)
+		if (nr_to_scan <= 0)
 			goto out;
 		spin_lock(&sbi->s_es_lock);
 	}
@@ -1029,7 +1023,7 @@ retry:
 	}
 
 	if (locked_ei && nr_shrunk == 0)
-		nr_shrunk = __es_try_to_reclaim_extents(locked_ei, nr_to_scan);
+		nr_shrunk = es_reclaim_extents(locked_ei, &nr_to_scan);
 
 out:
 	scan_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
@@ -1224,27 +1218,33 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
 	unregister_shrinker(&sbi->s_es_shrinker);
 }
 
-static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
-				       int nr_to_scan)
+/*
+ * Shrink extents in given inode from ei->i_es_shrink_lblk till end. Scan at
+ * most *nr_to_scan extents, update *nr_to_scan accordingly.
+ *
+ * Return 0 if we hit end of tree / interval, 1 if we exhausted nr_to_scan.
+ * Increment *nr_shrunk by the number of reclaimed extents. Also update
+ * ei->i_es_shrink_lblk to where we should continue scanning.
+ */
+static int es_do_reclaim_extents(struct ext4_inode_info *ei, ext4_lblk_t end,
+				 int *nr_to_scan, int *nr_shrunk)
 {
 	struct inode *inode = &ei->vfs_inode;
 	struct ext4_es_tree *tree = &ei->i_es_tree;
-	struct rb_node *node;
 	struct extent_status *es;
-	unsigned long nr_shrunk = 0;
-	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
-				      DEFAULT_RATELIMIT_BURST);
-
-	if (ei->i_es_shk_nr == 0)
-		return 0;
+	struct rb_node *node;
 
-	if (ext4_test_inode_state(inode, EXT4_STATE_EXT_PRECACHED) &&
-	    __ratelimit(&_rs))
-		ext4_warning(inode->i_sb, "forced shrink of precached extents");
+	es = __es_tree_search(&tree->root, ei->i_es_shrink_lblk);
+	if (!es)
+		goto out_wrap;
+	node = &es->rb_node;
+	while (*nr_to_scan > 0) {
+		if (es->es_lblk > end) {
+			ei->i_es_shrink_lblk = end + 1;
+			return 0;
+		}
 
-	node = rb_first(&tree->root);
-	while (node != NULL) {
-		es = rb_entry(node, struct extent_status, rb_node);
+		(*nr_to_scan)--;
 		node = rb_next(&es->rb_node);
 		/*
 		 * We can't reclaim delayed extent from status tree because
@@ -1253,11 +1253,38 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
 		if (!ext4_es_is_delayed(es)) {
 			rb_erase(&es->rb_node, &tree->root);
 			ext4_es_free_extent(inode, es);
-			nr_shrunk++;
-			if (--nr_to_scan == 0)
-				break;
+			(*nr_shrunk)++;
 		}
+		if (!node)
+			goto out_wrap;
+		es = rb_entry(node, struct extent_status, rb_node);
 	}
-	tree->cache_es = NULL;
+	ei->i_es_shrink_lblk = es->es_lblk;
+	return 1;
+out_wrap:
+	ei->i_es_shrink_lblk = 0;
+	return 0;
+}
+
+static int es_reclaim_extents(struct ext4_inode_info *ei, int *nr_to_scan)
+{
+	struct inode *inode = &ei->vfs_inode;
+	int nr_shrunk = 0;
+	ext4_lblk_t start = ei->i_es_shrink_lblk;
+	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
+
+	if (ei->i_es_shk_nr == 0)
+		return 0;
+
+	if (ext4_test_inode_state(inode, EXT4_STATE_EXT_PRECACHED) &&
+	    __ratelimit(&_rs))
+		ext4_warning(inode->i_sb, "forced shrink of precached extents");
+
+	if (!es_do_reclaim_extents(ei, EXT_MAX_BLOCKS, nr_to_scan, &nr_shrunk) &&
+	    start != 0)
+		es_do_reclaim_extents(ei, start - 1, nr_to_scan, &nr_shrunk);
+
+	ei->i_es_tree.cache_es = NULL;
 	return nr_shrunk;
 }
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f108d84e7da2..b53c243a142b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -883,6 +883,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	INIT_LIST_HEAD(&ei->i_es_list);
 	ei->i_es_all_nr = 0;
 	ei->i_es_shk_nr = 0;
+	ei->i_es_shrink_lblk = 0;
 	ei->i_reserved_data_blocks = 0;
 	ei->i_reserved_meta_blocks = 0;
 	ei->i_allocated_meta_blocks = 0;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/6] ext4: Cleanup flag definitions for extent status tree
  2014-11-14 10:45 [PATCH 0/6 v3] ext4: Extent status tree shrinker improvements Jan Kara
                   ` (3 preceding siblings ...)
  2014-11-14 10:45 ` [PATCH 4/6] ext4: Limit number of scanned extents in " Jan Kara
@ 2014-11-14 10:45 ` Jan Kara
  2014-11-14 10:45 ` [PATCH 6/6] ext4: Introduce aging to " Jan Kara
  5 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2014-11-14 10:45 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Zheng Liu, Jan Kara

Currently flags for extent status tree are defined twice, once shifted
and once without a being shifted. Consolidate these definitions into one
place and make some computations automatic to make adding flags less
error prone. Compiler should be clever enough to figure out these are
constants and generate the same code.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/extents_status.c |  2 ++
 fs/ext4/extents_status.h | 58 ++++++++++++++++++++++--------------------------
 2 files changed, 28 insertions(+), 32 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 8f2aac4006d2..30596498ed0b 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -1174,6 +1174,8 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
 {
 	int err;
 
+	/* Make sure we have enough bits for physical block number */
+	BUILD_BUG_ON(ES_SHIFT < 48);
 	INIT_LIST_HEAD(&sbi->s_es_list);
 	sbi->s_es_nr_inode = 0;
 	spin_lock_init(&sbi->s_es_lock);
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index b0b78b95f481..e86b1f34cfec 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -29,25 +29,21 @@
 /*
  * These flags live in the high bits of extent_status.es_pblk
  */
-#define ES_SHIFT	60
-
-#define EXTENT_STATUS_WRITTEN	(1 << 3)
-#define EXTENT_STATUS_UNWRITTEN (1 << 2)
-#define EXTENT_STATUS_DELAYED	(1 << 1)
-#define EXTENT_STATUS_HOLE	(1 << 0)
-
-#define EXTENT_STATUS_FLAGS	(EXTENT_STATUS_WRITTEN | \
-				 EXTENT_STATUS_UNWRITTEN | \
-				 EXTENT_STATUS_DELAYED | \
-				 EXTENT_STATUS_HOLE)
+enum {
+	ES_WRITTEN_B,
+	ES_UNWRITTEN_B,
+	ES_DELAYED_B,
+	ES_HOLE_B,
+	ES_FLAGS
+};
 
-#define ES_WRITTEN		(1ULL << 63)
-#define ES_UNWRITTEN		(1ULL << 62)
-#define ES_DELAYED		(1ULL << 61)
-#define ES_HOLE			(1ULL << 60)
+#define ES_SHIFT (sizeof(ext4_fsblk_t)*8 - ES_FLAGS)
+#define ES_MASK (~((ext4_fsblk_t)0) << ES_SHIFT)
 
-#define ES_MASK			(ES_WRITTEN | ES_UNWRITTEN | \
-				 ES_DELAYED | ES_HOLE)
+#define EXTENT_STATUS_WRITTEN	(1 << ES_WRITTEN_B)
+#define EXTENT_STATUS_UNWRITTEN (1 << ES_UNWRITTEN_B)
+#define EXTENT_STATUS_DELAYED	(1 << ES_DELAYED_B)
+#define EXTENT_STATUS_HOLE	(1 << ES_HOLE_B)
 
 struct ext4_sb_info;
 struct ext4_extent;
@@ -92,29 +88,29 @@ extern void ext4_es_find_delayed_extent_range(struct inode *inode,
 extern int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
 				 struct extent_status *es);
 
+static inline unsigned int ext4_es_status(struct extent_status *es)
+{
+	return es->es_pblk >> ES_SHIFT;
+}
+
 static inline int ext4_es_is_written(struct extent_status *es)
 {
-	return (es->es_pblk & ES_WRITTEN) != 0;
+	return (ext4_es_status(es) & EXTENT_STATUS_WRITTEN) != 0;
 }
 
 static inline int ext4_es_is_unwritten(struct extent_status *es)
 {
-	return (es->es_pblk & ES_UNWRITTEN) != 0;
+	return (ext4_es_status(es) & EXTENT_STATUS_UNWRITTEN) != 0;
 }
 
 static inline int ext4_es_is_delayed(struct extent_status *es)
 {
-	return (es->es_pblk & ES_DELAYED) != 0;
+	return (ext4_es_status(es) & EXTENT_STATUS_DELAYED) != 0;
 }
 
 static inline int ext4_es_is_hole(struct extent_status *es)
 {
-	return (es->es_pblk & ES_HOLE) != 0;
-}
-
-static inline unsigned int ext4_es_status(struct extent_status *es)
-{
-	return es->es_pblk >> ES_SHIFT;
+	return (ext4_es_status(es) & EXTENT_STATUS_HOLE) != 0;
 }
 
 static inline ext4_fsblk_t ext4_es_pblock(struct extent_status *es)
@@ -134,18 +130,16 @@ static inline void ext4_es_store_pblock(struct extent_status *es,
 static inline void ext4_es_store_status(struct extent_status *es,
 					unsigned int status)
 {
-	es->es_pblk = (((ext4_fsblk_t)
-			(status & EXTENT_STATUS_FLAGS) << ES_SHIFT) |
-		       (es->es_pblk & ~ES_MASK));
+	es->es_pblk = (((ext4_fsblk_t)status << ES_SHIFT) & ES_MASK) |
+		      (es->es_pblk & ~ES_MASK);
 }
 
 static inline void ext4_es_store_pblock_status(struct extent_status *es,
 					       ext4_fsblk_t pb,
 					       unsigned int status)
 {
-	es->es_pblk = (((ext4_fsblk_t)
-			(status & EXTENT_STATUS_FLAGS) << ES_SHIFT) |
-		       (pb & ~ES_MASK));
+	es->es_pblk = (((ext4_fsblk_t)status << ES_SHIFT) & ES_MASK) |
+		      (pb & ~ES_MASK);
 }
 
 extern int ext4_es_register_shrinker(struct ext4_sb_info *sbi);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 6/6] ext4: Introduce aging to extent status tree
  2014-11-14 10:45 [PATCH 0/6 v3] ext4: Extent status tree shrinker improvements Jan Kara
                   ` (4 preceding siblings ...)
  2014-11-14 10:45 ` [PATCH 5/6] ext4: Cleanup flag definitions for extent status tree Jan Kara
@ 2014-11-14 10:45 ` Jan Kara
  5 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2014-11-14 10:45 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Zheng Liu, Jan Kara

Introduce a simple aging to extent status tree. Each extent has a
REFERENCED bit which gets set when the extent is used. Shrinker then
skips entries with referenced bit set and clears the bit. Thus
frequently used extents have higher chances of staying in memory.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/extents_status.c | 22 +++++++++++++++++-----
 fs/ext4/extents_status.h | 35 +++++++++++++++++++++++++++++++----
 2 files changed, 48 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 30596498ed0b..e04d45733976 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -382,7 +382,7 @@ static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
 static int ext4_es_can_be_merged(struct extent_status *es1,
 				 struct extent_status *es2)
 {
-	if (ext4_es_status(es1) != ext4_es_status(es2))
+	if (ext4_es_type(es1) != ext4_es_type(es2))
 		return 0;
 
 	if (((__u64) es1->es_len) + es2->es_len > EXT_MAX_BLOCKS) {
@@ -425,6 +425,8 @@ ext4_es_try_to_merge_left(struct inode *inode, struct extent_status *es)
 	es1 = rb_entry(node, struct extent_status, rb_node);
 	if (ext4_es_can_be_merged(es1, es)) {
 		es1->es_len += es->es_len;
+		if (ext4_es_is_referenced(es))
+			ext4_es_set_referenced(es1);
 		rb_erase(&es->rb_node, &tree->root);
 		ext4_es_free_extent(inode, es);
 		es = es1;
@@ -447,6 +449,8 @@ ext4_es_try_to_merge_right(struct inode *inode, struct extent_status *es)
 	es1 = rb_entry(node, struct extent_status, rb_node);
 	if (ext4_es_can_be_merged(es, es1)) {
 		es->es_len += es1->es_len;
+		if (ext4_es_is_referenced(es1))
+			ext4_es_set_referenced(es);
 		rb_erase(node, &tree->root);
 		ext4_es_free_extent(inode, es1);
 	}
@@ -813,6 +817,8 @@ out:
 		es->es_lblk = es1->es_lblk;
 		es->es_len = es1->es_len;
 		es->es_pblk = es1->es_pblk;
+		if (!ext4_es_is_referenced(es))
+			ext4_es_set_referenced(es);
 		stats->es_stats_cache_hits++;
 	} else {
 		stats->es_stats_cache_misses++;
@@ -1252,11 +1258,17 @@ static int es_do_reclaim_extents(struct ext4_inode_info *ei, ext4_lblk_t end,
 		 * We can't reclaim delayed extent from status tree because
 		 * fiemap, bigallic, and seek_data/hole need to use it.
 		 */
-		if (!ext4_es_is_delayed(es)) {
-			rb_erase(&es->rb_node, &tree->root);
-			ext4_es_free_extent(inode, es);
-			(*nr_shrunk)++;
+		if (ext4_es_is_delayed(es))
+			goto next;
+		if (ext4_es_is_referenced(es)) {
+			ext4_es_clear_referenced(es);
+			goto next;
 		}
+
+		rb_erase(&es->rb_node, &tree->root);
+		ext4_es_free_extent(inode, es);
+		(*nr_shrunk)++;
+next:
 		if (!node)
 			goto out_wrap;
 		es = rb_entry(node, struct extent_status, rb_node);
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index e86b1f34cfec..691b52613ce4 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -34,6 +34,7 @@ enum {
 	ES_UNWRITTEN_B,
 	ES_DELAYED_B,
 	ES_HOLE_B,
+	ES_REFERENCED_B,
 	ES_FLAGS
 };
 
@@ -44,6 +45,12 @@ enum {
 #define EXTENT_STATUS_UNWRITTEN (1 << ES_UNWRITTEN_B)
 #define EXTENT_STATUS_DELAYED	(1 << ES_DELAYED_B)
 #define EXTENT_STATUS_HOLE	(1 << ES_HOLE_B)
+#define EXTENT_STATUS_REFERENCED	(1 << ES_REFERENCED_B)
+
+#define ES_TYPE_MASK	((ext4_fsblk_t)(EXTENT_STATUS_WRITTEN | \
+			  EXTENT_STATUS_UNWRITTEN | \
+			  EXTENT_STATUS_DELAYED | \
+			  EXTENT_STATUS_HOLE) << ES_SHIFT)
 
 struct ext4_sb_info;
 struct ext4_extent;
@@ -93,24 +100,44 @@ static inline unsigned int ext4_es_status(struct extent_status *es)
 	return es->es_pblk >> ES_SHIFT;
 }
 
+static inline unsigned int ext4_es_type(struct extent_status *es)
+{
+	return (es->es_pblk & ES_TYPE_MASK) >> ES_SHIFT;
+}
+
 static inline int ext4_es_is_written(struct extent_status *es)
 {
-	return (ext4_es_status(es) & EXTENT_STATUS_WRITTEN) != 0;
+	return (ext4_es_type(es) & EXTENT_STATUS_WRITTEN) != 0;
 }
 
 static inline int ext4_es_is_unwritten(struct extent_status *es)
 {
-	return (ext4_es_status(es) & EXTENT_STATUS_UNWRITTEN) != 0;
+	return (ext4_es_type(es) & EXTENT_STATUS_UNWRITTEN) != 0;
 }
 
 static inline int ext4_es_is_delayed(struct extent_status *es)
 {
-	return (ext4_es_status(es) & EXTENT_STATUS_DELAYED) != 0;
+	return (ext4_es_type(es) & EXTENT_STATUS_DELAYED) != 0;
 }
 
 static inline int ext4_es_is_hole(struct extent_status *es)
 {
-	return (ext4_es_status(es) & EXTENT_STATUS_HOLE) != 0;
+	return (ext4_es_type(es) & EXTENT_STATUS_HOLE) != 0;
+}
+
+static inline void ext4_es_set_referenced(struct extent_status *es)
+{
+	es->es_pblk |= ((ext4_fsblk_t)EXTENT_STATUS_REFERENCED) << ES_SHIFT;
+}
+
+static inline void ext4_es_clear_referenced(struct extent_status *es)
+{
+	es->es_pblk &= ~(((ext4_fsblk_t)EXTENT_STATUS_REFERENCED) << ES_SHIFT);
+}
+
+static inline int ext4_es_is_referenced(struct extent_status *es)
+{
+	return (ext4_es_status(es) & EXTENT_STATUS_REFERENCED) != 0;
 }
 
 static inline ext4_fsblk_t ext4_es_pblock(struct extent_status *es)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
  2014-11-14 10:45 ` [PATCH 1/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks() Jan Kara
@ 2014-11-23  5:43   ` Theodore Ts'o
  2014-11-24  9:34     ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Ts'o @ 2014-11-23  5:43 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Zheng Liu, Zheng Liu, Andreas Dilger

On Fri, Nov 14, 2014 at 11:45:52AM +0100, Jan Kara wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> Currently extent status tree doesn't cache extent hole when a write
> looks up in extent tree to make sure whether a block has been allocated
> or not.  In this case, we don't put extent hole in extent cache because
> later this extent might be removed and a new delayed extent might be
> added back.  But it will cause a defect when we do a lot of writes.  If
> we don't put extent hole in extent cache, the following writes also need
> to access extent tree to look at whether or not a block has been
> allocated.  It brings a cache miss.  This commit fixes this defect.
> Also if the inode doesn't have any extent, this extent hole will be
> cached as well.
> 
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Andreas Dilger <adilger.kernel@dilger.ca>
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> Signed-off-by: Jan Kara <jack@suse.cz>

This patch (either when applied alone, or as part of the entire patch
series) is causing an xfstests regression test failure:

FSTYP         -- ext4
PLATFORM      -- Linux/i686 kvm-xfstests 3.18.0-rc3-00537-g932e3f9
MKFS_OPTIONS  -- -q /dev/vdc
MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity /dev/vdc /vdc

ext4/003 4s ...	[00:14:39] [00:14:48] - output mismatch (see /results/results-4k/ext4/003.out.bad)
    --- tests/ext4/003.out	2014-10-31 10:13:03.000000000 -0400
    +++ /results/results-4k/ext4/003.out.bad	2014-11-23 00:14:48.935587897 -0500
    @@ -1,3 +1,3 @@
     QA output created by 003
    -wrote 268435456/268435456 bytes at offset 0
    +wrote 241762304/268435456 bytes at offset 0
     XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    ...
    (Run 'diff -u tests/ext4/003.out /results/results-4k/ext4/003.out.bad'  to see the entire diff)

				- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/6] ext4: Move LRU list handling into extent status code
  2014-11-14 10:45 ` [PATCH 2/6] ext4: Move LRU list handling into extent status code Jan Kara
@ 2014-11-23  5:48   ` Theodore Ts'o
  2014-11-24  9:32     ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Ts'o @ 2014-11-23  5:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Zheng Liu

On Fri, Nov 14, 2014 at 11:45:53AM +0100, Jan Kara wrote:
> Currently callers adding extents to extent status tree were responsible
> for adding the inode to LRU list. This is error prone and puts LRU list
> handling in unnecessarily many places.
> 
> Just add inode to LRU automatically when the first non-delay extent is
> added to the tree and remove inode from LRU when the last non-delay
> extent is removed.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

While trying to bisect the ext4/003 regression, I found that patches
one and two applied to gether causes the following deadlock (very
early in the boot sequence; before xfstests is started):

[   24.680699] 
[   24.681110] =============================================
[   24.682755] [ INFO: possible recursive locking detected ]
[   24.683344] 3.18.0-rc3-00538-gc12044b #2369 Not tainted
[   24.683344] ---------------------------------------------
[   24.683344] runtests.sh/2772 is trying to acquire lock:
[   24.683344]  (&(&sbi->s_es_lru_lock)->rlock){+.+...}, at: [<c02e870b>] ext4_es_free_extent+0x6f/0xc5
[   24.683344] 
[   24.683344] but task is already holding lock:
[   24.683344]  (&(&sbi->s_es_lru_lock)->rlock){+.+...}, at: [<c02e8857>] __ext4_es_shrink+0x3a/0x346
[   24.683344] 
[   24.683344] other info that might help us debug this:
[   24.683344]  Possible unsafe locking scenario:
[   24.683344] 
[   24.683344]        CPU0
[   24.683344]        ----
[   24.683344]   lock(&(&sbi->s_es_lru_lock)->rlock);
[   24.683344]   lock(&(&sbi->s_es_lru_lock)->rlock);
[   24.683344] 
[   24.683344]  *** DEADLOCK ***
[   24.683344] 
[   24.683344]  May be due to missing lock nesting notation
[   24.683344] 
[   24.683344] 4 locks held by runtests.sh/2772:
[   24.683344]  #0:  (sb_writers#5){.+.+.+}, at: [<c024a69d>] file_start_write+0x24/0x26
[   24.683344]  #1:  (shrinker_rwsem){++++..}, at: [<c021cca2>] shrink_slab+0x29/0xcd
[   24.683344]  #2:  (&(&sbi->s_es_lru_lock)->rlock){+.+...}, at: [<c02e8857>] __ext4_es_shrink+0x3a/0x346
[   24.683344]  #3:  (&ei->i_es_lock){++++..}, at: [<c02e8900>] __ext4_es_shrink+0xe3/0x346
[   24.683344] 
[   24.683344] stack backtrace:
[   24.683344] CPU: 1 PID: 2772 Comm: runtests.sh Not tainted 3.18.0-rc3-00538-gc12044b #2369
[   24.683344] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   24.683344]  00000000 00000000 f3f55d4c c091b746 f343e0d0 f3f55dc0 c019fa14 c0c0cc77
[   24.683344]  c0c0d6ea c0c0cb76 00000002 f343e694 c149d110 00003900 00000000 24b02008
[   24.683344]  4c422213 f343e694 00000000 00012581 00000004 f343e0d0 f343e6b8 f343e6bc
[   24.683344] Call Trace:
[   24.683344]  [<c091b746>] dump_stack+0x48/0x60
[   24.683344]  [<c019fa14>] __lock_acquire+0xb7d/0xcd9
[   24.683344]  [<c019f1f6>] ? __lock_acquire+0x35f/0xcd9
[   24.683344]  [<c019feb8>] lock_acquire+0xe7/0x15e
[   24.683344]  [<c02e870b>] ? ext4_es_free_extent+0x6f/0xc5
[   24.683344]  [<c0923cd1>] _raw_spin_lock+0x2a/0x5a
[   24.683344]  [<c02e870b>] ? ext4_es_free_extent+0x6f/0xc5
[   24.683344]  [<c02e870b>] ext4_es_free_extent+0x6f/0xc5
[   24.683344]  [<c02e87ff>] __es_try_to_reclaim_extents+0x9e/0xbc
[   24.683344]  [<c02e890e>] __ext4_es_shrink+0xf1/0x346
[   24.683344]  [<c02e8c38>] ext4_es_scan+0xd5/0x1fc
[   24.683344]  [<c021c8c5>] shrink_slab_node+0x196/0x321
[   24.683344]  [<c021ccd8>] shrink_slab+0x5f/0xcd
[   24.683344]  [<c0287306>] ? drop_pagecache_sb+0xbf/0xbf
[   24.683344]  [<c0287382>] drop_caches_sysctl_handler+0x7c/0xd2
[   24.683344]  [<c0296647>] proc_sys_call_handler+0x7b/0x9c
[   24.683344]  [<c0296668>] ? proc_sys_call_handler+0x9c/0x9c
[   24.683344]  [<c029667a>] proc_sys_write+0x12/0x14
[   24.683344]  [<c024ae85>] vfs_write+0x8c/0xf7
[   24.683344]  [<c024b21c>] SyS_write+0x4f/0x7c
[   24.683344]  [<c0924a2a>] syscall_call+0x7/0x7

We end up removing all of the LRU code in the next patch in the patch
series, so this is not a major problem, but it can trip people up when
they are doing bisects.  It may be simplest just to combine patches 2
and 3 in this series.

						- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/6] ext4: Move LRU list handling into extent status code
  2014-11-23  5:48   ` Theodore Ts'o
@ 2014-11-24  9:32     ` Jan Kara
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2014-11-24  9:32 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, linux-ext4, Zheng Liu

On Sun 23-11-14 00:48:02, Ted Tso wrote:
> On Fri, Nov 14, 2014 at 11:45:53AM +0100, Jan Kara wrote:
> > Currently callers adding extents to extent status tree were responsible
> > for adding the inode to LRU list. This is error prone and puts LRU list
> > handling in unnecessarily many places.
> > 
> > Just add inode to LRU automatically when the first non-delay extent is
> > added to the tree and remove inode from LRU when the last non-delay
> > extent is removed.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> While trying to bisect the ext4/003 regression, I found that patches
> one and two applied to gether causes the following deadlock (very
> early in the boot sequence; before xfstests is started):
> 
> [   24.680699] 
> [   24.681110] =============================================
> [   24.682755] [ INFO: possible recursive locking detected ]
> [   24.683344] 3.18.0-rc3-00538-gc12044b #2369 Not tainted
> [   24.683344] ---------------------------------------------
> [   24.683344] runtests.sh/2772 is trying to acquire lock:
> [   24.683344]  (&(&sbi->s_es_lru_lock)->rlock){+.+...}, at: [<c02e870b>] ext4_es_free_extent+0x6f/0xc5
> [   24.683344] 
> [   24.683344] but task is already holding lock:
> [   24.683344]  (&(&sbi->s_es_lru_lock)->rlock){+.+...}, at: [<c02e8857>] __ext4_es_shrink+0x3a/0x346
> [   24.683344] 
> [   24.683344] other info that might help us debug this:
> [   24.683344]  Possible unsafe locking scenario:
> [   24.683344] 
> [   24.683344]        CPU0
> [   24.683344]        ----
> [   24.683344]   lock(&(&sbi->s_es_lru_lock)->rlock);
> [   24.683344]   lock(&(&sbi->s_es_lru_lock)->rlock);
> [   24.683344] 
> [   24.683344]  *** DEADLOCK ***
> [   24.683344] 
> [   24.683344]  May be due to missing lock nesting notation
> [   24.683344] 
> [   24.683344] 4 locks held by runtests.sh/2772:
> [   24.683344]  #0:  (sb_writers#5){.+.+.+}, at: [<c024a69d>] file_start_write+0x24/0x26
> [   24.683344]  #1:  (shrinker_rwsem){++++..}, at: [<c021cca2>] shrink_slab+0x29/0xcd
> [   24.683344]  #2:  (&(&sbi->s_es_lru_lock)->rlock){+.+...}, at: [<c02e8857>] __ext4_es_shrink+0x3a/0x346
> [   24.683344]  #3:  (&ei->i_es_lock){++++..}, at: [<c02e8900>] __ext4_es_shrink+0xe3/0x346
> [   24.683344] 
> [   24.683344] stack backtrace:
> [   24.683344] CPU: 1 PID: 2772 Comm: runtests.sh Not tainted 3.18.0-rc3-00538-gc12044b #2369
> [   24.683344] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [   24.683344]  00000000 00000000 f3f55d4c c091b746 f343e0d0 f3f55dc0 c019fa14 c0c0cc77
> [   24.683344]  c0c0d6ea c0c0cb76 00000002 f343e694 c149d110 00003900 00000000 24b02008
> [   24.683344]  4c422213 f343e694 00000000 00012581 00000004 f343e0d0 f343e6b8 f343e6bc
> [   24.683344] Call Trace:
> [   24.683344]  [<c091b746>] dump_stack+0x48/0x60
> [   24.683344]  [<c019fa14>] __lock_acquire+0xb7d/0xcd9
> [   24.683344]  [<c019f1f6>] ? __lock_acquire+0x35f/0xcd9
> [   24.683344]  [<c019feb8>] lock_acquire+0xe7/0x15e
> [   24.683344]  [<c02e870b>] ? ext4_es_free_extent+0x6f/0xc5
> [   24.683344]  [<c0923cd1>] _raw_spin_lock+0x2a/0x5a
> [   24.683344]  [<c02e870b>] ? ext4_es_free_extent+0x6f/0xc5
> [   24.683344]  [<c02e870b>] ext4_es_free_extent+0x6f/0xc5
> [   24.683344]  [<c02e87ff>] __es_try_to_reclaim_extents+0x9e/0xbc
> [   24.683344]  [<c02e890e>] __ext4_es_shrink+0xf1/0x346
> [   24.683344]  [<c02e8c38>] ext4_es_scan+0xd5/0x1fc
> [   24.683344]  [<c021c8c5>] shrink_slab_node+0x196/0x321
> [   24.683344]  [<c021ccd8>] shrink_slab+0x5f/0xcd
> [   24.683344]  [<c0287306>] ? drop_pagecache_sb+0xbf/0xbf
> [   24.683344]  [<c0287382>] drop_caches_sysctl_handler+0x7c/0xd2
> [   24.683344]  [<c0296647>] proc_sys_call_handler+0x7b/0x9c
> [   24.683344]  [<c0296668>] ? proc_sys_call_handler+0x9c/0x9c
> [   24.683344]  [<c029667a>] proc_sys_write+0x12/0x14
> [   24.683344]  [<c024ae85>] vfs_write+0x8c/0xf7
> [   24.683344]  [<c024b21c>] SyS_write+0x4f/0x7c
> [   24.683344]  [<c0924a2a>] syscall_call+0x7/0x7
> 
> We end up removing all of the LRU code in the next patch in the patch
> series, so this is not a major problem, but it can trip people up when
> they are doing bisects.  It may be simplest just to combine patches 2
> and 3 in this series.
  Thanks for spotting this. The changes are fairly different so I don't
like combining the two patches. But we can just swap them in the series
without too much trouble and that should resolve the problem as well.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
  2014-11-23  5:43   ` Theodore Ts'o
@ 2014-11-24  9:34     ` Jan Kara
  2014-11-25 14:57       ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2014-11-24  9:34 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, linux-ext4, Zheng Liu, Zheng Liu, Andreas Dilger

On Sun 23-11-14 00:43:35, Ted Tso wrote:
> On Fri, Nov 14, 2014 at 11:45:52AM +0100, Jan Kara wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > Currently extent status tree doesn't cache extent hole when a write
> > looks up in extent tree to make sure whether a block has been allocated
> > or not.  In this case, we don't put extent hole in extent cache because
> > later this extent might be removed and a new delayed extent might be
> > added back.  But it will cause a defect when we do a lot of writes.  If
> > we don't put extent hole in extent cache, the following writes also need
> > to access extent tree to look at whether or not a block has been
> > allocated.  It brings a cache miss.  This commit fixes this defect.
> > Also if the inode doesn't have any extent, this extent hole will be
> > cached as well.
> > 
> > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > Cc: Andreas Dilger <adilger.kernel@dilger.ca>
> > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> This patch (either when applied alone, or as part of the entire patch
> series) is causing an xfstests regression test failure:
  I saw this in my testing as well but when investigating I somehow decided
it wasn't my bug and then forgot about it. Anyway, I'll check again and
resolve it either way.

								Honza

> FSTYP         -- ext4
> PLATFORM      -- Linux/i686 kvm-xfstests 3.18.0-rc3-00537-g932e3f9
> MKFS_OPTIONS  -- -q /dev/vdc
> MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity /dev/vdc /vdc
> 
> ext4/003 4s ...	[00:14:39] [00:14:48] - output mismatch (see /results/results-4k/ext4/003.out.bad)
>     --- tests/ext4/003.out	2014-10-31 10:13:03.000000000 -0400
>     +++ /results/results-4k/ext4/003.out.bad	2014-11-23 00:14:48.935587897 -0500
>     @@ -1,3 +1,3 @@
>      QA output created by 003
>     -wrote 268435456/268435456 bytes at offset 0
>     +wrote 241762304/268435456 bytes at offset 0
>      XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>     ...
>     (Run 'diff -u tests/ext4/003.out /results/results-4k/ext4/003.out.bad'  to see the entire diff)
> 
> 				- Ted
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
  2014-11-24  9:34     ` Jan Kara
@ 2014-11-25 14:57       ` Jan Kara
  2014-11-25 16:58         ` Theodore Ts'o
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2014-11-25 14:57 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, linux-ext4, Zheng Liu, Zheng Liu, Andreas Dilger

On Mon 24-11-14 10:34:34, Jan Kara wrote:
> On Sun 23-11-14 00:43:35, Ted Tso wrote:
> > On Fri, Nov 14, 2014 at 11:45:52AM +0100, Jan Kara wrote:
> > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > 
> > > Currently extent status tree doesn't cache extent hole when a write
> > > looks up in extent tree to make sure whether a block has been allocated
> > > or not.  In this case, we don't put extent hole in extent cache because
> > > later this extent might be removed and a new delayed extent might be
> > > added back.  But it will cause a defect when we do a lot of writes.  If
> > > we don't put extent hole in extent cache, the following writes also need
> > > to access extent tree to look at whether or not a block has been
> > > allocated.  It brings a cache miss.  This commit fixes this defect.
> > > Also if the inode doesn't have any extent, this extent hole will be
> > > cached as well.
> > > 
> > > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > > Cc: Andreas Dilger <adilger.kernel@dilger.ca>
> > > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > 
> > This patch (either when applied alone, or as part of the entire patch
> > series) is causing an xfstests regression test failure:
>   I saw this in my testing as well but when investigating I somehow decided
> it wasn't my bug and then forgot about it. Anyway, I'll check again and
> resolve it either way.
  So it really wasn't my bug but a problem in ext4_da_map_blocks() which
was made much more visible by this patch. Anyway, I've fixed it and
resubmitted the series.

								Honza

> > FSTYP         -- ext4
> > PLATFORM      -- Linux/i686 kvm-xfstests 3.18.0-rc3-00537-g932e3f9
> > MKFS_OPTIONS  -- -q /dev/vdc
> > MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity /dev/vdc /vdc
> > 
> > ext4/003 4s ...	[00:14:39] [00:14:48] - output mismatch (see /results/results-4k/ext4/003.out.bad)
> >     --- tests/ext4/003.out	2014-10-31 10:13:03.000000000 -0400
> >     +++ /results/results-4k/ext4/003.out.bad	2014-11-23 00:14:48.935587897 -0500
> >     @@ -1,3 +1,3 @@
> >      QA output created by 003
> >     -wrote 268435456/268435456 bytes at offset 0
> >     +wrote 241762304/268435456 bytes at offset 0
> >      XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> >     ...
> >     (Run 'diff -u tests/ext4/003.out /results/results-4k/ext4/003.out.bad'  to see the entire diff)
> > 
> > 				- Ted
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
  2014-11-25 14:57       ` Jan Kara
@ 2014-11-25 16:58         ` Theodore Ts'o
  0 siblings, 0 replies; 13+ messages in thread
From: Theodore Ts'o @ 2014-11-25 16:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Zheng Liu, Zheng Liu, Andreas Dilger

On Tue, Nov 25, 2014 at 03:57:14PM +0100, Jan Kara wrote:
>   So it really wasn't my bug but a problem in ext4_da_map_blocks() which
> was made much more visible by this patch. Anyway, I've fixed it and
> resubmitted the series.

Many thanks!  I'm running the your updated series through my tests
right now.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-11-25 16:58 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-14 10:45 [PATCH 0/6 v3] ext4: Extent status tree shrinker improvements Jan Kara
2014-11-14 10:45 ` [PATCH 1/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks() Jan Kara
2014-11-23  5:43   ` Theodore Ts'o
2014-11-24  9:34     ` Jan Kara
2014-11-25 14:57       ` Jan Kara
2014-11-25 16:58         ` Theodore Ts'o
2014-11-14 10:45 ` [PATCH 2/6] ext4: Move LRU list handling into extent status code Jan Kara
2014-11-23  5:48   ` Theodore Ts'o
2014-11-24  9:32     ` Jan Kara
2014-11-14 10:45 ` [PATCH 3/6] ext4: change lrm to round-robin in extent status tree shrinker Jan Kara
2014-11-14 10:45 ` [PATCH 4/6] ext4: Limit number of scanned extents in " Jan Kara
2014-11-14 10:45 ` [PATCH 5/6] ext4: Cleanup flag definitions for extent status tree Jan Kara
2014-11-14 10:45 ` [PATCH 6/6] ext4: Introduce aging to " Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.