All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/6] ext4: extents status tree shrinker improvement
@ 2014-08-07  3:35 Zheng Liu
  2014-08-07  3:35 ` [PATCH v3 1/6] ext4: improve extents status tree trace point Zheng Liu
                   ` (6 more replies)
  0 siblings, 7 replies; 33+ messages in thread
From: Zheng Liu @ 2014-08-07  3:35 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o, Andreas Dilger, Jan Kara

Hi all,

Intro
=====

Here is the third version to improve the extent status tree shrinker.
Sorry for very late work.  Thanks Ted's and Dave's suggestions.  I
revise my work and currently in extent status tree shrinker we should
fix two issues.  One is how to reclaim some objects from the tree as
quick as possible, and another is how to keep useful extent cache in
the tree as many as possible.  The first issue also can be divided into
two problems: a) how to scan list that tracks all inodes and b) how to
reclaim object from an inode.  This patch set tries to fix these issues.

In the patch set, the first two patches just does some cleanups and
adds some statistics for measuring the improvements.

Patch 3 makes extent status tree cache extent hole in delalloc code path
to improve the cache hit ratio.  Currently we don't cache extent hole
when we do a delalloc write because this extent hole might be converted
into delayed soon.  But the defect is that we could miss some extent
hole in extent cache.  That means that the following writes must lookup
in extent tree again to make sure whether a block has been allocated or
not.

Patch 4 makes shrinker replace lru list with a normal list to track all
inodes.  Then the shrinker uses a round-robin algorithm to scan this
list.  The purpose that we discard the lru list and use rr algorithm is
that we don't need to take a long time to keep lru list.  Thus we can
save some scan time.  Meanwhile this commit tries to reduce the critical
section.  Thus the locking gets more fine-granularity.  That means that
other processes don't need to wait on this locking for a long time.  The
improvement can be seen in test case (b).

Patch 5 uses a list to track all reclaimable objects in an inode to
speed up the reclaim time.  Now when shrinker tries to reclaim some
objects it should scan rb-tree in an inode.  But in this rb-tree, it
has some non-reclaimable objects (delayed extent, ext4 uses them to
finish seeking data/hole, finding delayed range, etc.).  That means the
shrinker must take some time to skip these non-reclaimble objects during
the scan time.  So after applying this patch the shrinker can directly
reclaim objects.  The improvement can be seen in test case (a).

Patch 6 improve the list that tracks all reclaimable objects in order
to promote the cache hit ratio.  After applied patch 5, the extent
cache could be flushed by a streaming workload because we don't any
way to recognize which one should be kept as much as possible.  In this
commit it splits the list into active list and inactive list.  Meanwhile
a new flag called '_ACCESSED' is defined.  When an extent cache is
accessed, this flag will be markd.  When the shrinker encounters this
flag during scanning list, this extent cache will be moved to the tail
of active list.  All these active extent caches will be reclaimed when
we are under high memory pressure and the shrinker doesn't reclaim any
object in the first round.  The improvement can be seen in test case (c).

Environment
===========
$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    4
CPU socket(s):         2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Stepping:              2
CPU MHz:               2400.000
BogoMIPS:              4799.89
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K
NUMA node0 CPU(s):     0-3,8-11
NUMA node1 CPU(s):     4-7,12-15

$ cat /proc/meminfo
MemTotal:       24677988 kB

$ df -ah
/dev/sdb1             453G   51G  380G  12% /mnt/sdb1 (HDD)

Test Case
=========

Test Case (a)
-------------
The test case (a) is used to create a huge number of extent caches in
500 inodes.  Running this test, we will get a very big rb-tree on every
inodes.  The purpose is to measure the latency when shrinker reclaim
object from an inode.

Fio script:

[global]
ioengine=psync
bs=4k
directory=/mnt/sdb1
group_reporting
fallocate=0
direct=0
filesize=100000g
size=600000g
runtime=300
create_on_open=1
create_serialize=0
create_fsync=0
norandommap

[io]
rw=write
numjobs=100
nrfiles=5

Test Case (b)
-------------
The test case (b) is used to create a very long list in super block so
that we can measure the latency when shrinker scan the list.

Fio script:

[global]
ioengine=psync
bs=4k
directory=/mnt/sdb1
group_reporting
fallocate=0
direct=0
runtime=300
create_on_open=1
create_serialize=0
create_fsync=0
norandommap

[io]
filesize=100000g
size=600000g
rw=write
numjobs=100
nrfiles=5
openfiles=10000

[rand]
filesize=1000g
size=600000g
rw=write:4k
numjobs=20
nrfiles=20000

Test Case (c)
-------------
The test case (c) is used to measure the cache hit/miss.

*NOTE*
I reduce the memory to 12g in order to make shrinker work more aggressive.

Fio script:

[global]
ioengine=psync
bs=4k
directory=/mnt/sdb1
group_reporting
direct=0
runtime=300

[read]
rw=read
numjobs=1
nrfiles=50
filesize=1g
size=600000g

[stream]
filesize=1000g
size=600000g
rw=write
numjobs=100
nrfiles=5
create_on_open=1
create_serialize=0
create_fsync=0

Note 1
------
*For getting a very fragmented extent status tree, I use the following
patch to disallow to merge extent cache*

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index b6c3366..f8c693e 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -351,6 +351,7 @@ static void ext4_es_free_extent(struct inode *inode, struct
extent_status *es)
 static int ext4_es_can_be_merged(struct extent_status *es1,
                                 struct extent_status *es2)
 {
+#if 0
        if (ext4_es_status(es1) != ext4_es_status(es2))
                return 0;

@@ -376,6 +377,7 @@ static int ext4_es_can_be_merged(struct extent_status *es1,
        /* we need to check delayed extent is without unwritten status */
        if (ext4_es_is_delayed(es1) && !ext4_es_is_unwritten(es1))
                return 1;
+#endif

        return 0;
 }

Note 2
------
For caching data in memory as many as possible, two sysctl parameters
are changed:
 sudo sysctl vm.dirty_ratio=80
 sudo sysctl vm.dirty_background_ratio=60

Result
======
Every test cases we collect 5 metrics:
 1. The shrinker avg. scan time
 2. The shrinker max scan time
 3. The extent cache hit/miss
 4. The userspace avg. latency
 5. The userspace max latency

Due to using fio to do the test, it is easy to get the userspace latency.

For review, I brief the patch as blow:
 * baseline: ext4/dev branch + patch 1, 2
 * es-cache: baseline + patch 3
 * es-slist: es-cache + patch 4
 * es-ilist: es-slist + patch 5
 * es-gc:    es-ilist + patch 6

Every test cases should be run 3 times.

1. es avg. scan time

test case (a)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
    N           Min           Max        Median           Avg        Stddev
x   3          9402         11552          9633     10195.667      1180.284
+   3          7069         10358          9931     9119.3333     1788.4301
*   3          3754          6854          4480     5029.3333     1621.3653
%   3           190           243           204     212.33333     27.465129
#   3           269           513           290     357.33333     135.21957

test case (b)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
    N           Min           Max        Median           Avg        Stddev
x   3         23047         36274         30195     29838.667     6620.6958
+   3          2992         32867         14370         16743     15078.205
*   3           124          1281           173           526     654.30803
%   3           133           157           136           142     13.076697
#   3           124           315           140           193     105.95754

test case (c)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
    N           Min           Max        Median           Avg        Stddev
x   3           137           235           163     178.33333     50.767444
+   3            57          9333          6383     5257.6667     4739.2853
*   3            48            54            49     50.333333     3.2145503
%   3            52            57            53            54     2.6457513
#   3            49            88            69     68.666667     19.502137

2. es max scan time

test case (a)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
    N           Min           Max        Median           Avg        Stddev
x   3         17503         22339         19329     19723.667     2442.0371
+   3         26275         33930         31867     30690.667     3960.7545
*   3        170970        199678        188287     186311.67     14455.579
%   3           325           405           382     370.66667     41.186567
#   3           500           857           790     715.66667     189.75335

test case (b)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
    N           Min           Max        Median           Avg        Stddev
x   3        361903        384286        380877     375688.67       12059.8
+   3        277470        313231        293883     294861.33     17900.562
*   3        112727        131888        114761        119792     10524.695
%   3         59466         76178         67195         67613     8363.8376
#   3         38747         84505         45682     56311.333     24661.421

test case (c)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
    N           Min           Max        Median           Avg        Stddev
x   3           337           462           355     384.66667      67.57465
+   3         22136         31236         24031         25801     4801.2681
*   3         36105         56888         54150     49047.667     11291.972
%   3           152           191           158           167            21
#   3           115           154           139           136     19.672316

3. cache hit/miss

test case (a)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
    N           Min           Max        Median           Avg        Stddev
x   3          4673          4858          4832     4787.6667     100.15155
+   3       7911368       8364372       8263210       8179650     237781.12
*   3       6657959       6900137       6830922     6796339.3     124737.79
%   3       6525957       6691967       6535884     6584602.7     93112.627
#   3      10061008      10704728      10501548      10422428      329072.7

test case (b)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
    N           Min           Max        Median           Avg        Stddev
x   3       1645455       1648083       1646239     1646592.3     1349.1588
+   3       8531336       9081690       8570763     8727929.7     306999.03
*   3       6591288       6643048       6595983     6610106.3     28624.741
%   3       6501434       6556700       6537840     6531991.3     28093.378
#   3       9656351       9801225       9739132       9732236      72682.77

test case (c)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
    N           Min           Max        Median           Avg        Stddev
x   3         10873         18200         15614     14895.667     3715.9433
+   3       1781010       1925331       1888470       1864937      74983.26
*   3       1697486       1929501       1765950     1797645.7     119210.74
%   3       1703674       1870348       1743376       1772466     87061.626
#   3       1687157       1953796       1774275       1805076     135961.82

4. userspace avg. latency (usec)

test case (a)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
    N           Min           Max        Median           Avg        Stddev
x   3       2638.64       2798.02       2649.18       2695.28     89.131384
+   3       2641.12       2817.16       2648.18     2702.1533     99.661231
*   3       2637.54       2812.82       2642.68       2697.68     99.747279
%   3       2850.76       2859.45       2852.68     2854.2967     4.5650009
#   3       2643.07       2817.62       2653.54     2704.7433     97.894135

test case (b)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
    N           Min           Max        Median           Avg        Stddev
x   3        3561.9        3580.1       3563.47       3568.49     10.085152
+   3       3596.12       3631.57        3600.2     3609.2967     19.396846
*   3        3565.4       3605.12       3585.48     3585.3333     19.860406
%   3       3577.07       3597.76       3588.12       3587.65     10.353004
#   3       3593.63        3626.3       3602.66       3607.53     16.870682

test case (c)
-------------
read:
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
    N           Min           Max        Median           Avg        Stddev
x   3        165.46        200.33        166.31     177.36667     19.891371
+   3        165.11        218.36        165.16     182.87667     30.729478
*   3        165.14        199.36        165.36        176.62     19.693725
%   3        165.45        197.07        166.32        176.28     18.009922
#   3        165.13        228.35        165.71     186.39667      36.33381

write:
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
    N           Min           Max        Median           Avg        Stddev
x   3      13804.25      15932.47      15148.61     14961.777     1076.3411
+   3      13807.94      15922.01      15114.67     14948.207     1066.8203
*   3      13808.51      15936.08      15083.47     14942.687      1070.749
%   3      13807.08       15290.5      15093.27     14730.283     805.57632
#   3      13809.13      15972.27       15308.6         15030     1108.1548

5. userspace max latency (usec)

test case (a)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
    N           Min           Max        Median           Avg        Stddev
x   3        849941       2720800        941754       1504165     1054636.4
+   3       1080500       2199500       1090500     1456833.3     643187.63
*   3        863204       2663400        920104       1482236     1023313.6
%   3        879189       1156700        991861       1009250     139570.31
#   3        896181       1692100        984983       1191088     436155.04

test case (b)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
    N           Min           Max        Median           Avg        Stddev
x   3       1293500       4917500       3935800     3382266.7     1874338.1
+   3       1027900       3232200       1505500     1921866.7     1159635.9
*   3       1025500       1109300       1033500       1056100     46245.865
%   3        983719       1188300       1119400     1097139.7     104091.25
#   3        869236       1118400        992344     993326.67     124584.91

test case (c)
-------------
read:
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
    N           Min           Max        Median           Avg        Stddev
x   3        147107       3594200        566804       1436037     1880767.7
+   3        124889       3586600        402357       1371282     1923531.3
*   3        299597       3578300        327797       1401898     1884872.2
%   3        406603       3587600        729393       1574532     1750822.8
#   3        301480       3583600        997081       1627387       1729463

write:
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
    N           Min           Max        Median           Avg        Stddev
x   3      28892400      29715800      29536400      29381533     432994.98
+   3      28726600      29821200      29530000      29359267     566921.24
*   3      28859400      29681400      29528700      29356500      437219.2
%   3      28472300      29819700      28728200      29006733     715581.79
#   3      28807200      29681900      29516100      29335067     464601.79

Conclusion
==========
On the view of kernel side that the max scan time and avg. scan time can
be improved.  Meanwhile the cache hit ratio is also increased.

On the view of userspace side, the max latency of test case (a) and (b) 
can be improved.  Others are not outstanding.

As always, any comment or feedback are welcome.

Thanks,
						- Zheng

Zheng Liu (6):
  ext4: improve extents status tree trace point
  ext4: track extent status tree shrinker delay statictics
  ext4: cache extent hole in extent status tree for
    ext4_da_map_blocks()
  ext4: change lru to round-robin in extent status tree shrinker
  ext4: use a list to track all reclaimable objects for extent status
    tree
  ext4: use a garbage collection algorithm to manage object

 fs/ext4/ext4.h              |   18 +-
 fs/ext4/extents.c           |   27 ++-
 fs/ext4/extents_status.c    |  430 ++++++++++++++++++++++++++++++-------------
 fs/ext4/extents_status.h    |   47 ++++-
 fs/ext4/inode.c             |   10 +-
 fs/ext4/ioctl.c             |    4 +-
 fs/ext4/super.c             |   18 +-
 include/trace/events/ext4.h |   59 +++++-
 8 files changed, 419 insertions(+), 194 deletions(-)

-- 
1.7.9.7


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 1/6] ext4: improve extents status tree trace point
  2014-08-07  3:35 [PATCH v3 0/6] ext4: extents status tree shrinker improvement Zheng Liu
@ 2014-08-07  3:35 ` Zheng Liu
  2014-09-02  2:25   ` Theodore Ts'o
  2014-08-07  3:35 ` [PATCH v3 2/6] ext4: track extent status tree shrinker delay statictics Zheng Liu
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Zheng Liu @ 2014-08-07  3:35 UTC (permalink / raw)
  To: linux-ext4; +Cc: Theodore Ts'o, Andreas Dilger, Jan Kara, Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

This commit improves the trace point of extents status tree.  We rename
trace_ext4_es_shrink_enter in ext4_es_count() because it is also used
in ext4_es_scan() and we can not identify them from the result.

Further this commit fixes a variable name in trace point in order to
keep consistency with others.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/extents_status.c    |    6 +++---
 include/trace/events/ext4.h |   28 ++++++++++++++++++++--------
 2 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 0b7e28e..47d60dc 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -1021,7 +1021,7 @@ static unsigned long ext4_es_count(struct shrinker *shrink,
 
 	sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker);
 	nr = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
-	trace_ext4_es_shrink_enter(sbi->s_sb, sc->nr_to_scan, nr);
+	trace_ext4_es_shrink_count(sbi->s_sb, sc->nr_to_scan, nr);
 	return nr;
 }
 
@@ -1034,14 +1034,14 @@ static unsigned long ext4_es_scan(struct shrinker *shrink,
 	int ret, nr_shrunk;
 
 	ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
-	trace_ext4_es_shrink_enter(sbi->s_sb, nr_to_scan, ret);
+	trace_ext4_es_shrink_scan_enter(sbi->s_sb, nr_to_scan, ret);
 
 	if (!nr_to_scan)
 		return ret;
 
 	nr_shrunk = __ext4_es_shrink(sbi, nr_to_scan, NULL);
 
-	trace_ext4_es_shrink_exit(sbi->s_sb, nr_shrunk, ret);
+	trace_ext4_es_shrink_scan_exit(sbi->s_sb, nr_shrunk, ret);
 	return nr_shrunk;
 }
 
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index d4f70a7..849aaba 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2369,7 +2369,7 @@ TRACE_EVENT(ext4_es_lookup_extent_exit,
 		  show_extent_status(__entry->found ? __entry->status : 0))
 );
 
-TRACE_EVENT(ext4_es_shrink_enter,
+DECLARE_EVENT_CLASS(ext4__es_shrink_enter,
 	TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),
 
 	TP_ARGS(sb, nr_to_scan, cache_cnt),
@@ -2391,26 +2391,38 @@ TRACE_EVENT(ext4_es_shrink_enter,
 		  __entry->nr_to_scan, __entry->cache_cnt)
 );
 
-TRACE_EVENT(ext4_es_shrink_exit,
-	TP_PROTO(struct super_block *sb, int shrunk_nr, int cache_cnt),
+DEFINE_EVENT(ext4__es_shrink_enter, ext4_es_shrink_count,
+	TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),
+
+	TP_ARGS(sb, nr_to_scan, cache_cnt)
+);
+
+DEFINE_EVENT(ext4__es_shrink_enter, ext4_es_shrink_scan_enter,
+	TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),
+
+	TP_ARGS(sb, nr_to_scan, cache_cnt)
+);
+
+TRACE_EVENT(ext4_es_shrink_scan_exit,
+	TP_PROTO(struct super_block *sb, int nr_shrunk, int cache_cnt),
 
-	TP_ARGS(sb, shrunk_nr, cache_cnt),
+	TP_ARGS(sb, nr_shrunk, cache_cnt),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,	dev			)
-		__field(	int,	shrunk_nr		)
+		__field(	int,	nr_shrunk		)
 		__field(	int,	cache_cnt		)
 	),
 
 	TP_fast_assign(
 		__entry->dev		= sb->s_dev;
-		__entry->shrunk_nr	= shrunk_nr;
+		__entry->nr_shrunk	= nr_shrunk;
 		__entry->cache_cnt	= cache_cnt;
 	),
 
-	TP_printk("dev %d,%d shrunk_nr %d cache_cnt %d",
+	TP_printk("dev %d,%d nr_shrunk %d cache_cnt %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->shrunk_nr, __entry->cache_cnt)
+		  __entry->nr_shrunk, __entry->cache_cnt)
 );
 
 TRACE_EVENT(ext4_collapse_range,
-- 
1.7.9.7


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 2/6] ext4: track extent status tree shrinker delay statictics
  2014-08-07  3:35 [PATCH v3 0/6] ext4: extents status tree shrinker improvement Zheng Liu
  2014-08-07  3:35 ` [PATCH v3 1/6] ext4: improve extents status tree trace point Zheng Liu
@ 2014-08-07  3:35 ` Zheng Liu
  2014-08-27 13:26   ` Jan Kara
  2014-08-07  3:35 ` [PATCH v3 3/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks() Zheng Liu
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Zheng Liu @ 2014-08-07  3:35 UTC (permalink / raw)
  To: linux-ext4; +Cc: Theodore Ts'o, Andreas Dilger, Jan Kara, Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

This commit adds some statictics in extent status tree shrinker.  The
purpose to add these is that we want to collect more details when we
encounter a stall caused by extent status tree shrinker.  Here we count
the following statictics:
  stats:
    the number of all objects on all extent status trees
    the number of reclaimable objects on lru list
    cache hits/misses
    the last sorted interval
    the number of inodes on lru list
  average:
    scan time for shrinking some objects
    the number of shrunk objects
  maximum:
    the inode that has max nr. of objects on lru list
    the maximum scan time for shrinking some objects

The output looks like below:
  $ cat /proc/fs/ext4/sda1/es_shrinker_info
  stats:
    28228 objects
    6341 reclaimable objects
    5281/631 cache hits/misses
    586 ms last sorted interval
    250 inodes on lru list
  average:
    153 us scan time
    128 shrunk objects
  maximum:
    255 inode (255 objects, 198 reclaimable)
    125723 us max scan time

If the lru list has never been sorted, the following line will not be
printed:
    586ms last sorted interval
If there is an empty lru list, the following lines also will not be
printed:
    250 inodes on lru list
  ...
  maximum:
    255 inode (255 objects, 198 reclaimable)
    0 us max scan time

Meanwhile in this commit a new trace point is defined to print some
details in __ext4_es_shrink().

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/ext4.h              |    4 +-
 fs/ext4/extents_status.c    |  186 ++++++++++++++++++++++++++++++++++++++++---
 fs/ext4/extents_status.h    |   13 ++-
 fs/ext4/super.c             |   11 +--
 include/trace/events/ext4.h |   31 ++++++++
 5 files changed, 224 insertions(+), 21 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 5b19760..6f294d3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -890,6 +890,7 @@ struct ext4_inode_info {
 	struct ext4_es_tree i_es_tree;
 	rwlock_t i_es_lock;
 	struct list_head i_es_lru;
+	unsigned int i_es_all_nr;	/* protected by i_es_lock */
 	unsigned int i_es_lru_nr;	/* protected by i_es_lock */
 	unsigned long i_touch_when;	/* jiffies of last accessing */
 
@@ -1330,8 +1331,7 @@ struct ext4_sb_info {
 	/* Reclaim extents from extent status tree */
 	struct shrinker s_es_shrinker;
 	struct list_head s_es_lru;
-	unsigned long s_es_last_sorted;
-	struct percpu_counter s_extent_cache_cnt;
+	struct ext4_es_stats s_es_stats;
 	struct mb_cache *s_mb_cache;
 	spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
 
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 47d60dc..b6c3366 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -11,6 +11,8 @@
  */
 #include <linux/rbtree.h>
 #include <linux/list_sort.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
 #include "ext4.h"
 #include "extents_status.h"
 
@@ -313,19 +315,27 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
 	 */
 	if (!ext4_es_is_delayed(es)) {
 		EXT4_I(inode)->i_es_lru_nr++;
-		percpu_counter_inc(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
+		percpu_counter_inc(&EXT4_SB(inode->i_sb)->
+					s_es_stats.es_stats_lru_cnt);
 	}
 
+	EXT4_I(inode)->i_es_all_nr++;
+	percpu_counter_inc(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);
+
 	return es;
 }
 
 static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
 {
+	EXT4_I(inode)->i_es_all_nr--;
+	percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);
+
 	/* Decrease the lru counter when this es is not delayed */
 	if (!ext4_es_is_delayed(es)) {
 		BUG_ON(EXT4_I(inode)->i_es_lru_nr == 0);
 		EXT4_I(inode)->i_es_lru_nr--;
-		percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
+		percpu_counter_dec(&EXT4_SB(inode->i_sb)->
+					s_es_stats.es_stats_lru_cnt);
 	}
 
 	kmem_cache_free(ext4_es_cachep, es);
@@ -731,6 +741,7 @@ int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
 			  struct extent_status *es)
 {
 	struct ext4_es_tree *tree;
+	struct ext4_es_stats *stats;
 	struct extent_status *es1 = NULL;
 	struct rb_node *node;
 	int found = 0;
@@ -767,11 +778,15 @@ int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
 	}
 
 out:
+	stats = &EXT4_SB(inode->i_sb)->s_es_stats;
 	if (found) {
 		BUG_ON(!es1);
 		es->es_lblk = es1->es_lblk;
 		es->es_len = es1->es_len;
 		es->es_pblk = es1->es_pblk;
+		stats->es_stats_cache_hits++;
+	} else {
+		stats->es_stats_cache_misses++;
 	}
 
 	read_unlock(&EXT4_I(inode)->i_es_lock);
@@ -933,11 +948,16 @@ static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
 			    struct ext4_inode_info *locked_ei)
 {
 	struct ext4_inode_info *ei;
+	struct ext4_es_stats *es_stats;
 	struct list_head *cur, *tmp;
 	LIST_HEAD(skipped);
+	ktime_t start_time;
+	u64 scan_time;
 	int nr_shrunk = 0;
 	int retried = 0, skip_precached = 1, nr_skipped = 0;
 
+	es_stats = &sbi->s_es_stats;
+	start_time = ktime_get();
 	spin_lock(&sbi->s_es_lru_lock);
 
 retry:
@@ -948,7 +968,8 @@ retry:
 		 * If we have already reclaimed all extents from extent
 		 * status tree, just stop the loop immediately.
 		 */
-		if (percpu_counter_read_positive(&sbi->s_extent_cache_cnt) == 0)
+		if (percpu_counter_read_positive(
+				&es_stats->es_stats_lru_cnt) == 0)
 			break;
 
 		ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
@@ -958,7 +979,7 @@ retry:
 		 * time.  Normally we try hard to avoid shrinking
 		 * precached inodes, but we will as a last resort.
 		 */
-		if ((sbi->s_es_last_sorted < ei->i_touch_when) ||
+		if ((es_stats->es_stats_last_sorted < ei->i_touch_when) ||
 		    (skip_precached && ext4_test_inode_state(&ei->vfs_inode,
 						EXT4_STATE_EXT_PRECACHED))) {
 			nr_skipped++;
@@ -992,7 +1013,7 @@ retry:
 	if ((nr_shrunk == 0) && nr_skipped && !retried) {
 		retried++;
 		list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp);
-		sbi->s_es_last_sorted = jiffies;
+		es_stats->es_stats_last_sorted = jiffies;
 		ei = list_first_entry(&sbi->s_es_lru, struct ext4_inode_info,
 				      i_es_lru);
 		/*
@@ -1010,6 +1031,22 @@ retry:
 	if (locked_ei && nr_shrunk == 0)
 		nr_shrunk = __es_try_to_reclaim_extents(locked_ei, nr_to_scan);
 
+	scan_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
+	if (likely(es_stats->es_stats_scan_time))
+		es_stats->es_stats_scan_time = (scan_time +
+				es_stats->es_stats_scan_time*3) / 4;
+	else
+		es_stats->es_stats_scan_time = scan_time;
+	if (scan_time > es_stats->es_stats_max_scan_time)
+		es_stats->es_stats_max_scan_time = scan_time;
+	if (likely(es_stats->es_stats_shrunk))
+		es_stats->es_stats_shrunk = (nr_shrunk +
+				es_stats->es_stats_shrunk*3) / 4;
+	else
+		es_stats->es_stats_shrunk = nr_shrunk;
+
+	trace_ext4_es_shrink(sbi->s_sb, nr_shrunk, scan_time, skip_precached,
+			     nr_skipped, retried);
 	return nr_shrunk;
 }
 
@@ -1020,7 +1057,7 @@ static unsigned long ext4_es_count(struct shrinker *shrink,
 	struct ext4_sb_info *sbi;
 
 	sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker);
-	nr = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
+	nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_lru_cnt);
 	trace_ext4_es_shrink_count(sbi->s_sb, sc->nr_to_scan, nr);
 	return nr;
 }
@@ -1033,7 +1070,7 @@ static unsigned long ext4_es_scan(struct shrinker *shrink,
 	int nr_to_scan = sc->nr_to_scan;
 	int ret, nr_shrunk;
 
-	ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
+	ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_lru_cnt);
 	trace_ext4_es_shrink_scan_enter(sbi->s_sb, nr_to_scan, ret);
 
 	if (!nr_to_scan)
@@ -1045,19 +1082,148 @@ static unsigned long ext4_es_scan(struct shrinker *shrink,
 	return nr_shrunk;
 }
 
-void ext4_es_register_shrinker(struct ext4_sb_info *sbi)
+static void *ext4_es_seq_shrinker_info_start(struct seq_file *seq, loff_t *pos)
+{
+	return *pos ? NULL : SEQ_START_TOKEN;
+}
+
+static void *
+ext4_es_seq_shrinker_info_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	return NULL;
+}
+
+static int ext4_es_seq_shrinker_info_show(struct seq_file *seq, void *v)
+{
+	struct ext4_sb_info *sbi = seq->private;
+	struct ext4_es_stats *es_stats = &sbi->s_es_stats;
+	struct ext4_inode_info *ei, *max = NULL;
+	unsigned int inode_cnt = 0;
+
+	if (v != SEQ_START_TOKEN)
+		return 0;
+
+	/* here we just find an inode that has the max nr. of objects */
+	spin_lock(&sbi->s_es_lru_lock);
+	list_for_each_entry(ei, &sbi->s_es_lru, i_es_lru) {
+		inode_cnt++;
+		if (max && max->i_es_all_nr < ei->i_es_all_nr)
+			max = ei;
+		else if (!max)
+			max = ei;
+	}
+	spin_unlock(&sbi->s_es_lru_lock);
+
+	seq_printf(seq, "stats:\n  %lld objects\n  %lld reclaimable objects\n",
+		   percpu_counter_sum_positive(&es_stats->es_stats_all_cnt),
+		   percpu_counter_sum_positive(&es_stats->es_stats_lru_cnt));
+	seq_printf(seq, "  %lu/%lu cache hits/misses\n",
+		   es_stats->es_stats_cache_hits,
+		   es_stats->es_stats_cache_misses);
+	if (es_stats->es_stats_last_sorted != 0)
+		seq_printf(seq, "  %u ms last sorted interval\n",
+			   jiffies_to_msecs(jiffies -
+					    es_stats->es_stats_last_sorted));
+	if (inode_cnt)
+		seq_printf(seq, "  %d inodes on lru list\n", inode_cnt);
+
+	seq_printf(seq, "average:\n  %llu us scan time\n",
+	    div_u64(es_stats->es_stats_scan_time, 1000));
+	seq_printf(seq, "  %lu shrunk objects\n", es_stats->es_stats_shrunk);
+	if (inode_cnt)
+		seq_printf(seq,
+		    "maximum:\n  %lu inode (%u objects, %u reclaimable)\n"
+		    "  %llu us max scan time\n",
+		    max->vfs_inode.i_ino, max->i_es_all_nr, max->i_es_lru_nr,
+		    div_u64(es_stats->es_stats_max_scan_time, 1000));
+
+	return 0;
+}
+
+static void ext4_es_seq_shrinker_info_stop(struct seq_file *seq, void *v)
+{
+}
+
+static const struct seq_operations ext4_es_seq_shrinker_info_ops = {
+	.start = ext4_es_seq_shrinker_info_start,
+	.next  = ext4_es_seq_shrinker_info_next,
+	.stop  = ext4_es_seq_shrinker_info_stop,
+	.show  = ext4_es_seq_shrinker_info_show,
+};
+
+static int
+ext4_es_seq_shrinker_info_open(struct inode *inode, struct file *file)
+{
+	int ret;
+
+	ret = seq_open(file, &ext4_es_seq_shrinker_info_ops);
+	if (!ret) {
+		struct seq_file *m = file->private_data;
+		m->private = PDE_DATA(inode);
+	}
+
+	return ret;
+}
+
+static int
+ext4_es_seq_shrinker_info_release(struct inode *inode, struct file *file)
+{
+	return seq_release(inode, file);
+}
+
+static const struct file_operations ext4_es_seq_shrinker_info_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ext4_es_seq_shrinker_info_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= ext4_es_seq_shrinker_info_release,
+};
+
+int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
 {
+	int err;
+
 	INIT_LIST_HEAD(&sbi->s_es_lru);
 	spin_lock_init(&sbi->s_es_lru_lock);
-	sbi->s_es_last_sorted = 0;
+	sbi->s_es_stats.es_stats_last_sorted = 0;
+	sbi->s_es_stats.es_stats_shrunk = 0;
+	sbi->s_es_stats.es_stats_cache_hits = 0;
+	sbi->s_es_stats.es_stats_cache_misses = 0;
+	sbi->s_es_stats.es_stats_scan_time = 0;
+	sbi->s_es_stats.es_stats_max_scan_time = 0;
+	err = percpu_counter_init(&sbi->s_es_stats.es_stats_all_cnt, 0);
+	if (err)
+		return err;
+	err = percpu_counter_init(&sbi->s_es_stats.es_stats_lru_cnt, 0);
+	if (err)
+		goto err1;
+
 	sbi->s_es_shrinker.scan_objects = ext4_es_scan;
 	sbi->s_es_shrinker.count_objects = ext4_es_count;
 	sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
-	register_shrinker(&sbi->s_es_shrinker);
+	err = register_shrinker(&sbi->s_es_shrinker);
+	if (err)
+		goto err2;
+
+	if (sbi->s_proc)
+		proc_create_data("es_shrinker_info", S_IRUGO, sbi->s_proc,
+				 &ext4_es_seq_shrinker_info_fops, sbi);
+
+	return 0;
+
+err2:
+	percpu_counter_destroy(&sbi->s_es_stats.es_stats_lru_cnt);
+err1:
+	percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
+	return err;
 }
 
 void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
 {
+	if (sbi->s_proc)
+		remove_proc_entry("es_shrinker_info", sbi->s_proc);
+	percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
+	percpu_counter_destroy(&sbi->s_es_stats.es_stats_lru_cnt);
 	unregister_shrinker(&sbi->s_es_shrinker);
 }
 
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index f1b62a4..efd5f97 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -64,6 +64,17 @@ struct ext4_es_tree {
 	struct extent_status *cache_es;	/* recently accessed extent */
 };
 
+struct ext4_es_stats {
+	unsigned long es_stats_last_sorted;
+	unsigned long es_stats_shrunk;
+	unsigned long es_stats_cache_hits;
+	unsigned long es_stats_cache_misses;
+	u64 es_stats_scan_time;
+	u64 es_stats_max_scan_time;
+	struct percpu_counter es_stats_all_cnt;
+	struct percpu_counter es_stats_lru_cnt;
+};
+
 extern int __init ext4_init_es(void);
 extern void ext4_exit_es(void);
 extern void ext4_es_init_tree(struct ext4_es_tree *tree);
@@ -138,7 +149,7 @@ static inline void ext4_es_store_pblock_status(struct extent_status *es,
 		       (pb & ~ES_MASK));
 }
 
-extern void ext4_es_register_shrinker(struct ext4_sb_info *sbi);
+extern int ext4_es_register_shrinker(struct ext4_sb_info *sbi);
 extern void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi);
 extern void ext4_es_lru_add(struct inode *inode);
 extern void ext4_es_lru_del(struct inode *inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 32b43ad..7a06eb4 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -820,7 +820,6 @@ static void ext4_put_super(struct super_block *sb)
 	percpu_counter_destroy(&sbi->s_freeinodes_counter);
 	percpu_counter_destroy(&sbi->s_dirs_counter);
 	percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
-	percpu_counter_destroy(&sbi->s_extent_cache_cnt);
 	brelse(sbi->s_sbh);
 #ifdef CONFIG_QUOTA
 	for (i = 0; i < MAXQUOTAS; i++)
@@ -885,6 +884,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ext4_es_init_tree(&ei->i_es_tree);
 	rwlock_init(&ei->i_es_lock);
 	INIT_LIST_HEAD(&ei->i_es_lru);
+	ei->i_es_all_nr = 0;
 	ei->i_es_lru_nr = 0;
 	ei->i_touch_when = 0;
 	ei->i_reserved_data_blocks = 0;
@@ -3889,12 +3889,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	sbi->s_err_report.data = (unsigned long) sb;
 
 	/* Register extent status tree shrinker */
-	ext4_es_register_shrinker(sbi);
-
-	if ((err = percpu_counter_init(&sbi->s_extent_cache_cnt, 0)) != 0) {
-		ext4_msg(sb, KERN_ERR, "insufficient memory");
+	if (ext4_es_register_shrinker(sbi))
 		goto failed_mount3;
-	}
 
 	sbi->s_stripe = ext4_get_stripe_size(sbi);
 	sbi->s_extent_max_zeroout_kb = 32;
@@ -4224,10 +4220,9 @@ failed_mount_wq:
 		jbd2_journal_destroy(sbi->s_journal);
 		sbi->s_journal = NULL;
 	}
-failed_mount3:
 	ext4_es_unregister_shrinker(sbi);
+failed_mount3:
 	del_timer_sync(&sbi->s_err_report);
-	percpu_counter_destroy(&sbi->s_extent_cache_cnt);
 	if (sbi->s_mmp_tsk)
 		kthread_stop(sbi->s_mmp_tsk);
 failed_mount2:
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 849aaba..ff4bd1b 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2450,6 +2450,37 @@ TRACE_EVENT(ext4_collapse_range,
 		  __entry->offset, __entry->len)
 );
 
+TRACE_EVENT(ext4_es_shrink,
+	TP_PROTO(struct super_block *sb, int nr_shrunk, u64 scan_time,
+		 int skip_precached, int nr_skipped, int retried),
+
+	TP_ARGS(sb, nr_shrunk, scan_time, skip_precached, nr_skipped, retried),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,		dev		)
+		__field(	int,		nr_shrunk	)
+		__field(	unsigned long long, scan_time	)
+		__field(	int,		skip_precached	)
+		__field(	int,		nr_skipped	)
+		__field(	int,		retried		)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= sb->s_dev;
+		__entry->nr_shrunk	= nr_shrunk;
+		__entry->scan_time	= div_u64(scan_time, 1000);
+		__entry->skip_precached = skip_precached;
+		__entry->nr_skipped	= nr_skipped;
+		__entry->retried	= retried;
+	),
+
+	TP_printk("dev %d,%d nr_shrunk %d, scan_time %llu skip_precached %d "
+		  "nr_skipped %d retried %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->nr_shrunk,
+		  __entry->scan_time, __entry->skip_precached,
+		  __entry->nr_skipped, __entry->retried)
+);
+
 #endif /* _TRACE_EXT4_H */
 
 /* This part must be outside protection */
-- 
1.7.9.7


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 3/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
  2014-08-07  3:35 [PATCH v3 0/6] ext4: extents status tree shrinker improvement Zheng Liu
  2014-08-07  3:35 ` [PATCH v3 1/6] ext4: improve extents status tree trace point Zheng Liu
  2014-08-07  3:35 ` [PATCH v3 2/6] ext4: track extent status tree shrinker delay statictics Zheng Liu
@ 2014-08-07  3:35 ` Zheng Liu
  2014-08-27 13:55   ` Jan Kara
  2014-09-02  2:43   ` Theodore Ts'o
  2014-08-07  3:35 ` [PATCH v3 4/6] ext4: change lru to round-robin in extent status tree shrinker Zheng Liu
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 33+ messages in thread
From: Zheng Liu @ 2014-08-07  3:35 UTC (permalink / raw)
  To: linux-ext4; +Cc: Theodore Ts'o, Andreas Dilger, Jan Kara, Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

Currently extent status tree doesn't cache extent hole when a write
looks up in extent tree to make sure whether a block has been allocated
or not.  In this case, we don't put extent hole in extent cache because
later this extent might be removed and a new delayed extent might be
added back.  But it will cause a defect when we do a lot of writes.
If we don't put extent hole in extent cache, the following writes also
need to access extent tree to look at whether or not a block has been
allocated.  It brings a cache miss.  This commit fixes this defect.
Meanwhile, if an inode has no any extent, this extent hole also will
be cached.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/ext4.h              |    4 +---
 fs/ext4/extents.c           |   23 +++++++++--------------
 fs/ext4/inode.c             |    6 ++----
 include/trace/events/ext4.h |    3 +--
 4 files changed, 13 insertions(+), 23 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6f294d3..7df9220 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -565,10 +565,8 @@ enum {
 #define EXT4_GET_BLOCKS_KEEP_SIZE		0x0080
 	/* Do not take i_data_sem locking in ext4_map_blocks */
 #define EXT4_GET_BLOCKS_NO_LOCK			0x0100
-	/* Do not put hole in extent cache */
-#define EXT4_GET_BLOCKS_NO_PUT_HOLE		0x0200
 	/* Convert written extents to unwritten */
-#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN	0x0400
+#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN	0x0200
 
 /*
  * The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 76c2df3..6463d34 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2284,16 +2284,15 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
 				ext4_lblk_t block)
 {
 	int depth = ext_depth(inode);
-	unsigned long len = 0;
-	ext4_lblk_t lblock = 0;
+	unsigned long len;
+	ext4_lblk_t lblock;
 	struct ext4_extent *ex;
 
 	ex = path[depth].p_ext;
 	if (ex == NULL) {
-		/*
-		 * there is no extent yet, so gap is [0;-] and we
-		 * don't cache it
-		 */
+		/* there is no extent yet, so gap is [0;-] */
+		lblock = 0;
+		len = EXT_MAX_BLOCKS;
 		ext_debug("cache gap(whole file):");
 	} else if (block < le32_to_cpu(ex->ee_block)) {
 		lblock = block;
@@ -2302,9 +2301,6 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
 				block,
 				le32_to_cpu(ex->ee_block),
 				 ext4_ext_get_actual_len(ex));
-		if (!ext4_find_delalloc_range(inode, lblock, lblock + len - 1))
-			ext4_es_insert_extent(inode, lblock, len, ~0,
-					      EXTENT_STATUS_HOLE);
 	} else if (block >= le32_to_cpu(ex->ee_block)
 			+ ext4_ext_get_actual_len(ex)) {
 		ext4_lblk_t next;
@@ -2318,14 +2314,14 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
 				block);
 		BUG_ON(next == lblock);
 		len = next - lblock;
-		if (!ext4_find_delalloc_range(inode, lblock, lblock + len - 1))
-			ext4_es_insert_extent(inode, lblock, len, ~0,
-					      EXTENT_STATUS_HOLE);
 	} else {
 		BUG();
 	}
 
 	ext_debug(" -> %u:%lu\n", lblock, len);
+	if (!ext4_find_delalloc_range(inode, lblock, lblock + len - 1))
+		ext4_es_insert_extent(inode, lblock, len, ~0,
+				      EXTENT_STATUS_HOLE);
 }
 
 /*
@@ -4362,8 +4358,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		 * put just found gap into cache to speed up
 		 * subsequent requests
 		 */
-		if ((flags & EXT4_GET_BLOCKS_NO_PUT_HOLE) == 0)
-			ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
+		ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
 		goto out2;
 	}
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 367a60c..d1ad9f9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1485,11 +1485,9 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 			map->m_flags |= EXT4_MAP_FROM_CLUSTER;
 		retval = 0;
 	} else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
-		retval = ext4_ext_map_blocks(NULL, inode, map,
-					     EXT4_GET_BLOCKS_NO_PUT_HOLE);
+		retval = ext4_ext_map_blocks(NULL, inode, map, 0);
 	else
-		retval = ext4_ind_map_blocks(NULL, inode, map,
-					     EXT4_GET_BLOCKS_NO_PUT_HOLE);
+		retval = ext4_ind_map_blocks(NULL, inode, map, 0);
 
 add_delayed:
 	if (retval == 0) {
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index ff4bd1b..9337d36 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -43,8 +43,7 @@ struct extent_status;
 	{ EXT4_GET_BLOCKS_METADATA_NOFAIL,	"METADATA_NOFAIL" },	\
 	{ EXT4_GET_BLOCKS_NO_NORMALIZE,		"NO_NORMALIZE" },	\
 	{ EXT4_GET_BLOCKS_KEEP_SIZE,		"KEEP_SIZE" },		\
-	{ EXT4_GET_BLOCKS_NO_LOCK,		"NO_LOCK" },		\
-	{ EXT4_GET_BLOCKS_NO_PUT_HOLE,		"NO_PUT_HOLE" })
+	{ EXT4_GET_BLOCKS_NO_LOCK,		"NO_LOCK" })
 
 #define show_mflags(flags) __print_flags(flags, "",	\
 	{ EXT4_MAP_NEW,		"N" },			\
-- 
1.7.9.7


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 4/6] ext4: change lru to round-robin in extent status tree shrinker
  2014-08-07  3:35 [PATCH v3 0/6] ext4: extents status tree shrinker improvement Zheng Liu
                   ` (2 preceding siblings ...)
  2014-08-07  3:35 ` [PATCH v3 3/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks() Zheng Liu
@ 2014-08-07  3:35 ` Zheng Liu
  2014-08-27 15:01   ` Jan Kara
  2014-08-07  3:35 ` [PATCH v3 5/6] ext4: use a list to track all reclaimable objects for extent status tree Zheng Liu
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Zheng Liu @ 2014-08-07  3:35 UTC (permalink / raw)
  To: linux-ext4; +Cc: Theodore Ts'o, Andreas Dilger, Jan Kara, Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

In this commit we discard the lru algorithm because it should take a
long time to keep a lru list in extent status tree shrinker and the
shrinker should take a long time to scan this lru list in order to
reclaim some objects.

For reducing the latency, this commit does two works.  The first one
is to replace lru with round-robin.  After that we never need to keep
a lru list.  That means that the list shouldn't be sorted if the
shrinker can not reclaim any objects in the first round.  The second
one is to shrink the length of the list.  After using round-robin
algorithm, the shrinker takes the first inode in the list and handle
it.  If this inode is skipped, it will be moved into the tail of the
list.  Otherwise it will be added back when it is touched again.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/ext4.h              |   10 +-
 fs/ext4/extents.c           |    4 +-
 fs/ext4/extents_status.c    |  216 +++++++++++++++++++------------------------
 fs/ext4/extents_status.h    |    7 +-
 fs/ext4/inode.c             |    4 +-
 fs/ext4/ioctl.c             |    4 +-
 fs/ext4/super.c             |    7 +-
 include/trace/events/ext4.h |   11 +--
 8 files changed, 115 insertions(+), 148 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7df9220..559af1b0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -887,10 +887,9 @@ struct ext4_inode_info {
 	/* extents status tree */
 	struct ext4_es_tree i_es_tree;
 	rwlock_t i_es_lock;
-	struct list_head i_es_lru;
+	struct list_head i_es_list;
 	unsigned int i_es_all_nr;	/* protected by i_es_lock */
-	unsigned int i_es_lru_nr;	/* protected by i_es_lock */
-	unsigned long i_touch_when;	/* jiffies of last accessing */
+	unsigned int i_es_shk_nr;	/* protected by i_es_lock */
 
 	/* ialloc */
 	ext4_group_t	i_last_alloc_group;
@@ -1328,10 +1327,11 @@ struct ext4_sb_info {
 
 	/* Reclaim extents from extent status tree */
 	struct shrinker s_es_shrinker;
-	struct list_head s_es_lru;
+	struct list_head s_es_list;
+	long s_es_nr_inode;
 	struct ext4_es_stats s_es_stats;
 	struct mb_cache *s_mb_cache;
-	spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
+	spinlock_t s_es_lock ____cacheline_aligned_in_smp;
 
 	/* Ratelimit ext4 messages. */
 	struct ratelimit_state s_err_ratelimit_state;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 6463d34..61455d1 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4621,7 +4621,7 @@ out2:
 
 	trace_ext4_ext_map_blocks_exit(inode, flags, map,
 				       err ? err : allocated);
-	ext4_es_lru_add(inode);
+	ext4_es_list_add(inode);
 	return err ? err : allocated;
 }
 
@@ -5173,7 +5173,7 @@ int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		error = ext4_fill_fiemap_extents(inode, start_blk,
 						 len_blks, fieinfo);
 	}
-	ext4_es_lru_add(inode);
+	ext4_es_list_add(inode);
 	return error;
 }
 
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index b6c3366..8d76fd5 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -149,8 +149,8 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 			      ext4_lblk_t end);
 static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
 				       int nr_to_scan);
-static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
-			    struct ext4_inode_info *locked_ei);
+static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
+		       struct ext4_inode_info *locked_ei);
 
 int __init ext4_init_es(void)
 {
@@ -314,9 +314,9 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
 	 * We don't count delayed extent because we never try to reclaim them
 	 */
 	if (!ext4_es_is_delayed(es)) {
-		EXT4_I(inode)->i_es_lru_nr++;
+		EXT4_I(inode)->i_es_shk_nr++;
 		percpu_counter_inc(&EXT4_SB(inode->i_sb)->
-					s_es_stats.es_stats_lru_cnt);
+					s_es_stats.es_stats_shk_cnt);
 	}
 
 	EXT4_I(inode)->i_es_all_nr++;
@@ -330,12 +330,12 @@ static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
 	EXT4_I(inode)->i_es_all_nr--;
 	percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);
 
-	/* Decrease the lru counter when this es is not delayed */
+	/* Decrease the shrink counter when this es is not delayed */
 	if (!ext4_es_is_delayed(es)) {
-		BUG_ON(EXT4_I(inode)->i_es_lru_nr == 0);
-		EXT4_I(inode)->i_es_lru_nr--;
+		BUG_ON(EXT4_I(inode)->i_es_shk_nr == 0);
+		EXT4_I(inode)->i_es_shk_nr--;
 		percpu_counter_dec(&EXT4_SB(inode->i_sb)->
-					s_es_stats.es_stats_lru_cnt);
+					s_es_stats.es_stats_shk_cnt);
 	}
 
 	kmem_cache_free(ext4_es_cachep, es);
@@ -685,8 +685,8 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 		goto error;
 retry:
 	err = __es_insert_extent(inode, &newes);
-	if (err == -ENOMEM && __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
-					       EXT4_I(inode)))
+	if (err == -ENOMEM && __es_shrink(EXT4_SB(inode->i_sb),
+					  1, EXT4_I(inode)))
 		goto retry;
 	if (err == -ENOMEM && !ext4_es_is_delayed(&newes))
 		err = 0;
@@ -843,8 +843,8 @@ retry:
 				es->es_lblk = orig_es.es_lblk;
 				es->es_len = orig_es.es_len;
 				if ((err == -ENOMEM) &&
-				    __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
-						     EXT4_I(inode)))
+				    __es_shrink(EXT4_SB(inode->i_sb),
+							1, EXT4_I(inode)))
 					goto retry;
 				goto out;
 			}
@@ -923,114 +923,117 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	return err;
 }
 
-static int ext4_inode_touch_time_cmp(void *priv, struct list_head *a,
-				     struct list_head *b)
+static inline void __ext4_es_list_add(struct ext4_sb_info *sbi,
+				      struct ext4_inode_info *ei)
 {
-	struct ext4_inode_info *eia, *eib;
-	eia = list_entry(a, struct ext4_inode_info, i_es_lru);
-	eib = list_entry(b, struct ext4_inode_info, i_es_lru);
+	if (list_empty(&ei->i_es_list)) {
+		list_add_tail(&ei->i_es_list, &sbi->s_es_list);
+		sbi->s_es_nr_inode++;
+	}
+}
 
-	if (ext4_test_inode_state(&eia->vfs_inode, EXT4_STATE_EXT_PRECACHED) &&
-	    !ext4_test_inode_state(&eib->vfs_inode, EXT4_STATE_EXT_PRECACHED))
-		return 1;
-	if (!ext4_test_inode_state(&eia->vfs_inode, EXT4_STATE_EXT_PRECACHED) &&
-	    ext4_test_inode_state(&eib->vfs_inode, EXT4_STATE_EXT_PRECACHED))
-		return -1;
-	if (eia->i_touch_when == eib->i_touch_when)
-		return 0;
-	if (time_after(eia->i_touch_when, eib->i_touch_when))
-		return 1;
-	else
-		return -1;
+void ext4_es_list_add(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+	if (!list_empty(&ei->i_es_list))
+		return;
+
+	spin_lock(&sbi->s_es_lock);
+	__ext4_es_list_add(sbi, ei);
+	spin_unlock(&sbi->s_es_lock);
 }
 
-static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
-			    struct ext4_inode_info *locked_ei)
+void ext4_es_list_del(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+	spin_lock(&sbi->s_es_lock);
+	if (!list_empty(&ei->i_es_list)) {
+		list_del_init(&ei->i_es_list);
+		WARN_ON_ONCE(sbi->s_es_nr_inode-- < 0);
+	}
+	spin_unlock(&sbi->s_es_lock);
+}
+
+static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
+		       struct ext4_inode_info *locked_ei)
 {
 	struct ext4_inode_info *ei;
 	struct ext4_es_stats *es_stats;
-	struct list_head *cur, *tmp;
-	LIST_HEAD(skipped);
 	ktime_t start_time;
 	u64 scan_time;
+	int nr_to_walk;
 	int nr_shrunk = 0;
-	int retried = 0, skip_precached = 1, nr_skipped = 0;
+	int retried = 0, nr_skipped = 0;
 
 	es_stats = &sbi->s_es_stats;
 	start_time = ktime_get();
-	spin_lock(&sbi->s_es_lru_lock);
 
 retry:
-	list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
+	spin_lock(&sbi->s_es_lock);
+	nr_to_walk = sbi->s_es_nr_inode;
+	while (!list_empty(&sbi->s_es_list) && nr_to_walk-- > 0) {
 		int shrunk;
 
-		/*
-		 * If we have already reclaimed all extents from extent
-		 * status tree, just stop the loop immediately.
-		 */
-		if (percpu_counter_read_positive(
-				&es_stats->es_stats_lru_cnt) == 0)
-			break;
+		ei = list_first_entry(&sbi->s_es_list, struct ext4_inode_info,
+				      i_es_list);
 
-		ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
+		list_del_init(&ei->i_es_list);
+		sbi->s_es_nr_inode--;
+		spin_unlock(&sbi->s_es_lock);
 
 		/*
-		 * Skip the inode that is newer than the last_sorted
-		 * time.  Normally we try hard to avoid shrinking
-		 * precached inodes, but we will as a last resort.
+		 * Normally we try hard to avoid shrinking precached inodes,
+		 * but we will as a last resort.
 		 */
-		if ((es_stats->es_stats_last_sorted < ei->i_touch_when) ||
-		    (skip_precached && ext4_test_inode_state(&ei->vfs_inode,
-						EXT4_STATE_EXT_PRECACHED))) {
+		if (!retried && ext4_test_inode_state(&ei->vfs_inode,
+						EXT4_STATE_EXT_PRECACHED)) {
 			nr_skipped++;
-			list_move_tail(cur, &skipped);
+			spin_lock(&sbi->s_es_lock);
+			__ext4_es_list_add(sbi, ei);
+			continue;
+		}
+
+		if (ei->i_es_shk_nr == 0) {
+			spin_lock(&sbi->s_es_lock);
 			continue;
 		}
 
-		if (ei->i_es_lru_nr == 0 || ei == locked_ei ||
-		    !write_trylock(&ei->i_es_lock))
+		if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
+			nr_skipped++;
+			spin_lock(&sbi->s_es_lock);
+			__ext4_es_list_add(sbi, ei);
 			continue;
+		}
 
 		shrunk = __es_try_to_reclaim_extents(ei, nr_to_scan);
-		if (ei->i_es_lru_nr == 0)
-			list_del_init(&ei->i_es_lru);
 		write_unlock(&ei->i_es_lock);
 
 		nr_shrunk += shrunk;
 		nr_to_scan -= shrunk;
+
 		if (nr_to_scan == 0)
-			break;
+			goto out;
+		spin_lock(&sbi->s_es_lock);
 	}
-
-	/* Move the newer inodes into the tail of the LRU list. */
-	list_splice_tail(&skipped, &sbi->s_es_lru);
-	INIT_LIST_HEAD(&skipped);
+	spin_unlock(&sbi->s_es_lock);
 
 	/*
 	 * If we skipped any inodes, and we weren't able to make any
-	 * forward progress, sort the list and try again.
+	 * forward progress, try again to scan precached inodes.
 	 */
 	if ((nr_shrunk == 0) && nr_skipped && !retried) {
 		retried++;
-		list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp);
-		es_stats->es_stats_last_sorted = jiffies;
-		ei = list_first_entry(&sbi->s_es_lru, struct ext4_inode_info,
-				      i_es_lru);
-		/*
-		 * If there are no non-precached inodes left on the
-		 * list, start releasing precached extents.
-		 */
-		if (ext4_test_inode_state(&ei->vfs_inode,
-					  EXT4_STATE_EXT_PRECACHED))
-			skip_precached = 0;
 		goto retry;
 	}
 
-	spin_unlock(&sbi->s_es_lru_lock);
-
 	if (locked_ei && nr_shrunk == 0)
 		nr_shrunk = __es_try_to_reclaim_extents(locked_ei, nr_to_scan);
 
+out:
 	scan_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
 	if (likely(es_stats->es_stats_scan_time))
 		es_stats->es_stats_scan_time = (scan_time +
@@ -1045,7 +1048,7 @@ retry:
 	else
 		es_stats->es_stats_shrunk = nr_shrunk;
 
-	trace_ext4_es_shrink(sbi->s_sb, nr_shrunk, scan_time, skip_precached,
+	trace_ext4_es_shrink(sbi->s_sb, nr_shrunk, scan_time,
 			     nr_skipped, retried);
 	return nr_shrunk;
 }
@@ -1057,7 +1060,7 @@ static unsigned long ext4_es_count(struct shrinker *shrink,
 	struct ext4_sb_info *sbi;
 
 	sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker);
-	nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_lru_cnt);
+	nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
 	trace_ext4_es_shrink_count(sbi->s_sb, sc->nr_to_scan, nr);
 	return nr;
 }
@@ -1070,13 +1073,13 @@ static unsigned long ext4_es_scan(struct shrinker *shrink,
 	int nr_to_scan = sc->nr_to_scan;
 	int ret, nr_shrunk;
 
-	ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_lru_cnt);
+	ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
 	trace_ext4_es_shrink_scan_enter(sbi->s_sb, nr_to_scan, ret);
 
 	if (!nr_to_scan)
 		return ret;
 
-	nr_shrunk = __ext4_es_shrink(sbi, nr_to_scan, NULL);
+	nr_shrunk = __es_shrink(sbi, nr_to_scan, NULL);
 
 	trace_ext4_es_shrink_scan_exit(sbi->s_sb, nr_shrunk, ret);
 	return nr_shrunk;
@@ -1104,28 +1107,24 @@ static int ext4_es_seq_shrinker_info_show(struct seq_file *seq, void *v)
 		return 0;
 
 	/* here we just find an inode that has the max nr. of objects */
-	spin_lock(&sbi->s_es_lru_lock);
-	list_for_each_entry(ei, &sbi->s_es_lru, i_es_lru) {
+	spin_lock(&sbi->s_es_lock);
+	list_for_each_entry(ei, &sbi->s_es_list, i_es_list) {
 		inode_cnt++;
 		if (max && max->i_es_all_nr < ei->i_es_all_nr)
 			max = ei;
 		else if (!max)
 			max = ei;
 	}
-	spin_unlock(&sbi->s_es_lru_lock);
+	spin_unlock(&sbi->s_es_lock);
 
 	seq_printf(seq, "stats:\n  %lld objects\n  %lld reclaimable objects\n",
 		   percpu_counter_sum_positive(&es_stats->es_stats_all_cnt),
-		   percpu_counter_sum_positive(&es_stats->es_stats_lru_cnt));
+		   percpu_counter_sum_positive(&es_stats->es_stats_shk_cnt));
 	seq_printf(seq, "  %lu/%lu cache hits/misses\n",
 		   es_stats->es_stats_cache_hits,
 		   es_stats->es_stats_cache_misses);
-	if (es_stats->es_stats_last_sorted != 0)
-		seq_printf(seq, "  %u ms last sorted interval\n",
-			   jiffies_to_msecs(jiffies -
-					    es_stats->es_stats_last_sorted));
 	if (inode_cnt)
-		seq_printf(seq, "  %d inodes on lru list\n", inode_cnt);
+		seq_printf(seq, "  %d inodes on list\n", inode_cnt);
 
 	seq_printf(seq, "average:\n  %llu us scan time\n",
 	    div_u64(es_stats->es_stats_scan_time, 1000));
@@ -1134,7 +1133,7 @@ static int ext4_es_seq_shrinker_info_show(struct seq_file *seq, void *v)
 		seq_printf(seq,
 		    "maximum:\n  %lu inode (%u objects, %u reclaimable)\n"
 		    "  %llu us max scan time\n",
-		    max->vfs_inode.i_ino, max->i_es_all_nr, max->i_es_lru_nr,
+		    max->vfs_inode.i_ino, max->i_es_all_nr, max->i_es_shk_nr,
 		    div_u64(es_stats->es_stats_max_scan_time, 1000));
 
 	return 0;
@@ -1183,9 +1182,9 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
 {
 	int err;
 
-	INIT_LIST_HEAD(&sbi->s_es_lru);
-	spin_lock_init(&sbi->s_es_lru_lock);
-	sbi->s_es_stats.es_stats_last_sorted = 0;
+	INIT_LIST_HEAD(&sbi->s_es_list);
+	sbi->s_es_nr_inode = 0;
+	spin_lock_init(&sbi->s_es_lock);
 	sbi->s_es_stats.es_stats_shrunk = 0;
 	sbi->s_es_stats.es_stats_cache_hits = 0;
 	sbi->s_es_stats.es_stats_cache_misses = 0;
@@ -1194,7 +1193,7 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
 	err = percpu_counter_init(&sbi->s_es_stats.es_stats_all_cnt, 0);
 	if (err)
 		return err;
-	err = percpu_counter_init(&sbi->s_es_stats.es_stats_lru_cnt, 0);
+	err = percpu_counter_init(&sbi->s_es_stats.es_stats_shk_cnt, 0);
 	if (err)
 		goto err1;
 
@@ -1212,7 +1211,7 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
 	return 0;
 
 err2:
-	percpu_counter_destroy(&sbi->s_es_stats.es_stats_lru_cnt);
+	percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
 err1:
 	percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
 	return err;
@@ -1223,37 +1222,10 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
 	if (sbi->s_proc)
 		remove_proc_entry("es_shrinker_info", sbi->s_proc);
 	percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
-	percpu_counter_destroy(&sbi->s_es_stats.es_stats_lru_cnt);
+	percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
 	unregister_shrinker(&sbi->s_es_shrinker);
 }
 
-void ext4_es_lru_add(struct inode *inode)
-{
-	struct ext4_inode_info *ei = EXT4_I(inode);
-	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
-
-	ei->i_touch_when = jiffies;
-
-	if (!list_empty(&ei->i_es_lru))
-		return;
-
-	spin_lock(&sbi->s_es_lru_lock);
-	if (list_empty(&ei->i_es_lru))
-		list_add_tail(&ei->i_es_lru, &sbi->s_es_lru);
-	spin_unlock(&sbi->s_es_lru_lock);
-}
-
-void ext4_es_lru_del(struct inode *inode)
-{
-	struct ext4_inode_info *ei = EXT4_I(inode);
-	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
-
-	spin_lock(&sbi->s_es_lru_lock);
-	if (!list_empty(&ei->i_es_lru))
-		list_del_init(&ei->i_es_lru);
-	spin_unlock(&sbi->s_es_lru_lock);
-}
-
 static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
 				       int nr_to_scan)
 {
@@ -1265,7 +1237,7 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
 	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
 				      DEFAULT_RATELIMIT_BURST);
 
-	if (ei->i_es_lru_nr == 0)
+	if (ei->i_es_shk_nr == 0)
 		return 0;
 
 	if (ext4_test_inode_state(inode, EXT4_STATE_EXT_PRECACHED) &&
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index efd5f97..0e6a33e 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -65,14 +65,13 @@ struct ext4_es_tree {
 };
 
 struct ext4_es_stats {
-	unsigned long es_stats_last_sorted;
 	unsigned long es_stats_shrunk;
 	unsigned long es_stats_cache_hits;
 	unsigned long es_stats_cache_misses;
 	u64 es_stats_scan_time;
 	u64 es_stats_max_scan_time;
 	struct percpu_counter es_stats_all_cnt;
-	struct percpu_counter es_stats_lru_cnt;
+	struct percpu_counter es_stats_shk_cnt;
 };
 
 extern int __init ext4_init_es(void);
@@ -151,7 +150,7 @@ static inline void ext4_es_store_pblock_status(struct extent_status *es,
 
 extern int ext4_es_register_shrinker(struct ext4_sb_info *sbi);
 extern void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi);
-extern void ext4_es_lru_add(struct inode *inode);
-extern void ext4_es_lru_del(struct inode *inode);
+extern void ext4_es_list_add(struct inode *inode);
+extern void ext4_es_list_del(struct inode *inode);
 
 #endif /* _EXT4_EXTENTS_STATUS_H */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d1ad9f9..157ca44 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -494,7 +494,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 
 	/* Lookup extent status tree firstly */
 	if (ext4_es_lookup_extent(inode, map->m_lblk, &es)) {
-		ext4_es_lru_add(inode);
+		ext4_es_list_add(inode);
 		if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
 			map->m_pblk = ext4_es_pblock(&es) +
 					map->m_lblk - es.es_lblk;
@@ -1431,7 +1431,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 
 	/* Lookup extent status tree firstly */
 	if (ext4_es_lookup_extent(inode, iblock, &es)) {
-		ext4_es_lru_add(inode);
+		ext4_es_list_add(inode);
 		if (ext4_es_is_hole(&es)) {
 			retval = 0;
 			down_read(&EXT4_I(inode)->i_data_sem);
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 0f2252e..25c9ef0 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -78,8 +78,8 @@ static void swap_inode_data(struct inode *inode1, struct inode *inode2)
 	memswap(&ei1->i_disksize, &ei2->i_disksize, sizeof(ei1->i_disksize));
 	ext4_es_remove_extent(inode1, 0, EXT_MAX_BLOCKS);
 	ext4_es_remove_extent(inode2, 0, EXT_MAX_BLOCKS);
-	ext4_es_lru_del(inode1);
-	ext4_es_lru_del(inode2);
+	ext4_es_list_del(inode1);
+	ext4_es_list_del(inode2);
 
 	isize = i_size_read(inode1);
 	i_size_write(inode1, i_size_read(inode2));
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7a06eb4..b03ef81 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -883,10 +883,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	spin_lock_init(&ei->i_prealloc_lock);
 	ext4_es_init_tree(&ei->i_es_tree);
 	rwlock_init(&ei->i_es_lock);
-	INIT_LIST_HEAD(&ei->i_es_lru);
+	INIT_LIST_HEAD(&ei->i_es_list);
 	ei->i_es_all_nr = 0;
-	ei->i_es_lru_nr = 0;
-	ei->i_touch_when = 0;
+	ei->i_es_shk_nr = 0;
 	ei->i_reserved_data_blocks = 0;
 	ei->i_reserved_meta_blocks = 0;
 	ei->i_allocated_meta_blocks = 0;
@@ -975,7 +974,7 @@ void ext4_clear_inode(struct inode *inode)
 	dquot_drop(inode);
 	ext4_discard_preallocations(inode);
 	ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS);
-	ext4_es_lru_del(inode);
+	ext4_es_list_del(inode);
 	if (EXT4_I(inode)->jinode) {
 		jbd2_journal_release_jbd_inode(EXT4_JOURNAL(inode),
 					       EXT4_I(inode)->jinode);
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 9337d36..27918ca 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2451,15 +2451,14 @@ TRACE_EVENT(ext4_collapse_range,
 
 TRACE_EVENT(ext4_es_shrink,
 	TP_PROTO(struct super_block *sb, int nr_shrunk, u64 scan_time,
-		 int skip_precached, int nr_skipped, int retried),
+		 int nr_skipped, int retried),
 
-	TP_ARGS(sb, nr_shrunk, scan_time, skip_precached, nr_skipped, retried),
+	TP_ARGS(sb, nr_shrunk, scan_time, nr_skipped, retried),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,		dev		)
 		__field(	int,		nr_shrunk	)
 		__field(	unsigned long long, scan_time	)
-		__field(	int,		skip_precached	)
 		__field(	int,		nr_skipped	)
 		__field(	int,		retried		)
 	),
@@ -2468,16 +2467,14 @@ TRACE_EVENT(ext4_es_shrink,
 		__entry->dev		= sb->s_dev;
 		__entry->nr_shrunk	= nr_shrunk;
 		__entry->scan_time	= div_u64(scan_time, 1000);
-		__entry->skip_precached = skip_precached;
 		__entry->nr_skipped	= nr_skipped;
 		__entry->retried	= retried;
 	),
 
-	TP_printk("dev %d,%d nr_shrunk %d, scan_time %llu skip_precached %d "
+	TP_printk("dev %d,%d nr_shrunk %d, scan_time %llu "
 		  "nr_skipped %d retried %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->nr_shrunk,
-		  __entry->scan_time, __entry->skip_precached,
-		  __entry->nr_skipped, __entry->retried)
+		  __entry->scan_time, __entry->nr_skipped, __entry->retried)
 );
 
 #endif /* _TRACE_EXT4_H */
-- 
1.7.9.7


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 5/6] ext4: use a list to track all reclaimable objects for extent status tree
  2014-08-07  3:35 [PATCH v3 0/6] ext4: extents status tree shrinker improvement Zheng Liu
                   ` (3 preceding siblings ...)
  2014-08-07  3:35 ` [PATCH v3 4/6] ext4: change lru to round-robin in extent status tree shrinker Zheng Liu
@ 2014-08-07  3:35 ` Zheng Liu
  2014-08-27 15:13   ` Jan Kara
  2014-08-07  3:35 ` [PATCH v3 6/6] ext4: use a garbage collection algorithm to manage object Zheng Liu
  2014-10-20 14:48 ` [PATCH v3 0/6] ext4: extents status tree shrinker improvement Theodore Ts'o
  6 siblings, 1 reply; 33+ messages in thread
From: Zheng Liu @ 2014-08-07  3:35 UTC (permalink / raw)
  To: linux-ext4; +Cc: Theodore Ts'o, Andreas Dilger, Jan Kara, Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

Currently the shrinker needs to take a long time to reclaim some objects
from a extent status tree because there are a lot of objects that are
delayed.  These objects could not be reclaimed because ext4 uses them to
finish seeking data/hole, finding delayed range and other works.  If a
rb-tree has a large number of delayed objects, shrinker should scan more
objects and the latency will be high.  This commit uses a list to track
all reclaimble objects in order to reduce the latency when the shrinker
tries to reclaim some objects from a extent status tree.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/extents_status.c |   74 ++++++++++++++++++++++++++--------------------
 fs/ext4/extents_status.h |    2 ++
 2 files changed, 44 insertions(+), 32 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 8d76fd5..2f81d1e 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -148,7 +148,8 @@ static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
 static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 			      ext4_lblk_t end);
 static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
-				       int nr_to_scan);
+				       struct list_head *freeable,
+				       int *nr_to_scan);
 static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
 		       struct ext4_inode_info *locked_ei);
 
@@ -171,6 +172,7 @@ void ext4_exit_es(void)
 void ext4_es_init_tree(struct ext4_es_tree *tree)
 {
 	tree->root = RB_ROOT;
+	INIT_LIST_HEAD(&tree->list);
 	tree->cache_es = NULL;
 }
 
@@ -302,6 +304,7 @@ static struct extent_status *
 ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
 		     ext4_fsblk_t pblk)
 {
+	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct extent_status *es;
 	es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
 	if (es == NULL)
@@ -314,12 +317,13 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
 	 * We don't count delayed extent because we never try to reclaim them
 	 */
 	if (!ext4_es_is_delayed(es)) {
-		EXT4_I(inode)->i_es_shk_nr++;
+		list_add_tail(&es->list, &ei->i_es_tree.list);
+		ei->i_es_shk_nr++;
 		percpu_counter_inc(&EXT4_SB(inode->i_sb)->
 					s_es_stats.es_stats_shk_cnt);
 	}
 
-	EXT4_I(inode)->i_es_all_nr++;
+	ei->i_es_all_nr++;
 	percpu_counter_inc(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);
 
 	return es;
@@ -327,13 +331,15 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
 
 static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
 {
-	EXT4_I(inode)->i_es_all_nr--;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	ei->i_es_all_nr--;
 	percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);
 
 	/* Decrease the shrink counter when this es is not delayed */
 	if (!ext4_es_is_delayed(es)) {
-		BUG_ON(EXT4_I(inode)->i_es_shk_nr == 0);
-		EXT4_I(inode)->i_es_shk_nr--;
+		list_del(&es->list);
+		BUG_ON(ei->i_es_shk_nr-- == 0);
 		percpu_counter_dec(&EXT4_SB(inode->i_sb)->
 					s_es_stats.es_stats_shk_cnt);
 	}
@@ -958,11 +964,24 @@ void ext4_es_list_del(struct inode *inode)
 	spin_unlock(&sbi->s_es_lock);
 }
 
+static void dispose_list(struct inode *inode, struct list_head *head)
+{
+	while (!list_empty(head)) {
+		struct extent_status *es;
+
+		es = list_first_entry(head, struct extent_status, list);
+		list_del_init(&es->list);
+
+		ext4_es_free_extent(inode, es);
+	}
+}
+
 static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
 		       struct ext4_inode_info *locked_ei)
 {
 	struct ext4_inode_info *ei;
 	struct ext4_es_stats *es_stats;
+	LIST_HEAD(freeable);
 	ktime_t start_time;
 	u64 scan_time;
 	int nr_to_walk;
@@ -976,8 +995,6 @@ retry:
 	spin_lock(&sbi->s_es_lock);
 	nr_to_walk = sbi->s_es_nr_inode;
 	while (!list_empty(&sbi->s_es_list) && nr_to_walk-- > 0) {
-		int shrunk;
-
 		ei = list_first_entry(&sbi->s_es_list, struct ext4_inode_info,
 				      i_es_list);
 
@@ -1009,11 +1026,10 @@ retry:
 			continue;
 		}
 
-		shrunk = __es_try_to_reclaim_extents(ei, nr_to_scan);
+		nr_shrunk += __es_try_to_reclaim_extents(ei, &freeable,
+							 &nr_to_scan);
 		write_unlock(&ei->i_es_lock);
-
-		nr_shrunk += shrunk;
-		nr_to_scan -= shrunk;
+		dispose_list(&ei->vfs_inode, &freeable);
 
 		if (nr_to_scan == 0)
 			goto out;
@@ -1030,8 +1046,11 @@ retry:
 		goto retry;
 	}
 
-	if (locked_ei && nr_shrunk == 0)
-		nr_shrunk = __es_try_to_reclaim_extents(locked_ei, nr_to_scan);
+	if (locked_ei && nr_shrunk == 0) {
+		nr_shrunk = __es_try_to_reclaim_extents(locked_ei, &freeable,
+							&nr_to_scan);
+		dispose_list(&locked_ei->vfs_inode, &freeable);
+	}
 
 out:
 	scan_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
@@ -1227,12 +1246,12 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
 }
 
 static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
-				       int nr_to_scan)
+				       struct list_head *freeable,
+				       int *nr_to_scan)
 {
 	struct inode *inode = &ei->vfs_inode;
 	struct ext4_es_tree *tree = &ei->i_es_tree;
-	struct rb_node *node;
-	struct extent_status *es;
+	struct extent_status *es, *tmp;
 	unsigned long nr_shrunk = 0;
 	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
 				      DEFAULT_RATELIMIT_BURST);
@@ -1244,21 +1263,12 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
 	    __ratelimit(&_rs))
 		ext4_warning(inode->i_sb, "forced shrink of precached extents");
 
-	node = rb_first(&tree->root);
-	while (node != NULL) {
-		es = rb_entry(node, struct extent_status, rb_node);
-		node = rb_next(&es->rb_node);
-		/*
-		 * We can't reclaim delayed extent from status tree because
-		 * fiemap, bigallic, and seek_data/hole need to use it.
-		 */
-		if (!ext4_es_is_delayed(es)) {
-			rb_erase(&es->rb_node, &tree->root);
-			ext4_es_free_extent(inode, es);
-			nr_shrunk++;
-			if (--nr_to_scan == 0)
-				break;
-		}
+	list_for_each_entry_safe(es, tmp, &tree->list, list) {
+		rb_erase(&es->rb_node, &tree->root);
+		list_move_tail(&es->list, freeable);
+		nr_shrunk++;
+		if (--*nr_to_scan == 0)
+			break;
 	}
 	tree->cache_es = NULL;
 	return nr_shrunk;
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 0e6a33e..a45c7fe 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -54,6 +54,7 @@ struct ext4_extent;
 
 struct extent_status {
 	struct rb_node rb_node;
+	struct list_head list;
 	ext4_lblk_t es_lblk;	/* first logical block extent covers */
 	ext4_lblk_t es_len;	/* length of extent in block */
 	ext4_fsblk_t es_pblk;	/* first physical block */
@@ -61,6 +62,7 @@ struct extent_status {
 
 struct ext4_es_tree {
 	struct rb_root root;
+	struct list_head list;
 	struct extent_status *cache_es;	/* recently accessed extent */
 };
 
-- 
1.7.9.7


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 6/6] ext4: use a garbage collection algorithm to manage object
  2014-08-07  3:35 [PATCH v3 0/6] ext4: extents status tree shrinker improvement Zheng Liu
                   ` (4 preceding siblings ...)
  2014-08-07  3:35 ` [PATCH v3 5/6] ext4: use a list to track all reclaimable objects for extent status tree Zheng Liu
@ 2014-08-07  3:35 ` Zheng Liu
  2014-08-27 15:24   ` Jan Kara
  2014-10-20 14:48 ` [PATCH v3 0/6] ext4: extents status tree shrinker improvement Theodore Ts'o
  6 siblings, 1 reply; 33+ messages in thread
From: Zheng Liu @ 2014-08-07  3:35 UTC (permalink / raw)
  To: linux-ext4; +Cc: Theodore Ts'o, Andreas Dilger, Jan Kara, Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

For keeping useful extent cache in the tree, this commit uses a gc-like
algorithm to manage object.  A new flag called '_ACCESSED' is defined to
track whether an extent cache is touched or not.  When the shrinker tries
to reclaim some ojbects, an extent cache will be moved to the tail of
active list from inactive list if this flag is marked.  The object in
active list will be reclaimed when we are under a high memory pressure.
After that, the aged extent cache should be kept as many as possible.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/extents_status.c |   42 +++++++++++++++++++++++++++++++++---------
 fs/ext4/extents_status.h |   31 ++++++++++++++++++++++++-------
 2 files changed, 57 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 2f81d1e..2f6bb538 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -149,7 +149,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 			      ext4_lblk_t end);
 static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
 				       struct list_head *freeable,
-				       int *nr_to_scan);
+				       int *nr_to_scan, int force);
 static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
 		       struct ext4_inode_info *locked_ei);
 
@@ -172,7 +172,8 @@ void ext4_exit_es(void)
 void ext4_es_init_tree(struct ext4_es_tree *tree)
 {
 	tree->root = RB_ROOT;
-	INIT_LIST_HEAD(&tree->list);
+	INIT_LIST_HEAD(&tree->active_list);
+	INIT_LIST_HEAD(&tree->inactive_list);
 	tree->cache_es = NULL;
 }
 
@@ -317,7 +318,7 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
 	 * We don't count delayed extent because we never try to reclaim them
 	 */
 	if (!ext4_es_is_delayed(es)) {
-		list_add_tail(&es->list, &ei->i_es_tree.list);
+		list_add_tail(&es->list, &ei->i_es_tree.inactive_list);
 		ei->i_es_shk_nr++;
 		percpu_counter_inc(&EXT4_SB(inode->i_sb)->
 					s_es_stats.es_stats_shk_cnt);
@@ -787,6 +788,7 @@ out:
 	stats = &EXT4_SB(inode->i_sb)->s_es_stats;
 	if (found) {
 		BUG_ON(!es1);
+		ext4_es_mark_accessed(es1);
 		es->es_lblk = es1->es_lblk;
 		es->es_len = es1->es_len;
 		es->es_pblk = es1->es_pblk;
@@ -1027,7 +1029,7 @@ retry:
 		}
 
 		nr_shrunk += __es_try_to_reclaim_extents(ei, &freeable,
-							 &nr_to_scan);
+							 &nr_to_scan, retried);
 		write_unlock(&ei->i_es_lock);
 		dispose_list(&ei->vfs_inode, &freeable);
 
@@ -1048,7 +1050,7 @@ retry:
 
 	if (locked_ei && nr_shrunk == 0) {
 		nr_shrunk = __es_try_to_reclaim_extents(locked_ei, &freeable,
-							&nr_to_scan);
+							&nr_to_scan, 1);
 		dispose_list(&locked_ei->vfs_inode, &freeable);
 	}
 
@@ -1247,7 +1249,7 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
 
 static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
 				       struct list_head *freeable,
-				       int *nr_to_scan)
+				       int *nr_to_scan, int force)
 {
 	struct inode *inode = &ei->vfs_inode;
 	struct ext4_es_tree *tree = &ei->i_es_tree;
@@ -1263,13 +1265,35 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
 	    __ratelimit(&_rs))
 		ext4_warning(inode->i_sb, "forced shrink of precached extents");
 
-	list_for_each_entry_safe(es, tmp, &tree->list, list) {
+	list_for_each_entry_safe(es, tmp, &tree->inactive_list, list) {
+		if (!*nr_to_scan)
+			goto done;
+		--*nr_to_scan;
+
+		if (ext4_es_is_accessed(es)) {
+			list_move_tail(&es->list, &tree->active_list);
+			continue;
+		} else {
+			rb_erase(&es->rb_node, &tree->root);
+			list_move_tail(&es->list, freeable);
+			nr_shrunk++;
+		}
+	}
+
+	if (!force)
+		goto done;
+
+	list_for_each_entry_safe(es, tmp, &tree->active_list, list) {
+		if (!*nr_to_scan)
+			goto done;
+		--*nr_to_scan;
+
 		rb_erase(&es->rb_node, &tree->root);
 		list_move_tail(&es->list, freeable);
 		nr_shrunk++;
-		if (--*nr_to_scan == 0)
-			break;
 	}
+
+done:
 	tree->cache_es = NULL;
 	return nr_shrunk;
 }
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index a45c7fe..213e056 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -29,12 +29,12 @@
 /*
  * These flags live in the high bits of extent_status.es_pblk
  */
-#define ES_SHIFT	60
+#define ES_SHIFT	59
 
-#define EXTENT_STATUS_WRITTEN	(1 << 3)
-#define EXTENT_STATUS_UNWRITTEN (1 << 2)
-#define EXTENT_STATUS_DELAYED	(1 << 1)
-#define EXTENT_STATUS_HOLE	(1 << 0)
+#define EXTENT_STATUS_WRITTEN	(1 << 4)
+#define EXTENT_STATUS_UNWRITTEN (1 << 3)
+#define EXTENT_STATUS_DELAYED	(1 << 2)
+#define EXTENT_STATUS_HOLE	(1 << 1)
 
 #define EXTENT_STATUS_FLAGS	(EXTENT_STATUS_WRITTEN | \
 				 EXTENT_STATUS_UNWRITTEN | \
@@ -45,9 +45,10 @@
 #define ES_UNWRITTEN		(1ULL << 62)
 #define ES_DELAYED		(1ULL << 61)
 #define ES_HOLE			(1ULL << 60)
+#define ES_ACCESSED		(1ULL << 59)
 
 #define ES_MASK			(ES_WRITTEN | ES_UNWRITTEN | \
-				 ES_DELAYED | ES_HOLE)
+				 ES_DELAYED | ES_HOLE | ES_ACCESSED)
 
 struct ext4_sb_info;
 struct ext4_extent;
@@ -62,7 +63,8 @@ struct extent_status {
 
 struct ext4_es_tree {
 	struct rb_root root;
-	struct list_head list;
+	struct list_head active_list;
+	struct list_head inactive_list;
 	struct extent_status *cache_es;	/* recently accessed extent */
 };
 
@@ -114,6 +116,21 @@ static inline int ext4_es_is_hole(struct extent_status *es)
 	return (es->es_pblk & ES_HOLE) != 0;
 }
 
+static inline int ext4_es_is_accessed(struct extent_status *es)
+{
+	return (es->es_pblk & ES_ACCESSED) != 0;
+}
+
+static inline void ext4_es_mark_accessed(struct extent_status *es)
+{
+	es->es_pblk |= ES_ACCESSED;
+}
+
+static inline void ext4_es_clear_accessed(struct extent_status *es)
+{
+	es->es_pblk &= ~ES_ACCESSED;
+}
+
 static inline unsigned int ext4_es_status(struct extent_status *es)
 {
 	return es->es_pblk >> ES_SHIFT;
-- 
1.7.9.7


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 2/6] ext4: track extent status tree shrinker delay statictics
  2014-08-07  3:35 ` [PATCH v3 2/6] ext4: track extent status tree shrinker delay statictics Zheng Liu
@ 2014-08-27 13:26   ` Jan Kara
  2014-09-04 12:10     ` Zheng Liu
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2014-08-27 13:26 UTC (permalink / raw)
  To: Zheng Liu
  Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Jan Kara, Zheng Liu

On Thu 07-08-14 11:35:49, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> This commit adds some statictics in extent status tree shrinker.  The
> purpose to add these is that we want to collect more details when we
> encounter a stall caused by extent status tree shrinker.  Here we count
> the following statictics:
>   stats:
>     the number of all objects on all extent status trees
>     the number of reclaimable objects on lru list
>     cache hits/misses
>     the last sorted interval
>     the number of inodes on lru list
>   average:
>     scan time for shrinking some objects
>     the number of shrunk objects
>   maximum:
>     the inode that has max nr. of objects on lru list
>     the maximum scan time for shrinking some objects
> 
> The output looks like below:
>   $ cat /proc/fs/ext4/sda1/es_shrinker_info
>   stats:
>     28228 objects
>     6341 reclaimable objects
>     5281/631 cache hits/misses
>     586 ms last sorted interval
>     250 inodes on lru list
>   average:
>     153 us scan time
>     128 shrunk objects
>   maximum:
>     255 inode (255 objects, 198 reclaimable)
>     125723 us max scan time
  When reading through this again, I realized that we probably don't want
this file in /proc/fs/ext4 but rather in /sys/kernel/debug/ext4 because
it's really a debugging interface and we don't want any tools to start
using it.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
  2014-08-07  3:35 ` [PATCH v3 3/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks() Zheng Liu
@ 2014-08-27 13:55   ` Jan Kara
  2014-09-04 13:05     ` Zheng Liu
  2014-09-02  2:43   ` Theodore Ts'o
  1 sibling, 1 reply; 33+ messages in thread
From: Jan Kara @ 2014-08-27 13:55 UTC (permalink / raw)
  To: Zheng Liu
  Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Jan Kara, Zheng Liu

On Thu 07-08-14 11:35:50, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> Currently extent status tree doesn't cache extent hole when a write
> looks up in extent tree to make sure whether a block has been allocated
> or not.  In this case, we don't put extent hole in extent cache because
> later this extent might be removed and a new delayed extent might be
> added back.  But it will cause a defect when we do a lot of writes.
> If we don't put extent hole in extent cache, the following writes also
> need to access extent tree to look at whether or not a block has been
> allocated.  It brings a cache miss.  This commit fixes this defect.
> Meanwhile, if an inode has no any extent, this extent hole also will
> be cached.
> 
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Andreas Dilger <adilger.kernel@dilger.ca>
> Cc: Jan Kara <jack@suse.cz>
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
  So I agree with this change, it certainly doesn't make things worse. But
when looking into ext4_ext_put_gap_in_cache() I have one question: That
function uses ext4_find_delalloc_range() to check that the intended range
doesn't have any delalloc blocks in it. However it doesn't make any effort
to put a smaller hole in cache - for example if we have blocks allocated
like:

5-6 = delalloc
7-10 = allocated

and ext4_ext_put_gap_in_cache() is called for block 0, the function won't
put anything into cache although it could put range 0-4 in the cache as a
hole. Why is that?

								Honza

> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 76c2df3..6463d34 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -2284,16 +2284,15 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
>  				ext4_lblk_t block)
>  {
>  	int depth = ext_depth(inode);
> -	unsigned long len = 0;
> -	ext4_lblk_t lblock = 0;
> +	unsigned long len;
> +	ext4_lblk_t lblock;
>  	struct ext4_extent *ex;
>  
>  	ex = path[depth].p_ext;
>  	if (ex == NULL) {
> -		/*
> -		 * there is no extent yet, so gap is [0;-] and we
> -		 * don't cache it
> -		 */
> +		/* there is no extent yet, so gap is [0;-] */
> +		lblock = 0;
> +		len = EXT_MAX_BLOCKS;
>  		ext_debug("cache gap(whole file):");
>  	} else if (block < le32_to_cpu(ex->ee_block)) {
>  		lblock = block;
> @@ -2302,9 +2301,6 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
>  				block,
>  				le32_to_cpu(ex->ee_block),
>  				 ext4_ext_get_actual_len(ex));
> -		if (!ext4_find_delalloc_range(inode, lblock, lblock + len - 1))
> -			ext4_es_insert_extent(inode, lblock, len, ~0,
> -					      EXTENT_STATUS_HOLE);
>  	} else if (block >= le32_to_cpu(ex->ee_block)
>  			+ ext4_ext_get_actual_len(ex)) {
>  		ext4_lblk_t next;
> @@ -2318,14 +2314,14 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
>  				block);
>  		BUG_ON(next == lblock);
>  		len = next - lblock;
> -		if (!ext4_find_delalloc_range(inode, lblock, lblock + len - 1))
> -			ext4_es_insert_extent(inode, lblock, len, ~0,
> -					      EXTENT_STATUS_HOLE);
>  	} else {
>  		BUG();
>  	}
>  
>  	ext_debug(" -> %u:%lu\n", lblock, len);
> +	if (!ext4_find_delalloc_range(inode, lblock, lblock + len - 1))
> +		ext4_es_insert_extent(inode, lblock, len, ~0,
> +				      EXTENT_STATUS_HOLE);
>  }
>  
>  /*
> @@ -4362,8 +4358,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
>  		 * put just found gap into cache to speed up
>  		 * subsequent requests
>  		 */
> -		if ((flags & EXT4_GET_BLOCKS_NO_PUT_HOLE) == 0)
> -			ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
> +		ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
>  		goto out2;
>  	}
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 4/6] ext4: change lru to round-robin in extent status tree shrinker
  2014-08-07  3:35 ` [PATCH v3 4/6] ext4: change lru to round-robin in extent status tree shrinker Zheng Liu
@ 2014-08-27 15:01   ` Jan Kara
  2014-09-03  3:37     ` Theodore Ts'o
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2014-08-27 15:01 UTC (permalink / raw)
  To: Zheng Liu
  Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Jan Kara, Zheng Liu

On Thu 07-08-14 11:35:51, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> In this commit we discard the lru algorithm because it should take a
> long time to keep a lru list in extent status tree shrinker and the
> shrinker should take a long time to scan this lru list in order to
> reclaim some objects.
> 
> For reducing the latency, this commit does two works.  The first one
> is to replace lru with round-robin.  After that we never need to keep
> a lru list.  That means that the list shouldn't be sorted if the
> shrinker can not reclaim any objects in the first round.  The second
> one is to shrink the length of the list.  After using round-robin
> algorithm, the shrinker takes the first inode in the list and handle
> it.  If this inode is skipped, it will be moved into the tail of the
> list.  Otherwise it will be added back when it is touched again.
  I like this change. Some comments are below.

...
> @@ -685,8 +685,8 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>  		goto error;
>  retry:
>  	err = __es_insert_extent(inode, &newes);
> -	if (err == -ENOMEM && __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
> -					       EXT4_I(inode)))
> +	if (err == -ENOMEM && __es_shrink(EXT4_SB(inode->i_sb),
> +					  1, EXT4_I(inode)))
>  		goto retry;
>  	if (err == -ENOMEM && !ext4_es_is_delayed(&newes))
>  		err = 0;
  This comment is not directly related to this patch but looking into the
code made me think about it. It seems ugly to call __es_shrink() from
internals of ext4_es_insert_extent(). Also thinking about locking
implications makes me shudder a bit and finally this may make the pressure
on the extent cache artificially bigger because MM subsystem is not aware
of the shrinking you do here. I would prefer to leave shrinking on
the slab subsystem itself.

Now GFP_ATOMIC allocation we use for extent cache makes it hard for the
slab subsystem and actually we could fairly easily use GFP_NOFS. We can just
allocate the structure before grabbing i_es_lock with GFP_NOFS allocation and
in case we don't need the structure, we can just free it again. It may
introduce some overhead from unnecessary alloc/free but things get simpler
that way (no need for that locked_ei argument for __es_shrink(), no need
for internal calls to __es_shrink() from within the filesystem).

...
> +static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
> +		       struct ext4_inode_info *locked_ei)
>  {
>  	struct ext4_inode_info *ei;
>  	struct ext4_es_stats *es_stats;
> -	struct list_head *cur, *tmp;
> -	LIST_HEAD(skipped);
>  	ktime_t start_time;
>  	u64 scan_time;
> +	int nr_to_walk;
>  	int nr_shrunk = 0;
> -	int retried = 0, skip_precached = 1, nr_skipped = 0;
> +	int retried = 0, nr_skipped = 0;
>  
>  	es_stats = &sbi->s_es_stats;
>  	start_time = ktime_get();
> -	spin_lock(&sbi->s_es_lru_lock);
>  
>  retry:
> -	list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
> +	spin_lock(&sbi->s_es_lock);
> +	nr_to_walk = sbi->s_es_nr_inode;
> +	while (!list_empty(&sbi->s_es_list) && nr_to_walk-- > 0) {
>  		int shrunk;
>  
> -		/*
> -		 * If we have already reclaimed all extents from extent
> -		 * status tree, just stop the loop immediately.
> -		 */
> -		if (percpu_counter_read_positive(
> -				&es_stats->es_stats_lru_cnt) == 0)
> -			break;
> +		ei = list_first_entry(&sbi->s_es_list, struct ext4_inode_info,
> +				      i_es_list);
>  
> -		ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
> +		list_del_init(&ei->i_es_list);
> +		sbi->s_es_nr_inode--;
> +		spin_unlock(&sbi->s_es_lock);
  Nothing seems to prevent reclaim from freeing the inode after we drop
s_es_lock. So we could use freed memory. I don't think we want to pin the
inode here by grabbing a refcount since we don't want to deal with iput()
in the shrinker (that could mean having to delete the inode from shrinker
context). But what we could do it to grab ei->i_es_lock before dropping
s_es_lock. Since ext4_es_remove_extent() called from ext4_clear_inode()
always grabs i_es_lock, we are protected from inode being freed while we
hold that lock. But please add comments about this both to the
__es_shrink() and ext4_es_remove_extent().

>  
>  		/*
> -		 * Skip the inode that is newer than the last_sorted
> -		 * time.  Normally we try hard to avoid shrinking
> -		 * precached inodes, but we will as a last resort.
> +		 * Normally we try hard to avoid shrinking precached inodes,
> +		 * but we will as a last resort.
>  		 */
> -		if ((es_stats->es_stats_last_sorted < ei->i_touch_when) ||
> -		    (skip_precached && ext4_test_inode_state(&ei->vfs_inode,
> -						EXT4_STATE_EXT_PRECACHED))) {
> +		if (!retried && ext4_test_inode_state(&ei->vfs_inode,
> +						EXT4_STATE_EXT_PRECACHED)) {
>  			nr_skipped++;
> -			list_move_tail(cur, &skipped);
> +			spin_lock(&sbi->s_es_lock);
> +			__ext4_es_list_add(sbi, ei);
> +			continue;
> +		}
> +
> +		if (ei->i_es_shk_nr == 0) {
> +			spin_lock(&sbi->s_es_lock);
>  			continue;
>  		}
>  
> -		if (ei->i_es_lru_nr == 0 || ei == locked_ei ||
> -		    !write_trylock(&ei->i_es_lock))
> +		if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
> +			nr_skipped++;
> +			spin_lock(&sbi->s_es_lock);
> +			__ext4_es_list_add(sbi, ei);
>  			continue;
> +		}
  These tests will need some reorg when we rely on ei->i_es_lock but it
should be easily doable.

>  
>  		shrunk = __es_try_to_reclaim_extents(ei, nr_to_scan);
> -		if (ei->i_es_lru_nr == 0)
> -			list_del_init(&ei->i_es_lru);
>  		write_unlock(&ei->i_es_lock);
>  
>  		nr_shrunk += shrunk;
>  		nr_to_scan -= shrunk;
> +
>  		if (nr_to_scan == 0)
> -			break;
> +			goto out;
> +		spin_lock(&sbi->s_es_lock);
>  	}
> -
> -	/* Move the newer inodes into the tail of the LRU list. */
> -	list_splice_tail(&skipped, &sbi->s_es_lru);
> -	INIT_LIST_HEAD(&skipped);
> +	spin_unlock(&sbi->s_es_lock);
>  
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 5/6] ext4: use a list to track all reclaimable objects for extent status tree
  2014-08-07  3:35 ` [PATCH v3 5/6] ext4: use a list to track all reclaimable objects for extent status tree Zheng Liu
@ 2014-08-27 15:13   ` Jan Kara
  2014-09-03  3:44     ` Theodore Ts'o
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2014-08-27 15:13 UTC (permalink / raw)
  To: Zheng Liu
  Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Jan Kara, Zheng Liu

On Thu 07-08-14 11:35:52, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> Currently the shrinker needs to take a long time to reclaim some objects
> from a extent status tree because there are a lot of objects that are
> delayed.  These objects could not be reclaimed because ext4 uses them to
> finish seeking data/hole, finding delayed range and other works.  If a
> rb-tree has a large number of delayed objects, shrinker should scan more
> objects and the latency will be high.  This commit uses a list to track
> all reclaimble objects in order to reduce the latency when the shrinker
> tries to reclaim some objects from a extent status tree.
> 
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Andreas Dilger <adilger.kernel@dilger.ca>
> Cc: Jan Kara <jack@suse.cz>
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
  Hmm, this will grow structure for caching single extent from 5 to 7
longs. That is quite noticeable. Since lots of delalloc extents within an
inode isn't really a common case I would prefer to avoid that.

What we could do to limit latency caused by scanning of unreclaimable
extents is to change the shrinker to really stop after inspecting
nr_to_scan extents regardless of how many extents did we really reclaim -
this is actually how slab shrinkers are designed to work.  We would also
have to record the logical block where the shrinker stopped scanning the
inode and the next shrinker invocation will start scanning at that offset -
it is enough to store this offset in the superblock, we just have to zero
it out when we remove the first inode from the list of inodes with cached
extents.

What do people think about this?

								Honza

> ---
>  fs/ext4/extents_status.c |   74 ++++++++++++++++++++++++++--------------------
>  fs/ext4/extents_status.h |    2 ++
>  2 files changed, 44 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index 8d76fd5..2f81d1e 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -148,7 +148,8 @@ static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
>  static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  			      ext4_lblk_t end);
>  static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> -				       int nr_to_scan);
> +				       struct list_head *freeable,
> +				       int *nr_to_scan);
>  static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
>  		       struct ext4_inode_info *locked_ei);
>  
> @@ -171,6 +172,7 @@ void ext4_exit_es(void)
>  void ext4_es_init_tree(struct ext4_es_tree *tree)
>  {
>  	tree->root = RB_ROOT;
> +	INIT_LIST_HEAD(&tree->list);
>  	tree->cache_es = NULL;
>  }
>  
> @@ -302,6 +304,7 @@ static struct extent_status *
>  ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
>  		     ext4_fsblk_t pblk)
>  {
> +	struct ext4_inode_info *ei = EXT4_I(inode);
>  	struct extent_status *es;
>  	es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
>  	if (es == NULL)
> @@ -314,12 +317,13 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
>  	 * We don't count delayed extent because we never try to reclaim them
>  	 */
>  	if (!ext4_es_is_delayed(es)) {
> -		EXT4_I(inode)->i_es_shk_nr++;
> +		list_add_tail(&es->list, &ei->i_es_tree.list);
> +		ei->i_es_shk_nr++;
>  		percpu_counter_inc(&EXT4_SB(inode->i_sb)->
>  					s_es_stats.es_stats_shk_cnt);
>  	}
>  
> -	EXT4_I(inode)->i_es_all_nr++;
> +	ei->i_es_all_nr++;
>  	percpu_counter_inc(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);
>  
>  	return es;
> @@ -327,13 +331,15 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
>  
>  static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
>  {
> -	EXT4_I(inode)->i_es_all_nr--;
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +
> +	ei->i_es_all_nr--;
>  	percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);
>  
>  	/* Decrease the shrink counter when this es is not delayed */
>  	if (!ext4_es_is_delayed(es)) {
> -		BUG_ON(EXT4_I(inode)->i_es_shk_nr == 0);
> -		EXT4_I(inode)->i_es_shk_nr--;
> +		list_del(&es->list);
> +		BUG_ON(ei->i_es_shk_nr-- == 0);
>  		percpu_counter_dec(&EXT4_SB(inode->i_sb)->
>  					s_es_stats.es_stats_shk_cnt);
>  	}
> @@ -958,11 +964,24 @@ void ext4_es_list_del(struct inode *inode)
>  	spin_unlock(&sbi->s_es_lock);
>  }
>  
> +static void dispose_list(struct inode *inode, struct list_head *head)
> +{
> +	while (!list_empty(head)) {
> +		struct extent_status *es;
> +
> +		es = list_first_entry(head, struct extent_status, list);
> +		list_del_init(&es->list);
> +
> +		ext4_es_free_extent(inode, es);
> +	}
> +}
> +
>  static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
>  		       struct ext4_inode_info *locked_ei)
>  {
>  	struct ext4_inode_info *ei;
>  	struct ext4_es_stats *es_stats;
> +	LIST_HEAD(freeable);
>  	ktime_t start_time;
>  	u64 scan_time;
>  	int nr_to_walk;
> @@ -976,8 +995,6 @@ retry:
>  	spin_lock(&sbi->s_es_lock);
>  	nr_to_walk = sbi->s_es_nr_inode;
>  	while (!list_empty(&sbi->s_es_list) && nr_to_walk-- > 0) {
> -		int shrunk;
> -
>  		ei = list_first_entry(&sbi->s_es_list, struct ext4_inode_info,
>  				      i_es_list);
>  
> @@ -1009,11 +1026,10 @@ retry:
>  			continue;
>  		}
>  
> -		shrunk = __es_try_to_reclaim_extents(ei, nr_to_scan);
> +		nr_shrunk += __es_try_to_reclaim_extents(ei, &freeable,
> +							 &nr_to_scan);
>  		write_unlock(&ei->i_es_lock);
> -
> -		nr_shrunk += shrunk;
> -		nr_to_scan -= shrunk;
> +		dispose_list(&ei->vfs_inode, &freeable);
>  
>  		if (nr_to_scan == 0)
>  			goto out;
> @@ -1030,8 +1046,11 @@ retry:
>  		goto retry;
>  	}
>  
> -	if (locked_ei && nr_shrunk == 0)
> -		nr_shrunk = __es_try_to_reclaim_extents(locked_ei, nr_to_scan);
> +	if (locked_ei && nr_shrunk == 0) {
> +		nr_shrunk = __es_try_to_reclaim_extents(locked_ei, &freeable,
> +							&nr_to_scan);
> +		dispose_list(&locked_ei->vfs_inode, &freeable);
> +	}
>  
>  out:
>  	scan_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
> @@ -1227,12 +1246,12 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
>  }
>  
>  static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> -				       int nr_to_scan)
> +				       struct list_head *freeable,
> +				       int *nr_to_scan)
>  {
>  	struct inode *inode = &ei->vfs_inode;
>  	struct ext4_es_tree *tree = &ei->i_es_tree;
> -	struct rb_node *node;
> -	struct extent_status *es;
> +	struct extent_status *es, *tmp;
>  	unsigned long nr_shrunk = 0;
>  	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
>  				      DEFAULT_RATELIMIT_BURST);
> @@ -1244,21 +1263,12 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
>  	    __ratelimit(&_rs))
>  		ext4_warning(inode->i_sb, "forced shrink of precached extents");
>  
> -	node = rb_first(&tree->root);
> -	while (node != NULL) {
> -		es = rb_entry(node, struct extent_status, rb_node);
> -		node = rb_next(&es->rb_node);
> -		/*
> -		 * We can't reclaim delayed extent from status tree because
> -		 * fiemap, bigallic, and seek_data/hole need to use it.
> -		 */
> -		if (!ext4_es_is_delayed(es)) {
> -			rb_erase(&es->rb_node, &tree->root);
> -			ext4_es_free_extent(inode, es);
> -			nr_shrunk++;
> -			if (--nr_to_scan == 0)
> -				break;
> -		}
> +	list_for_each_entry_safe(es, tmp, &tree->list, list) {
> +		rb_erase(&es->rb_node, &tree->root);
> +		list_move_tail(&es->list, freeable);
> +		nr_shrunk++;
> +		if (--*nr_to_scan == 0)
> +			break;
>  	}
>  	tree->cache_es = NULL;
>  	return nr_shrunk;
> diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
> index 0e6a33e..a45c7fe 100644
> --- a/fs/ext4/extents_status.h
> +++ b/fs/ext4/extents_status.h
> @@ -54,6 +54,7 @@ struct ext4_extent;
>  
>  struct extent_status {
>  	struct rb_node rb_node;
> +	struct list_head list;
>  	ext4_lblk_t es_lblk;	/* first logical block extent covers */
>  	ext4_lblk_t es_len;	/* length of extent in block */
>  	ext4_fsblk_t es_pblk;	/* first physical block */
> @@ -61,6 +62,7 @@ struct extent_status {
>  
>  struct ext4_es_tree {
>  	struct rb_root root;
> +	struct list_head list;
>  	struct extent_status *cache_es;	/* recently accessed extent */
>  };
>  
> -- 
> 1.7.9.7
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/6] ext4: use a garbage collection algorithm to manage object
  2014-08-07  3:35 ` [PATCH v3 6/6] ext4: use a garbage collection algorithm to manage object Zheng Liu
@ 2014-08-27 15:24   ` Jan Kara
  0 siblings, 0 replies; 33+ messages in thread
From: Jan Kara @ 2014-08-27 15:24 UTC (permalink / raw)
  To: Zheng Liu
  Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Jan Kara, Zheng Liu

On Thu 07-08-14 11:35:53, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> For keeping useful extent cache in the tree, this commit uses a gc-like
> algorithm to manage object.  A new flag called '_ACCESSED' is defined to
> track whether an extent cache is touched or not.  When the shrinker tries
> to reclaim some ojbects, an extent cache will be moved to the tail of
> active list from inactive list if this flag is marked.  The object in
> active list will be reclaimed when we are under a high memory pressure.
> After that, the aged extent cache should be kept as many as possible.
  So I like the idea of basic aging logic. However active / inactive list
makes it necessary to have list_head in the extent and I'd like to avoid
that. Also it seems like an overkill for a relatively simple structure like
extent cache. What we could do is to have only the ACCESSED bit - it gets
set when cache entry is used. When shrinker sees cache entry with ACCESSED
bit, it clears the bit and skips the entry. Entry without ACCESSED bit is
reclaimed. This way frequently accessed entries have higher chances of
being kept in cache. Again latency of scanning is limited by the fact we
stop scanning after inspecting nr_to_scan entries.

								Honza
> 
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Andreas Dilger <adilger.kernel@dilger.ca>
> Cc: Jan Kara <jack@suse.cz>
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> ---
>  fs/ext4/extents_status.c |   42 +++++++++++++++++++++++++++++++++---------
>  fs/ext4/extents_status.h |   31 ++++++++++++++++++++++++-------
>  2 files changed, 57 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index 2f81d1e..2f6bb538 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -149,7 +149,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  			      ext4_lblk_t end);
>  static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
>  				       struct list_head *freeable,
> -				       int *nr_to_scan);
> +				       int *nr_to_scan, int force);
>  static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
>  		       struct ext4_inode_info *locked_ei);
>  
> @@ -172,7 +172,8 @@ void ext4_exit_es(void)
>  void ext4_es_init_tree(struct ext4_es_tree *tree)
>  {
>  	tree->root = RB_ROOT;
> -	INIT_LIST_HEAD(&tree->list);
> +	INIT_LIST_HEAD(&tree->active_list);
> +	INIT_LIST_HEAD(&tree->inactive_list);
>  	tree->cache_es = NULL;
>  }
>  
> @@ -317,7 +318,7 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
>  	 * We don't count delayed extent because we never try to reclaim them
>  	 */
>  	if (!ext4_es_is_delayed(es)) {
> -		list_add_tail(&es->list, &ei->i_es_tree.list);
> +		list_add_tail(&es->list, &ei->i_es_tree.inactive_list);
>  		ei->i_es_shk_nr++;
>  		percpu_counter_inc(&EXT4_SB(inode->i_sb)->
>  					s_es_stats.es_stats_shk_cnt);
> @@ -787,6 +788,7 @@ out:
>  	stats = &EXT4_SB(inode->i_sb)->s_es_stats;
>  	if (found) {
>  		BUG_ON(!es1);
> +		ext4_es_mark_accessed(es1);
>  		es->es_lblk = es1->es_lblk;
>  		es->es_len = es1->es_len;
>  		es->es_pblk = es1->es_pblk;
> @@ -1027,7 +1029,7 @@ retry:
>  		}
>  
>  		nr_shrunk += __es_try_to_reclaim_extents(ei, &freeable,
> -							 &nr_to_scan);
> +							 &nr_to_scan, retried);
>  		write_unlock(&ei->i_es_lock);
>  		dispose_list(&ei->vfs_inode, &freeable);
>  
> @@ -1048,7 +1050,7 @@ retry:
>  
>  	if (locked_ei && nr_shrunk == 0) {
>  		nr_shrunk = __es_try_to_reclaim_extents(locked_ei, &freeable,
> -							&nr_to_scan);
> +							&nr_to_scan, 1);
>  		dispose_list(&locked_ei->vfs_inode, &freeable);
>  	}
>  
> @@ -1247,7 +1249,7 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
>  
>  static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
>  				       struct list_head *freeable,
> -				       int *nr_to_scan)
> +				       int *nr_to_scan, int force)
>  {
>  	struct inode *inode = &ei->vfs_inode;
>  	struct ext4_es_tree *tree = &ei->i_es_tree;
> @@ -1263,13 +1265,35 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
>  	    __ratelimit(&_rs))
>  		ext4_warning(inode->i_sb, "forced shrink of precached extents");
>  
> -	list_for_each_entry_safe(es, tmp, &tree->list, list) {
> +	list_for_each_entry_safe(es, tmp, &tree->inactive_list, list) {
> +		if (!*nr_to_scan)
> +			goto done;
> +		--*nr_to_scan;
> +
> +		if (ext4_es_is_accessed(es)) {
> +			list_move_tail(&es->list, &tree->active_list);
> +			continue;
> +		} else {
> +			rb_erase(&es->rb_node, &tree->root);
> +			list_move_tail(&es->list, freeable);
> +			nr_shrunk++;
> +		}
> +	}
> +
> +	if (!force)
> +		goto done;
> +
> +	list_for_each_entry_safe(es, tmp, &tree->active_list, list) {
> +		if (!*nr_to_scan)
> +			goto done;
> +		--*nr_to_scan;
> +
>  		rb_erase(&es->rb_node, &tree->root);
>  		list_move_tail(&es->list, freeable);
>  		nr_shrunk++;
> -		if (--*nr_to_scan == 0)
> -			break;
>  	}
> +
> +done:
>  	tree->cache_es = NULL;
>  	return nr_shrunk;
>  }
> diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
> index a45c7fe..213e056 100644
> --- a/fs/ext4/extents_status.h
> +++ b/fs/ext4/extents_status.h
> @@ -29,12 +29,12 @@
>  /*
>   * These flags live in the high bits of extent_status.es_pblk
>   */
> -#define ES_SHIFT	60
> +#define ES_SHIFT	59
>  
> -#define EXTENT_STATUS_WRITTEN	(1 << 3)
> -#define EXTENT_STATUS_UNWRITTEN (1 << 2)
> -#define EXTENT_STATUS_DELAYED	(1 << 1)
> -#define EXTENT_STATUS_HOLE	(1 << 0)
> +#define EXTENT_STATUS_WRITTEN	(1 << 4)
> +#define EXTENT_STATUS_UNWRITTEN (1 << 3)
> +#define EXTENT_STATUS_DELAYED	(1 << 2)
> +#define EXTENT_STATUS_HOLE	(1 << 1)
>  
>  #define EXTENT_STATUS_FLAGS	(EXTENT_STATUS_WRITTEN | \
>  				 EXTENT_STATUS_UNWRITTEN | \
> @@ -45,9 +45,10 @@
>  #define ES_UNWRITTEN		(1ULL << 62)
>  #define ES_DELAYED		(1ULL << 61)
>  #define ES_HOLE			(1ULL << 60)
> +#define ES_ACCESSED		(1ULL << 59)
>  
>  #define ES_MASK			(ES_WRITTEN | ES_UNWRITTEN | \
> -				 ES_DELAYED | ES_HOLE)
> +				 ES_DELAYED | ES_HOLE | ES_ACCESSED)
>  
>  struct ext4_sb_info;
>  struct ext4_extent;
> @@ -62,7 +63,8 @@ struct extent_status {
>  
>  struct ext4_es_tree {
>  	struct rb_root root;
> -	struct list_head list;
> +	struct list_head active_list;
> +	struct list_head inactive_list;
>  	struct extent_status *cache_es;	/* recently accessed extent */
>  };
>  
> @@ -114,6 +116,21 @@ static inline int ext4_es_is_hole(struct extent_status *es)
>  	return (es->es_pblk & ES_HOLE) != 0;
>  }
>  
> +static inline int ext4_es_is_accessed(struct extent_status *es)
> +{
> +	return (es->es_pblk & ES_ACCESSED) != 0;
> +}
> +
> +static inline void ext4_es_mark_accessed(struct extent_status *es)
> +{
> +	es->es_pblk |= ES_ACCESSED;
> +}
> +
> +static inline void ext4_es_clear_accessed(struct extent_status *es)
> +{
> +	es->es_pblk &= ~ES_ACCESSED;
> +}
> +
>  static inline unsigned int ext4_es_status(struct extent_status *es)
>  {
>  	return es->es_pblk >> ES_SHIFT;
> -- 
> 1.7.9.7
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 1/6] ext4: improve extents status tree trace point
  2014-08-07  3:35 ` [PATCH v3 1/6] ext4: improve extents status tree trace point Zheng Liu
@ 2014-09-02  2:25   ` Theodore Ts'o
  0 siblings, 0 replies; 33+ messages in thread
From: Theodore Ts'o @ 2014-09-02  2:25 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Andreas Dilger, Jan Kara, Zheng Liu

On Thu, Aug 07, 2014 at 11:35:48AM +0800, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> This commit improves the trace point of extents status tree.  We rename
> trace_ext4_es_shrink_enter in ext4_es_count() because it is also used
> in ext4_es_scan() and we can not identify them from the result.
> 
> Further this commit fixes a variable name in trace point in order to
> keep consistency with others.
> 
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Andreas Dilger <adilger.kernel@dilger.ca>
> Cc: Jan Kara <jack@suse.cz>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>

Applied to the ext4 tree, thanks.

					- Ted

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
  2014-08-07  3:35 ` [PATCH v3 3/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks() Zheng Liu
  2014-08-27 13:55   ` Jan Kara
@ 2014-09-02  2:43   ` Theodore Ts'o
  2014-09-04 13:04     ` Zheng Liu
  1 sibling, 1 reply; 33+ messages in thread
From: Theodore Ts'o @ 2014-09-02  2:43 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Andreas Dilger, Jan Kara, Zheng Liu

On Thu, Aug 07, 2014 at 11:35:50AM +0800, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> Currently extent status tree doesn't cache extent hole when a write
> looks up in extent tree to make sure whether a block has been allocated
> or not.  In this case, we don't put extent hole in extent cache because
> later this extent might be removed and a new delayed extent might be
> added back.  But it will cause a defect when we do a lot of writes.
> If we don't put extent hole in extent cache, the following writes also
> need to access extent tree to look at whether or not a block has been
> allocated.  It brings a cache miss.  This commit fixes this defect.
> Meanwhile, if an inode has no any extent, this extent hole also will
> be cached.

Hi Zheng,

I thought the reason why we have the EXT4_GET_BLOCKS_NO_PUT_HOLE flag
is because in ext4_da_map_blocks(), if there is a hole, we will be
immediately following it up with a call to ext4_es_insert_extent() to
fill in the hole with the EXTENT_STATUS_DELAYED flag.  The only time
we don't is if we run into an ENOSPC error.

Am I missing something?

						- Ted

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 4/6] ext4: change lru to round-robin in extent status tree shrinker
  2014-08-27 15:01   ` Jan Kara
@ 2014-09-03  3:37     ` Theodore Ts'o
  2014-09-03 15:31       ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: Theodore Ts'o @ 2014-09-03  3:37 UTC (permalink / raw)
  To: Jan Kara; +Cc: Zheng Liu, linux-ext4, Andreas Dilger, Zheng Liu

On Wed, Aug 27, 2014 at 05:01:21PM +0200, Jan Kara wrote:
> On Thu 07-08-14 11:35:51, Zheng Liu wrote:
>   This comment is not directly related to this patch but looking into the
> code made me think about it. It seems ugly to call __es_shrink() from
> internals of ext4_es_insert_extent(). Also thinking about locking
> implications makes me shudder a bit and finally this may make the pressure
> on the extent cache artificially bigger because MM subsystem is not aware
> of the shrinking you do here. I would prefer to leave shrinking on
> the slab subsystem itself.

If we fail, the allocation we only try to free at most one extent, so
I don't think it's going to make the slab system that confused; it's
the equivalent of freeing an entry and then using allocating it again.

> Now GFP_ATOMIC allocation we use for extent cache makes it hard for the
> slab subsystem and actually we could fairly easily use GFP_NOFS. We can just
> allocate the structure before grabbing i_es_lock with GFP_NOFS allocation and
> in case we don't need the structure, we can just free it again. It may
> introduce some overhead from unnecessary alloc/free but things get simpler
> that way (no need for that locked_ei argument for __es_shrink(), no need
> for internal calls to __es_shrink() from within the filesystem).

The tricky bit is that even __es_remove_extent() can require a memory
allocation, and in the worst case, it's possible that
ext4_es_insert_extent() can require *two* allocations.  For example,
if you start with a single large extent, and then need to insert a
subregion with a different set of flags into the already existing
extent, thus resulting in three extents where you started with one.

And in some cases, no allocation is required at all....

One thing that can help is that so long as we haven't done something
critical, such as erase a delalloc region, we always release the write
lock and retry the allocation with GFP_NOFS, and the try the operation
again.

So we may need to think a bit about what's the best way to improve
this, although it is separate topic from making the shrinker be less
heavyweight.

>   Nothing seems to prevent reclaim from freeing the inode after we drop
> s_es_lock. So we could use freed memory. I don't think we want to pin the
> inode here by grabbing a refcount since we don't want to deal with iput()
> in the shrinker (that could mean having to delete the inode from shrinker
> context). But what we could do it to grab ei->i_es_lock before dropping
> s_es_lock. Since ext4_es_remove_extent() called from ext4_clear_inode()
> always grabs i_es_lock, we are protected from inode being freed while we
> hold that lock. But please add comments about this both to the
> __es_shrink() and ext4_es_remove_extent().

Something like this should work, yes?

						- Ted

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 25da1bf..4768f7f 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -981,32 +981,27 @@ retry:
 
 		list_del_init(&ei->i_es_list);
 		sbi->s_es_nr_inode--;
-		spin_unlock(&sbi->s_es_lock);
+		if (ei->i_es_shk_nr == 0)
+			continue;
 
 		/*
 		 * Normally we try hard to avoid shrinking precached inodes,
 		 * but we will as a last resort.
 		 */
-		if (!retried && ext4_test_inode_state(&ei->vfs_inode,
-						EXT4_STATE_EXT_PRECACHED)) {
+		if ((!retried && ext4_test_inode_state(&ei->vfs_inode,
+				       EXT4_STATE_EXT_PRECACHED)) ||
+		    ei == locked_ei ||
+		    !write_trylock(&ei->i_es_lock)) {
 			nr_skipped++;
-			spin_lock(&sbi->s_es_lock);
-			__ext4_es_list_add(sbi, ei);
-			continue;
-		}
-
-		if (ei->i_es_shk_nr == 0) {
-			spin_lock(&sbi->s_es_lock);
-			continue;
-		}
-
-		if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
-			nr_skipped++;
-			spin_lock(&sbi->s_es_lock);
 			__ext4_es_list_add(sbi, ei);
+			if (spin_is_contended(&sbi->s_es_lock)) {
+				spin_unlock(&sbi->s_es_lock);
+				spin_lock(&sbi->s_es_lock);
+			}
 			continue;
 		}

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 5/6] ext4: use a list to track all reclaimable objects for extent status tree
  2014-08-27 15:13   ` Jan Kara
@ 2014-09-03  3:44     ` Theodore Ts'o
  0 siblings, 0 replies; 33+ messages in thread
From: Theodore Ts'o @ 2014-09-03  3:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: Zheng Liu, linux-ext4, Andreas Dilger, Zheng Liu

On Wed, Aug 27, 2014 at 05:13:54PM +0200, Jan Kara wrote:
> 
> What we could do to limit latency caused by scanning of unreclaimable
> extents is to change the shrinker to really stop after inspecting
> nr_to_scan extents regardless of how many extents did we really reclaim -
> this is actually how slab shrinkers are designed to work.  We would also
> have to record the logical block where the shrinker stopped scanning the
> inode and the next shrinker invocation will start scanning at that offset -
> it is enough to store this offset in the superblock, we just have to zero
> it out when we remove the first inode from the list of inodes with cached
> extents.
> 
> What do people think about this?

Yes, I agree this would be a better approach.

					- Ted

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 4/6] ext4: change lru to round-robin in extent status tree shrinker
  2014-09-03  3:37     ` Theodore Ts'o
@ 2014-09-03 15:31       ` Jan Kara
  2014-09-03 20:00         ` Theodore Ts'o
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2014-09-03 15:31 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, Zheng Liu, linux-ext4, Andreas Dilger, Zheng Liu

On Tue 02-09-14 23:37:38, Ted Tso wrote:
> On Wed, Aug 27, 2014 at 05:01:21PM +0200, Jan Kara wrote:
> > On Thu 07-08-14 11:35:51, Zheng Liu wrote:
> >   This comment is not directly related to this patch but looking into the
> > code made me think about it. It seems ugly to call __es_shrink() from
> > internals of ext4_es_insert_extent(). Also thinking about locking
> > implications makes me shudder a bit and finally this may make the pressure
> > on the extent cache artificially bigger because MM subsystem is not aware
> > of the shrinking you do here. I would prefer to leave shrinking on
> > the slab subsystem itself.
> 
> If we fail, the allocation we only try to free at most one extent, so
> I don't think it's going to make the slab system that confused; it's
> the equivalent of freeing an entry and then using allocating it again.
> 
> > Now GFP_ATOMIC allocation we use for extent cache makes it hard for the
> > slab subsystem and actually we could fairly easily use GFP_NOFS. We can just
> > allocate the structure before grabbing i_es_lock with GFP_NOFS allocation and
> > in case we don't need the structure, we can just free it again. It may
> > introduce some overhead from unnecessary alloc/free but things get simpler
> > that way (no need for that locked_ei argument for __es_shrink(), no need
> > for internal calls to __es_shrink() from within the filesystem).
> 
> The tricky bit is that even __es_remove_extent() can require a memory
> allocation, and in the worst case, it's possible that
> ext4_es_insert_extent() can require *two* allocations.  For example,
> if you start with a single large extent, and then need to insert a
> subregion with a different set of flags into the already existing
> extent, thus resulting in three extents where you started with one.
  Right, I didn't realize that.

> And in some cases, no allocation is required at all....
> 
> One thing that can help is that so long as we haven't done something
> critical, such as erase a delalloc region, we always release the write
> lock and retry the allocation with GFP_NOFS, and the try the operation
> again.
  Yeah, maybe we could use mempools for this. It should make the code less
clumsy.

> So we may need to think a bit about what's the best way to improve
> this, although it is separate topic from making the shrinker be less
> heavyweight.
  Agreed, it's a separate topic.

> >   Nothing seems to prevent reclaim from freeing the inode after we drop
> > s_es_lock. So we could use freed memory. I don't think we want to pin the
> > inode here by grabbing a refcount since we don't want to deal with iput()
> > in the shrinker (that could mean having to delete the inode from shrinker
> > context). But what we could do it to grab ei->i_es_lock before dropping
> > s_es_lock. Since ext4_es_remove_extent() called from ext4_clear_inode()
> > always grabs i_es_lock, we are protected from inode being freed while we
> > hold that lock. But please add comments about this both to the
> > __es_shrink() and ext4_es_remove_extent().
> 
> Something like this should work, yes?
  Yes, this should work. I would just add a comment to
ext4_es_remove_extent() about the fact that ext4_clear_inode() requires
grabbing i_es_lock so that we don't do some clever optimization in future
and break these lifetime rules...

Also one question:

> -		if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
> -			nr_skipped++;
> -			spin_lock(&sbi->s_es_lock);
>  			__ext4_es_list_add(sbi, ei);
> +			if (spin_is_contended(&sbi->s_es_lock)) {
> +				spin_unlock(&sbi->s_es_lock);
> +				spin_lock(&sbi->s_es_lock);
> +			}
  Why not cond_resched_lock(&sbi->s_es_lock)?


								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 4/6] ext4: change lru to round-robin in extent status tree shrinker
  2014-09-03 15:31       ` Jan Kara
@ 2014-09-03 20:00         ` Theodore Ts'o
  2014-09-03 22:14           ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: Theodore Ts'o @ 2014-09-03 20:00 UTC (permalink / raw)
  To: Jan Kara; +Cc: Zheng Liu, linux-ext4, Andreas Dilger, Zheng Liu

On Wed, Sep 03, 2014 at 05:31:22PM +0200, Jan Kara wrote:
> Also one question:
> 
> > -		if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
> > -			nr_skipped++;
> > -			spin_lock(&sbi->s_es_lock);
> >  			__ext4_es_list_add(sbi, ei);
> > +			if (spin_is_contended(&sbi->s_es_lock)) {
> > +				spin_unlock(&sbi->s_es_lock);
> > +				spin_lock(&sbi->s_es_lock);
> > +			}
>   Why not cond_resched_lock(&sbi->s_es_lock)?

I didn't think we were allowed to reschedule or sleep while in
shrinker context?

					- Ted

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 4/6] ext4: change lru to round-robin in extent status tree shrinker
  2014-09-03 20:00         ` Theodore Ts'o
@ 2014-09-03 22:14           ` Jan Kara
  2014-09-03 22:38             ` Theodore Ts'o
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2014-09-03 22:14 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, Zheng Liu, linux-ext4, Andreas Dilger, Zheng Liu

On Wed 03-09-14 16:00:39, Ted Tso wrote:
> On Wed, Sep 03, 2014 at 05:31:22PM +0200, Jan Kara wrote:
> > Also one question:
> > 
> > > -		if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
> > > -			nr_skipped++;
> > > -			spin_lock(&sbi->s_es_lock);
> > >  			__ext4_es_list_add(sbi, ei);
> > > +			if (spin_is_contended(&sbi->s_es_lock)) {
> > > +				spin_unlock(&sbi->s_es_lock);
> > > +				spin_lock(&sbi->s_es_lock);
> > > +			}
> >   Why not cond_resched_lock(&sbi->s_es_lock)?
> 
> I didn't think we were allowed to reschedule or sleep while in
> shrinker context?
  I believe we are allowed to sleep in the shrinker if appropriate gfp
flags are set (__GFP_WAIT) and we enter extent cache shrinker only if
__GFP_FS is set which guarantees __GFP_WAIT.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 4/6] ext4: change lru to round-robin in extent status tree shrinker
  2014-09-03 22:14           ` Jan Kara
@ 2014-09-03 22:38             ` Theodore Ts'o
       [not found]               ` <20140904071553.GA26930@quack.suse.cz>
  0 siblings, 1 reply; 33+ messages in thread
From: Theodore Ts'o @ 2014-09-03 22:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: Zheng Liu, linux-ext4, Andreas Dilger, Zheng Liu

On Thu, Sep 04, 2014 at 12:14:02AM +0200, Jan Kara wrote:
> > I didn't think we were allowed to reschedule or sleep while in
> > shrinker context?
>   I believe we are allowed to sleep in the shrinker if appropriate gfp
> flags are set (__GFP_WAIT) and we enter extent cache shrinker only if
> __GFP_FS is set which guarantees __GFP_WAIT.

I must be missing something.  How is this guaranteed?

I can see how we can determine what gfp_mask was used in the
allocation which triggered the shrinker callback, by checking
shrink->gfp_mask, but I don't see anything that guarantees that the
extent cache shrinker is only entered if __GFP_FS is set.

I guess we could add something like:

	if ((shrink->gfp_mask & __GFP_WAIT) == 0)
		return 0;

to the beginning of ext4_es_scan(), but we're not doing that at the
moment.

       	     	      	      	      	 - Ted

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 2/6] ext4: track extent status tree shrinker delay statictics
  2014-08-27 13:26   ` Jan Kara
@ 2014-09-04 12:10     ` Zheng Liu
  2014-09-04 15:49       ` Theodore Ts'o
  0 siblings, 1 reply; 33+ messages in thread
From: Zheng Liu @ 2014-09-04 12:10 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Zheng Liu

On Wed, Aug 27, 2014 at 03:26:07PM +0200, Jan Kara wrote:
> On Thu 07-08-14 11:35:49, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > This commit adds some statictics in extent status tree shrinker.  The
> > purpose to add these is that we want to collect more details when we
> > encounter a stall caused by extent status tree shrinker.  Here we count
> > the following statictics:
> >   stats:
> >     the number of all objects on all extent status trees
> >     the number of reclaimable objects on lru list
> >     cache hits/misses
> >     the last sorted interval
> >     the number of inodes on lru list
> >   average:
> >     scan time for shrinking some objects
> >     the number of shrunk objects
> >   maximum:
> >     the inode that has max nr. of objects on lru list
> >     the maximum scan time for shrinking some objects
> > 
> > The output looks like below:
> >   $ cat /proc/fs/ext4/sda1/es_shrinker_info
> >   stats:
> >     28228 objects
> >     6341 reclaimable objects
> >     5281/631 cache hits/misses
> >     586 ms last sorted interval
> >     250 inodes on lru list
> >   average:
> >     153 us scan time
> >     128 shrunk objects
> >   maximum:
> >     255 inode (255 objects, 198 reclaimable)
> >     125723 us max scan time
>   When reading through this again, I realized that we probably don't want
> this file in /proc/fs/ext4 but rather in /sys/kernel/debug/ext4 because
> it's really a debugging interface and we don't want any tools to start
> using it.

Fair enough.  I will fix it in next version.

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
  2014-09-02  2:43   ` Theodore Ts'o
@ 2014-09-04 13:04     ` Zheng Liu
  2014-09-04 15:54       ` Theodore Ts'o
  0 siblings, 1 reply; 33+ messages in thread
From: Zheng Liu @ 2014-09-04 13:04 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Andreas Dilger, Jan Kara, Zheng Liu

On Mon, Sep 01, 2014 at 10:43:50PM -0400, Theodore Ts'o wrote:
> On Thu, Aug 07, 2014 at 11:35:50AM +0800, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > Currently extent status tree doesn't cache extent hole when a write
> > looks up in extent tree to make sure whether a block has been allocated
> > or not.  In this case, we don't put extent hole in extent cache because
> > later this extent might be removed and a new delayed extent might be
> > added back.  But it will cause a defect when we do a lot of writes.
> > If we don't put extent hole in extent cache, the following writes also
> > need to access extent tree to look at whether or not a block has been
> > allocated.  It brings a cache miss.  This commit fixes this defect.
> > Meanwhile, if an inode has no any extent, this extent hole also will
> > be cached.
> 
> Hi Zheng,
> 
> I thought the reason why we have the EXT4_GET_BLOCKS_NO_PUT_HOLE flag
> is because in ext4_da_map_blocks(), if there is a hole, we will be
> immediately following it up with a call to ext4_es_insert_extent() to
> fill in the hole with the EXTENT_STATUS_DELAYED flag.  The only time
> we don't is if we run into an ENOSPC error.
> 
> Am I missing something?

Yes, you are right.  The purpose is used to do the work like you said
above.  As the commit log described this flag brings a huge number of
cache misses when an empty file is written.  So it might be worth
putting a hole in the cache when delalloc is enabled.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
  2014-08-27 13:55   ` Jan Kara
@ 2014-09-04 13:05     ` Zheng Liu
  0 siblings, 0 replies; 33+ messages in thread
From: Zheng Liu @ 2014-09-04 13:05 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Zheng Liu

On Wed, Aug 27, 2014 at 03:55:16PM +0200, Jan Kara wrote:
> On Thu 07-08-14 11:35:50, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > Currently extent status tree doesn't cache extent hole when a write
> > looks up in extent tree to make sure whether a block has been allocated
> > or not.  In this case, we don't put extent hole in extent cache because
> > later this extent might be removed and a new delayed extent might be
> > added back.  But it will cause a defect when we do a lot of writes.
> > If we don't put extent hole in extent cache, the following writes also
> > need to access extent tree to look at whether or not a block has been
> > allocated.  It brings a cache miss.  This commit fixes this defect.
> > Meanwhile, if an inode has no any extent, this extent hole also will
> > be cached.
> > 
> > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > Cc: Andreas Dilger <adilger.kernel@dilger.ca>
> > Cc: Jan Kara <jack@suse.cz>
> > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
>   So I agree with this change, it certainly doesn't make things worse. But
> when looking into ext4_ext_put_gap_in_cache() I have one question: That
> function uses ext4_find_delalloc_range() to check that the intended range
> doesn't have any delalloc blocks in it. However it doesn't make any effort
> to put a smaller hole in cache - for example if we have blocks allocated
> like:
> 
> 5-6 = delalloc
> 7-10 = allocated
> 
> and ext4_ext_put_gap_in_cache() is called for block 0, the function won't
> put anything into cache although it could put range 0-4 in the cache as a
> hole. Why is that?

Oops, it should put range 0-4 in the cache.  Thanks for pointing it out.
I will fix it.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 4/6] ext4: change lru to round-robin in extent status tree shrinker
       [not found]               ` <20140904071553.GA26930@quack.suse.cz>
@ 2014-09-04 15:44                 ` Theodore Ts'o
  2014-09-08 15:47                   ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: Theodore Ts'o @ 2014-09-04 15:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: Zheng Liu, linux-ext4, Andreas Dilger, Zheng Liu

On Thu, Sep 04, 2014 at 09:15:53AM +0200, Jan Kara wrote:
>   Ah, sorry. I was mistaken and thought we do check for __GFP_FS in
> ext4_es_scan() but we don't and we don't need to. But thinking about it
> again - if we're going to always scan at most nr_to_scan cache entries,
> there's probably no need to reduce s_es_lock latency by playing with
> spinlock_contended(), right?

I'm more generally worried contention on s_es_lock, since it's a file
system-wide spinlock that is grabbed whenever we need to add or remove
an inode from the es_list.  So if someone were to try to run AIM7
benchmark on a large core count machine on an ext4 file system mounted
on a ramdisk, this lock would likely show up.

Now, this might not be a realistic scenario, but it's a common way to
test for fs scalability without having a super-expensive RAID array,
so it's quite common if you look at FAST papers over the last couple
of years, for example..

So my thinking was that if we do run into contention, the shrinker
thread should always yield, since if it gets slowed down slightly,
there's no harm done.  Hmmm.... OTOH, the extra cache line bounce
could potentially be worse, so maybe it would be better to let the
shrinker thread do its thing and then get out of there.

					- Ted

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 2/6] ext4: track extent status tree shrinker delay statictics
  2014-09-04 12:10     ` Zheng Liu
@ 2014-09-04 15:49       ` Theodore Ts'o
  0 siblings, 0 replies; 33+ messages in thread
From: Theodore Ts'o @ 2014-09-04 15:49 UTC (permalink / raw)
  To: Jan Kara, linux-ext4, Andreas Dilger, Zheng Liu

On Thu, Sep 04, 2014 at 08:10:48PM +0800, Zheng Liu wrote:
> >   When reading through this again, I realized that we probably don't want
> > this file in /proc/fs/ext4 but rather in /sys/kernel/debug/ext4 because
> > it's really a debugging interface and we don't want any tools to start
> > using it.
> 
> Fair enough.  I will fix it in next version.

I was actually OK with leaving it in /proc, since we have a number of
other debugging files in /proc/fs/ext4/<dev> already, and I'm not sure
spreading ext4's foot print so that we have some files in /proc, some
in /sys, and some in debugfs is worth decreasing the possibility that
some tool would start depending on the contents of that file --- I
can't really imagine why non-fs developers would ever care about such
statistics.

I probably *would* be in favor of creating a new config option which
disables /proc/fs/ext4/<dev>/mballoc_groups and
/proc/fs/ext4/<dev>/es_shrinker_info, to help with the people who care
shrinking the total footprint of the Linux kernel for embedded
application (kernel tinification), though.

						- Ted

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
  2014-09-04 13:04     ` Zheng Liu
@ 2014-09-04 15:54       ` Theodore Ts'o
  0 siblings, 0 replies; 33+ messages in thread
From: Theodore Ts'o @ 2014-09-04 15:54 UTC (permalink / raw)
  To: linux-ext4, Andreas Dilger, Jan Kara, Zheng Liu

On Thu, Sep 04, 2014 at 09:04:48PM +0800, Zheng Liu wrote:
> > I thought the reason why we have the EXT4_GET_BLOCKS_NO_PUT_HOLE flag
> > is because in ext4_da_map_blocks(), if there is a hole, we will be
> > immediately following it up with a call to ext4_es_insert_extent() to
> > fill in the hole with the EXTENT_STATUS_DELAYED flag.  The only time
> > we don't is if we run into an ENOSPC error.
> > 
> > Am I missing something?
> 
> Yes, you are right.  The purpose is used to do the work like you said
> above.  As the commit log described this flag brings a huge number of
> cache misses when an empty file is written.  

Right, and I was trying to figure out why that was happening, because
it looked like in the normal case we would immediately fill in the
hole afterwards.  I was goign to try doing some tracing to understand
why this was resulting a huge number of cache misses, but I haven't
had time to do that investigation yet.  Can you sketch out the
scenario where this was leading to so many cache misses?

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 4/6] ext4: change lru to round-robin in extent status tree shrinker
  2014-09-04 15:44                 ` Theodore Ts'o
@ 2014-09-08 15:47                   ` Jan Kara
  0 siblings, 0 replies; 33+ messages in thread
From: Jan Kara @ 2014-09-08 15:47 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, Zheng Liu, linux-ext4, Andreas Dilger, Zheng Liu

On Thu 04-09-14 11:44:59, Ted Tso wrote:
> On Thu, Sep 04, 2014 at 09:15:53AM +0200, Jan Kara wrote:
> >   Ah, sorry. I was mistaken and thought we do check for __GFP_FS in
> > ext4_es_scan() but we don't and we don't need to. But thinking about it
> > again - if we're going to always scan at most nr_to_scan cache entries,
> > there's probably no need to reduce s_es_lock latency by playing with
> > spinlock_contended(), right?
> 
> I'm more generally worried contention on s_es_lock, since it's a file
> system-wide spinlock that is grabbed whenever we need to add or remove
> an inode from the es_list.  So if someone were to try to run AIM7
> benchmark on a large core count machine on an ext4 file system mounted
> on a ramdisk, this lock would likely show up.
> 
> Now, this might not be a realistic scenario, but it's a common way to
> test for fs scalability without having a super-expensive RAID array,
> so it's quite common if you look at FAST papers over the last couple
> of years, for example..
> 
> So my thinking was that if we do run into contention, the shrinker
> thread should always yield, since if it gets slowed down slightly,
> there's no harm done.  Hmmm.... OTOH, the extra cache line bounce
> could potentially be worse, so maybe it would be better to let the
> shrinker thread do its thing and then get out of there.
  Yeah. I think cache bouncing limits scalability in a similar way spinlock
itself does so there's no big win in shortening the lock hold times. If
someone is concerned about scalability of our extent cache LRU, we could
use some more fancy LRU implementation like the one implemented in
mm/list_lru.c and used for other fs objects. But I would see that as a
separate step and only once someone can show a benefit...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/6] ext4: extents status tree shrinker improvement
  2014-08-07  3:35 [PATCH v3 0/6] ext4: extents status tree shrinker improvement Zheng Liu
                   ` (5 preceding siblings ...)
  2014-08-07  3:35 ` [PATCH v3 6/6] ext4: use a garbage collection algorithm to manage object Zheng Liu
@ 2014-10-20 14:48 ` Theodore Ts'o
  2014-10-21 10:22   ` Jan Kara
  6 siblings, 1 reply; 33+ messages in thread
From: Theodore Ts'o @ 2014-10-20 14:48 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Zheng Liu, Andreas Dilger, Jan Kara

Hi Zheng,

It's been a while since you've posted a revised version of this patch
series.  I believe Jan had suggested some changes which you said you
would fix in the next set of patches.  Do you know when you might be
able that might be done?

Note that the first two patches in these series are already the
patches that have been queued for Linus for this merge window.

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/6] ext4: extents status tree shrinker improvement
  2014-10-20 14:48 ` [PATCH v3 0/6] ext4: extents status tree shrinker improvement Theodore Ts'o
@ 2014-10-21 10:22   ` Jan Kara
  2014-10-21 15:58     ` 刘峥(文卿)
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2014-10-21 10:22 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Zheng Liu, linux-ext4, Zheng Liu, Andreas Dilger, Jan Kara

  Hello,

On Mon 20-10-14 10:48:49, Ted Tso wrote:
> It's been a while since you've posted a revised version of this patch
> series.  I believe Jan had suggested some changes which you said you
> would fix in the next set of patches.  Do you know when you might be
> able that might be done?
  Yeah, actually if you don't have time just say so. I can push it to
completion somehow (there wasn't that much work left). I need the thing
completed reasonably quickly since I already have some user reports for
openSUSE and I'm somewhat afraid reports for SLES will come pretty soon and
I cannot leave those unresolved for long...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/6] ext4: extents status tree shrinker improvement
  2014-10-21 10:22   ` Jan Kara
@ 2014-10-21 15:58     ` 刘峥(文卿)
  2014-11-03 16:10       ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: 刘峥(文卿) @ 2014-10-21 15:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, Zheng Liu, linux-ext4,
	刘峥(文卿),
	Andreas Dilger

Hi Ted, Jan and others,

I deeply sorry for this because of my delay work.  I don’t have any objection
for Jan’s suggestions.  Until now there are still some works that push me
tough, and I can see that I don’t have time to finish it at this merge
window.  It’s a shame for me!

Jan, I really really appreciate if you are willing to push this patch set
to completion.  Thanks!!!

Regards,
						- Zheng

> On Oct 21, 2014, at 6:22 PM, Jan Kara <jack@suse.cz> wrote:
> 
>  Hello,
> 
> On Mon 20-10-14 10:48:49, Ted Tso wrote:
>> It's been a while since you've posted a revised version of this patch
>> series.  I believe Jan had suggested some changes which you said you
>> would fix in the next set of patches.  Do you know when you might be
>> able that might be done?
>  Yeah, actually if you don't have time just say so. I can push it to
> completion somehow (there wasn't that much work left). I need the thing
> completed reasonably quickly since I already have some user reports for
> openSUSE and I'm somewhat afraid reports for SLES will come pretty soon and
> I cannot leave those unresolved for long...
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/6] ext4: extents status tree shrinker improvement
  2014-10-21 15:58     ` 刘峥(文卿)
@ 2014-11-03 16:10       ` Jan Kara
  2014-11-07  2:38         ` Zheng Liu
  2014-11-13 23:40         ` Theodore Ts'o
  0 siblings, 2 replies; 33+ messages in thread
From: Jan Kara @ 2014-11-03 16:10 UTC (permalink / raw)
  To: 刘峥(文卿)
  Cc: Jan Kara, Theodore Ts'o, Zheng Liu, linux-ext4,
	刘峥(文卿),
	Andreas Dilger

  Hello,

On Tue 21-10-14 23:58:10, 刘峥(文卿) wrote:
> I deeply sorry for this because of my delay work.  I don’t have any objection
> for Jan’s suggestions.  Until now there are still some works that push me
> tough, and I can see that I don’t have time to finish it at this merge
> window.  It’s a shame for me!
> 
> Jan, I really really appreciate if you are willing to push this patch set
> to completion.  Thanks!!!
  OK, I have updated the patches according to the review I and Ted did. It
survives basic fsstress run. How were you testing your patches? I should
probably also gather some statistics etc...

								Honza

> > On Oct 21, 2014, at 6:22 PM, Jan Kara <jack@suse.cz> wrote:
> > 
> >  Hello,
> > 
> > On Mon 20-10-14 10:48:49, Ted Tso wrote:
> >> It's been a while since you've posted a revised version of this patch
> >> series.  I believe Jan had suggested some changes which you said you
> >> would fix in the next set of patches.  Do you know when you might be
> >> able that might be done?
> >  Yeah, actually if you don't have time just say so. I can push it to
> > completion somehow (there wasn't that much work left). I need the thing
> > completed reasonably quickly since I already have some user reports for
> > openSUSE and I'm somewhat afraid reports for SLES will come pretty soon and
> > I cannot leave those unresolved for long...
> > 
> > 								Honza
> > -- 
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/6] ext4: extents status tree shrinker improvement
  2014-11-03 16:10       ` Jan Kara
@ 2014-11-07  2:38         ` Zheng Liu
  2014-11-13 23:40         ` Theodore Ts'o
  1 sibling, 0 replies; 33+ messages in thread
From: Zheng Liu @ 2014-11-07  2:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: 刘峥(文卿),
	Theodore Ts'o, linux-ext4, 刘峥(文卿),
	Andreas Dilger

On Mon, Nov 03, 2014 at 05:10:46PM +0100, Jan Kara wrote:
>   Hello,
> 
> On Tue 21-10-14 23:58:10, 刘峥(文卿) wrote:
> > I deeply sorry for this because of my delay work.  I don’t have any objection
> > for Jan’s suggestions.  Until now there are still some works that push me
> > tough, and I can see that I don’t have time to finish it at this merge
> > window.  It’s a shame for me!
> > 
> > Jan, I really really appreciate if you are willing to push this patch set
> > to completion.  Thanks!!!
>   OK, I have updated the patches according to the review I and Ted did. It
> survives basic fsstress run. How were you testing your patches? I should
> probably also gather some statistics etc...

Thanks!!

Here are my test cases for performance.

case 1:
  [global]
  ioengine=psync
  bs=4k
  directory=/mnt/sda1
  thread
  group_reporting
  fallocate=0
  direct=0
  filesize=10g
  size=20g
  runtime=300
  
  [io]
  rw=randwrite:32
  rw_sequencer=sequential
  numjobs=25
  nrfiles=10

case 2:
  [global]
  ioengine=psync
  bs=4k
  directory=/mnt/sda1
  group_reporting
  fallocate=0
  direct=0
  filesize=10g
  size=20g
  runtime=300
  
  [io]
  rw=write:4k
  numjobs=15
  nrfiles=20000

For getting a really fragmented extent status tree, I will disable
extent status tree merge as I run these test cases.  The patch looks
like below:

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 09fd576..0946f50 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -351,6 +351,7 @@ static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
 static int ext4_es_can_be_merged(struct extent_status *es1,
 				 struct extent_status *es2)
 {
+#if 0
 	if (ext4_es_status(es1) != ext4_es_status(es2))
 		return 0;
 
@@ -376,6 +377,7 @@ static int ext4_es_can_be_merged(struct extent_status *es1,
 	/* we need to check delayed extent is without unwritten status */
 	if (ext4_es_is_delayed(es1) && !ext4_es_is_unwritten(es1))
 		return 1;
+#endif
 
 	return 0;
 }


In the mean time, the following sysctl parameters are adjusted to keep
dirty data in memory as much as possible.

sudo sysctl vm.dirty_ratio=80
sudo sysctl vm.dirty_background_ratio=60

Thanks,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/6] ext4: extents status tree shrinker improvement
  2014-11-03 16:10       ` Jan Kara
  2014-11-07  2:38         ` Zheng Liu
@ 2014-11-13 23:40         ` Theodore Ts'o
  1 sibling, 0 replies; 33+ messages in thread
From: Theodore Ts'o @ 2014-11-13 23:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: 刘峥(文卿),
	Zheng Liu, linux-ext4, 刘峥(文卿),
	Andreas Dilger

On Mon, Nov 03, 2014 at 05:10:46PM +0100, Jan Kara wrote:
>   Hello,
> 
> On Tue 21-10-14 23:58:10, 刘峥(文卿) wrote:
> > I deeply sorry for this because of my delay work.  I don’t have any objection
> > for Jan’s suggestions.  Until now there are still some works that push me
> > tough, and I can see that I don’t have time to finish it at this merge
> > window.  It’s a shame for me!
> > 
> > Jan, I really really appreciate if you are willing to push this patch set
> > to completion.  Thanks!!!
>   OK, I have updated the patches according to the review I and Ted did. It
> survives basic fsstress run. How were you testing your patches? I should
> probably also gather some statistics etc...

Hi Jan,

I was wondering how your updated extent status tree shrinker patches
are going?  Are you comfortable sending them out for review/merging?
Sorry for nagging but we're already at 3.18-rc4, and it would be great
if we could get these merged for this merge window.

Thanks!

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2014-11-13 23:40 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-07  3:35 [PATCH v3 0/6] ext4: extents status tree shrinker improvement Zheng Liu
2014-08-07  3:35 ` [PATCH v3 1/6] ext4: improve extents status tree trace point Zheng Liu
2014-09-02  2:25   ` Theodore Ts'o
2014-08-07  3:35 ` [PATCH v3 2/6] ext4: track extent status tree shrinker delay statictics Zheng Liu
2014-08-27 13:26   ` Jan Kara
2014-09-04 12:10     ` Zheng Liu
2014-09-04 15:49       ` Theodore Ts'o
2014-08-07  3:35 ` [PATCH v3 3/6] ext4: cache extent hole in extent status tree for ext4_da_map_blocks() Zheng Liu
2014-08-27 13:55   ` Jan Kara
2014-09-04 13:05     ` Zheng Liu
2014-09-02  2:43   ` Theodore Ts'o
2014-09-04 13:04     ` Zheng Liu
2014-09-04 15:54       ` Theodore Ts'o
2014-08-07  3:35 ` [PATCH v3 4/6] ext4: change lru to round-robin in extent status tree shrinker Zheng Liu
2014-08-27 15:01   ` Jan Kara
2014-09-03  3:37     ` Theodore Ts'o
2014-09-03 15:31       ` Jan Kara
2014-09-03 20:00         ` Theodore Ts'o
2014-09-03 22:14           ` Jan Kara
2014-09-03 22:38             ` Theodore Ts'o
     [not found]               ` <20140904071553.GA26930@quack.suse.cz>
2014-09-04 15:44                 ` Theodore Ts'o
2014-09-08 15:47                   ` Jan Kara
2014-08-07  3:35 ` [PATCH v3 5/6] ext4: use a list to track all reclaimable objects for extent status tree Zheng Liu
2014-08-27 15:13   ` Jan Kara
2014-09-03  3:44     ` Theodore Ts'o
2014-08-07  3:35 ` [PATCH v3 6/6] ext4: use a garbage collection algorithm to manage object Zheng Liu
2014-08-27 15:24   ` Jan Kara
2014-10-20 14:48 ` [PATCH v3 0/6] ext4: extents status tree shrinker improvement Theodore Ts'o
2014-10-21 10:22   ` Jan Kara
2014-10-21 15:58     ` 刘峥(文卿)
2014-11-03 16:10       ` Jan Kara
2014-11-07  2:38         ` Zheng Liu
2014-11-13 23:40         ` Theodore Ts'o

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.