* [PATCH] bcache: consider the fragmentation when update the writeback rate
@ 2021-01-05 3:06 Dongdong Tao
[not found] ` <CAJS8hVK-ZCxJt=E3hwR0hmqPYL1T07_WC_nerb-dZodO+DqtDA@mail.gmail.com>
2021-01-17 7:11 ` kernel test robot
0 siblings, 2 replies; 24+ messages in thread
From: Dongdong Tao @ 2021-01-05 3:06 UTC (permalink / raw)
To: colyli; +Cc: dongdong tao, Kent Overstreet, linux-bcache, linux-kernel
From: dongdong tao <dongdong.tao@canonical.com>
Current way to calculate the writeback rate only considered the
dirty sectors, this usually works fine when the fragmentation
is not high, but it will give us unreasonable small rate when
we are under a situation that very few dirty sectors consumed
a lot dirty buckets. In some case, the dirty bucekts can reached
to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
reached the writeback_percent, the writeback rate will still
be the minimum value (4k), thus it will cause all the writes to be
stucked in a non-writeback mode because of the slow writeback.
We accelerate the rate in 3 stages with different aggressiveness,
the first stage starts when dirty buckets percent reach above
BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default
the first stage tries to writeback the amount of dirty data
in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second,
the second stage tries to writeback the amount of dirty data in one bucket
in (1 / (dirty_buckets_percent - 57)) * 200 millisecond, the third
stage tries to writeback the amount of dirty data in one bucket in
(1 / (dirty_buckets_percent - 64)) * 20 millisecond.
Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
---
drivers/md/bcache/bcache.h | 3 +++
drivers/md/bcache/sysfs.c | 18 ++++++++++++++++
drivers/md/bcache/writeback.c | 39 +++++++++++++++++++++++++++++++++++
drivers/md/bcache/writeback.h | 4 ++++
4 files changed, 64 insertions(+)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 1d57f48307e6..f8e892208bae 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -385,6 +385,9 @@ struct cached_dev {
unsigned int writeback_rate_update_seconds;
unsigned int writeback_rate_i_term_inverse;
unsigned int writeback_rate_p_term_inverse;
+ unsigned int writeback_rate_fp_term_low;
+ unsigned int writeback_rate_fp_term_mid;
+ unsigned int writeback_rate_fp_term_high;
unsigned int writeback_rate_minimum;
enum stop_on_failure stop_when_cache_set_failed;
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 554e3afc9b68..130df9406171 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -121,6 +121,9 @@ rw_attribute(writeback_rate);
rw_attribute(writeback_rate_update_seconds);
rw_attribute(writeback_rate_i_term_inverse);
rw_attribute(writeback_rate_p_term_inverse);
+rw_attribute(writeback_rate_fp_term_low);
+rw_attribute(writeback_rate_fp_term_mid);
+rw_attribute(writeback_rate_fp_term_high);
rw_attribute(writeback_rate_minimum);
read_attribute(writeback_rate_debug);
@@ -205,6 +208,9 @@ SHOW(__bch_cached_dev)
var_print(writeback_rate_update_seconds);
var_print(writeback_rate_i_term_inverse);
var_print(writeback_rate_p_term_inverse);
+ var_print(writeback_rate_fp_term_low);
+ var_print(writeback_rate_fp_term_mid);
+ var_print(writeback_rate_fp_term_high);
var_print(writeback_rate_minimum);
if (attr == &sysfs_writeback_rate_debug) {
@@ -331,6 +337,15 @@ STORE(__cached_dev)
sysfs_strtoul_clamp(writeback_rate_p_term_inverse,
dc->writeback_rate_p_term_inverse,
1, UINT_MAX);
+ sysfs_strtoul_clamp(writeback_rate_fp_term_low,
+ dc->writeback_rate_fp_term_low,
+ 1, UINT_MAX);
+ sysfs_strtoul_clamp(writeback_rate_fp_term_mid,
+ dc->writeback_rate_fp_term_mid,
+ 1, UINT_MAX);
+ sysfs_strtoul_clamp(writeback_rate_fp_term_high,
+ dc->writeback_rate_fp_term_high,
+ 1, UINT_MAX);
sysfs_strtoul_clamp(writeback_rate_minimum,
dc->writeback_rate_minimum,
1, UINT_MAX);
@@ -502,6 +517,9 @@ static struct attribute *bch_cached_dev_files[] = {
&sysfs_writeback_rate_update_seconds,
&sysfs_writeback_rate_i_term_inverse,
&sysfs_writeback_rate_p_term_inverse,
+ &sysfs_writeback_rate_fp_term_low,
+ &sysfs_writeback_rate_fp_term_mid,
+ &sysfs_writeback_rate_fp_term_high,
&sysfs_writeback_rate_minimum,
&sysfs_writeback_rate_debug,
&sysfs_io_errors,
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index a129e4d2707c..a21485448e8e 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -88,6 +88,42 @@ static void __update_writeback_rate(struct cached_dev *dc)
int64_t integral_scaled;
uint32_t new_rate;
+ /*
+ * We need to consider the number of dirty buckets as well
+ * when calculating the proportional_scaled, Otherwise we might
+ * have an unreasonable small writeback rate at a highly fragmented situation
+ * when very few dirty sectors consumed a lot dirty buckets, the
+ * worst case is when dirty_data reached writeback_percent and
+ * dirty buckets reached to cutoff_writeback_sync, but the rate
+ * still will be at the minimum value, which will cause the write
+ * stuck at a non-writeback mode.
+ */
+ struct cache_set *c = dc->disk.c;
+
+ int64_t dirty_buckets = c->nbuckets - c->avail_nbuckets;
+
+ if (c->gc_stats.in_use > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW && dirty > 0) {
+ int64_t fragment = (dirty_buckets * c->cache->sb.bucket_size) / dirty;
+ int64_t fp_term;
+ int64_t fps;
+
+ if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID) {
+ fp_term = dc->writeback_rate_fp_term_low *
+ (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW);
+ } else if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH) {
+ fp_term = dc->writeback_rate_fp_term_mid *
+ (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID);
+ } else {
+ fp_term = dc->writeback_rate_fp_term_high *
+ (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH);
+ }
+ fps = (dirty / dirty_buckets) * fp_term;
+ if (fragment > 3 && fps > proportional_scaled) {
+ //Only overrite the p when fragment > 3
+ proportional_scaled = fps;
+ }
+ }
+
if ((error < 0 && dc->writeback_rate_integral > 0) ||
(error > 0 && time_before64(local_clock(),
dc->writeback_rate.next + NSEC_PER_MSEC))) {
@@ -984,6 +1020,9 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT;
dc->writeback_rate_p_term_inverse = 40;
+ dc->writeback_rate_fp_term_low = 1;
+ dc->writeback_rate_fp_term_mid = 5;
+ dc->writeback_rate_fp_term_high = 50;
dc->writeback_rate_i_term_inverse = 10000;
WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags));
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 3f1230e22de0..02b2f9df73f6 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -16,6 +16,10 @@
#define BCH_AUTO_GC_DIRTY_THRESHOLD 50
+#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW 50
+#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID 57
+#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH 64
+
#define BCH_DIRTY_INIT_THRD_MAX 64
/*
* 14 (16384ths) is chosen here as something that each backing device
--
2.17.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
[parent not found: <CAJS8hVK-ZCxJt=E3hwR0hmqPYL1T07_WC_nerb-dZodO+DqtDA@mail.gmail.com>]
* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
[not found] ` <CAJS8hVK-ZCxJt=E3hwR0hmqPYL1T07_WC_nerb-dZodO+DqtDA@mail.gmail.com>
@ 2021-01-05 4:33 ` Coly Li
[not found] ` <CAJS8hVL2B=RZr8H4jFbz=bX9k_E9ur7kTeue6BJwzm4pwv1+qQ@mail.gmail.com>
0 siblings, 1 reply; 24+ messages in thread
From: Coly Li @ 2021-01-05 4:33 UTC (permalink / raw)
To: Dongdong Tao
Cc: Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
open list, Gavin Guo, Gerald Yang, Trent Lloyd,
Dominique Poulain, Dongsheng Yang
On 1/5/21 11:44 AM, Dongdong Tao wrote:
> Hey Coly,
>
> This is the second version of the patch, please allow me to explain a
> bit for this patch:
>
> We accelerate the rate in 3 stages with different aggressiveness, the
> first stage starts when dirty buckets percent reach above
> BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW(50), the second is
> BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID(57) and the third is
> BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH(64). By default the first stage
> tries to writeback the amount of dirty data in one bucket (on average)
> in (1 / (dirty_buckets_percent - 50)) second, the second stage tries to
> writeback the amount of dirty data in one bucket in (1 /
> (dirty_buckets_percent - 57)) * 200 millisecond. The third stage tries
> to writeback the amount of dirty data in one bucket in (1 /
> (dirty_buckets_percent - 64)) * 20 millisecond.
>
> As we can see, there are two writeback aggressiveness increasing
> strategies, one strategy is with the increasing of the stage, the first
> stage is the easy-going phase whose initial rate is trying to write back
> dirty data of one bucket in 1 second, the second stage is a bit more
> aggressive, the initial rate tries to writeback the dirty data of one
> bucket in 200 ms, the last stage is even more, whose initial rate tries
> to writeback the dirty data of one bucket in 20 ms. This makes sense,
> one reason is that if the preceding stage couldn’t get the fragmentation
> to a fine stage, then the next stage should increase the aggressiveness
> properly, also it is because the later stage is closer to the
> bch_cutoff_writeback_sync. Another aggressiveness increasing strategy is
> with the increasing of dirty bucket percent within each stage, the first
> strategy controls the initial writeback rate of each stage, while this
> one increases the rate based on the initial rate, which is initial_rate
> * (dirty bucket percent - BCH_WRITEBACK_FRAGMENT_THRESHOLD_X).
>
> The initial rate can be controlled by 3 parameters
> writeback_rate_fp_term_low, writeback_rate_fp_term_mid,
> writeback_rate_fp_term_high, they are default 1, 5, 50, users can adjust
> them based on their needs.
>
> The reason that I choose 50, 57, 64 as the threshold value is because
> the GC must be triggered at least once during each stage due to the
> “sectors_to_gc” being set to 1/16 (6.25 %) of the total cache size. So,
> the hope is that the first and second stage can get us back to good
> shape in most situations by smoothly writing back the dirty data without
> giving too much stress to the backing devices, but it might still enter
> the third stage if the bucket consumption is very aggressive.
>
> This patch use (dirty / dirty_buckets) * fp_term to calculate the rate,
> this formula means that we want to writeback (dirty / dirty_buckets) in
> 1/fp_term second, fp_term is calculated by above aggressiveness
> controller, “dirty” is the current dirty sectors, “dirty_buckets” is the
> current dirty buckets, so (dirty / dirty_buckets) means the average
> dirty sectors in one bucket, the value is between 0 to 1024 for the
> default setting, so this formula basically gives a hint that to reclaim
> one bucket in 1/fp_term second. By using this semantic, we can have a
> lower writeback rate when the amount of dirty data is decreasing and
> overcome the fact that dirty buckets number is always increasing unless
> GC happens.
>
> *Compare to the first patch:
> *The first patch is trying to write back all the data in 40 seconds,
> this will result in a very high writeback rate when the amount of dirty
> data is big, this is mostly true for the large cache devices. The basic
> problem is that the semantic of this patch is not ideal, because we
> don’t really need to writeback all dirty data in order to solve this
> issue, and the instant large increase of the rate is something I feel we
> should better avoid (I like things to be smoothly changed unless no
> choice: )).
>
> Before I get to this new patch(which I believe should be optimal for me
> atm), there have been many tuning/testing iterations, eg. I’ve tried to
> tune the algorithm to writeback ⅓ of the dirty data in a certain amount
> of seconds, writeback 1/fragment of the dirty data in a certain amount
> of seconds, writeback all the dirty data only in those error_buckets
> (error buckets = dirty buckets - 50% of the total buckets) in a certain
> amount of time. However, those all turn out not to be ideal, only the
> semantic of the patch makes much sense for me and allows me to control
> the rate in a more precise way.
>
> *Testing data:
> *I'll provide the visualized testing data in the next couple of days
> with 1TB NVME devices cache but with HDD as backing device since it's
> what we mostly used in production env.
> I have the data for 400GB NVME, let me prepare it and take it for you to
> review.
[snipped]
Hi Dongdong,
Thanks for the update and continuous effort on this idea.
Please keep in mind the writeback rate is just a advice rate for the
writeback throughput, in real workload changing the writeback rate
number does not change writeback throughput obviously.
Currently I feel this is an interesting and promising idea for your
patch, but I am not able to say whether it may take effect in real
workload, so we do need convinced performance data on real workload and
configuration.
Of course I may also help on the benchmark, but my to-do list is long
enough and it may take a very long delay time.
Thanks.
Coly Li
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
2021-01-05 3:06 [PATCH] bcache: consider the fragmentation when update the writeback rate Dongdong Tao
[not found] ` <CAJS8hVK-ZCxJt=E3hwR0hmqPYL1T07_WC_nerb-dZodO+DqtDA@mail.gmail.com>
@ 2021-01-17 7:11 ` kernel test robot
1 sibling, 0 replies; 24+ messages in thread
From: kernel test robot @ 2021-01-17 7:11 UTC (permalink / raw)
To: Dongdong Tao, colyli
Cc: kbuild-all, dongdong tao, Kent Overstreet, linux-bcache, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 6040 bytes --]
Hi Dongdong,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on linus/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Dongdong-Tao/bcache-consider-the-fragmentation-when-update-the-writeback-rate/20210105-110903
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git e71ba9452f0b5b2e8dc8aa5445198cd9214a6a62
config: i386-randconfig-a002-20200806 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce (this is a W=1 build):
# https://github.com/0day-ci/linux/commit/7777fef68d1401235db42dd0d59c5c3dba3d42d3
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Dongdong-Tao/bcache-consider-the-fragmentation-when-update-the-writeback-rate/20210105-110903
git checkout 7777fef68d1401235db42dd0d59c5c3dba3d42d3
# save the attached .config to linux build tree
make W=1 ARCH=i386
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
ld: drivers/md/bcache/writeback.o: in function `__update_writeback_rate':
>> drivers/md/bcache/writeback.c:106: undefined reference to `__divdi3'
>> ld: drivers/md/bcache/writeback.c:120: undefined reference to `__divdi3'
vim +106 drivers/md/bcache/writeback.c
60
61 static void __update_writeback_rate(struct cached_dev *dc)
62 {
63 /*
64 * PI controller:
65 * Figures out the amount that should be written per second.
66 *
67 * First, the error (number of sectors that are dirty beyond our
68 * target) is calculated. The error is accumulated (numerically
69 * integrated).
70 *
71 * Then, the proportional value and integral value are scaled
72 * based on configured values. These are stored as inverses to
73 * avoid fixed point math and to make configuration easy-- e.g.
74 * the default value of 40 for writeback_rate_p_term_inverse
75 * attempts to write at a rate that would retire all the dirty
76 * blocks in 40 seconds.
77 *
78 * The writeback_rate_i_inverse value of 10000 means that 1/10000th
79 * of the error is accumulated in the integral term per second.
80 * This acts as a slow, long-term average that is not subject to
81 * variations in usage like the p term.
82 */
83 int64_t target = __calc_target_rate(dc);
84 int64_t dirty = bcache_dev_sectors_dirty(&dc->disk);
85 int64_t error = dirty - target;
86 int64_t proportional_scaled =
87 div_s64(error, dc->writeback_rate_p_term_inverse);
88 int64_t integral_scaled;
89 uint32_t new_rate;
90
91 /*
92 * We need to consider the number of dirty buckets as well
93 * when calculating the proportional_scaled, Otherwise we might
94 * have an unreasonable small writeback rate at a highly fragmented situation
95 * when very few dirty sectors consumed a lot dirty buckets, the
96 * worst case is when dirty_data reached writeback_percent and
97 * dirty buckets reached to cutoff_writeback_sync, but the rate
98 * still will be at the minimum value, which will cause the write
99 * stuck at a non-writeback mode.
100 */
101 struct cache_set *c = dc->disk.c;
102
103 int64_t dirty_buckets = c->nbuckets - c->avail_nbuckets;
104
105 if (c->gc_stats.in_use > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW && dirty > 0) {
> 106 int64_t fragment = (dirty_buckets * c->cache->sb.bucket_size) / dirty;
107 int64_t fp_term;
108 int64_t fps;
109
110 if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID) {
111 fp_term = dc->writeback_rate_fp_term_low *
112 (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW);
113 } else if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH) {
114 fp_term = dc->writeback_rate_fp_term_mid *
115 (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID);
116 } else {
117 fp_term = dc->writeback_rate_fp_term_high *
118 (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH);
119 }
> 120 fps = (dirty / dirty_buckets) * fp_term;
121 if (fragment > 3 && fps > proportional_scaled) {
122 //Only overrite the p when fragment > 3
123 proportional_scaled = fps;
124 }
125 }
126
127 if ((error < 0 && dc->writeback_rate_integral > 0) ||
128 (error > 0 && time_before64(local_clock(),
129 dc->writeback_rate.next + NSEC_PER_MSEC))) {
130 /*
131 * Only decrease the integral term if it's more than
132 * zero. Only increase the integral term if the device
133 * is keeping up. (Don't wind up the integral
134 * ineffectively in either case).
135 *
136 * It's necessary to scale this by
137 * writeback_rate_update_seconds to keep the integral
138 * term dimensioned properly.
139 */
140 dc->writeback_rate_integral += error *
141 dc->writeback_rate_update_seconds;
142 }
143
144 integral_scaled = div_s64(dc->writeback_rate_integral,
145 dc->writeback_rate_i_term_inverse);
146
147 new_rate = clamp_t(int32_t, (proportional_scaled + integral_scaled),
148 dc->writeback_rate_minimum, NSEC_PER_SEC);
149
150 dc->writeback_rate_proportional = proportional_scaled;
151 dc->writeback_rate_integral_scaled = integral_scaled;
152 dc->writeback_rate_change = new_rate -
153 atomic_long_read(&dc->writeback_rate.rate);
154 atomic_long_set(&dc->writeback_rate.rate, new_rate);
155 dc->writeback_rate_target = target;
156 }
157
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30207 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH] bcache: consider the fragmentation when update the writeback rate
@ 2021-01-19 12:56 Dongdong Tao
2021-01-19 14:06 ` Coly Li
0 siblings, 1 reply; 24+ messages in thread
From: Dongdong Tao @ 2021-01-19 12:56 UTC (permalink / raw)
To: colyli; +Cc: dongdong tao, Kent Overstreet, linux-bcache, linux-kernel
From: dongdong tao <dongdong.tao@canonical.com>
Current way to calculate the writeback rate only considered the
dirty sectors, this usually works fine when the fragmentation
is not high, but it will give us unreasonable small rate when
we are under a situation that very few dirty sectors consumed
a lot dirty buckets. In some case, the dirty bucekts can reached
to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) not even
reached the writeback_percent, the writeback rate will still
be the minimum value (4k), thus it will cause all the writes to be
stucked in a non-writeback mode because of the slow writeback.
We accelerate the rate in 3 stages with different aggressiveness,
the first stage starts when dirty buckets percent reach above
BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default
the first stage tries to writeback the amount of dirty data
in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second,
the second stage tries to writeback the amount of dirty data in one bucket
in (1 / (dirty_buckets_percent - 57)) * 100 millisecond, the third
stage tries to writeback the amount of dirty data in one bucket in
(1 / (dirty_buckets_percent - 64)) millisecond. It is ok to be very
aggressive in the last stage, as it is our last chance to pull it back.
Option writeback_consider_fragment to control whether we want
this feature to be on or off, it's on by default.
Lastly, below is the performance data for all the testing result,
including the data from production env:
https://docs.google.com/document/d/
1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing
Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
---
drivers/md/bcache/bcache.h | 4 ++++
drivers/md/bcache/sysfs.c | 22 ++++++++++++++++++
drivers/md/bcache/writeback.c | 42 +++++++++++++++++++++++++++++++++++
drivers/md/bcache/writeback.h | 4 ++++
4 files changed, 72 insertions(+)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 1d57f48307e6..d7a84327b7f1 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -373,6 +373,7 @@ struct cached_dev {
unsigned int partial_stripes_expensive:1;
unsigned int writeback_metadata:1;
unsigned int writeback_running:1;
+ unsigned int writeback_consider_fragment:1;
unsigned char writeback_percent;
unsigned int writeback_delay;
@@ -385,6 +386,9 @@ struct cached_dev {
unsigned int writeback_rate_update_seconds;
unsigned int writeback_rate_i_term_inverse;
unsigned int writeback_rate_p_term_inverse;
+ unsigned int writeback_rate_fp_term_low;
+ unsigned int writeback_rate_fp_term_mid;
+ unsigned int writeback_rate_fp_term_high;
unsigned int writeback_rate_minimum;
enum stop_on_failure stop_when_cache_set_failed;
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 00a520c03f41..136899beadba 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -117,10 +117,14 @@ rw_attribute(writeback_running);
rw_attribute(writeback_percent);
rw_attribute(writeback_delay);
rw_attribute(writeback_rate);
+rw_attribute(writeback_consider_fragment);
rw_attribute(writeback_rate_update_seconds);
rw_attribute(writeback_rate_i_term_inverse);
rw_attribute(writeback_rate_p_term_inverse);
+rw_attribute(writeback_rate_fp_term_low);
+rw_attribute(writeback_rate_fp_term_mid);
+rw_attribute(writeback_rate_fp_term_high);
rw_attribute(writeback_rate_minimum);
read_attribute(writeback_rate_debug);
@@ -195,6 +199,7 @@ SHOW(__bch_cached_dev)
var_printf(bypass_torture_test, "%i");
var_printf(writeback_metadata, "%i");
var_printf(writeback_running, "%i");
+ var_printf(writeback_consider_fragment, "%i");
var_print(writeback_delay);
var_print(writeback_percent);
sysfs_hprint(writeback_rate,
@@ -205,6 +210,9 @@ SHOW(__bch_cached_dev)
var_print(writeback_rate_update_seconds);
var_print(writeback_rate_i_term_inverse);
var_print(writeback_rate_p_term_inverse);
+ var_print(writeback_rate_fp_term_low);
+ var_print(writeback_rate_fp_term_mid);
+ var_print(writeback_rate_fp_term_high);
var_print(writeback_rate_minimum);
if (attr == &sysfs_writeback_rate_debug) {
@@ -303,6 +311,7 @@ STORE(__cached_dev)
sysfs_strtoul_bool(bypass_torture_test, dc->bypass_torture_test);
sysfs_strtoul_bool(writeback_metadata, dc->writeback_metadata);
sysfs_strtoul_bool(writeback_running, dc->writeback_running);
+ sysfs_strtoul_bool(writeback_consider_fragment, dc->writeback_consider_fragment);
sysfs_strtoul_clamp(writeback_delay, dc->writeback_delay, 0, UINT_MAX);
sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
@@ -331,6 +340,15 @@ STORE(__cached_dev)
sysfs_strtoul_clamp(writeback_rate_p_term_inverse,
dc->writeback_rate_p_term_inverse,
1, UINT_MAX);
+ sysfs_strtoul_clamp(writeback_rate_fp_term_low,
+ dc->writeback_rate_fp_term_low,
+ 1, UINT_MAX);
+ sysfs_strtoul_clamp(writeback_rate_fp_term_mid,
+ dc->writeback_rate_fp_term_mid,
+ 1, UINT_MAX);
+ sysfs_strtoul_clamp(writeback_rate_fp_term_high,
+ dc->writeback_rate_fp_term_high,
+ 1, UINT_MAX);
sysfs_strtoul_clamp(writeback_rate_minimum,
dc->writeback_rate_minimum,
1, UINT_MAX);
@@ -499,9 +517,13 @@ static struct attribute *bch_cached_dev_files[] = {
&sysfs_writeback_delay,
&sysfs_writeback_percent,
&sysfs_writeback_rate,
+ &sysfs_writeback_consider_fragment,
&sysfs_writeback_rate_update_seconds,
&sysfs_writeback_rate_i_term_inverse,
&sysfs_writeback_rate_p_term_inverse,
+ &sysfs_writeback_rate_fp_term_low,
+ &sysfs_writeback_rate_fp_term_mid,
+ &sysfs_writeback_rate_fp_term_high,
&sysfs_writeback_rate_minimum,
&sysfs_writeback_rate_debug,
&sysfs_io_errors,
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index a129e4d2707c..9d440166c6bf 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -88,6 +88,44 @@ static void __update_writeback_rate(struct cached_dev *dc)
int64_t integral_scaled;
uint32_t new_rate;
+ /*
+ * We need to consider the number of dirty buckets as well
+ * when calculating the proportional_scaled, Otherwise we might
+ * have an unreasonable small writeback rate at a highly fragmented situation
+ * when very few dirty sectors consumed a lot dirty buckets, the
+ * worst case is when dirty buckets reached cutoff_writeback_sync and
+ * dirty data is still not even reached to writeback percent, so the rate
+ * still will be at the minimum value, which will cause the write
+ * stuck at a non-writeback mode.
+ */
+ struct cache_set *c = dc->disk.c;
+
+ int64_t dirty_buckets = c->nbuckets - c->avail_nbuckets;
+
+ if (dc->writeback_consider_fragment &&
+ c->gc_stats.in_use > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW && dirty > 0) {
+ int64_t fragment =
+ div_s64((dirty_buckets * c->cache->sb.bucket_size), dirty);
+ int64_t fp_term;
+ int64_t fps;
+
+ if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID) {
+ fp_term = dc->writeback_rate_fp_term_low *
+ (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW);
+ } else if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH) {
+ fp_term = dc->writeback_rate_fp_term_mid *
+ (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID);
+ } else {
+ fp_term = dc->writeback_rate_fp_term_high *
+ (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH);
+ }
+ fps = div_s64(dirty, dirty_buckets) * fp_term;
+ if (fragment > 3 && fps > proportional_scaled) {
+ //Only overrite the p when fragment > 3
+ proportional_scaled = fps;
+ }
+ }
+
if ((error < 0 && dc->writeback_rate_integral > 0) ||
(error > 0 && time_before64(local_clock(),
dc->writeback_rate.next + NSEC_PER_MSEC))) {
@@ -977,6 +1015,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
dc->writeback_metadata = true;
dc->writeback_running = false;
+ dc->writeback_consider_fragment = true;
dc->writeback_percent = 10;
dc->writeback_delay = 30;
atomic_long_set(&dc->writeback_rate.rate, 1024);
@@ -984,6 +1023,9 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT;
dc->writeback_rate_p_term_inverse = 40;
+ dc->writeback_rate_fp_term_low = 1;
+ dc->writeback_rate_fp_term_mid = 10;
+ dc->writeback_rate_fp_term_high = 1000;
dc->writeback_rate_i_term_inverse = 10000;
WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags));
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 3f1230e22de0..02b2f9df73f6 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -16,6 +16,10 @@
#define BCH_AUTO_GC_DIRTY_THRESHOLD 50
+#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW 50
+#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID 57
+#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH 64
+
#define BCH_DIRTY_INIT_THRD_MAX 64
/*
* 14 (16384ths) is chosen here as something that each backing device
--
2.17.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
2021-01-19 12:56 Dongdong Tao
@ 2021-01-19 14:06 ` Coly Li
0 siblings, 0 replies; 24+ messages in thread
From: Coly Li @ 2021-01-19 14:06 UTC (permalink / raw)
To: Dongdong Tao; +Cc: dongdong tao, Kent Overstreet, linux-bcache, linux-kernel
On 1/19/21 8:56 PM, Dongdong Tao wrote:
> From: dongdong tao <dongdong.tao@canonical.com>
>
> Current way to calculate the writeback rate only considered the
> dirty sectors, this usually works fine when the fragmentation
> is not high, but it will give us unreasonable small rate when
> we are under a situation that very few dirty sectors consumed
> a lot dirty buckets. In some case, the dirty bucekts can reached
> to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) not even
> reached the writeback_percent, the writeback rate will still
> be the minimum value (4k), thus it will cause all the writes to be
> stucked in a non-writeback mode because of the slow writeback.
>
> We accelerate the rate in 3 stages with different aggressiveness,
> the first stage starts when dirty buckets percent reach above
> BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is
> BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is
> BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default
> the first stage tries to writeback the amount of dirty data
> in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second,
> the second stage tries to writeback the amount of dirty data in one bucket
> in (1 / (dirty_buckets_percent - 57)) * 100 millisecond, the third
> stage tries to writeback the amount of dirty data in one bucket in
> (1 / (dirty_buckets_percent - 64)) millisecond. It is ok to be very
> aggressive in the last stage, as it is our last chance to pull it back.
>
> Option writeback_consider_fragment to control whether we want
> this feature to be on or off, it's on by default.
>
> Lastly, below is the performance data for all the testing result,
> including the data from production env:
> https://docs.google.com/document/d/
> 1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing
>
> Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
Hi Dongdong,
Now overall the patch is OK for me, just minor comments and please
check them inline.
> ---
> drivers/md/bcache/bcache.h | 4 ++++
> drivers/md/bcache/sysfs.c | 22 ++++++++++++++++++
> drivers/md/bcache/writeback.c | 42 +++++++++++++++++++++++++++++++++++
> drivers/md/bcache/writeback.h | 4 ++++
> 4 files changed, 72 insertions(+)
>
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 1d57f48307e6..d7a84327b7f1 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -373,6 +373,7 @@ struct cached_dev {
> unsigned int partial_stripes_expensive:1;
> unsigned int writeback_metadata:1;
> unsigned int writeback_running:1;
> + unsigned int writeback_consider_fragment:1;
> unsigned char writeback_percent;
> unsigned int writeback_delay;
>
> @@ -385,6 +386,9 @@ struct cached_dev {
> unsigned int writeback_rate_update_seconds;
> unsigned int writeback_rate_i_term_inverse;
> unsigned int writeback_rate_p_term_inverse;
> + unsigned int writeback_rate_fp_term_low;
> + unsigned int writeback_rate_fp_term_mid;
> + unsigned int writeback_rate_fp_term_high;
> unsigned int writeback_rate_minimum;
>
> enum stop_on_failure stop_when_cache_set_failed;
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index 00a520c03f41..136899beadba 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -117,10 +117,14 @@ rw_attribute(writeback_running);
> rw_attribute(writeback_percent);
> rw_attribute(writeback_delay);
> rw_attribute(writeback_rate);
> +rw_attribute(writeback_consider_fragment);
>
> rw_attribute(writeback_rate_update_seconds);
> rw_attribute(writeback_rate_i_term_inverse);
> rw_attribute(writeback_rate_p_term_inverse);
> +rw_attribute(writeback_rate_fp_term_low);
> +rw_attribute(writeback_rate_fp_term_mid);
> +rw_attribute(writeback_rate_fp_term_high);
> rw_attribute(writeback_rate_minimum);
> read_attribute(writeback_rate_debug);
>
> @@ -195,6 +199,7 @@ SHOW(__bch_cached_dev)
> var_printf(bypass_torture_test, "%i");
> var_printf(writeback_metadata, "%i");
> var_printf(writeback_running, "%i");
> + var_printf(writeback_consider_fragment, "%i");
> var_print(writeback_delay);
> var_print(writeback_percent);
> sysfs_hprint(writeback_rate,
> @@ -205,6 +210,9 @@ SHOW(__bch_cached_dev)
> var_print(writeback_rate_update_seconds);
> var_print(writeback_rate_i_term_inverse);
> var_print(writeback_rate_p_term_inverse);
> + var_print(writeback_rate_fp_term_low);
> + var_print(writeback_rate_fp_term_mid);
> + var_print(writeback_rate_fp_term_high);
> var_print(writeback_rate_minimum);
>
> if (attr == &sysfs_writeback_rate_debug) {
> @@ -303,6 +311,7 @@ STORE(__cached_dev)
> sysfs_strtoul_bool(bypass_torture_test, dc->bypass_torture_test);
> sysfs_strtoul_bool(writeback_metadata, dc->writeback_metadata);
> sysfs_strtoul_bool(writeback_running, dc->writeback_running);
> + sysfs_strtoul_bool(writeback_consider_fragment, dc->writeback_consider_fragment);
> sysfs_strtoul_clamp(writeback_delay, dc->writeback_delay, 0, UINT_MAX);
>
> sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
> @@ -331,6 +340,15 @@ STORE(__cached_dev)
> sysfs_strtoul_clamp(writeback_rate_p_term_inverse,
> dc->writeback_rate_p_term_inverse,
> 1, UINT_MAX);
> + sysfs_strtoul_clamp(writeback_rate_fp_term_low,
> + dc->writeback_rate_fp_term_low,
> + 1, UINT_MAX);
> + sysfs_strtoul_clamp(writeback_rate_fp_term_mid,
> + dc->writeback_rate_fp_term_mid,
> + 1, UINT_MAX);
> + sysfs_strtoul_clamp(writeback_rate_fp_term_high,
> + dc->writeback_rate_fp_term_high,
> + 1, UINT_MAX);
Here you should make sure the input value should always be in order,
low < mid < high
Otherwise the calculation might be messed up.
> sysfs_strtoul_clamp(writeback_rate_minimum,
> dc->writeback_rate_minimum,
> 1, UINT_MAX);
> @@ -499,9 +517,13 @@ static struct attribute *bch_cached_dev_files[] = {
> &sysfs_writeback_delay,
> &sysfs_writeback_percent,
> &sysfs_writeback_rate,
> + &sysfs_writeback_consider_fragment,
> &sysfs_writeback_rate_update_seconds,
> &sysfs_writeback_rate_i_term_inverse,
> &sysfs_writeback_rate_p_term_inverse,
> + &sysfs_writeback_rate_fp_term_low,
> + &sysfs_writeback_rate_fp_term_mid,
> + &sysfs_writeback_rate_fp_term_high,
> &sysfs_writeback_rate_minimum,
> &sysfs_writeback_rate_debug,
> &sysfs_io_errors,
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index a129e4d2707c..9d440166c6bf 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -88,6 +88,44 @@ static void __update_writeback_rate(struct cached_dev *dc)
> int64_t integral_scaled;
> uint32_t new_rate;
>
> + /*
> + * We need to consider the number of dirty buckets as well
> + * when calculating the proportional_scaled, Otherwise we might
> + * have an unreasonable small writeback rate at a highly fragmented situation
> + * when very few dirty sectors consumed a lot dirty buckets, the
> + * worst case is when dirty buckets reached cutoff_writeback_sync and
> + * dirty data is still not even reached to writeback percent, so the rate
> + * still will be at the minimum value, which will cause the write
> + * stuck at a non-writeback mode.
> + */
> + struct cache_set *c = dc->disk.c;
> +
> + int64_t dirty_buckets = c->nbuckets - c->avail_nbuckets;
> +
> + if (dc->writeback_consider_fragment &&
> + c->gc_stats.in_use > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW && dirty > 0) {
> + int64_t fragment =
> + div_s64((dirty_buckets * c->cache->sb.bucket_size), dirty);
> + int64_t fp_term;
> + int64_t fps;
> +
> + if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID) {
> + fp_term = dc->writeback_rate_fp_term_low *
> + (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW);
> + } else if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH) {
> + fp_term = dc->writeback_rate_fp_term_mid *
> + (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID);
> + } else {
> + fp_term = dc->writeback_rate_fp_term_high *
> + (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH);
> + }
> + fps = div_s64(dirty, dirty_buckets) * fp_term;
> + if (fragment > 3 && fps > proportional_scaled) {
> + //Only overrite the p when fragment > 3
The above "//" style is not recommended as bcache coding style.
> + proportional_scaled = fps;
> + }
> + }
> +
> if ((error < 0 && dc->writeback_rate_integral > 0) ||
> (error > 0 && time_before64(local_clock(),
> dc->writeback_rate.next + NSEC_PER_MSEC))) {
> @@ -977,6 +1015,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
>
> dc->writeback_metadata = true;
> dc->writeback_running = false;
> + dc->writeback_consider_fragment = true;
> dc->writeback_percent = 10;
> dc->writeback_delay = 30;
> atomic_long_set(&dc->writeback_rate.rate, 1024);
> @@ -984,6 +1023,9 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
>
> dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT;
> dc->writeback_rate_p_term_inverse = 40;
> + dc->writeback_rate_fp_term_low = 1;
> + dc->writeback_rate_fp_term_mid = 10;
> + dc->writeback_rate_fp_term_high = 1000;
Could you please to explain a bit how the above 3 numbers are decided ?
> dc->writeback_rate_i_term_inverse = 10000;
>
> WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags));
> diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
> index 3f1230e22de0..02b2f9df73f6 100644
> --- a/drivers/md/bcache/writeback.h
> +++ b/drivers/md/bcache/writeback.h
> @@ -16,6 +16,10 @@
>
> #define BCH_AUTO_GC_DIRTY_THRESHOLD 50
>
> +#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW 50
> +#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID 57
> +#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH 64
> +
> #define BCH_DIRTY_INIT_THRD_MAX 64
> /*
> * 14 (16384ths) is chosen here as something that each backing device
>
Thanks in advance.
Coly Li
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH] bcache: consider the fragmentation when update the writeback rate
@ 2020-11-03 12:42 Dongdong Tao
2020-11-04 5:06 ` kernel test robot
` (2 more replies)
0 siblings, 3 replies; 24+ messages in thread
From: Dongdong Tao @ 2020-11-03 12:42 UTC (permalink / raw)
To: colyli
Cc: gavin.guo, gerald.yang, trent.lloyd, dongdong tao,
Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
open list
From: dongdong tao <dongdong.tao@canonical.com>
Current way to calculate the writeback rate only considered the
dirty sectors, this usually works fine when the fragmentation
is not high, but it will give us unreasonable small rate when
we are under a situation that very few dirty sectors consumed
a lot dirty buckets. In some case, the dirty bucekts can reached
to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
reached the writeback_percent, the writeback rate will still
be the minimum value (4k), thus it will cause all the writes to be
stucked in a non-writeback mode because of the slow writeback.
This patch will try to accelerate the writeback rate when the
fragmentation is high. It calculate the propotional_scaled value
based on below:
(dirty_sectors / writeback_rate_p_term_inverse) * fragment
As we can see, the higher fragmentation will result a larger
proportional_scaled value, thus cause a larger writeback rate.
The fragment value is calculated based on below:
(dirty_buckets * bucket_size) / dirty_sectors
If you think about it, the value of fragment will be always
inside [1, bucket_size].
This patch only considers the fragmentation when the number of
dirty_buckets reached to a dirty threshold(configurable by
writeback_fragment_percent, default is 50), so bcache will
remain the original behaviour before the dirty buckets reached
the threshold.
Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
---
drivers/md/bcache/bcache.h | 1 +
drivers/md/bcache/sysfs.c | 6 ++++++
drivers/md/bcache/writeback.c | 21 +++++++++++++++++++++
3 files changed, 28 insertions(+)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 1d57f48307e6..87632f7032b6 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -374,6 +374,7 @@ struct cached_dev {
unsigned int writeback_metadata:1;
unsigned int writeback_running:1;
unsigned char writeback_percent;
+ unsigned char writeback_fragment_percent;
unsigned int writeback_delay;
uint64_t writeback_rate_target;
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 554e3afc9b68..69499113aef8 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -115,6 +115,7 @@ rw_attribute(stop_when_cache_set_failed);
rw_attribute(writeback_metadata);
rw_attribute(writeback_running);
rw_attribute(writeback_percent);
+rw_attribute(writeback_fragment_percent);
rw_attribute(writeback_delay);
rw_attribute(writeback_rate);
@@ -197,6 +198,7 @@ SHOW(__bch_cached_dev)
var_printf(writeback_running, "%i");
var_print(writeback_delay);
var_print(writeback_percent);
+ var_print(writeback_fragment_percent);
sysfs_hprint(writeback_rate,
wb ? atomic_long_read(&dc->writeback_rate.rate) << 9 : 0);
sysfs_printf(io_errors, "%i", atomic_read(&dc->io_errors));
@@ -308,6 +310,9 @@ STORE(__cached_dev)
sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
0, bch_cutoff_writeback);
+ sysfs_strtoul_clamp(writeback_fragment_percent, dc->writeback_fragment_percent,
+ 0, bch_cutoff_writeback_sync);
+
if (attr == &sysfs_writeback_rate) {
ssize_t ret;
long int v = atomic_long_read(&dc->writeback_rate.rate);
@@ -498,6 +503,7 @@ static struct attribute *bch_cached_dev_files[] = {
&sysfs_writeback_running,
&sysfs_writeback_delay,
&sysfs_writeback_percent,
+ &sysfs_writeback_fragment_percent,
&sysfs_writeback_rate,
&sysfs_writeback_rate_update_seconds,
&sysfs_writeback_rate_i_term_inverse,
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 3c74996978da..34babc89fdf3 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -88,6 +88,26 @@ static void __update_writeback_rate(struct cached_dev *dc)
int64_t integral_scaled;
uint32_t new_rate;
+ /*
+ * We need to consider the number of dirty buckets as well
+ * when calculating the proportional_scaled, Otherwise we might
+ * have an unreasonable small writeback rate at a highly fragmented situation
+ * when very few dirty sectors consumed a lot dirty buckets, the
+ * worst case is when dirty_data reached writeback_percent and
+ * dirty buckets reached to cutoff_writeback_sync, but the rate
+ * still will be at the minimum value, which will cause the write
+ * stuck at a non-writeback mode.
+ */
+ struct cache_set *c = dc->disk.c;
+
+ if (c->gc_stats.in_use > dc->writeback_fragment_percent && dirty > 0) {
+ int64_t dirty_buckets = (c->gc_stats.in_use * c->nbuckets) / 100;
+ int64_t fragment = (dirty_buckets * c->cache->sb.bucket_size) / dirty;
+
+ proportional_scaled =
+ div_s64(dirty, dc->writeback_rate_p_term_inverse) * (fragment);
+ }
+
if ((error < 0 && dc->writeback_rate_integral > 0) ||
(error > 0 && time_before64(local_clock(),
dc->writeback_rate.next + NSEC_PER_MSEC))) {
@@ -969,6 +989,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
dc->writeback_metadata = true;
dc->writeback_running = false;
dc->writeback_percent = 10;
+ dc->writeback_fragment_percent = 50;
dc->writeback_delay = 30;
atomic_long_set(&dc->writeback_rate.rate, 1024);
dc->writeback_rate_minimum = 8;
--
2.17.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
2020-11-03 12:42 Dongdong Tao
@ 2020-11-04 5:06 ` kernel test robot
2020-11-05 16:32 ` Coly Li
2020-12-09 2:27 ` Dongsheng Yang
2 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2020-11-04 5:06 UTC (permalink / raw)
To: Dongdong Tao, colyli; +Cc: linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2529 bytes --]
Hi Dongdong,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on linus/master]
[also build test ERROR on v5.10-rc2 next-20201103]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Dongdong-Tao/bcache-consider-the-fragmentation-when-update-the-writeback-rate/20201103-204517
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git b7cbaf59f62f8ab8f157698f9e31642bff525bd0
config: nds32-randconfig-p002-20201103 (attached as .config)
compiler: nds32le-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/4e57aac94bb0d0822473657d7ef0f3a0455e81f6
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Dongdong-Tao/bcache-consider-the-fragmentation-when-update-the-writeback-rate/20201103-204517
git checkout 4e57aac94bb0d0822473657d7ef0f3a0455e81f6
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=nds32
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
nds32le-linux-ld: arch/nds32/kernel/ex-entry.o: in function `common_exception_handler':
(.text+0xfe): undefined reference to `__trace_hardirqs_off'
(.text+0xfe): relocation truncated to fit: R_NDS32_25_PCREL_RELA against undefined symbol `__trace_hardirqs_off'
nds32le-linux-ld: arch/nds32/kernel/ex-exit.o: in function `no_work_pending':
(.text+0xce): undefined reference to `__trace_hardirqs_off'
nds32le-linux-ld: (.text+0xd2): undefined reference to `__trace_hardirqs_off'
nds32le-linux-ld: (.text+0xd6): undefined reference to `__trace_hardirqs_on'
nds32le-linux-ld: (.text+0xda): undefined reference to `__trace_hardirqs_on'
nds32le-linux-ld: drivers/md/bcache/writeback.o: in function `update_writeback_rate':
>> writeback.c:(.text+0x1ffa): undefined reference to `__divdi3'
>> nds32le-linux-ld: writeback.c:(.text+0x1ffe): undefined reference to `__divdi3'
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 34232 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
2020-11-03 12:42 Dongdong Tao
2020-11-04 5:06 ` kernel test robot
@ 2020-11-05 16:32 ` Coly Li
2020-11-10 4:19 ` Dongdong Tao
2020-12-09 2:27 ` Dongsheng Yang
2 siblings, 1 reply; 24+ messages in thread
From: Coly Li @ 2020-11-05 16:32 UTC (permalink / raw)
To: Dongdong Tao
Cc: gavin.guo, gerald.yang, trent.lloyd, dongdong tao,
Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
open list
On 2020/11/3 20:42, Dongdong Tao wrote:
> From: dongdong tao <dongdong.tao@canonical.com>
>
> Current way to calculate the writeback rate only considered the
> dirty sectors, this usually works fine when the fragmentation
> is not high, but it will give us unreasonable small rate when
> we are under a situation that very few dirty sectors consumed
> a lot dirty buckets. In some case, the dirty bucekts can reached
> to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
> reached the writeback_percent, the writeback rate will still
> be the minimum value (4k), thus it will cause all the writes to be
> stucked in a non-writeback mode because of the slow writeback.
>
> This patch will try to accelerate the writeback rate when the
> fragmentation is high. It calculate the propotional_scaled value
> based on below:
> (dirty_sectors / writeback_rate_p_term_inverse) * fragment
> As we can see, the higher fragmentation will result a larger
> proportional_scaled value, thus cause a larger writeback rate.
> The fragment value is calculated based on below:
> (dirty_buckets * bucket_size) / dirty_sectors
> If you think about it, the value of fragment will be always
> inside [1, bucket_size].
>
> This patch only considers the fragmentation when the number of
> dirty_buckets reached to a dirty threshold(configurable by
> writeback_fragment_percent, default is 50), so bcache will
> remain the original behaviour before the dirty buckets reached
> the threshold.
>
> Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
Hi Dongdong,
Change the writeback rate does not effect the real throughput indeed,
your change is just increasing the upper limit hint of the writeback
throughput, the bottle neck is spinning drive for random I/O.
A good direction should be the moving gc. If the moving gc may work
faster, the situation you mentioned above could be relaxed a lot.
I will NACK this patch unless you may have a observable and reproducible
performance number.
Thanks.
Coly Li
> ---
> drivers/md/bcache/bcache.h | 1 +
> drivers/md/bcache/sysfs.c | 6 ++++++
> drivers/md/bcache/writeback.c | 21 +++++++++++++++++++++
> 3 files changed, 28 insertions(+)
>
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 1d57f48307e6..87632f7032b6 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -374,6 +374,7 @@ struct cached_dev {
> unsigned int writeback_metadata:1;
> unsigned int writeback_running:1;
> unsigned char writeback_percent;
> + unsigned char writeback_fragment_percent;
> unsigned int writeback_delay;
>
> uint64_t writeback_rate_target;
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index 554e3afc9b68..69499113aef8 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -115,6 +115,7 @@ rw_attribute(stop_when_cache_set_failed);
> rw_attribute(writeback_metadata);
> rw_attribute(writeback_running);
> rw_attribute(writeback_percent);
> +rw_attribute(writeback_fragment_percent);
> rw_attribute(writeback_delay);
> rw_attribute(writeback_rate);
>
> @@ -197,6 +198,7 @@ SHOW(__bch_cached_dev)
> var_printf(writeback_running, "%i");
> var_print(writeback_delay);
> var_print(writeback_percent);
> + var_print(writeback_fragment_percent);
> sysfs_hprint(writeback_rate,
> wb ? atomic_long_read(&dc->writeback_rate.rate) << 9 : 0);
> sysfs_printf(io_errors, "%i", atomic_read(&dc->io_errors));
> @@ -308,6 +310,9 @@ STORE(__cached_dev)
> sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
> 0, bch_cutoff_writeback);
>
> + sysfs_strtoul_clamp(writeback_fragment_percent, dc->writeback_fragment_percent,
> + 0, bch_cutoff_writeback_sync);
> +
> if (attr == &sysfs_writeback_rate) {
> ssize_t ret;
> long int v = atomic_long_read(&dc->writeback_rate.rate);
> @@ -498,6 +503,7 @@ static struct attribute *bch_cached_dev_files[] = {
> &sysfs_writeback_running,
> &sysfs_writeback_delay,
> &sysfs_writeback_percent,
> + &sysfs_writeback_fragment_percent,
> &sysfs_writeback_rate,
> &sysfs_writeback_rate_update_seconds,
> &sysfs_writeback_rate_i_term_inverse,
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index 3c74996978da..34babc89fdf3 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -88,6 +88,26 @@ static void __update_writeback_rate(struct cached_dev *dc)
> int64_t integral_scaled;
> uint32_t new_rate;
>
> + /*
> + * We need to consider the number of dirty buckets as well
> + * when calculating the proportional_scaled, Otherwise we might
> + * have an unreasonable small writeback rate at a highly fragmented situation
> + * when very few dirty sectors consumed a lot dirty buckets, the
> + * worst case is when dirty_data reached writeback_percent and
> + * dirty buckets reached to cutoff_writeback_sync, but the rate
> + * still will be at the minimum value, which will cause the write
> + * stuck at a non-writeback mode.
> + */
> + struct cache_set *c = dc->disk.c;
> +
> + if (c->gc_stats.in_use > dc->writeback_fragment_percent && dirty > 0) {
> + int64_t dirty_buckets = (c->gc_stats.in_use * c->nbuckets) / 100;
> + int64_t fragment = (dirty_buckets * c->cache->sb.bucket_size) / dirty;
> +
> + proportional_scaled =
> + div_s64(dirty, dc->writeback_rate_p_term_inverse) * (fragment);
> + }
> +
> if ((error < 0 && dc->writeback_rate_integral > 0) ||
> (error > 0 && time_before64(local_clock(),
> dc->writeback_rate.next + NSEC_PER_MSEC))) {
> @@ -969,6 +989,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
> dc->writeback_metadata = true;
> dc->writeback_running = false;
> dc->writeback_percent = 10;
> + dc->writeback_fragment_percent = 50;
> dc->writeback_delay = 30;
> atomic_long_set(&dc->writeback_rate.rate, 1024);
> dc->writeback_rate_minimum = 8;
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
2020-11-05 16:32 ` Coly Li
@ 2020-11-10 4:19 ` Dongdong Tao
2020-11-11 8:33 ` Coly Li
0 siblings, 1 reply; 24+ messages in thread
From: Dongdong Tao @ 2020-11-10 4:19 UTC (permalink / raw)
To: Coly Li
Cc: Dongdong Tao, open list:BCACHE (BLOCK LAYER CACHE),
open list, Kent Overstreet
[Sorry again for the SPAM detection]
Thank you the reply Coly!
I agree that this patch is not a final solution for fixing the
fragmentation issue, but more like a workaround to alleviate this
problem.
So, part of my intention is to look for how upstream would like to fix
this issue.
I've looked into the code of moving_gc part, as well as did some
debug/test, unfortunately, I think it's not the solution for this
issue also.
Because movnig_gc will not just move the dirty cache, but also the
clean cache, so seems the purpose of moving_gc
is trying to move the data (dirty and clean) from those relatively
empty buckets to some new buckets, so that to reclaim the original
buckets.
For this purpose, I guess moving gc was more useful at the time when
we usually don't have large nvme devices.
Let's get back to the problem I have, the problem that I'm trying to
fix is that you might have lots of buckets (Say 70 percent) that are
all being fully consumed,
while those buckets only contain very few dirty data (Say 10 percent
), since gc can't reclaim a bucket which contains any dirty data, so
the worst situation
is that the cache_availability_percent can drop under 30 percent which
will make all the write op can't perform in a writeback mode, thus
kill the performance of writes.
So, unlike the moving_gc, I only want to move dirty data around (as
you've suggested :)), but I don't think it's a good idea to change the
behavior of moving_gc.
My current idea is to implement a compaction thread that triggers the
dirty data compaction only under some certain circumstances (like when
the fragmentaion and dirty buckets are both high), and this thread can
be turned on/off based on an extra option, so that people can keep the
original behavior if they want.
This is a rough idea now, could you please let me know if the above
thought makes sense to you or any other suggestions will be
appreciated!
I also understand the hardest part is making sure the general bcache
performance and functionality still look sane,
so it might require much more time to do it and it's more likely a
feature atm.
How to reproduce and observe this issue:
This issue is very easy to repreduce by running below fio command
against a writeback mode bcache deivce:
fio --name=random-writers --filename=/dev/bcache0 --ioengine=libaio
--iodepth=4 --rw=randrw --rate_iops=95,5 --bs=4k --direct=1
--numjobs=4
Note that the key option to reproduce this issue here is
"rate_iops=95,5", so that you will have 95 percent read and only 5
percent write, this is to make sure
one bucket only contains very few dirty data.
Also, it's faster to reproduce this with a small cache device, I use
1GB cache, but it's same for bigger cache device, just a matter of
time.
We can observe this issue by monitoring bcache stats "data_dirty" and
"cache_available_percent", after the cache_available_percent dropped
to 30 percent,
we can observe the write performance is hugely degraded by below
bpftrace script:
---
#!/usr/bin/env bpftrace
#include <linux/bio.h>
kprobe:cached_dev_make_request
{
@start[arg1] = nsecs;
}
kprobe:bio_endio /@start[arg0]/
{
if(((struct bio *)arg0)->bi_opf & 1) {
@write = hist(nsecs - @start[arg0]); delete(@start[arg0]);
}
else {
@read = hist(nsecs - @start[arg0]); delete(@start[arg0]);
}
}
---
To run this script:
Save above bpftrace file to bcache_io_lat.bt, then run it with chmod
+x bcache_io_lat.bt & ./bcache_io_lat.bt
By the way, we mainly hit this issue on ceph, the fio reproducer is
just an easy way to reproduce it.
Regards,
Dongdong
On Fri, Nov 6, 2020 at 12:32 AM Coly Li <colyli@suse.de> wrote:
>
> On 2020/11/3 20:42, Dongdong Tao wrote:
> > From: dongdong tao <dongdong.tao@canonical.com>
> >
> > Current way to calculate the writeback rate only considered the
> > dirty sectors, this usually works fine when the fragmentation
> > is not high, but it will give us unreasonable small rate when
> > we are under a situation that very few dirty sectors consumed
> > a lot dirty buckets. In some case, the dirty bucekts can reached
> > to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
> > reached the writeback_percent, the writeback rate will still
> > be the minimum value (4k), thus it will cause all the writes to be
> > stucked in a non-writeback mode because of the slow writeback.
> >
> > This patch will try to accelerate the writeback rate when the
> > fragmentation is high. It calculate the propotional_scaled value
> > based on below:
> > (dirty_sectors / writeback_rate_p_term_inverse) * fragment
> > As we can see, the higher fragmentation will result a larger
> > proportional_scaled value, thus cause a larger writeback rate.
> > The fragment value is calculated based on below:
> > (dirty_buckets * bucket_size) / dirty_sectors
> > If you think about it, the value of fragment will be always
> > inside [1, bucket_size].
> >
> > This patch only considers the fragmentation when the number of
> > dirty_buckets reached to a dirty threshold(configurable by
> > writeback_fragment_percent, default is 50), so bcache will
> > remain the original behaviour before the dirty buckets reached
> > the threshold.
> >
> > Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
>
> Hi Dongdong,
>
> Change the writeback rate does not effect the real throughput indeed,
> your change is just increasing the upper limit hint of the writeback
> throughput, the bottle neck is spinning drive for random I/O.
>
> A good direction should be the moving gc. If the moving gc may work
> faster, the situation you mentioned above could be relaxed a lot.
>
> I will NACK this patch unless you may have a observable and reproducible
> performance number.
>
> Thanks.
>
> Coly Li
>
>
> > ---
> > drivers/md/bcache/bcache.h | 1 +
> > drivers/md/bcache/sysfs.c | 6 ++++++
> > drivers/md/bcache/writeback.c | 21 +++++++++++++++++++++
> > 3 files changed, 28 insertions(+)
> >
> > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> > index 1d57f48307e6..87632f7032b6 100644
> > --- a/drivers/md/bcache/bcache.h
> > +++ b/drivers/md/bcache/bcache.h
> > @@ -374,6 +374,7 @@ struct cached_dev {
> > unsigned int writeback_metadata:1;
> > unsigned int writeback_running:1;
> > unsigned char writeback_percent;
> > + unsigned char writeback_fragment_percent;
> > unsigned int writeback_delay;
> >
> > uint64_t writeback_rate_target;
> > diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> > index 554e3afc9b68..69499113aef8 100644
> > --- a/drivers/md/bcache/sysfs.c
> > +++ b/drivers/md/bcache/sysfs.c
> > @@ -115,6 +115,7 @@ rw_attribute(stop_when_cache_set_failed);
> > rw_attribute(writeback_metadata);
> > rw_attribute(writeback_running);
> > rw_attribute(writeback_percent);
> > +rw_attribute(writeback_fragment_percent);
> > rw_attribute(writeback_delay);
> > rw_attribute(writeback_rate);
> >
> > @@ -197,6 +198,7 @@ SHOW(__bch_cached_dev)
> > var_printf(writeback_running, "%i");
> > var_print(writeback_delay);
> > var_print(writeback_percent);
> > + var_print(writeback_fragment_percent);
> > sysfs_hprint(writeback_rate,
> > wb ? atomic_long_read(&dc->writeback_rate.rate) << 9 : 0);
> > sysfs_printf(io_errors, "%i", atomic_read(&dc->io_errors));
> > @@ -308,6 +310,9 @@ STORE(__cached_dev)
> > sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
> > 0, bch_cutoff_writeback);
> >
> > + sysfs_strtoul_clamp(writeback_fragment_percent, dc->writeback_fragment_percent,
> > + 0, bch_cutoff_writeback_sync);
> > +
> > if (attr == &sysfs_writeback_rate) {
> > ssize_t ret;
> > long int v = atomic_long_read(&dc->writeback_rate.rate);
> > @@ -498,6 +503,7 @@ static struct attribute *bch_cached_dev_files[] = {
> > &sysfs_writeback_running,
> > &sysfs_writeback_delay,
> > &sysfs_writeback_percent,
> > + &sysfs_writeback_fragment_percent,
> > &sysfs_writeback_rate,
> > &sysfs_writeback_rate_update_seconds,
> > &sysfs_writeback_rate_i_term_inverse,
> > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> > index 3c74996978da..34babc89fdf3 100644
> > --- a/drivers/md/bcache/writeback.c
> > +++ b/drivers/md/bcache/writeback.c
> > @@ -88,6 +88,26 @@ static void __update_writeback_rate(struct cached_dev *dc)
> > int64_t integral_scaled;
> > uint32_t new_rate;
> >
> > + /*
> > + * We need to consider the number of dirty buckets as well
> > + * when calculating the proportional_scaled, Otherwise we might
> > + * have an unreasonable small writeback rate at a highly fragmented situation
> > + * when very few dirty sectors consumed a lot dirty buckets, the
> > + * worst case is when dirty_data reached writeback_percent and
> > + * dirty buckets reached to cutoff_writeback_sync, but the rate
> > + * still will be at the minimum value, which will cause the write
> > + * stuck at a non-writeback mode.
> > + */
> > + struct cache_set *c = dc->disk.c;
> > +
> > + if (c->gc_stats.in_use > dc->writeback_fragment_percent && dirty > 0) {
> > + int64_t dirty_buckets = (c->gc_stats.in_use * c->nbuckets) / 100;
> > + int64_t fragment = (dirty_buckets * c->cache->sb.bucket_size) / dirty;
> > +
> > + proportional_scaled =
> > + div_s64(dirty, dc->writeback_rate_p_term_inverse) * (fragment);
> > + }
> > +
> > if ((error < 0 && dc->writeback_rate_integral > 0) ||
> > (error > 0 && time_before64(local_clock(),
> > dc->writeback_rate.next + NSEC_PER_MSEC))) {
> > @@ -969,6 +989,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
> > dc->writeback_metadata = true;
> > dc->writeback_running = false;
> > dc->writeback_percent = 10;
> > + dc->writeback_fragment_percent = 50;
> > dc->writeback_delay = 30;
> > atomic_long_set(&dc->writeback_rate.rate, 1024);
> > dc->writeback_rate_minimum = 8;
> >
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
2020-11-10 4:19 ` Dongdong Tao
@ 2020-11-11 8:33 ` Coly Li
0 siblings, 0 replies; 24+ messages in thread
From: Coly Li @ 2020-11-11 8:33 UTC (permalink / raw)
To: Dongdong Tao
Cc: Dongdong Tao, open list:BCACHE (BLOCK LAYER CACHE),
open list, Kent Overstreet
On 2020/11/10 12:19, Dongdong Tao wrote:
> [Sorry again for the SPAM detection]
>
> Thank you the reply Coly!
>
> I agree that this patch is not a final solution for fixing the
> fragmentation issue, but more like a workaround to alleviate this
> problem.
> So, part of my intention is to look for how upstream would like to fix
> this issue.
>
> I've looked into the code of moving_gc part, as well as did some
> debug/test, unfortunately, I think it's not the solution for this
> issue also.
> Because movnig_gc will not just move the dirty cache, but also the
> clean cache, so seems the purpose of moving_gc
> is trying to move the data (dirty and clean) from those relatively
> empty buckets to some new buckets, so that to reclaim the original
> buckets.
> For this purpose, I guess moving gc was more useful at the time when
> we usually don't have large nvme devices.
>
> Let's get back to the problem I have, the problem that I'm trying to
> fix is that you might have lots of buckets (Say 70 percent) that are
> all being fully consumed,
> while those buckets only contain very few dirty data (Say 10 percent
> ), since gc can't reclaim a bucket which contains any dirty data, so
> the worst situation
> is that the cache_availability_percent can drop under 30 percent which
> will make all the write op can't perform in a writeback mode, thus
> kill the performance of writes.
>
> So, unlike the moving_gc, I only want to move dirty data around (as
> you've suggested :)), but I don't think it's a good idea to change the
> behavior of moving_gc.
> My current idea is to implement a compaction thread that triggers the
> dirty data compaction only under some certain circumstances (like when
> the fragmentaion and dirty buckets are both high), and this thread can
> be turned on/off based on an extra option, so that people can keep the
> original behavior if they want.
>
> This is a rough idea now, could you please let me know if the above
> thought makes sense to you or any other suggestions will be
> appreciated!
> I also understand the hardest part is making sure the general bcache
> performance and functionality still look sane,
> so it might require much more time to do it and it's more likely a
> feature atm.
>
> How to reproduce and observe this issue:
> This issue is very easy to repreduce by running below fio command
> against a writeback mode bcache deivce:
>
> fio --name=random-writers --filename=/dev/bcache0 --ioengine=libaio
> --iodepth=4 --rw=randrw --rate_iops=95,5 --bs=4k --direct=1
> --numjobs=4
>
> Note that the key option to reproduce this issue here is
> "rate_iops=95,5", so that you will have 95 percent read and only 5
> percent write, this is to make sure
> one bucket only contains very few dirty data.
> Also, it's faster to reproduce this with a small cache device, I use
> 1GB cache, but it's same for bigger cache device, just a matter of
> time.
>
> We can observe this issue by monitoring bcache stats "data_dirty" and
> "cache_available_percent", after the cache_available_percent dropped
> to 30 percent,
> we can observe the write performance is hugely degraded by below
> bpftrace script:
> ---
> #!/usr/bin/env bpftrace
>
> #include <linux/bio.h>
>
> kprobe:cached_dev_make_request
> {
> @start[arg1] = nsecs;
> }
>
> kprobe:bio_endio /@start[arg0]/
> {
> if(((struct bio *)arg0)->bi_opf & 1) {
> @write = hist(nsecs - @start[arg0]); delete(@start[arg0]);
> }
> else {
> @read = hist(nsecs - @start[arg0]); delete(@start[arg0]);
> }
> }
> ---
>
> To run this script:
> Save above bpftrace file to bcache_io_lat.bt, then run it with chmod
> +x bcache_io_lat.bt & ./bcache_io_lat.bt
>
> By the way, we mainly hit this issue on ceph, the fio reproducer is
> just an easy way to reproduce it.
>
Hi Dongdong,
I know this situation, this is not the first time it is mentioned.
What is the performance number that your patch gains ? I wanted to see
"observable and reproducible performance number", especially the latency
and IOPS for regular I/O requests.
Thanks.
Coly Li
> On Fri, Nov 6, 2020 at 12:32 AM Coly Li <colyli@suse.de> wrote:
>>
>> On 2020/11/3 20:42, Dongdong Tao wrote:
>>> From: dongdong tao <dongdong.tao@canonical.com>
>>>
>>> Current way to calculate the writeback rate only considered the
>>> dirty sectors, this usually works fine when the fragmentation
>>> is not high, but it will give us unreasonable small rate when
>>> we are under a situation that very few dirty sectors consumed
>>> a lot dirty buckets. In some case, the dirty bucekts can reached
>>> to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
>>> reached the writeback_percent, the writeback rate will still
>>> be the minimum value (4k), thus it will cause all the writes to be
>>> stucked in a non-writeback mode because of the slow writeback.
>>>
>>> This patch will try to accelerate the writeback rate when the
>>> fragmentation is high. It calculate the propotional_scaled value
>>> based on below:
>>> (dirty_sectors / writeback_rate_p_term_inverse) * fragment
>>> As we can see, the higher fragmentation will result a larger
>>> proportional_scaled value, thus cause a larger writeback rate.
>>> The fragment value is calculated based on below:
>>> (dirty_buckets * bucket_size) / dirty_sectors
>>> If you think about it, the value of fragment will be always
>>> inside [1, bucket_size].
>>>
>>> This patch only considers the fragmentation when the number of
>>> dirty_buckets reached to a dirty threshold(configurable by
>>> writeback_fragment_percent, default is 50), so bcache will
>>> remain the original behaviour before the dirty buckets reached
>>> the threshold.
>>>
>>> Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
>>
>> Hi Dongdong,
>>
>> Change the writeback rate does not effect the real throughput indeed,
>> your change is just increasing the upper limit hint of the writeback
>> throughput, the bottle neck is spinning drive for random I/O.
>>
>> A good direction should be the moving gc. If the moving gc may work
>> faster, the situation you mentioned above could be relaxed a lot.
>>
>> I will NACK this patch unless you may have a observable and reproducible
>> performance number.
>>
>> Thanks.
>>
>> Coly Li
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
2020-11-03 12:42 Dongdong Tao
2020-11-04 5:06 ` kernel test robot
2020-11-05 16:32 ` Coly Li
@ 2020-12-09 2:27 ` Dongsheng Yang
2020-12-09 4:48 ` Dongdong Tao
2 siblings, 1 reply; 24+ messages in thread
From: Dongsheng Yang @ 2020-12-09 2:27 UTC (permalink / raw)
To: Dongdong Tao, colyli
Cc: gavin.guo, gerald.yang, trent.lloyd, dongdong tao,
Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
open list
在 2020/11/3 星期二 下午 8:42, Dongdong Tao 写道:
> From: dongdong tao <dongdong.tao@canonical.com>
>
> Current way to calculate the writeback rate only considered the
> dirty sectors, this usually works fine when the fragmentation
> is not high, but it will give us unreasonable small rate when
> we are under a situation that very few dirty sectors consumed
> a lot dirty buckets. In some case, the dirty bucekts can reached
> to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
> reached the writeback_percent, the writeback rate will still
> be the minimum value (4k), thus it will cause all the writes to be
> stucked in a non-writeback mode because of the slow writeback.
>
> This patch will try to accelerate the writeback rate when the
> fragmentation is high. It calculate the propotional_scaled value
> based on below:
> (dirty_sectors / writeback_rate_p_term_inverse) * fragment
> As we can see, the higher fragmentation will result a larger
> proportional_scaled value, thus cause a larger writeback rate.
> The fragment value is calculated based on below:
> (dirty_buckets * bucket_size) / dirty_sectors
> If you think about it, the value of fragment will be always
> inside [1, bucket_size].
>
> This patch only considers the fragmentation when the number of
> dirty_buckets reached to a dirty threshold(configurable by
> writeback_fragment_percent, default is 50), so bcache will
> remain the original behaviour before the dirty buckets reached
> the threshold.
>
> Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
> ---
> drivers/md/bcache/bcache.h | 1 +
> drivers/md/bcache/sysfs.c | 6 ++++++
> drivers/md/bcache/writeback.c | 21 +++++++++++++++++++++
> 3 files changed, 28 insertions(+)
>
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 1d57f48307e6..87632f7032b6 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -374,6 +374,7 @@ struct cached_dev {
> unsigned int writeback_metadata:1;
> unsigned int writeback_running:1;
> unsigned char writeback_percent;
> + unsigned char writeback_fragment_percent;
> unsigned int writeback_delay;
>
> uint64_t writeback_rate_target;
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index 554e3afc9b68..69499113aef8 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -115,6 +115,7 @@ rw_attribute(stop_when_cache_set_failed);
> rw_attribute(writeback_metadata);
> rw_attribute(writeback_running);
> rw_attribute(writeback_percent);
> +rw_attribute(writeback_fragment_percent);
Hi Dongdong and Coly,
What is the status about this patch? In my opinion, it is a problem
we need to solve,
but can we just reuse the parameter of writeback_percent, rather than
introduce a new writeback_fragment_percent?
That means the semantic of writeback_percent will act on dirty data
percent and dirty bucket percent.
When we found there are dirty buckets more than (c->nbuckets *
writeback_percent), start the writeback.
Thanx
Yang
> rw_attribute(writeback_delay);
> rw_attribute(writeback_rate);
>
> @@ -197,6 +198,7 @@ SHOW(__bch_cached_dev)
> var_printf(writeback_running, "%i");
> var_print(writeback_delay);
> var_print(writeback_percent);
> + var_print(writeback_fragment_percent);
> sysfs_hprint(writeback_rate,
> wb ? atomic_long_read(&dc->writeback_rate.rate) << 9 : 0);
> sysfs_printf(io_errors, "%i", atomic_read(&dc->io_errors));
> @@ -308,6 +310,9 @@ STORE(__cached_dev)
> sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
> 0, bch_cutoff_writeback);
>
> + sysfs_strtoul_clamp(writeback_fragment_percent, dc->writeback_fragment_percent,
> + 0, bch_cutoff_writeback_sync);
> +
> if (attr == &sysfs_writeback_rate) {
> ssize_t ret;
> long int v = atomic_long_read(&dc->writeback_rate.rate);
> @@ -498,6 +503,7 @@ static struct attribute *bch_cached_dev_files[] = {
> &sysfs_writeback_running,
> &sysfs_writeback_delay,
> &sysfs_writeback_percent,
> + &sysfs_writeback_fragment_percent,
> &sysfs_writeback_rate,
> &sysfs_writeback_rate_update_seconds,
> &sysfs_writeback_rate_i_term_inverse,
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index 3c74996978da..34babc89fdf3 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -88,6 +88,26 @@ static void __update_writeback_rate(struct cached_dev *dc)
> int64_t integral_scaled;
> uint32_t new_rate;
>
> + /*
> + * We need to consider the number of dirty buckets as well
> + * when calculating the proportional_scaled, Otherwise we might
> + * have an unreasonable small writeback rate at a highly fragmented situation
> + * when very few dirty sectors consumed a lot dirty buckets, the
> + * worst case is when dirty_data reached writeback_percent and
> + * dirty buckets reached to cutoff_writeback_sync, but the rate
> + * still will be at the minimum value, which will cause the write
> + * stuck at a non-writeback mode.
> + */
> + struct cache_set *c = dc->disk.c;
> +
> + if (c->gc_stats.in_use > dc->writeback_fragment_percent && dirty > 0) {
> + int64_t dirty_buckets = (c->gc_stats.in_use * c->nbuckets) / 100;
> + int64_t fragment = (dirty_buckets * c->cache->sb.bucket_size) / dirty;
> +
> + proportional_scaled =
> + div_s64(dirty, dc->writeback_rate_p_term_inverse) * (fragment);
> + }
> +
> if ((error < 0 && dc->writeback_rate_integral > 0) ||
> (error > 0 && time_before64(local_clock(),
> dc->writeback_rate.next + NSEC_PER_MSEC))) {
> @@ -969,6 +989,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
> dc->writeback_metadata = true;
> dc->writeback_running = false;
> dc->writeback_percent = 10;
> + dc->writeback_fragment_percent = 50;
> dc->writeback_delay = 30;
> atomic_long_set(&dc->writeback_rate.rate, 1024);
> dc->writeback_rate_minimum = 8;
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
2020-12-09 2:27 ` Dongsheng Yang
@ 2020-12-09 4:48 ` Dongdong Tao
2020-12-09 7:49 ` Dongsheng Yang
0 siblings, 1 reply; 24+ messages in thread
From: Dongdong Tao @ 2020-12-09 4:48 UTC (permalink / raw)
To: Dongsheng Yang
Cc: Dongdong Tao, Coly Li, Gavin Guo, Gerald Yang, Trent Lloyd,
Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
open list
Hi Dongsheng,
I'm working on it, next step I'm gathering some testing data and
upload (very sorry for the delay...)
Thanks for the comment.
One of the main concerns to alleviate this issue with the writeback
process is that we need to minimize the impact on the client IO
performance.
writeback_percent by default is 10, start writeback when dirty buckets
reached 10 percent might be a bit too aggressive, as the
writeback_cutoff_sync is 70 percent.
So i chose to start the writeback when dirty buckets reached 50
percent so that this patch will only take effect after dirty buckets
percent is above that
Thanks,
Dongdong
On Wed, Dec 9, 2020 at 10:27 AM Dongsheng Yang
<dongsheng.yang@easystack.cn> wrote:
>
>
> 在 2020/11/3 星期二 下午 8:42, Dongdong Tao 写道:
> > From: dongdong tao <dongdong.tao@canonical.com>
> >
> > Current way to calculate the writeback rate only considered the
> > dirty sectors, this usually works fine when the fragmentation
> > is not high, but it will give us unreasonable small rate when
> > we are under a situation that very few dirty sectors consumed
> > a lot dirty buckets. In some case, the dirty bucekts can reached
> > to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
> > reached the writeback_percent, the writeback rate will still
> > be the minimum value (4k), thus it will cause all the writes to be
> > stucked in a non-writeback mode because of the slow writeback.
> >
> > This patch will try to accelerate the writeback rate when the
> > fragmentation is high. It calculate the propotional_scaled value
> > based on below:
> > (dirty_sectors / writeback_rate_p_term_inverse) * fragment
> > As we can see, the higher fragmentation will result a larger
> > proportional_scaled value, thus cause a larger writeback rate.
> > The fragment value is calculated based on below:
> > (dirty_buckets * bucket_size) / dirty_sectors
> > If you think about it, the value of fragment will be always
> > inside [1, bucket_size].
> >
> > This patch only considers the fragmentation when the number of
> > dirty_buckets reached to a dirty threshold(configurable by
> > writeback_fragment_percent, default is 50), so bcache will
> > remain the original behaviour before the dirty buckets reached
> > the threshold.
> >
> > Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
> > ---
> > drivers/md/bcache/bcache.h | 1 +
> > drivers/md/bcache/sysfs.c | 6 ++++++
> > drivers/md/bcache/writeback.c | 21 +++++++++++++++++++++
> > 3 files changed, 28 insertions(+)
> >
> > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> > index 1d57f48307e6..87632f7032b6 100644
> > --- a/drivers/md/bcache/bcache.h
> > +++ b/drivers/md/bcache/bcache.h
> > @@ -374,6 +374,7 @@ struct cached_dev {
> > unsigned int writeback_metadata:1;
> > unsigned int writeback_running:1;
> > unsigned char writeback_percent;
> > + unsigned char writeback_fragment_percent;
> > unsigned int writeback_delay;
> >
> > uint64_t writeback_rate_target;
> > diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> > index 554e3afc9b68..69499113aef8 100644
> > --- a/drivers/md/bcache/sysfs.c
> > +++ b/drivers/md/bcache/sysfs.c
> > @@ -115,6 +115,7 @@ rw_attribute(stop_when_cache_set_failed);
> > rw_attribute(writeback_metadata);
> > rw_attribute(writeback_running);
> > rw_attribute(writeback_percent);
> > +rw_attribute(writeback_fragment_percent);
>
>
> Hi Dongdong and Coly,
>
> What is the status about this patch? In my opinion, it is a problem
> we need to solve,
>
> but can we just reuse the parameter of writeback_percent, rather than
> introduce a new writeback_fragment_percent?
>
> That means the semantic of writeback_percent will act on dirty data
> percent and dirty bucket percent.
>
> When we found there are dirty buckets more than (c->nbuckets *
> writeback_percent), start the writeback.
>
>
> Thanx
>
> Yang
>
> > rw_attribute(writeback_delay);
> > rw_attribute(writeback_rate);
> >
> > @@ -197,6 +198,7 @@ SHOW(__bch_cached_dev)
> > var_printf(writeback_running, "%i");
> > var_print(writeback_delay);
> > var_print(writeback_percent);
> > + var_print(writeback_fragment_percent);
> > sysfs_hprint(writeback_rate,
> > wb ? atomic_long_read(&dc->writeback_rate.rate) << 9 : 0);
> > sysfs_printf(io_errors, "%i", atomic_read(&dc->io_errors));
> > @@ -308,6 +310,9 @@ STORE(__cached_dev)
> > sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
> > 0, bch_cutoff_writeback);
> >
> > + sysfs_strtoul_clamp(writeback_fragment_percent, dc->writeback_fragment_percent,
> > + 0, bch_cutoff_writeback_sync);
> > +
> > if (attr == &sysfs_writeback_rate) {
> > ssize_t ret;
> > long int v = atomic_long_read(&dc->writeback_rate.rate);
> > @@ -498,6 +503,7 @@ static struct attribute *bch_cached_dev_files[] = {
> > &sysfs_writeback_running,
> > &sysfs_writeback_delay,
> > &sysfs_writeback_percent,
> > + &sysfs_writeback_fragment_percent,
> > &sysfs_writeback_rate,
> > &sysfs_writeback_rate_update_seconds,
> > &sysfs_writeback_rate_i_term_inverse,
> > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> > index 3c74996978da..34babc89fdf3 100644
> > --- a/drivers/md/bcache/writeback.c
> > +++ b/drivers/md/bcache/writeback.c
> > @@ -88,6 +88,26 @@ static void __update_writeback_rate(struct cached_dev *dc)
> > int64_t integral_scaled;
> > uint32_t new_rate;
> >
> > + /*
> > + * We need to consider the number of dirty buckets as well
> > + * when calculating the proportional_scaled, Otherwise we might
> > + * have an unreasonable small writeback rate at a highly fragmented situation
> > + * when very few dirty sectors consumed a lot dirty buckets, the
> > + * worst case is when dirty_data reached writeback_percent and
> > + * dirty buckets reached to cutoff_writeback_sync, but the rate
> > + * still will be at the minimum value, which will cause the write
> > + * stuck at a non-writeback mode.
> > + */
> > + struct cache_set *c = dc->disk.c;
> > +
> > + if (c->gc_stats.in_use > dc->writeback_fragment_percent && dirty > 0) {
> > + int64_t dirty_buckets = (c->gc_stats.in_use * c->nbuckets) / 100;
> > + int64_t fragment = (dirty_buckets * c->cache->sb.bucket_size) / dirty;
> > +
> > + proportional_scaled =
> > + div_s64(dirty, dc->writeback_rate_p_term_inverse) * (fragment);
> > + }
> > +
> > if ((error < 0 && dc->writeback_rate_integral > 0) ||
> > (error > 0 && time_before64(local_clock(),
> > dc->writeback_rate.next + NSEC_PER_MSEC))) {
> > @@ -969,6 +989,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
> > dc->writeback_metadata = true;
> > dc->writeback_running = false;
> > dc->writeback_percent = 10;
> > + dc->writeback_fragment_percent = 50;
> > dc->writeback_delay = 30;
> > atomic_long_set(&dc->writeback_rate.rate, 1024);
> > dc->writeback_rate_minimum = 8;
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
2020-12-09 4:48 ` Dongdong Tao
@ 2020-12-09 7:49 ` Dongsheng Yang
[not found] ` <CAJS8hVLMUS1mdrwC8ovzvMO+HWf4xtXRCNJEghtbtW0g93Kh_g@mail.gmail.com>
0 siblings, 1 reply; 24+ messages in thread
From: Dongsheng Yang @ 2020-12-09 7:49 UTC (permalink / raw)
To: Dongdong Tao
Cc: Dongdong Tao, Coly Li, Gavin Guo, Gerald Yang, Trent Lloyd,
Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
open list
在 2020/12/9 星期三 下午 12:48, Dongdong Tao 写道:
> Hi Dongsheng,
>
> I'm working on it, next step I'm gathering some testing data and
> upload (very sorry for the delay...)
> Thanks for the comment.
> One of the main concerns to alleviate this issue with the writeback
> process is that we need to minimize the impact on the client IO
> performance.
> writeback_percent by default is 10, start writeback when dirty buckets
> reached 10 percent might be a bit too aggressive, as the
> writeback_cutoff_sync is 70 percent.
> So i chose to start the writeback when dirty buckets reached 50
> percent so that this patch will only take effect after dirty buckets
> percent is above that
Agree with that's too aggressive to reuse writeback_percent, and that's
less flexable.
Okey, let's wait for your testing result.
Thanx
>
> Thanks,
> Dongdong
>
>
>
>
> On Wed, Dec 9, 2020 at 10:27 AM Dongsheng Yang
> <dongsheng.yang@easystack.cn> wrote:
>>
>> 在 2020/11/3 星期二 下午 8:42, Dongdong Tao 写道:
>>> From: dongdong tao <dongdong.tao@canonical.com>
>>>
>>> Current way to calculate the writeback rate only considered the
>>> dirty sectors, this usually works fine when the fragmentation
>>> is not high, but it will give us unreasonable small rate when
>>> we are under a situation that very few dirty sectors consumed
>>> a lot dirty buckets. In some case, the dirty bucekts can reached
>>> to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
>>> reached the writeback_percent, the writeback rate will still
>>> be the minimum value (4k), thus it will cause all the writes to be
>>> stucked in a non-writeback mode because of the slow writeback.
>>>
>>> This patch will try to accelerate the writeback rate when the
>>> fragmentation is high. It calculate the propotional_scaled value
>>> based on below:
>>> (dirty_sectors / writeback_rate_p_term_inverse) * fragment
>>> As we can see, the higher fragmentation will result a larger
>>> proportional_scaled value, thus cause a larger writeback rate.
>>> The fragment value is calculated based on below:
>>> (dirty_buckets * bucket_size) / dirty_sectors
>>> If you think about it, the value of fragment will be always
>>> inside [1, bucket_size].
>>>
>>> This patch only considers the fragmentation when the number of
>>> dirty_buckets reached to a dirty threshold(configurable by
>>> writeback_fragment_percent, default is 50), so bcache will
>>> remain the original behaviour before the dirty buckets reached
>>> the threshold.
>>>
>>> Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
>>> ---
>>> drivers/md/bcache/bcache.h | 1 +
>>> drivers/md/bcache/sysfs.c | 6 ++++++
>>> drivers/md/bcache/writeback.c | 21 +++++++++++++++++++++
>>> 3 files changed, 28 insertions(+)
>>>
>>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>>> index 1d57f48307e6..87632f7032b6 100644
>>> --- a/drivers/md/bcache/bcache.h
>>> +++ b/drivers/md/bcache/bcache.h
>>> @@ -374,6 +374,7 @@ struct cached_dev {
>>> unsigned int writeback_metadata:1;
>>> unsigned int writeback_running:1;
>>> unsigned char writeback_percent;
>>> + unsigned char writeback_fragment_percent;
>>> unsigned int writeback_delay;
>>>
>>> uint64_t writeback_rate_target;
>>> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
>>> index 554e3afc9b68..69499113aef8 100644
>>> --- a/drivers/md/bcache/sysfs.c
>>> +++ b/drivers/md/bcache/sysfs.c
>>> @@ -115,6 +115,7 @@ rw_attribute(stop_when_cache_set_failed);
>>> rw_attribute(writeback_metadata);
>>> rw_attribute(writeback_running);
>>> rw_attribute(writeback_percent);
>>> +rw_attribute(writeback_fragment_percent);
>>
>> Hi Dongdong and Coly,
>>
>> What is the status about this patch? In my opinion, it is a problem
>> we need to solve,
>>
>> but can we just reuse the parameter of writeback_percent, rather than
>> introduce a new writeback_fragment_percent?
>>
>> That means the semantic of writeback_percent will act on dirty data
>> percent and dirty bucket percent.
>>
>> When we found there are dirty buckets more than (c->nbuckets *
>> writeback_percent), start the writeback.
>>
>>
>> Thanx
>>
>> Yang
>>
>>> rw_attribute(writeback_delay);
>>> rw_attribute(writeback_rate);
>>>
>>> @@ -197,6 +198,7 @@ SHOW(__bch_cached_dev)
>>> var_printf(writeback_running, "%i");
>>> var_print(writeback_delay);
>>> var_print(writeback_percent);
>>> + var_print(writeback_fragment_percent);
>>> sysfs_hprint(writeback_rate,
>>> wb ? atomic_long_read(&dc->writeback_rate.rate) << 9 : 0);
>>> sysfs_printf(io_errors, "%i", atomic_read(&dc->io_errors));
>>> @@ -308,6 +310,9 @@ STORE(__cached_dev)
>>> sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
>>> 0, bch_cutoff_writeback);
>>>
>>> + sysfs_strtoul_clamp(writeback_fragment_percent, dc->writeback_fragment_percent,
>>> + 0, bch_cutoff_writeback_sync);
>>> +
>>> if (attr == &sysfs_writeback_rate) {
>>> ssize_t ret;
>>> long int v = atomic_long_read(&dc->writeback_rate.rate);
>>> @@ -498,6 +503,7 @@ static struct attribute *bch_cached_dev_files[] = {
>>> &sysfs_writeback_running,
>>> &sysfs_writeback_delay,
>>> &sysfs_writeback_percent,
>>> + &sysfs_writeback_fragment_percent,
>>> &sysfs_writeback_rate,
>>> &sysfs_writeback_rate_update_seconds,
>>> &sysfs_writeback_rate_i_term_inverse,
>>> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
>>> index 3c74996978da..34babc89fdf3 100644
>>> --- a/drivers/md/bcache/writeback.c
>>> +++ b/drivers/md/bcache/writeback.c
>>> @@ -88,6 +88,26 @@ static void __update_writeback_rate(struct cached_dev *dc)
>>> int64_t integral_scaled;
>>> uint32_t new_rate;
>>>
>>> + /*
>>> + * We need to consider the number of dirty buckets as well
>>> + * when calculating the proportional_scaled, Otherwise we might
>>> + * have an unreasonable small writeback rate at a highly fragmented situation
>>> + * when very few dirty sectors consumed a lot dirty buckets, the
>>> + * worst case is when dirty_data reached writeback_percent and
>>> + * dirty buckets reached to cutoff_writeback_sync, but the rate
>>> + * still will be at the minimum value, which will cause the write
>>> + * stuck at a non-writeback mode.
>>> + */
>>> + struct cache_set *c = dc->disk.c;
>>> +
>>> + if (c->gc_stats.in_use > dc->writeback_fragment_percent && dirty > 0) {
>>> + int64_t dirty_buckets = (c->gc_stats.in_use * c->nbuckets) / 100;
>>> + int64_t fragment = (dirty_buckets * c->cache->sb.bucket_size) / dirty;
>>> +
>>> + proportional_scaled =
>>> + div_s64(dirty, dc->writeback_rate_p_term_inverse) * (fragment);
>>> + }
>>> +
>>> if ((error < 0 && dc->writeback_rate_integral > 0) ||
>>> (error > 0 && time_before64(local_clock(),
>>> dc->writeback_rate.next + NSEC_PER_MSEC))) {
>>> @@ -969,6 +989,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
>>> dc->writeback_metadata = true;
>>> dc->writeback_running = false;
>>> dc->writeback_percent = 10;
>>> + dc->writeback_fragment_percent = 50;
>>> dc->writeback_delay = 30;
>>> atomic_long_set(&dc->writeback_rate.rate, 1024);
>>> dc->writeback_rate_minimum = 8;
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2021-01-19 23:40 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-05 3:06 [PATCH] bcache: consider the fragmentation when update the writeback rate Dongdong Tao
[not found] ` <CAJS8hVK-ZCxJt=E3hwR0hmqPYL1T07_WC_nerb-dZodO+DqtDA@mail.gmail.com>
2021-01-05 4:33 ` Coly Li
[not found] ` <CAJS8hVL2B=RZr8H4jFbz=bX9k_E9ur7kTeue6BJwzm4pwv1+qQ@mail.gmail.com>
2021-01-08 4:05 ` Coly Li
2021-01-08 8:30 ` Dongdong Tao
2021-01-08 8:39 ` Coly Li
2021-01-08 8:47 ` Dongdong Tao
2021-01-14 4:55 ` Dongdong Tao
[not found] ` <CAJS8hVJDaREvpvG4iO+Xs-KQXQKFi7=k29TrG=NXqjyiPpUCZA@mail.gmail.com>
2021-01-14 10:05 ` Coly Li
2021-01-14 12:22 ` Dongdong Tao
2021-01-14 13:31 ` Coly Li
2021-01-14 15:35 ` Dongdong Tao
2021-01-17 7:11 ` kernel test robot
-- strict thread matches above, loose matches on Subject: below --
2021-01-19 12:56 Dongdong Tao
2021-01-19 14:06 ` Coly Li
2020-11-03 12:42 Dongdong Tao
2020-11-04 5:06 ` kernel test robot
2020-11-05 16:32 ` Coly Li
2020-11-10 4:19 ` Dongdong Tao
2020-11-11 8:33 ` Coly Li
2020-12-09 2:27 ` Dongsheng Yang
2020-12-09 4:48 ` Dongdong Tao
2020-12-09 7:49 ` Dongsheng Yang
[not found] ` <CAJS8hVLMUS1mdrwC8ovzvMO+HWf4xtXRCNJEghtbtW0g93Kh_g@mail.gmail.com>
2020-12-14 17:07 ` Coly Li
[not found] ` <CAJS8hVKMjec1cpe_zoeZAJrfY0Pq9bJ51eO6E+g8pgN9jV3Nmw@mail.gmail.com>
2020-12-21 8:08 ` Coly Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).