linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] bcache: consider the fragmentation when update the writeback rate
@ 2020-11-03 12:42 Dongdong Tao
  2020-11-04  5:06 ` kernel test robot
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Dongdong Tao @ 2020-11-03 12:42 UTC (permalink / raw)
  To: colyli
  Cc: gavin.guo, gerald.yang, trent.lloyd, dongdong tao,
	Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
	open list

From: dongdong tao <dongdong.tao@canonical.com>

Current way to calculate the writeback rate only considered the
dirty sectors, this usually works fine when the fragmentation
is not high, but it will give us unreasonable small rate when
we are under a situation that very few dirty sectors consumed
a lot dirty buckets. In some case, the dirty bucekts can reached
to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
reached the writeback_percent, the writeback rate will still
be the minimum value (4k), thus it will cause all the writes to be
stucked in a non-writeback mode because of the slow writeback.

This patch will try to accelerate the writeback rate when the
fragmentation is high. It calculate the propotional_scaled value
based on below:
(dirty_sectors / writeback_rate_p_term_inverse) * fragment
As we can see, the higher fragmentation will result a larger
proportional_scaled value, thus cause a larger writeback rate.
The fragment value is calculated based on below:
(dirty_buckets *  bucket_size) / dirty_sectors
If you think about it, the value of fragment will be always
inside [1, bucket_size].

This patch only considers the fragmentation when the number of
dirty_buckets reached to a dirty threshold(configurable by
writeback_fragment_percent, default is 50), so bcache will
remain the original behaviour before the dirty buckets reached
the threshold.

Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
---
 drivers/md/bcache/bcache.h    |  1 +
 drivers/md/bcache/sysfs.c     |  6 ++++++
 drivers/md/bcache/writeback.c | 21 +++++++++++++++++++++
 3 files changed, 28 insertions(+)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 1d57f48307e6..87632f7032b6 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -374,6 +374,7 @@ struct cached_dev {
 	unsigned int		writeback_metadata:1;
 	unsigned int		writeback_running:1;
 	unsigned char		writeback_percent;
+	unsigned char		writeback_fragment_percent;
 	unsigned int		writeback_delay;
 
 	uint64_t		writeback_rate_target;
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 554e3afc9b68..69499113aef8 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -115,6 +115,7 @@ rw_attribute(stop_when_cache_set_failed);
 rw_attribute(writeback_metadata);
 rw_attribute(writeback_running);
 rw_attribute(writeback_percent);
+rw_attribute(writeback_fragment_percent);
 rw_attribute(writeback_delay);
 rw_attribute(writeback_rate);
 
@@ -197,6 +198,7 @@ SHOW(__bch_cached_dev)
 	var_printf(writeback_running,	"%i");
 	var_print(writeback_delay);
 	var_print(writeback_percent);
+	var_print(writeback_fragment_percent);
 	sysfs_hprint(writeback_rate,
 		     wb ? atomic_long_read(&dc->writeback_rate.rate) << 9 : 0);
 	sysfs_printf(io_errors,		"%i", atomic_read(&dc->io_errors));
@@ -308,6 +310,9 @@ STORE(__cached_dev)
 	sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
 			    0, bch_cutoff_writeback);
 
+	sysfs_strtoul_clamp(writeback_fragment_percent, dc->writeback_fragment_percent,
+			    0, bch_cutoff_writeback_sync);
+
 	if (attr == &sysfs_writeback_rate) {
 		ssize_t ret;
 		long int v = atomic_long_read(&dc->writeback_rate.rate);
@@ -498,6 +503,7 @@ static struct attribute *bch_cached_dev_files[] = {
 	&sysfs_writeback_running,
 	&sysfs_writeback_delay,
 	&sysfs_writeback_percent,
+	&sysfs_writeback_fragment_percent,
 	&sysfs_writeback_rate,
 	&sysfs_writeback_rate_update_seconds,
 	&sysfs_writeback_rate_i_term_inverse,
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 3c74996978da..34babc89fdf3 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -88,6 +88,26 @@ static void __update_writeback_rate(struct cached_dev *dc)
 	int64_t integral_scaled;
 	uint32_t new_rate;
 
+	/*
+	 * We need to consider the number of dirty buckets as well
+	 * when calculating the proportional_scaled, Otherwise we might
+	 * have an unreasonable small writeback rate at a highly fragmented situation
+	 * when very few dirty sectors consumed a lot dirty buckets, the
+	 * worst case is when dirty_data reached writeback_percent and
+	 * dirty buckets reached to cutoff_writeback_sync, but the rate
+	 * still will be at the minimum value, which will cause the write
+	 * stuck at a non-writeback mode.
+	 */
+	struct cache_set *c = dc->disk.c;
+
+	if (c->gc_stats.in_use > dc->writeback_fragment_percent && dirty > 0) {
+		int64_t dirty_buckets = (c->gc_stats.in_use * c->nbuckets) / 100;
+		int64_t fragment = (dirty_buckets *  c->cache->sb.bucket_size) / dirty;
+
+		proportional_scaled =
+		div_s64(dirty, dc->writeback_rate_p_term_inverse) * (fragment);
+	}
+
 	if ((error < 0 && dc->writeback_rate_integral > 0) ||
 	    (error > 0 && time_before64(local_clock(),
 			 dc->writeback_rate.next + NSEC_PER_MSEC))) {
@@ -969,6 +989,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
 	dc->writeback_metadata		= true;
 	dc->writeback_running		= false;
 	dc->writeback_percent		= 10;
+	dc->writeback_fragment_percent	= 50;
 	dc->writeback_delay		= 30;
 	atomic_long_set(&dc->writeback_rate.rate, 1024);
 	dc->writeback_rate_minimum	= 8;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
  2020-11-03 12:42 [PATCH] bcache: consider the fragmentation when update the writeback rate Dongdong Tao
@ 2020-11-04  5:06 ` kernel test robot
  2020-11-05 16:32 ` Coly Li
  2020-12-09  2:27 ` Dongsheng Yang
  2 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2020-11-04  5:06 UTC (permalink / raw)
  To: Dongdong Tao, colyli; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2529 bytes --]

Hi Dongdong,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.10-rc2 next-20201103]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Dongdong-Tao/bcache-consider-the-fragmentation-when-update-the-writeback-rate/20201103-204517
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git b7cbaf59f62f8ab8f157698f9e31642bff525bd0
config: nds32-randconfig-p002-20201103 (attached as .config)
compiler: nds32le-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/4e57aac94bb0d0822473657d7ef0f3a0455e81f6
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Dongdong-Tao/bcache-consider-the-fragmentation-when-update-the-writeback-rate/20201103-204517
        git checkout 4e57aac94bb0d0822473657d7ef0f3a0455e81f6
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=nds32 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   nds32le-linux-ld: arch/nds32/kernel/ex-entry.o: in function `common_exception_handler':
   (.text+0xfe): undefined reference to `__trace_hardirqs_off'
   (.text+0xfe): relocation truncated to fit: R_NDS32_25_PCREL_RELA against undefined symbol `__trace_hardirqs_off'
   nds32le-linux-ld: arch/nds32/kernel/ex-exit.o: in function `no_work_pending':
   (.text+0xce): undefined reference to `__trace_hardirqs_off'
   nds32le-linux-ld: (.text+0xd2): undefined reference to `__trace_hardirqs_off'
   nds32le-linux-ld: (.text+0xd6): undefined reference to `__trace_hardirqs_on'
   nds32le-linux-ld: (.text+0xda): undefined reference to `__trace_hardirqs_on'
   nds32le-linux-ld: drivers/md/bcache/writeback.o: in function `update_writeback_rate':
>> writeback.c:(.text+0x1ffa): undefined reference to `__divdi3'
>> nds32le-linux-ld: writeback.c:(.text+0x1ffe): undefined reference to `__divdi3'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 34232 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
  2020-11-03 12:42 [PATCH] bcache: consider the fragmentation when update the writeback rate Dongdong Tao
  2020-11-04  5:06 ` kernel test robot
@ 2020-11-05 16:32 ` Coly Li
  2020-11-10  4:19   ` Dongdong Tao
  2020-12-09  2:27 ` Dongsheng Yang
  2 siblings, 1 reply; 24+ messages in thread
From: Coly Li @ 2020-11-05 16:32 UTC (permalink / raw)
  To: Dongdong Tao
  Cc: gavin.guo, gerald.yang, trent.lloyd, dongdong tao,
	Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
	open list

On 2020/11/3 20:42, Dongdong Tao wrote:
> From: dongdong tao <dongdong.tao@canonical.com>
> 
> Current way to calculate the writeback rate only considered the
> dirty sectors, this usually works fine when the fragmentation
> is not high, but it will give us unreasonable small rate when
> we are under a situation that very few dirty sectors consumed
> a lot dirty buckets. In some case, the dirty bucekts can reached
> to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
> reached the writeback_percent, the writeback rate will still
> be the minimum value (4k), thus it will cause all the writes to be
> stucked in a non-writeback mode because of the slow writeback.
> 
> This patch will try to accelerate the writeback rate when the
> fragmentation is high. It calculate the propotional_scaled value
> based on below:
> (dirty_sectors / writeback_rate_p_term_inverse) * fragment
> As we can see, the higher fragmentation will result a larger
> proportional_scaled value, thus cause a larger writeback rate.
> The fragment value is calculated based on below:
> (dirty_buckets *  bucket_size) / dirty_sectors
> If you think about it, the value of fragment will be always
> inside [1, bucket_size].
> 
> This patch only considers the fragmentation when the number of
> dirty_buckets reached to a dirty threshold(configurable by
> writeback_fragment_percent, default is 50), so bcache will
> remain the original behaviour before the dirty buckets reached
> the threshold.
> 
> Signed-off-by: dongdong tao <dongdong.tao@canonical.com>

Hi Dongdong,

Change the writeback rate does not effect the real throughput indeed,
your change is just increasing the upper limit hint of the writeback
throughput, the bottle neck is spinning drive for random I/O.

A good direction should be the moving gc. If the moving gc may work
faster, the situation you mentioned above could be relaxed a lot.

I will NACK this patch unless you may have a observable and reproducible
performance number.

Thanks.

Coly Li


> ---
>  drivers/md/bcache/bcache.h    |  1 +
>  drivers/md/bcache/sysfs.c     |  6 ++++++
>  drivers/md/bcache/writeback.c | 21 +++++++++++++++++++++
>  3 files changed, 28 insertions(+)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 1d57f48307e6..87632f7032b6 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -374,6 +374,7 @@ struct cached_dev {
>  	unsigned int		writeback_metadata:1;
>  	unsigned int		writeback_running:1;
>  	unsigned char		writeback_percent;
> +	unsigned char		writeback_fragment_percent;
>  	unsigned int		writeback_delay;
>  
>  	uint64_t		writeback_rate_target;
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index 554e3afc9b68..69499113aef8 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -115,6 +115,7 @@ rw_attribute(stop_when_cache_set_failed);
>  rw_attribute(writeback_metadata);
>  rw_attribute(writeback_running);
>  rw_attribute(writeback_percent);
> +rw_attribute(writeback_fragment_percent);
>  rw_attribute(writeback_delay);
>  rw_attribute(writeback_rate);
>  
> @@ -197,6 +198,7 @@ SHOW(__bch_cached_dev)
>  	var_printf(writeback_running,	"%i");
>  	var_print(writeback_delay);
>  	var_print(writeback_percent);
> +	var_print(writeback_fragment_percent);
>  	sysfs_hprint(writeback_rate,
>  		     wb ? atomic_long_read(&dc->writeback_rate.rate) << 9 : 0);
>  	sysfs_printf(io_errors,		"%i", atomic_read(&dc->io_errors));
> @@ -308,6 +310,9 @@ STORE(__cached_dev)
>  	sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
>  			    0, bch_cutoff_writeback);
>  
> +	sysfs_strtoul_clamp(writeback_fragment_percent, dc->writeback_fragment_percent,
> +			    0, bch_cutoff_writeback_sync);
> +
>  	if (attr == &sysfs_writeback_rate) {
>  		ssize_t ret;
>  		long int v = atomic_long_read(&dc->writeback_rate.rate);
> @@ -498,6 +503,7 @@ static struct attribute *bch_cached_dev_files[] = {
>  	&sysfs_writeback_running,
>  	&sysfs_writeback_delay,
>  	&sysfs_writeback_percent,
> +	&sysfs_writeback_fragment_percent,
>  	&sysfs_writeback_rate,
>  	&sysfs_writeback_rate_update_seconds,
>  	&sysfs_writeback_rate_i_term_inverse,
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index 3c74996978da..34babc89fdf3 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -88,6 +88,26 @@ static void __update_writeback_rate(struct cached_dev *dc)
>  	int64_t integral_scaled;
>  	uint32_t new_rate;
>  
> +	/*
> +	 * We need to consider the number of dirty buckets as well
> +	 * when calculating the proportional_scaled, Otherwise we might
> +	 * have an unreasonable small writeback rate at a highly fragmented situation
> +	 * when very few dirty sectors consumed a lot dirty buckets, the
> +	 * worst case is when dirty_data reached writeback_percent and
> +	 * dirty buckets reached to cutoff_writeback_sync, but the rate
> +	 * still will be at the minimum value, which will cause the write
> +	 * stuck at a non-writeback mode.
> +	 */
> +	struct cache_set *c = dc->disk.c;
> +
> +	if (c->gc_stats.in_use > dc->writeback_fragment_percent && dirty > 0) {
> +		int64_t dirty_buckets = (c->gc_stats.in_use * c->nbuckets) / 100;
> +		int64_t fragment = (dirty_buckets *  c->cache->sb.bucket_size) / dirty;
> +
> +		proportional_scaled =
> +		div_s64(dirty, dc->writeback_rate_p_term_inverse) * (fragment);
> +	}
> +
>  	if ((error < 0 && dc->writeback_rate_integral > 0) ||
>  	    (error > 0 && time_before64(local_clock(),
>  			 dc->writeback_rate.next + NSEC_PER_MSEC))) {
> @@ -969,6 +989,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
>  	dc->writeback_metadata		= true;
>  	dc->writeback_running		= false;
>  	dc->writeback_percent		= 10;
> +	dc->writeback_fragment_percent	= 50;
>  	dc->writeback_delay		= 30;
>  	atomic_long_set(&dc->writeback_rate.rate, 1024);
>  	dc->writeback_rate_minimum	= 8;
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
  2020-11-05 16:32 ` Coly Li
@ 2020-11-10  4:19   ` Dongdong Tao
  2020-11-11  8:33     ` Coly Li
  0 siblings, 1 reply; 24+ messages in thread
From: Dongdong Tao @ 2020-11-10  4:19 UTC (permalink / raw)
  To: Coly Li
  Cc: Dongdong Tao, open list:BCACHE (BLOCK LAYER CACHE),
	open list, Kent Overstreet

[Sorry again for the SPAM detection]

Thank you the reply Coly!

I agree that this patch is not a final solution for fixing the
fragmentation issue, but more like a workaround to alleviate this
problem.
So, part of my intention is to look for how upstream would like to fix
this issue.

I've looked into the code of moving_gc part, as well as did some
debug/test, unfortunately, I think it's not the solution for this
issue also.
Because movnig_gc will not just move the dirty cache, but also the
clean cache, so seems the purpose of moving_gc
is trying to move the data (dirty and clean) from those relatively
empty buckets to some new buckets, so that to reclaim the original
buckets.
For this purpose, I guess moving gc was more useful at the time when
we usually don't have large nvme devices.

Let's get back to the problem I have, the problem that I'm trying to
fix is that you might have lots of buckets (Say 70 percent) that are
all being fully consumed,
while those buckets only contain very few dirty data (Say 10 percent
), since gc can't reclaim a bucket which contains any dirty data, so
the worst situation
is that the cache_availability_percent can drop under 30 percent which
will make all the write op can't perform in a writeback mode, thus
kill the performance of writes.

So, unlike the moving_gc, I only want to move dirty data around (as
you've suggested :)), but I don't think it's a good idea to change the
behavior of moving_gc.
My current idea is to implement a compaction thread that triggers the
dirty data compaction only under some certain circumstances (like when
the fragmentaion and dirty buckets are both high), and this thread can
be turned on/off based on an extra option, so that people can keep the
original behavior if they want.

This is a rough idea now, could you please let me know if the above
thought makes sense to you or any other suggestions will be
appreciated!
I also understand the hardest part is making sure the general bcache
performance and functionality still look sane,
so it might require much more time to do it and it's more likely a
feature atm.

How to reproduce and observe this issue:
This issue is very easy to repreduce by running below fio command
against a writeback mode bcache deivce:

fio --name=random-writers --filename=/dev/bcache0 --ioengine=libaio
--iodepth=4 --rw=randrw --rate_iops=95,5 --bs=4k --direct=1
--numjobs=4

Note that the key option to reproduce this issue here is
"rate_iops=95,5", so that you will have 95 percent read and only 5
percent write, this is to make sure
one bucket only contains very few dirty data.
Also, it's faster to reproduce this with a small cache device, I use
1GB cache, but it's same for bigger cache device, just a matter of
time.

We can observe this issue by monitoring bcache stats "data_dirty" and
"cache_available_percent", after the cache_available_percent dropped
to 30 percent,
we can observe the write performance is hugely degraded by below
bpftrace script:
---
#!/usr/bin/env bpftrace

#include <linux/bio.h>

kprobe:cached_dev_make_request
{
   @start[arg1] = nsecs;
}

kprobe:bio_endio /@start[arg0]/
{
 if(((struct bio *)arg0)->bi_opf & 1) {
    @write = hist(nsecs - @start[arg0]); delete(@start[arg0]);
 }
 else {
    @read = hist(nsecs - @start[arg0]); delete(@start[arg0]);
 }
}
---

To run this script:
Save above bpftrace file to bcache_io_lat.bt, then run it with chmod
+x bcache_io_lat.bt & ./bcache_io_lat.bt

By the way, we mainly hit this issue on ceph, the fio reproducer is
just an easy way to reproduce it.

Regards,
Dongdong


On Fri, Nov 6, 2020 at 12:32 AM Coly Li <colyli@suse.de> wrote:
>
> On 2020/11/3 20:42, Dongdong Tao wrote:
> > From: dongdong tao <dongdong.tao@canonical.com>
> >
> > Current way to calculate the writeback rate only considered the
> > dirty sectors, this usually works fine when the fragmentation
> > is not high, but it will give us unreasonable small rate when
> > we are under a situation that very few dirty sectors consumed
> > a lot dirty buckets. In some case, the dirty bucekts can reached
> > to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
> > reached the writeback_percent, the writeback rate will still
> > be the minimum value (4k), thus it will cause all the writes to be
> > stucked in a non-writeback mode because of the slow writeback.
> >
> > This patch will try to accelerate the writeback rate when the
> > fragmentation is high. It calculate the propotional_scaled value
> > based on below:
> > (dirty_sectors / writeback_rate_p_term_inverse) * fragment
> > As we can see, the higher fragmentation will result a larger
> > proportional_scaled value, thus cause a larger writeback rate.
> > The fragment value is calculated based on below:
> > (dirty_buckets *  bucket_size) / dirty_sectors
> > If you think about it, the value of fragment will be always
> > inside [1, bucket_size].
> >
> > This patch only considers the fragmentation when the number of
> > dirty_buckets reached to a dirty threshold(configurable by
> > writeback_fragment_percent, default is 50), so bcache will
> > remain the original behaviour before the dirty buckets reached
> > the threshold.
> >
> > Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
>
> Hi Dongdong,
>
> Change the writeback rate does not effect the real throughput indeed,
> your change is just increasing the upper limit hint of the writeback
> throughput, the bottle neck is spinning drive for random I/O.
>
> A good direction should be the moving gc. If the moving gc may work
> faster, the situation you mentioned above could be relaxed a lot.
>
> I will NACK this patch unless you may have a observable and reproducible
> performance number.
>
> Thanks.
>
> Coly Li
>
>
> > ---
> >  drivers/md/bcache/bcache.h    |  1 +
> >  drivers/md/bcache/sysfs.c     |  6 ++++++
> >  drivers/md/bcache/writeback.c | 21 +++++++++++++++++++++
> >  3 files changed, 28 insertions(+)
> >
> > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> > index 1d57f48307e6..87632f7032b6 100644
> > --- a/drivers/md/bcache/bcache.h
> > +++ b/drivers/md/bcache/bcache.h
> > @@ -374,6 +374,7 @@ struct cached_dev {
> >       unsigned int            writeback_metadata:1;
> >       unsigned int            writeback_running:1;
> >       unsigned char           writeback_percent;
> > +     unsigned char           writeback_fragment_percent;
> >       unsigned int            writeback_delay;
> >
> >       uint64_t                writeback_rate_target;
> > diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> > index 554e3afc9b68..69499113aef8 100644
> > --- a/drivers/md/bcache/sysfs.c
> > +++ b/drivers/md/bcache/sysfs.c
> > @@ -115,6 +115,7 @@ rw_attribute(stop_when_cache_set_failed);
> >  rw_attribute(writeback_metadata);
> >  rw_attribute(writeback_running);
> >  rw_attribute(writeback_percent);
> > +rw_attribute(writeback_fragment_percent);
> >  rw_attribute(writeback_delay);
> >  rw_attribute(writeback_rate);
> >
> > @@ -197,6 +198,7 @@ SHOW(__bch_cached_dev)
> >       var_printf(writeback_running,   "%i");
> >       var_print(writeback_delay);
> >       var_print(writeback_percent);
> > +     var_print(writeback_fragment_percent);
> >       sysfs_hprint(writeback_rate,
> >                    wb ? atomic_long_read(&dc->writeback_rate.rate) << 9 : 0);
> >       sysfs_printf(io_errors,         "%i", atomic_read(&dc->io_errors));
> > @@ -308,6 +310,9 @@ STORE(__cached_dev)
> >       sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
> >                           0, bch_cutoff_writeback);
> >
> > +     sysfs_strtoul_clamp(writeback_fragment_percent, dc->writeback_fragment_percent,
> > +                         0, bch_cutoff_writeback_sync);
> > +
> >       if (attr == &sysfs_writeback_rate) {
> >               ssize_t ret;
> >               long int v = atomic_long_read(&dc->writeback_rate.rate);
> > @@ -498,6 +503,7 @@ static struct attribute *bch_cached_dev_files[] = {
> >       &sysfs_writeback_running,
> >       &sysfs_writeback_delay,
> >       &sysfs_writeback_percent,
> > +     &sysfs_writeback_fragment_percent,
> >       &sysfs_writeback_rate,
> >       &sysfs_writeback_rate_update_seconds,
> >       &sysfs_writeback_rate_i_term_inverse,
> > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> > index 3c74996978da..34babc89fdf3 100644
> > --- a/drivers/md/bcache/writeback.c
> > +++ b/drivers/md/bcache/writeback.c
> > @@ -88,6 +88,26 @@ static void __update_writeback_rate(struct cached_dev *dc)
> >       int64_t integral_scaled;
> >       uint32_t new_rate;
> >
> > +     /*
> > +      * We need to consider the number of dirty buckets as well
> > +      * when calculating the proportional_scaled, Otherwise we might
> > +      * have an unreasonable small writeback rate at a highly fragmented situation
> > +      * when very few dirty sectors consumed a lot dirty buckets, the
> > +      * worst case is when dirty_data reached writeback_percent and
> > +      * dirty buckets reached to cutoff_writeback_sync, but the rate
> > +      * still will be at the minimum value, which will cause the write
> > +      * stuck at a non-writeback mode.
> > +      */
> > +     struct cache_set *c = dc->disk.c;
> > +
> > +     if (c->gc_stats.in_use > dc->writeback_fragment_percent && dirty > 0) {
> > +             int64_t dirty_buckets = (c->gc_stats.in_use * c->nbuckets) / 100;
> > +             int64_t fragment = (dirty_buckets *  c->cache->sb.bucket_size) / dirty;
> > +
> > +             proportional_scaled =
> > +             div_s64(dirty, dc->writeback_rate_p_term_inverse) * (fragment);
> > +     }
> > +
> >       if ((error < 0 && dc->writeback_rate_integral > 0) ||
> >           (error > 0 && time_before64(local_clock(),
> >                        dc->writeback_rate.next + NSEC_PER_MSEC))) {
> > @@ -969,6 +989,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
> >       dc->writeback_metadata          = true;
> >       dc->writeback_running           = false;
> >       dc->writeback_percent           = 10;
> > +     dc->writeback_fragment_percent  = 50;
> >       dc->writeback_delay             = 30;
> >       atomic_long_set(&dc->writeback_rate.rate, 1024);
> >       dc->writeback_rate_minimum      = 8;
> >
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
  2020-11-10  4:19   ` Dongdong Tao
@ 2020-11-11  8:33     ` Coly Li
  0 siblings, 0 replies; 24+ messages in thread
From: Coly Li @ 2020-11-11  8:33 UTC (permalink / raw)
  To: Dongdong Tao
  Cc: Dongdong Tao, open list:BCACHE (BLOCK LAYER CACHE),
	open list, Kent Overstreet

On 2020/11/10 12:19, Dongdong Tao wrote:
> [Sorry again for the SPAM detection]
> 
> Thank you the reply Coly!
> 
> I agree that this patch is not a final solution for fixing the
> fragmentation issue, but more like a workaround to alleviate this
> problem.
> So, part of my intention is to look for how upstream would like to fix
> this issue.
> 
> I've looked into the code of moving_gc part, as well as did some
> debug/test, unfortunately, I think it's not the solution for this
> issue also.
> Because movnig_gc will not just move the dirty cache, but also the
> clean cache, so seems the purpose of moving_gc
> is trying to move the data (dirty and clean) from those relatively
> empty buckets to some new buckets, so that to reclaim the original
> buckets.
> For this purpose, I guess moving gc was more useful at the time when
> we usually don't have large nvme devices.
> 
> Let's get back to the problem I have, the problem that I'm trying to
> fix is that you might have lots of buckets (Say 70 percent) that are
> all being fully consumed,
> while those buckets only contain very few dirty data (Say 10 percent
> ), since gc can't reclaim a bucket which contains any dirty data, so
> the worst situation
> is that the cache_availability_percent can drop under 30 percent which
> will make all the write op can't perform in a writeback mode, thus
> kill the performance of writes.
> 
> So, unlike the moving_gc, I only want to move dirty data around (as
> you've suggested :)), but I don't think it's a good idea to change the
> behavior of moving_gc.
> My current idea is to implement a compaction thread that triggers the
> dirty data compaction only under some certain circumstances (like when
> the fragmentaion and dirty buckets are both high), and this thread can
> be turned on/off based on an extra option, so that people can keep the
> original behavior if they want.
> 
> This is a rough idea now, could you please let me know if the above
> thought makes sense to you or any other suggestions will be
> appreciated!
> I also understand the hardest part is making sure the general bcache
> performance and functionality still look sane,
> so it might require much more time to do it and it's more likely a
> feature atm.
> 
> How to reproduce and observe this issue:
> This issue is very easy to repreduce by running below fio command
> against a writeback mode bcache deivce:
> 
> fio --name=random-writers --filename=/dev/bcache0 --ioengine=libaio
> --iodepth=4 --rw=randrw --rate_iops=95,5 --bs=4k --direct=1
> --numjobs=4
> 
> Note that the key option to reproduce this issue here is
> "rate_iops=95,5", so that you will have 95 percent read and only 5
> percent write, this is to make sure
> one bucket only contains very few dirty data.
> Also, it's faster to reproduce this with a small cache device, I use
> 1GB cache, but it's same for bigger cache device, just a matter of
> time.
> 
> We can observe this issue by monitoring bcache stats "data_dirty" and
> "cache_available_percent", after the cache_available_percent dropped
> to 30 percent,
> we can observe the write performance is hugely degraded by below
> bpftrace script:
> ---
> #!/usr/bin/env bpftrace
> 
> #include <linux/bio.h>
> 
> kprobe:cached_dev_make_request
> {
>    @start[arg1] = nsecs;
> }
> 
> kprobe:bio_endio /@start[arg0]/
> {
>  if(((struct bio *)arg0)->bi_opf & 1) {
>     @write = hist(nsecs - @start[arg0]); delete(@start[arg0]);
>  }
>  else {
>     @read = hist(nsecs - @start[arg0]); delete(@start[arg0]);
>  }
> }
> ---
> 
> To run this script:
> Save above bpftrace file to bcache_io_lat.bt, then run it with chmod
> +x bcache_io_lat.bt & ./bcache_io_lat.bt
> 
> By the way, we mainly hit this issue on ceph, the fio reproducer is
> just an easy way to reproduce it.
> 

Hi Dongdong,

I know this situation, this is not the first time it is mentioned.

What is the performance number that your patch gains ? I wanted to see
"observable and reproducible performance number", especially the latency
 and IOPS for regular I/O requests.

Thanks.

Coly Li


> On Fri, Nov 6, 2020 at 12:32 AM Coly Li <colyli@suse.de> wrote:
>>
>> On 2020/11/3 20:42, Dongdong Tao wrote:
>>> From: dongdong tao <dongdong.tao@canonical.com>
>>>
>>> Current way to calculate the writeback rate only considered the
>>> dirty sectors, this usually works fine when the fragmentation
>>> is not high, but it will give us unreasonable small rate when
>>> we are under a situation that very few dirty sectors consumed
>>> a lot dirty buckets. In some case, the dirty bucekts can reached
>>> to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
>>> reached the writeback_percent, the writeback rate will still
>>> be the minimum value (4k), thus it will cause all the writes to be
>>> stucked in a non-writeback mode because of the slow writeback.
>>>
>>> This patch will try to accelerate the writeback rate when the
>>> fragmentation is high. It calculate the propotional_scaled value
>>> based on below:
>>> (dirty_sectors / writeback_rate_p_term_inverse) * fragment
>>> As we can see, the higher fragmentation will result a larger
>>> proportional_scaled value, thus cause a larger writeback rate.
>>> The fragment value is calculated based on below:
>>> (dirty_buckets *  bucket_size) / dirty_sectors
>>> If you think about it, the value of fragment will be always
>>> inside [1, bucket_size].
>>>
>>> This patch only considers the fragmentation when the number of
>>> dirty_buckets reached to a dirty threshold(configurable by
>>> writeback_fragment_percent, default is 50), so bcache will
>>> remain the original behaviour before the dirty buckets reached
>>> the threshold.
>>>
>>> Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
>>
>> Hi Dongdong,
>>
>> Change the writeback rate does not effect the real throughput indeed,
>> your change is just increasing the upper limit hint of the writeback
>> throughput, the bottle neck is spinning drive for random I/O.
>>
>> A good direction should be the moving gc. If the moving gc may work
>> faster, the situation you mentioned above could be relaxed a lot.
>>
>> I will NACK this patch unless you may have a observable and reproducible
>> performance number.
>>
>> Thanks.
>>
>> Coly Li

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
  2020-11-03 12:42 [PATCH] bcache: consider the fragmentation when update the writeback rate Dongdong Tao
  2020-11-04  5:06 ` kernel test robot
  2020-11-05 16:32 ` Coly Li
@ 2020-12-09  2:27 ` Dongsheng Yang
  2020-12-09  4:48   ` Dongdong Tao
  2 siblings, 1 reply; 24+ messages in thread
From: Dongsheng Yang @ 2020-12-09  2:27 UTC (permalink / raw)
  To: Dongdong Tao, colyli
  Cc: gavin.guo, gerald.yang, trent.lloyd, dongdong tao,
	Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
	open list


在 2020/11/3 星期二 下午 8:42, Dongdong Tao 写道:
> From: dongdong tao <dongdong.tao@canonical.com>
>
> Current way to calculate the writeback rate only considered the
> dirty sectors, this usually works fine when the fragmentation
> is not high, but it will give us unreasonable small rate when
> we are under a situation that very few dirty sectors consumed
> a lot dirty buckets. In some case, the dirty bucekts can reached
> to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
> reached the writeback_percent, the writeback rate will still
> be the minimum value (4k), thus it will cause all the writes to be
> stucked in a non-writeback mode because of the slow writeback.
>
> This patch will try to accelerate the writeback rate when the
> fragmentation is high. It calculate the propotional_scaled value
> based on below:
> (dirty_sectors / writeback_rate_p_term_inverse) * fragment
> As we can see, the higher fragmentation will result a larger
> proportional_scaled value, thus cause a larger writeback rate.
> The fragment value is calculated based on below:
> (dirty_buckets *  bucket_size) / dirty_sectors
> If you think about it, the value of fragment will be always
> inside [1, bucket_size].
>
> This patch only considers the fragmentation when the number of
> dirty_buckets reached to a dirty threshold(configurable by
> writeback_fragment_percent, default is 50), so bcache will
> remain the original behaviour before the dirty buckets reached
> the threshold.
>
> Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
> ---
>   drivers/md/bcache/bcache.h    |  1 +
>   drivers/md/bcache/sysfs.c     |  6 ++++++
>   drivers/md/bcache/writeback.c | 21 +++++++++++++++++++++
>   3 files changed, 28 insertions(+)
>
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 1d57f48307e6..87632f7032b6 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -374,6 +374,7 @@ struct cached_dev {
>   	unsigned int		writeback_metadata:1;
>   	unsigned int		writeback_running:1;
>   	unsigned char		writeback_percent;
> +	unsigned char		writeback_fragment_percent;
>   	unsigned int		writeback_delay;
>   
>   	uint64_t		writeback_rate_target;
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index 554e3afc9b68..69499113aef8 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -115,6 +115,7 @@ rw_attribute(stop_when_cache_set_failed);
>   rw_attribute(writeback_metadata);
>   rw_attribute(writeback_running);
>   rw_attribute(writeback_percent);
> +rw_attribute(writeback_fragment_percent);


Hi Dongdong and Coly,

     What is the status about this patch? In my opinion, it is a problem 
we need to solve,

but can we just reuse the parameter of writeback_percent, rather than 
introduce a new writeback_fragment_percent?

That means the semantic of writeback_percent will act on dirty data 
percent and dirty bucket percent.

When we found there are dirty buckets more than (c->nbuckets * 
writeback_percent), start the writeback.


Thanx

Yang

>   rw_attribute(writeback_delay);
>   rw_attribute(writeback_rate);
>   
> @@ -197,6 +198,7 @@ SHOW(__bch_cached_dev)
>   	var_printf(writeback_running,	"%i");
>   	var_print(writeback_delay);
>   	var_print(writeback_percent);
> +	var_print(writeback_fragment_percent);
>   	sysfs_hprint(writeback_rate,
>   		     wb ? atomic_long_read(&dc->writeback_rate.rate) << 9 : 0);
>   	sysfs_printf(io_errors,		"%i", atomic_read(&dc->io_errors));
> @@ -308,6 +310,9 @@ STORE(__cached_dev)
>   	sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
>   			    0, bch_cutoff_writeback);
>   
> +	sysfs_strtoul_clamp(writeback_fragment_percent, dc->writeback_fragment_percent,
> +			    0, bch_cutoff_writeback_sync);
> +
>   	if (attr == &sysfs_writeback_rate) {
>   		ssize_t ret;
>   		long int v = atomic_long_read(&dc->writeback_rate.rate);
> @@ -498,6 +503,7 @@ static struct attribute *bch_cached_dev_files[] = {
>   	&sysfs_writeback_running,
>   	&sysfs_writeback_delay,
>   	&sysfs_writeback_percent,
> +	&sysfs_writeback_fragment_percent,
>   	&sysfs_writeback_rate,
>   	&sysfs_writeback_rate_update_seconds,
>   	&sysfs_writeback_rate_i_term_inverse,
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index 3c74996978da..34babc89fdf3 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -88,6 +88,26 @@ static void __update_writeback_rate(struct cached_dev *dc)
>   	int64_t integral_scaled;
>   	uint32_t new_rate;
>   
> +	/*
> +	 * We need to consider the number of dirty buckets as well
> +	 * when calculating the proportional_scaled, Otherwise we might
> +	 * have an unreasonable small writeback rate at a highly fragmented situation
> +	 * when very few dirty sectors consumed a lot dirty buckets, the
> +	 * worst case is when dirty_data reached writeback_percent and
> +	 * dirty buckets reached to cutoff_writeback_sync, but the rate
> +	 * still will be at the minimum value, which will cause the write
> +	 * stuck at a non-writeback mode.
> +	 */
> +	struct cache_set *c = dc->disk.c;
> +
> +	if (c->gc_stats.in_use > dc->writeback_fragment_percent && dirty > 0) {
> +		int64_t dirty_buckets = (c->gc_stats.in_use * c->nbuckets) / 100;
> +		int64_t fragment = (dirty_buckets *  c->cache->sb.bucket_size) / dirty;
> +
> +		proportional_scaled =
> +		div_s64(dirty, dc->writeback_rate_p_term_inverse) * (fragment);
> +	}
> +
>   	if ((error < 0 && dc->writeback_rate_integral > 0) ||
>   	    (error > 0 && time_before64(local_clock(),
>   			 dc->writeback_rate.next + NSEC_PER_MSEC))) {
> @@ -969,6 +989,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
>   	dc->writeback_metadata		= true;
>   	dc->writeback_running		= false;
>   	dc->writeback_percent		= 10;
> +	dc->writeback_fragment_percent	= 50;
>   	dc->writeback_delay		= 30;
>   	atomic_long_set(&dc->writeback_rate.rate, 1024);
>   	dc->writeback_rate_minimum	= 8;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
  2020-12-09  2:27 ` Dongsheng Yang
@ 2020-12-09  4:48   ` Dongdong Tao
  2020-12-09  7:49     ` Dongsheng Yang
  0 siblings, 1 reply; 24+ messages in thread
From: Dongdong Tao @ 2020-12-09  4:48 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: Dongdong Tao, Coly Li, Gavin Guo, Gerald Yang, Trent Lloyd,
	Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
	open list

Hi Dongsheng,

I'm working on it, next step I'm gathering some testing data and
upload (very sorry for the delay...)
Thanks for the comment.
One of the main concerns to alleviate this issue with the writeback
process is that we need to minimize the impact on the client IO
performance.
writeback_percent by default is 10, start writeback when dirty buckets
reached 10 percent might be a bit too aggressive, as the
writeback_cutoff_sync is 70 percent.
So i chose to start the writeback when dirty buckets reached 50
percent so that this patch will only take effect after dirty buckets
percent is above that

Thanks,
Dongdong




On Wed, Dec 9, 2020 at 10:27 AM Dongsheng Yang
<dongsheng.yang@easystack.cn> wrote:
>
>
> 在 2020/11/3 星期二 下午 8:42, Dongdong Tao 写道:
> > From: dongdong tao <dongdong.tao@canonical.com>
> >
> > Current way to calculate the writeback rate only considered the
> > dirty sectors, this usually works fine when the fragmentation
> > is not high, but it will give us unreasonable small rate when
> > we are under a situation that very few dirty sectors consumed
> > a lot dirty buckets. In some case, the dirty bucekts can reached
> > to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
> > reached the writeback_percent, the writeback rate will still
> > be the minimum value (4k), thus it will cause all the writes to be
> > stucked in a non-writeback mode because of the slow writeback.
> >
> > This patch will try to accelerate the writeback rate when the
> > fragmentation is high. It calculate the propotional_scaled value
> > based on below:
> > (dirty_sectors / writeback_rate_p_term_inverse) * fragment
> > As we can see, the higher fragmentation will result a larger
> > proportional_scaled value, thus cause a larger writeback rate.
> > The fragment value is calculated based on below:
> > (dirty_buckets *  bucket_size) / dirty_sectors
> > If you think about it, the value of fragment will be always
> > inside [1, bucket_size].
> >
> > This patch only considers the fragmentation when the number of
> > dirty_buckets reached to a dirty threshold(configurable by
> > writeback_fragment_percent, default is 50), so bcache will
> > remain the original behaviour before the dirty buckets reached
> > the threshold.
> >
> > Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
> > ---
> >   drivers/md/bcache/bcache.h    |  1 +
> >   drivers/md/bcache/sysfs.c     |  6 ++++++
> >   drivers/md/bcache/writeback.c | 21 +++++++++++++++++++++
> >   3 files changed, 28 insertions(+)
> >
> > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> > index 1d57f48307e6..87632f7032b6 100644
> > --- a/drivers/md/bcache/bcache.h
> > +++ b/drivers/md/bcache/bcache.h
> > @@ -374,6 +374,7 @@ struct cached_dev {
> >       unsigned int            writeback_metadata:1;
> >       unsigned int            writeback_running:1;
> >       unsigned char           writeback_percent;
> > +     unsigned char           writeback_fragment_percent;
> >       unsigned int            writeback_delay;
> >
> >       uint64_t                writeback_rate_target;
> > diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> > index 554e3afc9b68..69499113aef8 100644
> > --- a/drivers/md/bcache/sysfs.c
> > +++ b/drivers/md/bcache/sysfs.c
> > @@ -115,6 +115,7 @@ rw_attribute(stop_when_cache_set_failed);
> >   rw_attribute(writeback_metadata);
> >   rw_attribute(writeback_running);
> >   rw_attribute(writeback_percent);
> > +rw_attribute(writeback_fragment_percent);
>
>
> Hi Dongdong and Coly,
>
>      What is the status about this patch? In my opinion, it is a problem
> we need to solve,
>
> but can we just reuse the parameter of writeback_percent, rather than
> introduce a new writeback_fragment_percent?
>
> That means the semantic of writeback_percent will act on dirty data
> percent and dirty bucket percent.
>
> When we found there are dirty buckets more than (c->nbuckets *
> writeback_percent), start the writeback.
>
>
> Thanx
>
> Yang
>
> >   rw_attribute(writeback_delay);
> >   rw_attribute(writeback_rate);
> >
> > @@ -197,6 +198,7 @@ SHOW(__bch_cached_dev)
> >       var_printf(writeback_running,   "%i");
> >       var_print(writeback_delay);
> >       var_print(writeback_percent);
> > +     var_print(writeback_fragment_percent);
> >       sysfs_hprint(writeback_rate,
> >                    wb ? atomic_long_read(&dc->writeback_rate.rate) << 9 : 0);
> >       sysfs_printf(io_errors,         "%i", atomic_read(&dc->io_errors));
> > @@ -308,6 +310,9 @@ STORE(__cached_dev)
> >       sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
> >                           0, bch_cutoff_writeback);
> >
> > +     sysfs_strtoul_clamp(writeback_fragment_percent, dc->writeback_fragment_percent,
> > +                         0, bch_cutoff_writeback_sync);
> > +
> >       if (attr == &sysfs_writeback_rate) {
> >               ssize_t ret;
> >               long int v = atomic_long_read(&dc->writeback_rate.rate);
> > @@ -498,6 +503,7 @@ static struct attribute *bch_cached_dev_files[] = {
> >       &sysfs_writeback_running,
> >       &sysfs_writeback_delay,
> >       &sysfs_writeback_percent,
> > +     &sysfs_writeback_fragment_percent,
> >       &sysfs_writeback_rate,
> >       &sysfs_writeback_rate_update_seconds,
> >       &sysfs_writeback_rate_i_term_inverse,
> > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> > index 3c74996978da..34babc89fdf3 100644
> > --- a/drivers/md/bcache/writeback.c
> > +++ b/drivers/md/bcache/writeback.c
> > @@ -88,6 +88,26 @@ static void __update_writeback_rate(struct cached_dev *dc)
> >       int64_t integral_scaled;
> >       uint32_t new_rate;
> >
> > +     /*
> > +      * We need to consider the number of dirty buckets as well
> > +      * when calculating the proportional_scaled, Otherwise we might
> > +      * have an unreasonable small writeback rate at a highly fragmented situation
> > +      * when very few dirty sectors consumed a lot dirty buckets, the
> > +      * worst case is when dirty_data reached writeback_percent and
> > +      * dirty buckets reached to cutoff_writeback_sync, but the rate
> > +      * still will be at the minimum value, which will cause the write
> > +      * stuck at a non-writeback mode.
> > +      */
> > +     struct cache_set *c = dc->disk.c;
> > +
> > +     if (c->gc_stats.in_use > dc->writeback_fragment_percent && dirty > 0) {
> > +             int64_t dirty_buckets = (c->gc_stats.in_use * c->nbuckets) / 100;
> > +             int64_t fragment = (dirty_buckets *  c->cache->sb.bucket_size) / dirty;
> > +
> > +             proportional_scaled =
> > +             div_s64(dirty, dc->writeback_rate_p_term_inverse) * (fragment);
> > +     }
> > +
> >       if ((error < 0 && dc->writeback_rate_integral > 0) ||
> >           (error > 0 && time_before64(local_clock(),
> >                        dc->writeback_rate.next + NSEC_PER_MSEC))) {
> > @@ -969,6 +989,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
> >       dc->writeback_metadata          = true;
> >       dc->writeback_running           = false;
> >       dc->writeback_percent           = 10;
> > +     dc->writeback_fragment_percent  = 50;
> >       dc->writeback_delay             = 30;
> >       atomic_long_set(&dc->writeback_rate.rate, 1024);
> >       dc->writeback_rate_minimum      = 8;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
  2020-12-09  4:48   ` Dongdong Tao
@ 2020-12-09  7:49     ` Dongsheng Yang
       [not found]       ` <CAJS8hVLMUS1mdrwC8ovzvMO+HWf4xtXRCNJEghtbtW0g93Kh_g@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Dongsheng Yang @ 2020-12-09  7:49 UTC (permalink / raw)
  To: Dongdong Tao
  Cc: Dongdong Tao, Coly Li, Gavin Guo, Gerald Yang, Trent Lloyd,
	Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
	open list


在 2020/12/9 星期三 下午 12:48, Dongdong Tao 写道:
> Hi Dongsheng,
>
> I'm working on it, next step I'm gathering some testing data and
> upload (very sorry for the delay...)
> Thanks for the comment.
> One of the main concerns to alleviate this issue with the writeback
> process is that we need to minimize the impact on the client IO
> performance.
> writeback_percent by default is 10, start writeback when dirty buckets
> reached 10 percent might be a bit too aggressive, as the
> writeback_cutoff_sync is 70 percent.
> So i chose to start the writeback when dirty buckets reached 50
> percent so that this patch will only take effect after dirty buckets
> percent is above that

Agree with that's too aggressive to reuse writeback_percent, and that's 
less flexable.

Okey, let's wait for your testing result.


Thanx

>
> Thanks,
> Dongdong
>
>
>
>
> On Wed, Dec 9, 2020 at 10:27 AM Dongsheng Yang
> <dongsheng.yang@easystack.cn> wrote:
>>
>> 在 2020/11/3 星期二 下午 8:42, Dongdong Tao 写道:
>>> From: dongdong tao <dongdong.tao@canonical.com>
>>>
>>> Current way to calculate the writeback rate only considered the
>>> dirty sectors, this usually works fine when the fragmentation
>>> is not high, but it will give us unreasonable small rate when
>>> we are under a situation that very few dirty sectors consumed
>>> a lot dirty buckets. In some case, the dirty bucekts can reached
>>> to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
>>> reached the writeback_percent, the writeback rate will still
>>> be the minimum value (4k), thus it will cause all the writes to be
>>> stucked in a non-writeback mode because of the slow writeback.
>>>
>>> This patch will try to accelerate the writeback rate when the
>>> fragmentation is high. It calculate the propotional_scaled value
>>> based on below:
>>> (dirty_sectors / writeback_rate_p_term_inverse) * fragment
>>> As we can see, the higher fragmentation will result a larger
>>> proportional_scaled value, thus cause a larger writeback rate.
>>> The fragment value is calculated based on below:
>>> (dirty_buckets *  bucket_size) / dirty_sectors
>>> If you think about it, the value of fragment will be always
>>> inside [1, bucket_size].
>>>
>>> This patch only considers the fragmentation when the number of
>>> dirty_buckets reached to a dirty threshold(configurable by
>>> writeback_fragment_percent, default is 50), so bcache will
>>> remain the original behaviour before the dirty buckets reached
>>> the threshold.
>>>
>>> Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
>>> ---
>>>    drivers/md/bcache/bcache.h    |  1 +
>>>    drivers/md/bcache/sysfs.c     |  6 ++++++
>>>    drivers/md/bcache/writeback.c | 21 +++++++++++++++++++++
>>>    3 files changed, 28 insertions(+)
>>>
>>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>>> index 1d57f48307e6..87632f7032b6 100644
>>> --- a/drivers/md/bcache/bcache.h
>>> +++ b/drivers/md/bcache/bcache.h
>>> @@ -374,6 +374,7 @@ struct cached_dev {
>>>        unsigned int            writeback_metadata:1;
>>>        unsigned int            writeback_running:1;
>>>        unsigned char           writeback_percent;
>>> +     unsigned char           writeback_fragment_percent;
>>>        unsigned int            writeback_delay;
>>>
>>>        uint64_t                writeback_rate_target;
>>> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
>>> index 554e3afc9b68..69499113aef8 100644
>>> --- a/drivers/md/bcache/sysfs.c
>>> +++ b/drivers/md/bcache/sysfs.c
>>> @@ -115,6 +115,7 @@ rw_attribute(stop_when_cache_set_failed);
>>>    rw_attribute(writeback_metadata);
>>>    rw_attribute(writeback_running);
>>>    rw_attribute(writeback_percent);
>>> +rw_attribute(writeback_fragment_percent);
>>
>> Hi Dongdong and Coly,
>>
>>       What is the status about this patch? In my opinion, it is a problem
>> we need to solve,
>>
>> but can we just reuse the parameter of writeback_percent, rather than
>> introduce a new writeback_fragment_percent?
>>
>> That means the semantic of writeback_percent will act on dirty data
>> percent and dirty bucket percent.
>>
>> When we found there are dirty buckets more than (c->nbuckets *
>> writeback_percent), start the writeback.
>>
>>
>> Thanx
>>
>> Yang
>>
>>>    rw_attribute(writeback_delay);
>>>    rw_attribute(writeback_rate);
>>>
>>> @@ -197,6 +198,7 @@ SHOW(__bch_cached_dev)
>>>        var_printf(writeback_running,   "%i");
>>>        var_print(writeback_delay);
>>>        var_print(writeback_percent);
>>> +     var_print(writeback_fragment_percent);
>>>        sysfs_hprint(writeback_rate,
>>>                     wb ? atomic_long_read(&dc->writeback_rate.rate) << 9 : 0);
>>>        sysfs_printf(io_errors,         "%i", atomic_read(&dc->io_errors));
>>> @@ -308,6 +310,9 @@ STORE(__cached_dev)
>>>        sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
>>>                            0, bch_cutoff_writeback);
>>>
>>> +     sysfs_strtoul_clamp(writeback_fragment_percent, dc->writeback_fragment_percent,
>>> +                         0, bch_cutoff_writeback_sync);
>>> +
>>>        if (attr == &sysfs_writeback_rate) {
>>>                ssize_t ret;
>>>                long int v = atomic_long_read(&dc->writeback_rate.rate);
>>> @@ -498,6 +503,7 @@ static struct attribute *bch_cached_dev_files[] = {
>>>        &sysfs_writeback_running,
>>>        &sysfs_writeback_delay,
>>>        &sysfs_writeback_percent,
>>> +     &sysfs_writeback_fragment_percent,
>>>        &sysfs_writeback_rate,
>>>        &sysfs_writeback_rate_update_seconds,
>>>        &sysfs_writeback_rate_i_term_inverse,
>>> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
>>> index 3c74996978da..34babc89fdf3 100644
>>> --- a/drivers/md/bcache/writeback.c
>>> +++ b/drivers/md/bcache/writeback.c
>>> @@ -88,6 +88,26 @@ static void __update_writeback_rate(struct cached_dev *dc)
>>>        int64_t integral_scaled;
>>>        uint32_t new_rate;
>>>
>>> +     /*
>>> +      * We need to consider the number of dirty buckets as well
>>> +      * when calculating the proportional_scaled, Otherwise we might
>>> +      * have an unreasonable small writeback rate at a highly fragmented situation
>>> +      * when very few dirty sectors consumed a lot dirty buckets, the
>>> +      * worst case is when dirty_data reached writeback_percent and
>>> +      * dirty buckets reached to cutoff_writeback_sync, but the rate
>>> +      * still will be at the minimum value, which will cause the write
>>> +      * stuck at a non-writeback mode.
>>> +      */
>>> +     struct cache_set *c = dc->disk.c;
>>> +
>>> +     if (c->gc_stats.in_use > dc->writeback_fragment_percent && dirty > 0) {
>>> +             int64_t dirty_buckets = (c->gc_stats.in_use * c->nbuckets) / 100;
>>> +             int64_t fragment = (dirty_buckets *  c->cache->sb.bucket_size) / dirty;
>>> +
>>> +             proportional_scaled =
>>> +             div_s64(dirty, dc->writeback_rate_p_term_inverse) * (fragment);
>>> +     }
>>> +
>>>        if ((error < 0 && dc->writeback_rate_integral > 0) ||
>>>            (error > 0 && time_before64(local_clock(),
>>>                         dc->writeback_rate.next + NSEC_PER_MSEC))) {
>>> @@ -969,6 +989,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
>>>        dc->writeback_metadata          = true;
>>>        dc->writeback_running           = false;
>>>        dc->writeback_percent           = 10;
>>> +     dc->writeback_fragment_percent  = 50;
>>>        dc->writeback_delay             = 30;
>>>        atomic_long_set(&dc->writeback_rate.rate, 1024);
>>>        dc->writeback_rate_minimum      = 8;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
       [not found]       ` <CAJS8hVLMUS1mdrwC8ovzvMO+HWf4xtXRCNJEghtbtW0g93Kh_g@mail.gmail.com>
@ 2020-12-14 17:07         ` Coly Li
       [not found]           ` <CAJS8hVKMjec1cpe_zoeZAJrfY0Pq9bJ51eO6E+g8pgN9jV3Nmw@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Coly Li @ 2020-12-14 17:07 UTC (permalink / raw)
  To: Dongdong Tao
  Cc: Gavin Guo, Gerald Yang, Trent Lloyd, Kent Overstreet,
	open list:BCACHE (BLOCK LAYER CACHE),
	open list, Dominique Poulain, Dongsheng Yang

On 12/14/20 11:30 PM, Dongdong Tao wrote:
> Hi Coly and Dongsheng,
> 
> I've get the testing result and confirmed that this testing result is
> reproducible by repeating it many times.
> I ran fio to get the write latency log and parsed the log and then
> generated below latency graphs with some visualization tool
> 

Hi Dongdong,

Thank you so much for the performance number!

[snipped]
> So, my code will accelerate the writeback process when the dirty buckets
> exceeds 50%( can be tuned), as we can see
> the cache_available_percent does increase once it hit 50, so we won't
> hit writeback cutoff issue.
> 
> Below are the steps that I used to do the experiment:
> 1. make-bcache -B <hdd> -C <nvme> --writeback -- I prepared the nvme
> size to 1G, so it can be reproduced faster
> 
> 2. sudo fio --name=random-writers --filename=/dev/bcache0
> --ioengine=libaio --iodepth=1 --rw=randrw --bs=16k --direct=1
> --rate_iops=90,10 --numjobs=1 --write_lat_log=16k
> 
> 3. For 1 G nvme, running for about 20 minutes is enough get the data. 

1GB cache and 20 minutes is quite limited for the performance valuation.
Could you please to do similar testing with 1TB SSD and 1 hours for each
run of the benchmark ?

> 
> Using randrw with rate_iops=90,10 is just one way to reproduce this
> easily, this can be reproduced as long as we can create a fragmented
> situation that quite few dirty data consumed a lot dirty buckets thus
> killing the write performance.
> 

Yes this is a good method to generate dirty data segments.

> This bug nowadays is becoming very critical, as ceph is hitting it, ceph
> mostly submitting random small IO.
> Please let me know what you need in order to move forward in this
> direction, I'm sure this patch can be improved also.

The performance number is quite convinced and the idea in your patch is
promising.

I will provide my comments on your patch after we see the performance
number for larger cache device and longer run time.

Thanks again for the detailed performance number, which is really
desired for performance optimization changes :-)

Coly Li

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
       [not found]           ` <CAJS8hVKMjec1cpe_zoeZAJrfY0Pq9bJ51eO6E+g8pgN9jV3Nmw@mail.gmail.com>
@ 2020-12-21  8:08             ` Coly Li
  0 siblings, 0 replies; 24+ messages in thread
From: Coly Li @ 2020-12-21  8:08 UTC (permalink / raw)
  To: Dongdong Tao
  Cc: Gavin Guo, Gerald Yang, Trent Lloyd, Kent Overstreet,
	open list:BCACHE (BLOCK LAYER CACHE),
	open list, Dominique Poulain, Dongsheng Yang

On 12/21/20 12:06 PM, Dongdong Tao wrote:
> Hi Coly,
> 
> Thank you so much for your prompt reply!
> 
> So, I've performed the same fio testing based on 1TB NVME and 10 TB HDD
> disk as the backing device.
> I've run them both for about 4 hours, since it's 1TB nvme device, it
> will roughly take about 10 days to consume 50 percent dirty buckets
> I did increase the iops to 500,100, but the dirty buckets only increased
> to about 30 after 2 days run, and because the read is much limited by
> the backing hdd deivce, so the actual maximum read iops I can constantly
> get is only about 200.

HI Dongdong,

There are two method to make the buckets being filled faster.
1) use larger non-spinning backing device (e.g. a md raid0 with multiple
SSDs).
2) specify larger read block size and small write block size in fio

Or you may combine them together to fill the cache device faster.


> 
> Though the 4 hours run on 1TB nvme bcache didn't make us hit the 50
> percent dirty bucket threshold, but I think it's still valuable to prove
> that 
> bcache with my patch behaves the same way as expected in terms of the
> latency when the dirty bucket is under 50 percent. I guess this might be
> what you wanted to confirm with this run.
> 

Previous testing on tiny SSD size is much better than this time, I
cannot provide my opinion before a non-optimized-configuration testing
finished.

If the latency distribution has no (recognized) difference, it will mean
your patch does not improve I/O latency. I believe this is only the
testing is unfinished yet.

The idea to heuristically estimate bucket fragmentation condition is
cool IMHO, but we need solid performance number to prove this
optimization. Please continue to finish the benchmark with real hardware
configuration, and I do hope we can see recoganized positive result for
your goal (improve I/O latency and throughput for high bucket dirty
segmentation).


Thanks.


Coly Li


> Here is the result:
> Master:
> 
> fio-master.png
> 
> Master + My patch:
> fio-patch.png
> As we can see, the latency distributions for the outliers or those
> majorities are the same between these two runs。 
> Let's combine those two together, and they are more clear:
> fio-full.png
> The test steps are exactly the same for those two runs:
> 
> 
> 1. make-bcache -B <hdd> -C <nvme> --writeback
> 
> 2. sudo fio --name=random-writers --filename=/dev/bcache0
> --ioengine=libaio --iodepth=1 --rw=randrw --bs=16k --direct=1
> --rate_iops=90,10 --numjobs=1 --write_lat_log=16k --runtime=14000
> 
> Thank you so much!
> Regards,
> Dongdong
> 
> On Tue, Dec 15, 2020 at 1:07 AM Coly Li <colyli@suse.de
> <mailto:colyli@suse.de>> wrote:
> 
>     On 12/14/20 11:30 PM, Dongdong Tao wrote:
>     > Hi Coly and Dongsheng,
>     >
>     > I've get the testing result and confirmed that this testing result is
>     > reproducible by repeating it many times.
>     > I ran fio to get the write latency log and parsed the log and then
>     > generated below latency graphs with some visualization tool
>     >
> 
>     Hi Dongdong,
> 
>     Thank you so much for the performance number!
> 
>     [snipped]
>     > So, my code will accelerate the writeback process when the dirty
>     buckets
>     > exceeds 50%( can be tuned), as we can see
>     > the cache_available_percent does increase once it hit 50, so we won't
>     > hit writeback cutoff issue.
>     >
>     > Below are the steps that I used to do the experiment:
>     > 1. make-bcache -B <hdd> -C <nvme> --writeback -- I prepared the nvme
>     > size to 1G, so it can be reproduced faster
>     >
>     > 2. sudo fio --name=random-writers --filename=/dev/bcache0
>     > --ioengine=libaio --iodepth=1 --rw=randrw --bs=16k --direct=1
>     > --rate_iops=90,10 --numjobs=1 --write_lat_log=16k
>     >
>     > 3. For 1 G nvme, running for about 20 minutes is enough get the data.
> 
>     1GB cache and 20 minutes is quite limited for the performance valuation.
>     Could you please to do similar testing with 1TB SSD and 1 hours for each
>     run of the benchmark ?
> 
>     >
>     > Using randrw with rate_iops=90,10 is just one way to reproduce this
>     > easily, this can be reproduced as long as we can create a fragmented
>     > situation that quite few dirty data consumed a lot dirty buckets thus
>     > killing the write performance.
>     >
> 
>     Yes this is a good method to generate dirty data segments.
> 
>     > This bug nowadays is becoming very critical, as ceph is hitting
>     it, ceph
>     > mostly submitting random small IO.
>     > Please let me know what you need in order to move forward in this
>     > direction, I'm sure this patch can be improved also.
> 
>     The performance number is quite convinced and the idea in your patch is
>     promising.
> 
>     I will provide my comments on your patch after we see the performance
>     number for larger cache device and longer run time.
> 
>     Thanks again for the detailed performance number, which is really
>     desired for performance optimization changes :-)
> 
>     Coly Li
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
  2021-01-19 12:56 Dongdong Tao
@ 2021-01-19 14:06 ` Coly Li
  0 siblings, 0 replies; 24+ messages in thread
From: Coly Li @ 2021-01-19 14:06 UTC (permalink / raw)
  To: Dongdong Tao; +Cc: dongdong tao, Kent Overstreet, linux-bcache, linux-kernel

On 1/19/21 8:56 PM, Dongdong Tao wrote:
> From: dongdong tao <dongdong.tao@canonical.com>
> 
> Current way to calculate the writeback rate only considered the
> dirty sectors, this usually works fine when the fragmentation
> is not high, but it will give us unreasonable small rate when
> we are under a situation that very few dirty sectors consumed
> a lot dirty buckets. In some case, the dirty bucekts can reached
> to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) not even
> reached the writeback_percent, the writeback rate will still
> be the minimum value (4k), thus it will cause all the writes to be
> stucked in a non-writeback mode because of the slow writeback.
> 
> We accelerate the rate in 3 stages with different aggressiveness,
> the first stage starts when dirty buckets percent reach above
> BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is
> BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is
> BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default
> the first stage tries to writeback the amount of dirty data
> in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second,
> the second stage tries to writeback the amount of dirty data in one bucket
> in (1 / (dirty_buckets_percent - 57)) * 100 millisecond, the third
> stage tries to writeback the amount of dirty data in one bucket in
> (1 / (dirty_buckets_percent - 64)) millisecond. It is ok to be very
> aggressive in the last stage, as it is our last chance to pull it back.
> 
> Option writeback_consider_fragment to control whether we want
> this feature to be on or off, it's on by default.
> 
> Lastly, below is the performance data for all the testing result,
> including the data from production env:
> https://docs.google.com/document/d/
> 1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing
> 
> Signed-off-by: dongdong tao <dongdong.tao@canonical.com>

Hi Dongdong,

Now overall the patch is OK for me, just minor comments  and please
check them inline.


> ---
>  drivers/md/bcache/bcache.h    |  4 ++++
>  drivers/md/bcache/sysfs.c     | 22 ++++++++++++++++++
>  drivers/md/bcache/writeback.c | 42 +++++++++++++++++++++++++++++++++++
>  drivers/md/bcache/writeback.h |  4 ++++
>  4 files changed, 72 insertions(+)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 1d57f48307e6..d7a84327b7f1 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -373,6 +373,7 @@ struct cached_dev {
>  	unsigned int		partial_stripes_expensive:1;
>  	unsigned int		writeback_metadata:1;
>  	unsigned int		writeback_running:1;
> +	unsigned int		writeback_consider_fragment:1;
>  	unsigned char		writeback_percent;
>  	unsigned int		writeback_delay;
>  
> @@ -385,6 +386,9 @@ struct cached_dev {
>  	unsigned int		writeback_rate_update_seconds;
>  	unsigned int		writeback_rate_i_term_inverse;
>  	unsigned int		writeback_rate_p_term_inverse;
> +	unsigned int		writeback_rate_fp_term_low;
> +	unsigned int		writeback_rate_fp_term_mid;
> +	unsigned int		writeback_rate_fp_term_high;
>  	unsigned int		writeback_rate_minimum;
>  
>  	enum stop_on_failure	stop_when_cache_set_failed;
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index 00a520c03f41..136899beadba 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -117,10 +117,14 @@ rw_attribute(writeback_running);
>  rw_attribute(writeback_percent);
>  rw_attribute(writeback_delay);
>  rw_attribute(writeback_rate);
> +rw_attribute(writeback_consider_fragment);
>  
>  rw_attribute(writeback_rate_update_seconds);
>  rw_attribute(writeback_rate_i_term_inverse);
>  rw_attribute(writeback_rate_p_term_inverse);
> +rw_attribute(writeback_rate_fp_term_low);
> +rw_attribute(writeback_rate_fp_term_mid);
> +rw_attribute(writeback_rate_fp_term_high);
>  rw_attribute(writeback_rate_minimum);
>  read_attribute(writeback_rate_debug);
>  
> @@ -195,6 +199,7 @@ SHOW(__bch_cached_dev)
>  	var_printf(bypass_torture_test,	"%i");
>  	var_printf(writeback_metadata,	"%i");
>  	var_printf(writeback_running,	"%i");
> +	var_printf(writeback_consider_fragment,	"%i");
>  	var_print(writeback_delay);
>  	var_print(writeback_percent);
>  	sysfs_hprint(writeback_rate,
> @@ -205,6 +210,9 @@ SHOW(__bch_cached_dev)
>  	var_print(writeback_rate_update_seconds);
>  	var_print(writeback_rate_i_term_inverse);
>  	var_print(writeback_rate_p_term_inverse);
> +	var_print(writeback_rate_fp_term_low);
> +	var_print(writeback_rate_fp_term_mid);
> +	var_print(writeback_rate_fp_term_high);
>  	var_print(writeback_rate_minimum);
>  
>  	if (attr == &sysfs_writeback_rate_debug) {
> @@ -303,6 +311,7 @@ STORE(__cached_dev)
>  	sysfs_strtoul_bool(bypass_torture_test, dc->bypass_torture_test);
>  	sysfs_strtoul_bool(writeback_metadata, dc->writeback_metadata);
>  	sysfs_strtoul_bool(writeback_running, dc->writeback_running);
> +	sysfs_strtoul_bool(writeback_consider_fragment, dc->writeback_consider_fragment);
>  	sysfs_strtoul_clamp(writeback_delay, dc->writeback_delay, 0, UINT_MAX);
>  
>  	sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
> @@ -331,6 +340,15 @@ STORE(__cached_dev)
>  	sysfs_strtoul_clamp(writeback_rate_p_term_inverse,
>  			    dc->writeback_rate_p_term_inverse,
>  			    1, UINT_MAX);
> +	sysfs_strtoul_clamp(writeback_rate_fp_term_low,
> +			    dc->writeback_rate_fp_term_low,
> +			    1, UINT_MAX);
> +	sysfs_strtoul_clamp(writeback_rate_fp_term_mid,
> +			    dc->writeback_rate_fp_term_mid,
> +			    1, UINT_MAX);
> +	sysfs_strtoul_clamp(writeback_rate_fp_term_high,
> +			    dc->writeback_rate_fp_term_high,
> +			    1, UINT_MAX);


Here you should make sure the input value should always be in order,
  low < mid < high

Otherwise the calculation might be messed up.


>  	sysfs_strtoul_clamp(writeback_rate_minimum,
>  			    dc->writeback_rate_minimum,
>  			    1, UINT_MAX);
> @@ -499,9 +517,13 @@ static struct attribute *bch_cached_dev_files[] = {
>  	&sysfs_writeback_delay,
>  	&sysfs_writeback_percent,
>  	&sysfs_writeback_rate,
> +	&sysfs_writeback_consider_fragment,
>  	&sysfs_writeback_rate_update_seconds,
>  	&sysfs_writeback_rate_i_term_inverse,
>  	&sysfs_writeback_rate_p_term_inverse,
> +	&sysfs_writeback_rate_fp_term_low,
> +	&sysfs_writeback_rate_fp_term_mid,
> +	&sysfs_writeback_rate_fp_term_high,
>  	&sysfs_writeback_rate_minimum,
>  	&sysfs_writeback_rate_debug,
>  	&sysfs_io_errors,
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index a129e4d2707c..9d440166c6bf 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -88,6 +88,44 @@ static void __update_writeback_rate(struct cached_dev *dc)
>  	int64_t integral_scaled;
>  	uint32_t new_rate;
>  
> +	/*
> +	 * We need to consider the number of dirty buckets as well
> +	 * when calculating the proportional_scaled, Otherwise we might
> +	 * have an unreasonable small writeback rate at a highly fragmented situation
> +	 * when very few dirty sectors consumed a lot dirty buckets, the
> +	 * worst case is when dirty buckets reached cutoff_writeback_sync and
> +	 * dirty data is still not even reached to writeback percent, so the rate
> +	 * still will be at the minimum value, which will cause the write
> +	 * stuck at a non-writeback mode.
> +	 */
> +	struct cache_set *c = dc->disk.c;
> +
> +	int64_t dirty_buckets = c->nbuckets - c->avail_nbuckets;
> +
> +	if (dc->writeback_consider_fragment &&
> +		c->gc_stats.in_use > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW && dirty > 0) {
> +		int64_t fragment =
> +			div_s64((dirty_buckets *  c->cache->sb.bucket_size), dirty);
> +		int64_t fp_term;
> +		int64_t fps;
> +
> +		if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID) {
> +			fp_term = dc->writeback_rate_fp_term_low *
> +			(c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW);
> +		} else if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH) {
> +			fp_term = dc->writeback_rate_fp_term_mid *
> +			(c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID);
> +		} else {
> +			fp_term = dc->writeback_rate_fp_term_high *
> +			(c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH);
> +		}
> +		fps = div_s64(dirty, dirty_buckets) * fp_term;
> +		if (fragment > 3 && fps > proportional_scaled) {
> +			//Only overrite the p when fragment > 3

The above "//" style is not recommended as bcache coding style.


> +			proportional_scaled = fps;
> +		}
> +	}
> +
>  	if ((error < 0 && dc->writeback_rate_integral > 0) ||
>  	    (error > 0 && time_before64(local_clock(),
>  			 dc->writeback_rate.next + NSEC_PER_MSEC))) {
> @@ -977,6 +1015,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
>  
>  	dc->writeback_metadata		= true;
>  	dc->writeback_running		= false;
> +	dc->writeback_consider_fragment = true;
>  	dc->writeback_percent		= 10;
>  	dc->writeback_delay		= 30;
>  	atomic_long_set(&dc->writeback_rate.rate, 1024);
> @@ -984,6 +1023,9 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
>  
>  	dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT;
>  	dc->writeback_rate_p_term_inverse = 40;
> +	dc->writeback_rate_fp_term_low = 1;
> +	dc->writeback_rate_fp_term_mid = 10;
> +	dc->writeback_rate_fp_term_high = 1000;

Could you please to explain a bit how the above 3 numbers are decided ?



>  	dc->writeback_rate_i_term_inverse = 10000;
>  
>  	WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags));
> diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
> index 3f1230e22de0..02b2f9df73f6 100644
> --- a/drivers/md/bcache/writeback.h
> +++ b/drivers/md/bcache/writeback.h
> @@ -16,6 +16,10 @@
>  
>  #define BCH_AUTO_GC_DIRTY_THRESHOLD	50
>  
> +#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW 50
> +#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID 57
> +#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH 64
> +
>  #define BCH_DIRTY_INIT_THRD_MAX	64
>  /*
>   * 14 (16384ths) is chosen here as something that each backing device
> 

Thanks in advance.

Coly Li

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] bcache: consider the fragmentation when update the writeback rate
@ 2021-01-19 12:56 Dongdong Tao
  2021-01-19 14:06 ` Coly Li
  0 siblings, 1 reply; 24+ messages in thread
From: Dongdong Tao @ 2021-01-19 12:56 UTC (permalink / raw)
  To: colyli; +Cc: dongdong tao, Kent Overstreet, linux-bcache, linux-kernel

From: dongdong tao <dongdong.tao@canonical.com>

Current way to calculate the writeback rate only considered the
dirty sectors, this usually works fine when the fragmentation
is not high, but it will give us unreasonable small rate when
we are under a situation that very few dirty sectors consumed
a lot dirty buckets. In some case, the dirty bucekts can reached
to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) not even
reached the writeback_percent, the writeback rate will still
be the minimum value (4k), thus it will cause all the writes to be
stucked in a non-writeback mode because of the slow writeback.

We accelerate the rate in 3 stages with different aggressiveness,
the first stage starts when dirty buckets percent reach above
BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default
the first stage tries to writeback the amount of dirty data
in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second,
the second stage tries to writeback the amount of dirty data in one bucket
in (1 / (dirty_buckets_percent - 57)) * 100 millisecond, the third
stage tries to writeback the amount of dirty data in one bucket in
(1 / (dirty_buckets_percent - 64)) millisecond. It is ok to be very
aggressive in the last stage, as it is our last chance to pull it back.

Option writeback_consider_fragment to control whether we want
this feature to be on or off, it's on by default.

Lastly, below is the performance data for all the testing result,
including the data from production env:
https://docs.google.com/document/d/
1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing

Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
---
 drivers/md/bcache/bcache.h    |  4 ++++
 drivers/md/bcache/sysfs.c     | 22 ++++++++++++++++++
 drivers/md/bcache/writeback.c | 42 +++++++++++++++++++++++++++++++++++
 drivers/md/bcache/writeback.h |  4 ++++
 4 files changed, 72 insertions(+)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 1d57f48307e6..d7a84327b7f1 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -373,6 +373,7 @@ struct cached_dev {
 	unsigned int		partial_stripes_expensive:1;
 	unsigned int		writeback_metadata:1;
 	unsigned int		writeback_running:1;
+	unsigned int		writeback_consider_fragment:1;
 	unsigned char		writeback_percent;
 	unsigned int		writeback_delay;
 
@@ -385,6 +386,9 @@ struct cached_dev {
 	unsigned int		writeback_rate_update_seconds;
 	unsigned int		writeback_rate_i_term_inverse;
 	unsigned int		writeback_rate_p_term_inverse;
+	unsigned int		writeback_rate_fp_term_low;
+	unsigned int		writeback_rate_fp_term_mid;
+	unsigned int		writeback_rate_fp_term_high;
 	unsigned int		writeback_rate_minimum;
 
 	enum stop_on_failure	stop_when_cache_set_failed;
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 00a520c03f41..136899beadba 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -117,10 +117,14 @@ rw_attribute(writeback_running);
 rw_attribute(writeback_percent);
 rw_attribute(writeback_delay);
 rw_attribute(writeback_rate);
+rw_attribute(writeback_consider_fragment);
 
 rw_attribute(writeback_rate_update_seconds);
 rw_attribute(writeback_rate_i_term_inverse);
 rw_attribute(writeback_rate_p_term_inverse);
+rw_attribute(writeback_rate_fp_term_low);
+rw_attribute(writeback_rate_fp_term_mid);
+rw_attribute(writeback_rate_fp_term_high);
 rw_attribute(writeback_rate_minimum);
 read_attribute(writeback_rate_debug);
 
@@ -195,6 +199,7 @@ SHOW(__bch_cached_dev)
 	var_printf(bypass_torture_test,	"%i");
 	var_printf(writeback_metadata,	"%i");
 	var_printf(writeback_running,	"%i");
+	var_printf(writeback_consider_fragment,	"%i");
 	var_print(writeback_delay);
 	var_print(writeback_percent);
 	sysfs_hprint(writeback_rate,
@@ -205,6 +210,9 @@ SHOW(__bch_cached_dev)
 	var_print(writeback_rate_update_seconds);
 	var_print(writeback_rate_i_term_inverse);
 	var_print(writeback_rate_p_term_inverse);
+	var_print(writeback_rate_fp_term_low);
+	var_print(writeback_rate_fp_term_mid);
+	var_print(writeback_rate_fp_term_high);
 	var_print(writeback_rate_minimum);
 
 	if (attr == &sysfs_writeback_rate_debug) {
@@ -303,6 +311,7 @@ STORE(__cached_dev)
 	sysfs_strtoul_bool(bypass_torture_test, dc->bypass_torture_test);
 	sysfs_strtoul_bool(writeback_metadata, dc->writeback_metadata);
 	sysfs_strtoul_bool(writeback_running, dc->writeback_running);
+	sysfs_strtoul_bool(writeback_consider_fragment, dc->writeback_consider_fragment);
 	sysfs_strtoul_clamp(writeback_delay, dc->writeback_delay, 0, UINT_MAX);
 
 	sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
@@ -331,6 +340,15 @@ STORE(__cached_dev)
 	sysfs_strtoul_clamp(writeback_rate_p_term_inverse,
 			    dc->writeback_rate_p_term_inverse,
 			    1, UINT_MAX);
+	sysfs_strtoul_clamp(writeback_rate_fp_term_low,
+			    dc->writeback_rate_fp_term_low,
+			    1, UINT_MAX);
+	sysfs_strtoul_clamp(writeback_rate_fp_term_mid,
+			    dc->writeback_rate_fp_term_mid,
+			    1, UINT_MAX);
+	sysfs_strtoul_clamp(writeback_rate_fp_term_high,
+			    dc->writeback_rate_fp_term_high,
+			    1, UINT_MAX);
 	sysfs_strtoul_clamp(writeback_rate_minimum,
 			    dc->writeback_rate_minimum,
 			    1, UINT_MAX);
@@ -499,9 +517,13 @@ static struct attribute *bch_cached_dev_files[] = {
 	&sysfs_writeback_delay,
 	&sysfs_writeback_percent,
 	&sysfs_writeback_rate,
+	&sysfs_writeback_consider_fragment,
 	&sysfs_writeback_rate_update_seconds,
 	&sysfs_writeback_rate_i_term_inverse,
 	&sysfs_writeback_rate_p_term_inverse,
+	&sysfs_writeback_rate_fp_term_low,
+	&sysfs_writeback_rate_fp_term_mid,
+	&sysfs_writeback_rate_fp_term_high,
 	&sysfs_writeback_rate_minimum,
 	&sysfs_writeback_rate_debug,
 	&sysfs_io_errors,
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index a129e4d2707c..9d440166c6bf 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -88,6 +88,44 @@ static void __update_writeback_rate(struct cached_dev *dc)
 	int64_t integral_scaled;
 	uint32_t new_rate;
 
+	/*
+	 * We need to consider the number of dirty buckets as well
+	 * when calculating the proportional_scaled, Otherwise we might
+	 * have an unreasonable small writeback rate at a highly fragmented situation
+	 * when very few dirty sectors consumed a lot dirty buckets, the
+	 * worst case is when dirty buckets reached cutoff_writeback_sync and
+	 * dirty data is still not even reached to writeback percent, so the rate
+	 * still will be at the minimum value, which will cause the write
+	 * stuck at a non-writeback mode.
+	 */
+	struct cache_set *c = dc->disk.c;
+
+	int64_t dirty_buckets = c->nbuckets - c->avail_nbuckets;
+
+	if (dc->writeback_consider_fragment &&
+		c->gc_stats.in_use > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW && dirty > 0) {
+		int64_t fragment =
+			div_s64((dirty_buckets *  c->cache->sb.bucket_size), dirty);
+		int64_t fp_term;
+		int64_t fps;
+
+		if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID) {
+			fp_term = dc->writeback_rate_fp_term_low *
+			(c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW);
+		} else if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH) {
+			fp_term = dc->writeback_rate_fp_term_mid *
+			(c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID);
+		} else {
+			fp_term = dc->writeback_rate_fp_term_high *
+			(c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH);
+		}
+		fps = div_s64(dirty, dirty_buckets) * fp_term;
+		if (fragment > 3 && fps > proportional_scaled) {
+			//Only overrite the p when fragment > 3
+			proportional_scaled = fps;
+		}
+	}
+
 	if ((error < 0 && dc->writeback_rate_integral > 0) ||
 	    (error > 0 && time_before64(local_clock(),
 			 dc->writeback_rate.next + NSEC_PER_MSEC))) {
@@ -977,6 +1015,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
 
 	dc->writeback_metadata		= true;
 	dc->writeback_running		= false;
+	dc->writeback_consider_fragment = true;
 	dc->writeback_percent		= 10;
 	dc->writeback_delay		= 30;
 	atomic_long_set(&dc->writeback_rate.rate, 1024);
@@ -984,6 +1023,9 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
 
 	dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT;
 	dc->writeback_rate_p_term_inverse = 40;
+	dc->writeback_rate_fp_term_low = 1;
+	dc->writeback_rate_fp_term_mid = 10;
+	dc->writeback_rate_fp_term_high = 1000;
 	dc->writeback_rate_i_term_inverse = 10000;
 
 	WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags));
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 3f1230e22de0..02b2f9df73f6 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -16,6 +16,10 @@
 
 #define BCH_AUTO_GC_DIRTY_THRESHOLD	50
 
+#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW 50
+#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID 57
+#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH 64
+
 #define BCH_DIRTY_INIT_THRD_MAX	64
 /*
  * 14 (16384ths) is chosen here as something that each backing device
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
  2021-01-05  3:06 Dongdong Tao
       [not found] ` <CAJS8hVK-ZCxJt=E3hwR0hmqPYL1T07_WC_nerb-dZodO+DqtDA@mail.gmail.com>
@ 2021-01-17  7:11 ` kernel test robot
  1 sibling, 0 replies; 24+ messages in thread
From: kernel test robot @ 2021-01-17  7:11 UTC (permalink / raw)
  To: Dongdong Tao, colyli
  Cc: kbuild-all, dongdong tao, Kent Overstreet, linux-bcache, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6040 bytes --]

Hi Dongdong,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Dongdong-Tao/bcache-consider-the-fragmentation-when-update-the-writeback-rate/20210105-110903
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git e71ba9452f0b5b2e8dc8aa5445198cd9214a6a62
config: i386-randconfig-a002-20200806 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/7777fef68d1401235db42dd0d59c5c3dba3d42d3
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Dongdong-Tao/bcache-consider-the-fragmentation-when-update-the-writeback-rate/20210105-110903
        git checkout 7777fef68d1401235db42dd0d59c5c3dba3d42d3
        # save the attached .config to linux build tree
        make W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: drivers/md/bcache/writeback.o: in function `__update_writeback_rate':
>> drivers/md/bcache/writeback.c:106: undefined reference to `__divdi3'
>> ld: drivers/md/bcache/writeback.c:120: undefined reference to `__divdi3'


vim +106 drivers/md/bcache/writeback.c

    60	
    61	static void __update_writeback_rate(struct cached_dev *dc)
    62	{
    63		/*
    64		 * PI controller:
    65		 * Figures out the amount that should be written per second.
    66		 *
    67		 * First, the error (number of sectors that are dirty beyond our
    68		 * target) is calculated.  The error is accumulated (numerically
    69		 * integrated).
    70		 *
    71		 * Then, the proportional value and integral value are scaled
    72		 * based on configured values.  These are stored as inverses to
    73		 * avoid fixed point math and to make configuration easy-- e.g.
    74		 * the default value of 40 for writeback_rate_p_term_inverse
    75		 * attempts to write at a rate that would retire all the dirty
    76		 * blocks in 40 seconds.
    77		 *
    78		 * The writeback_rate_i_inverse value of 10000 means that 1/10000th
    79		 * of the error is accumulated in the integral term per second.
    80		 * This acts as a slow, long-term average that is not subject to
    81		 * variations in usage like the p term.
    82		 */
    83		int64_t target = __calc_target_rate(dc);
    84		int64_t dirty = bcache_dev_sectors_dirty(&dc->disk);
    85		int64_t error = dirty - target;
    86		int64_t proportional_scaled =
    87			div_s64(error, dc->writeback_rate_p_term_inverse);
    88		int64_t integral_scaled;
    89		uint32_t new_rate;
    90	
    91		/*
    92		 * We need to consider the number of dirty buckets as well
    93		 * when calculating the proportional_scaled, Otherwise we might
    94		 * have an unreasonable small writeback rate at a highly fragmented situation
    95		 * when very few dirty sectors consumed a lot dirty buckets, the
    96		 * worst case is when dirty_data reached writeback_percent and
    97		 * dirty buckets reached to cutoff_writeback_sync, but the rate
    98		 * still will be at the minimum value, which will cause the write
    99		 * stuck at a non-writeback mode.
   100		 */
   101		struct cache_set *c = dc->disk.c;
   102	
   103		int64_t dirty_buckets = c->nbuckets - c->avail_nbuckets;
   104	
   105		if (c->gc_stats.in_use > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW && dirty > 0) {
 > 106			int64_t fragment = (dirty_buckets *  c->cache->sb.bucket_size) / dirty;
   107			int64_t fp_term;
   108			int64_t fps;
   109	
   110			if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID) {
   111				fp_term = dc->writeback_rate_fp_term_low *
   112				(c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW);
   113			} else if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH) {
   114				fp_term = dc->writeback_rate_fp_term_mid *
   115				(c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID);
   116			} else {
   117				fp_term = dc->writeback_rate_fp_term_high *
   118				(c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH);
   119			}
 > 120			fps = (dirty / dirty_buckets) * fp_term;
   121			if (fragment > 3 && fps > proportional_scaled) {
   122				//Only overrite the p when fragment > 3
   123				proportional_scaled = fps;
   124			}
   125		}
   126	
   127		if ((error < 0 && dc->writeback_rate_integral > 0) ||
   128		    (error > 0 && time_before64(local_clock(),
   129				 dc->writeback_rate.next + NSEC_PER_MSEC))) {
   130			/*
   131			 * Only decrease the integral term if it's more than
   132			 * zero.  Only increase the integral term if the device
   133			 * is keeping up.  (Don't wind up the integral
   134			 * ineffectively in either case).
   135			 *
   136			 * It's necessary to scale this by
   137			 * writeback_rate_update_seconds to keep the integral
   138			 * term dimensioned properly.
   139			 */
   140			dc->writeback_rate_integral += error *
   141				dc->writeback_rate_update_seconds;
   142		}
   143	
   144		integral_scaled = div_s64(dc->writeback_rate_integral,
   145				dc->writeback_rate_i_term_inverse);
   146	
   147		new_rate = clamp_t(int32_t, (proportional_scaled + integral_scaled),
   148				dc->writeback_rate_minimum, NSEC_PER_SEC);
   149	
   150		dc->writeback_rate_proportional = proportional_scaled;
   151		dc->writeback_rate_integral_scaled = integral_scaled;
   152		dc->writeback_rate_change = new_rate -
   153				atomic_long_read(&dc->writeback_rate.rate);
   154		atomic_long_set(&dc->writeback_rate.rate, new_rate);
   155		dc->writeback_rate_target = target;
   156	}
   157	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30207 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
  2021-01-14 13:31                     ` Coly Li
@ 2021-01-14 15:35                       ` Dongdong Tao
  0 siblings, 0 replies; 24+ messages in thread
From: Dongdong Tao @ 2021-01-14 15:35 UTC (permalink / raw)
  To: Coly Li
  Cc: Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
	open list, Gavin Guo, Gerald Yang, Trent Lloyd,
	Dominique Poulain, Dongsheng Yang, Benjamin Allot

Hi Coly,

Apologies for any confusion that I might have caused, and thanks a lot
for your patience and your help !

On Thu, Jan 14, 2021 at 9:31 PM Coly Li <colyli@suse.de> wrote:
>
> On 1/14/21 8:22 PM, Dongdong Tao wrote:
> > Hi Coly,
> >
> > Why you limit the iodeph to 8 and iops to 150 on cache device?
> > For cache device the limitation is small. Iosp 150 with 4KB block size,
> > it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35
> > hours it is only 70GB.
> >
> >
> > What if the iodepth is 128 or 64, and no iops rate limitation ?
> > -> There are two reasons why I limit the iodepth and iops rate.
> > 1. If I don't limit them, the dirty cache will be filled up very
> > quickly within 20 minutes.
> >      It's almost NVME speed before it reaches the 70
> > cutoff_writeback_sync, there is no way for any kind of writeback to
> > stop it from
> >      filling up due to the huge gap between NVME and HDD in terms of
> > the throughput,
> >      I don't think there is anything we can do about it? and it should
> > only happen in a benchmark world, not should in production.
> >      The improvement I'm trying to do here is just for normal
> > production workload ,not for this benchmark scenario really.
> >      I currently can't see any necessity to test this scenario, please
> > kindly let me know about this if I'm wrong.
> >
> > 2. The reason that I set iodepth to 8 and iops to 150 is based on the
> > experience that I observed from production env, mostly ceph,
> >     ceph-osd has less than 10 thread(default setting) that will send
> > io to bcache in parallel. But I'm not sure about other applications.
> >     I agree that we can increase the iodepth to 64 or 128 and it's
> > doable. But we have to limit the iops, 150 IOPS is a reasonable
> > workload.
> >     The most busy ceph-osd that I've seen is about 1000 IOPS, but on
> > average is still only about 600.
> >     I can set the IOPS to a higher value like 600 and the iodepth to
> > 128 to perform the later test if it make sense to you?
> >
>
> OK, now I know the reason with the extra information. Since the cache
> device is filled up within 20 minutes, it is unnecessary to do the
> faster testing on your side. Let me do it later on my hardware.
>
>
> > Lastly, please allow me to clarify more about the production issue
> > that this patch is trying to address:
> >
> > In the production env that hit this issue, it usually takes a very
> > long time (many take days) for the cache_available_percent to drop to
> > 30, and the dirty data is mostly staying at a very low level (around
> > 10 percent), which means that the bcache isn't being stressed very
> > hard most of the time.
> >  There is no intention to save the cutoff_writeback_sync when the
> > bcache is being stressed without limitation, hope above make sense :)
> >
>
> Yes you explained clearly previously. What I worried was whether a
> faster writeback may interfere throughput and latency of regular I/O
> regular I/Os.
>
> From your current testing data it looks find with me.
>
>
> > By the way, my colleague and I are trying to gathering some production
> > bcache stats, I hope we can give you the performance number before and
> > after applying the patch.
>
> Yes that will be great.
>
> And could you please gather all current data chats into a single email,
> and reference it in your patch via lore ? Then for people don't
> subscribe linux-bcache mailing list, they may find all the posted
> performance data from you patch.
>

Sounds good, I'll update the patch comment with reference data.
But it seems like the linux mailing list doesn't accept chart ?
(always been detected as SPAM)
But, I can't be sure, I'll try to send it again, but if not, I'll put
all those data into a google doc.

> In general your testing data is convinced IMHO, and I will add your
> updated patch for 5.12 merge window.
>
Thank you Coly, that's great !!!

>
> Thanks.
>
> Coly Li
>
>
> >
> >
> > On Thu, Jan 14, 2021 at 6:05 PM Coly Li <colyli@suse.de> wrote:
> >>
> >> On 1/14/21 12:45 PM, Dongdong Tao wrote:
> >>> Hi Coly,
> >>>
> >>> I've got the testing data for multiple threads with larger IO depth.
> >>>
> >>
> >> Hi Dongdong,
> >>
> >> Thanks for the testing number.
> >>
> >>> *Here is the testing steps:
> >>> *1. make-bcache -B <> -C <> --writeback
> >>>
> >>> 2. Open two tabs, start different fio task in them at the same time.
> >>> Tab1 run below fio command:
> >>> sudo fio --name=random-writers --filename=/dev/bcache0 --ioengine=libaio
> >>> --iodepth=32 --rw=randrw --blocksize=64k,8k  --direct=1 --runtime=24000
> >>>
> >>> Tab2 run below fio command:
> >>> sudo fio --name=random-writers2 --filename=/dev/bcache0
> >>> --ioengine=libaio --iodepth=8 --rw=randwrite --bs=4k --rate_iops=150
> >>> --direct=1 --write_lat_log=rw --log_avg_msec=20
> >>>
> >>
> >>
> >> Why you limit the iodep to 8 and iops to 150 on cache device?
> >> For cache device the limitation is small. Iosp 150 with 4KB block size,
> >> it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35
> >> hours it is only 70GB.
> >>
> >>
> >> What if the iodeps is 128 or 64, and no iops rate limitation ?
> >>
> >>
> >>> Note
> >>> - Tab1 fio will run for 24000 seconds, which is the one to cause the
> >>> fragmentation and made the cache_available_percent drops to under 40.
> >>> - Tab2 fio is the one that I'm capturing the latency and I have let it
> >>> run for about 35 hours, which is long enough to allow the
> >>> cache_available_percent drops under 30.
> >>> - This testing method utilized fio benchmark with larger read block
> >>> size/small write block size to cause the high fragmentation, However in
> >>> a real production env, there could be
> >>>    various reasons or a combination of various reasons to cause the high
> >>> fragmentation,  but I believe it should be ok to use any method to cause
> >>> the fragmentation to verify if
> >>>    bcache with this patch is responding better than the master in this
> >>> situation.
> >>>
> >>> *Below is the testing result:*
> >>>
> >>> The total run time is about 35 hours, the latency points in the charts
> >>> for each run are 1.5 million
> >>>
> >>> Master:
> >>> fio-lat-mater.png
> >>>
> >>> Master + patch:
> >>> fio-lat-patch.png
> >>> Combine them together:
> >>> fio-lat-mix.png
> >>>
> >>> Now we can see the master is even worse when we increase the iodepth,
> >>> which makes sense since the backing HDD is being stressed more hardly.
> >>>
> >>> *Below are the cache stats changing during the run:*
> >>> Master:
> >>> bcache-stats-master.png
> >>>
> >>> Master + the patch:
> >>> bcache-stats-patch.png
> >>>
> >>> That's all the testing done with 400GB NVME with 512B block size.
> >>>
> >>> Coly, do you want me to continue the same testing on 1TB nvme with
> >>> different block size ?
> >>> or is it ok to skip the 1TB testing and continue the test with 400GB
> >>> NVME but with different block size?
> >>> feel free to let me know any other test scenarios that we should cover
> >>> here.
> >>
> >> Yes please, more testing is desired for performance improvement. So far
> >> I don't see performance number for real high work load yet.
> >>
> >> Thanks.
> >>
> >> Coly Li
> >>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
  2021-01-14 12:22                   ` Dongdong Tao
@ 2021-01-14 13:31                     ` Coly Li
  2021-01-14 15:35                       ` Dongdong Tao
  0 siblings, 1 reply; 24+ messages in thread
From: Coly Li @ 2021-01-14 13:31 UTC (permalink / raw)
  To: Dongdong Tao
  Cc: Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
	open list, Gavin Guo, Gerald Yang, Trent Lloyd,
	Dominique Poulain, Dongsheng Yang, Benjamin Allot

On 1/14/21 8:22 PM, Dongdong Tao wrote:
> Hi Coly,
> 
> Why you limit the iodeph to 8 and iops to 150 on cache device?
> For cache device the limitation is small. Iosp 150 with 4KB block size,
> it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35
> hours it is only 70GB.
> 
> 
> What if the iodepth is 128 or 64, and no iops rate limitation ?
> -> There are two reasons why I limit the iodepth and iops rate.
> 1. If I don't limit them, the dirty cache will be filled up very
> quickly within 20 minutes.
>      It's almost NVME speed before it reaches the 70
> cutoff_writeback_sync, there is no way for any kind of writeback to
> stop it from
>      filling up due to the huge gap between NVME and HDD in terms of
> the throughput,
>      I don't think there is anything we can do about it? and it should
> only happen in a benchmark world, not should in production.
>      The improvement I'm trying to do here is just for normal
> production workload ,not for this benchmark scenario really.
>      I currently can't see any necessity to test this scenario, please
> kindly let me know about this if I'm wrong.
> 
> 2. The reason that I set iodepth to 8 and iops to 150 is based on the
> experience that I observed from production env, mostly ceph,
>     ceph-osd has less than 10 thread(default setting) that will send
> io to bcache in parallel. But I'm not sure about other applications.
>     I agree that we can increase the iodepth to 64 or 128 and it's
> doable. But we have to limit the iops, 150 IOPS is a reasonable
> workload.
>     The most busy ceph-osd that I've seen is about 1000 IOPS, but on
> average is still only about 600.
>     I can set the IOPS to a higher value like 600 and the iodepth to
> 128 to perform the later test if it make sense to you?
> 

OK, now I know the reason with the extra information. Since the cache
device is filled up within 20 minutes, it is unnecessary to do the
faster testing on your side. Let me do it later on my hardware.


> Lastly, please allow me to clarify more about the production issue
> that this patch is trying to address:
> 
> In the production env that hit this issue, it usually takes a very
> long time (many take days) for the cache_available_percent to drop to
> 30, and the dirty data is mostly staying at a very low level (around
> 10 percent), which means that the bcache isn't being stressed very
> hard most of the time.
>  There is no intention to save the cutoff_writeback_sync when the
> bcache is being stressed without limitation, hope above make sense :)
> 

Yes you explained clearly previously. What I worried was whether a
faster writeback may interfere throughput and latency of regular I/O
regular I/Os.

From your current testing data it looks find with me.


> By the way, my colleague and I are trying to gathering some production
> bcache stats, I hope we can give you the performance number before and
> after applying the patch.

Yes that will be great.

And could you please gather all current data chats into a single email,
and reference it in your patch via lore ? Then for people don't
subscribe linux-bcache mailing list, they may find all the posted
performance data from you patch.

In general your testing data is convinced IMHO, and I will add your
updated patch for 5.12 merge window.


Thanks.

Coly Li


> 
> 
> On Thu, Jan 14, 2021 at 6:05 PM Coly Li <colyli@suse.de> wrote:
>>
>> On 1/14/21 12:45 PM, Dongdong Tao wrote:
>>> Hi Coly,
>>>
>>> I've got the testing data for multiple threads with larger IO depth.
>>>
>>
>> Hi Dongdong,
>>
>> Thanks for the testing number.
>>
>>> *Here is the testing steps:
>>> *1. make-bcache -B <> -C <> --writeback
>>>
>>> 2. Open two tabs, start different fio task in them at the same time.
>>> Tab1 run below fio command:
>>> sudo fio --name=random-writers --filename=/dev/bcache0 --ioengine=libaio
>>> --iodepth=32 --rw=randrw --blocksize=64k,8k  --direct=1 --runtime=24000
>>>
>>> Tab2 run below fio command:
>>> sudo fio --name=random-writers2 --filename=/dev/bcache0
>>> --ioengine=libaio --iodepth=8 --rw=randwrite --bs=4k --rate_iops=150
>>> --direct=1 --write_lat_log=rw --log_avg_msec=20
>>>
>>
>>
>> Why you limit the iodep to 8 and iops to 150 on cache device?
>> For cache device the limitation is small. Iosp 150 with 4KB block size,
>> it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35
>> hours it is only 70GB.
>>
>>
>> What if the iodeps is 128 or 64, and no iops rate limitation ?
>>
>>
>>> Note
>>> - Tab1 fio will run for 24000 seconds, which is the one to cause the
>>> fragmentation and made the cache_available_percent drops to under 40.
>>> - Tab2 fio is the one that I'm capturing the latency and I have let it
>>> run for about 35 hours, which is long enough to allow the
>>> cache_available_percent drops under 30.
>>> - This testing method utilized fio benchmark with larger read block
>>> size/small write block size to cause the high fragmentation, However in
>>> a real production env, there could be
>>>    various reasons or a combination of various reasons to cause the high
>>> fragmentation,  but I believe it should be ok to use any method to cause
>>> the fragmentation to verify if
>>>    bcache with this patch is responding better than the master in this
>>> situation.
>>>
>>> *Below is the testing result:*
>>>
>>> The total run time is about 35 hours, the latency points in the charts
>>> for each run are 1.5 million
>>>
>>> Master:
>>> fio-lat-mater.png
>>>
>>> Master + patch:
>>> fio-lat-patch.png
>>> Combine them together:
>>> fio-lat-mix.png
>>>
>>> Now we can see the master is even worse when we increase the iodepth,
>>> which makes sense since the backing HDD is being stressed more hardly.
>>>
>>> *Below are the cache stats changing during the run:*
>>> Master:
>>> bcache-stats-master.png
>>>
>>> Master + the patch:
>>> bcache-stats-patch.png
>>>
>>> That's all the testing done with 400GB NVME with 512B block size.
>>>
>>> Coly, do you want me to continue the same testing on 1TB nvme with
>>> different block size ?
>>> or is it ok to skip the 1TB testing and continue the test with 400GB
>>> NVME but with different block size?
>>> feel free to let me know any other test scenarios that we should cover
>>> here.
>>
>> Yes please, more testing is desired for performance improvement. So far
>> I don't see performance number for real high work load yet.
>>
>> Thanks.
>>
>> Coly Li
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
  2021-01-14 10:05                 ` Coly Li
@ 2021-01-14 12:22                   ` Dongdong Tao
  2021-01-14 13:31                     ` Coly Li
  0 siblings, 1 reply; 24+ messages in thread
From: Dongdong Tao @ 2021-01-14 12:22 UTC (permalink / raw)
  To: Coly Li
  Cc: Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
	open list, Gavin Guo, Gerald Yang, Trent Lloyd,
	Dominique Poulain, Dongsheng Yang, Benjamin Allot

Hi Coly,

Why you limit the iodeph to 8 and iops to 150 on cache device?
For cache device the limitation is small. Iosp 150 with 4KB block size,
it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35
hours it is only 70GB.


What if the iodepth is 128 or 64, and no iops rate limitation ?
-> There are two reasons why I limit the iodepth and iops rate.
1. If I don't limit them, the dirty cache will be filled up very
quickly within 20 minutes.
     It's almost NVME speed before it reaches the 70
cutoff_writeback_sync, there is no way for any kind of writeback to
stop it from
     filling up due to the huge gap between NVME and HDD in terms of
the throughput,
     I don't think there is anything we can do about it? and it should
only happen in a benchmark world, not should in production.
     The improvement I'm trying to do here is just for normal
production workload ,not for this benchmark scenario really.
     I currently can't see any necessity to test this scenario, please
kindly let me know about this if I'm wrong.

2. The reason that I set iodepth to 8 and iops to 150 is based on the
experience that I observed from production env, mostly ceph,
    ceph-osd has less than 10 thread(default setting) that will send
io to bcache in parallel. But I'm not sure about other applications.
    I agree that we can increase the iodepth to 64 or 128 and it's
doable. But we have to limit the iops, 150 IOPS is a reasonable
workload.
    The most busy ceph-osd that I've seen is about 1000 IOPS, but on
average is still only about 600.
    I can set the IOPS to a higher value like 600 and the iodepth to
128 to perform the later test if it make sense to you?

Lastly, please allow me to clarify more about the production issue
that this patch is trying to address:

In the production env that hit this issue, it usually takes a very
long time (many take days) for the cache_available_percent to drop to
30, and the dirty data is mostly staying at a very low level (around
10 percent), which means that the bcache isn't being stressed very
hard most of the time.
 There is no intention to save the cutoff_writeback_sync when the
bcache is being stressed without limitation, hope above make sense :)

By the way, my colleague and I are trying to gathering some production
bcache stats, I hope we can give you the performance number before and
after applying the patch.

Thanks,
Dongdong





On Thu, Jan 14, 2021 at 6:05 PM Coly Li <colyli@suse.de> wrote:
>
> On 1/14/21 12:45 PM, Dongdong Tao wrote:
> > Hi Coly,
> >
> > I've got the testing data for multiple threads with larger IO depth.
> >
>
> Hi Dongdong,
>
> Thanks for the testing number.
>
> > *Here is the testing steps:
> > *1. make-bcache -B <> -C <> --writeback
> >
> > 2. Open two tabs, start different fio task in them at the same time.
> > Tab1 run below fio command:
> > sudo fio --name=random-writers --filename=/dev/bcache0 --ioengine=libaio
> > --iodepth=32 --rw=randrw --blocksize=64k,8k  --direct=1 --runtime=24000
> >
> > Tab2 run below fio command:
> > sudo fio --name=random-writers2 --filename=/dev/bcache0
> > --ioengine=libaio --iodepth=8 --rw=randwrite --bs=4k --rate_iops=150
> > --direct=1 --write_lat_log=rw --log_avg_msec=20
> >
>
>
> Why you limit the iodep to 8 and iops to 150 on cache device?
> For cache device the limitation is small. Iosp 150 with 4KB block size,
> it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35
> hours it is only 70GB.
>
>
> What if the iodeps is 128 or 64, and no iops rate limitation ?
>
>
> > Note
> > - Tab1 fio will run for 24000 seconds, which is the one to cause the
> > fragmentation and made the cache_available_percent drops to under 40.
> > - Tab2 fio is the one that I'm capturing the latency and I have let it
> > run for about 35 hours, which is long enough to allow the
> > cache_available_percent drops under 30.
> > - This testing method utilized fio benchmark with larger read block
> > size/small write block size to cause the high fragmentation, However in
> > a real production env, there could be
> >    various reasons or a combination of various reasons to cause the high
> > fragmentation,  but I believe it should be ok to use any method to cause
> > the fragmentation to verify if
> >    bcache with this patch is responding better than the master in this
> > situation.
> >
> > *Below is the testing result:*
> >
> > The total run time is about 35 hours, the latency points in the charts
> > for each run are 1.5 million
> >
> > Master:
> > fio-lat-mater.png
> >
> > Master + patch:
> > fio-lat-patch.png
> > Combine them together:
> > fio-lat-mix.png
> >
> > Now we can see the master is even worse when we increase the iodepth,
> > which makes sense since the backing HDD is being stressed more hardly.
> >
> > *Below are the cache stats changing during the run:*
> > Master:
> > bcache-stats-master.png
> >
> > Master + the patch:
> > bcache-stats-patch.png
> >
> > That's all the testing done with 400GB NVME with 512B block size.
> >
> > Coly, do you want me to continue the same testing on 1TB nvme with
> > different block size ?
> > or is it ok to skip the 1TB testing and continue the test with 400GB
> > NVME but with different block size?
> > feel free to let me know any other test scenarios that we should cover
> > here.
>
> Yes please, more testing is desired for performance improvement. So far
> I don't see performance number for real high work load yet.
>
> Thanks.
>
> Coly Li
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
       [not found]               ` <CAJS8hVJDaREvpvG4iO+Xs-KQXQKFi7=k29TrG=NXqjyiPpUCZA@mail.gmail.com>
@ 2021-01-14 10:05                 ` Coly Li
  2021-01-14 12:22                   ` Dongdong Tao
  0 siblings, 1 reply; 24+ messages in thread
From: Coly Li @ 2021-01-14 10:05 UTC (permalink / raw)
  To: Dongdong Tao
  Cc: Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
	open list, Gavin Guo, Gerald Yang, Trent Lloyd,
	Dominique Poulain, Dongsheng Yang

On 1/14/21 12:45 PM, Dongdong Tao wrote:
> Hi Coly,
> 
> I've got the testing data for multiple threads with larger IO depth.
> 

Hi Dongdong,

Thanks for the testing number.

> *Here is the testing steps:
> *1. make-bcache -B <> -C <> --writeback
> 
> 2. Open two tabs, start different fio task in them at the same time.
> Tab1 run below fio command:
> sudo fio --name=random-writers --filename=/dev/bcache0 --ioengine=libaio
> --iodepth=32 --rw=randrw --blocksize=64k,8k  --direct=1 --runtime=24000
> 
> Tab2 run below fio command:
> sudo fio --name=random-writers2 --filename=/dev/bcache0
> --ioengine=libaio --iodepth=8 --rw=randwrite --bs=4k --rate_iops=150
> --direct=1 --write_lat_log=rw --log_avg_msec=20
> 


Why you limit the iodep to 8 and iops to 150 on cache device?
For cache device the limitation is small. Iosp 150 with 4KB block size,
it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35
hours it is only 70GB.


What if the iodeps is 128 or 64, and no iops rate limitation ?


> Note
> - Tab1 fio will run for 24000 seconds, which is the one to cause the
> fragmentation and made the cache_available_percent drops to under 40.
> - Tab2 fio is the one that I'm capturing the latency and I have let it
> run for about 35 hours, which is long enough to allow the
> cache_available_percent drops under 30.
> - This testing method utilized fio benchmark with larger read block
> size/small write block size to cause the high fragmentation, However in
> a real production env, there could be
>    various reasons or a combination of various reasons to cause the high
> fragmentation,  but I believe it should be ok to use any method to cause
> the fragmentation to verify if
>    bcache with this patch is responding better than the master in this
> situation. 
> 
> *Below is the testing result:*
> 
> The total run time is about 35 hours, the latency points in the charts
> for each run are 1.5 million
> 
> Master:
> fio-lat-mater.png
> 
> Master + patch:
> fio-lat-patch.png
> Combine them together:
> fio-lat-mix.png
> 
> Now we can see the master is even worse when we increase the iodepth,
> which makes sense since the backing HDD is being stressed more hardly.
> 
> *Below are the cache stats changing during the run:*
> Master:
> bcache-stats-master.png
> 
> Master + the patch:
> bcache-stats-patch.png
> 
> That's all the testing done with 400GB NVME with 512B block size.
> 
> Coly, do you want me to continue the same testing on 1TB nvme with
> different block size ?
> or is it ok to skip the 1TB testing and continue the test with 400GB
> NVME but with different block size? 
> feel free to let me know any other test scenarios that we should cover
> here. 

Yes please, more testing is desired for performance improvement. So far
I don't see performance number for real high work load yet.

Thanks.

Coly Li


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
  2021-01-08  8:47             ` Dongdong Tao
@ 2021-01-14  4:55               ` Dongdong Tao
       [not found]               ` <CAJS8hVJDaREvpvG4iO+Xs-KQXQKFi7=k29TrG=NXqjyiPpUCZA@mail.gmail.com>
  1 sibling, 0 replies; 24+ messages in thread
From: Dongdong Tao @ 2021-01-14  4:55 UTC (permalink / raw)
  To: open list:BCACHE (BLOCK LAYER CACHE); +Cc: open list

[Share the google doc here to avoid SPAM detection]

Here is the new testing result with multiple threads fio testing:

https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing


On Fri, Jan 8, 2021 at 4:47 PM Dongdong Tao <dongdong.tao@canonical.com> wrote:
>
> Yeap, I will scale the testing for multiple threads with larger IO
> depth, thanks for the suggestion!
>
> On Fri, Jan 8, 2021 at 4:40 PM Coly Li <colyli@suse.de> wrote:
> >
> > On 1/8/21 4:30 PM, Dongdong Tao wrote:
> > > Hi Coly,
> > >
> > > They are captured with the same time length, the meaning of the
> > > timestamp and the time unit on the x-axis are different.
> > > (Sorry, I should have clarified this right after the chart)
> > >
> > > For the latency chart:
> > > The timestamp is the relative time since the beginning of the
> > > benchmark, so the start timestamp is 0 and the unit is based on
> > > millisecond
> > >
> > > For the dirty data and cache available percent chart:
> > > The timestamp is the UNIX timestamp, the time unit is based on second,
> > > I capture the stats every 5 seconds with the below script:
> > > ---
> > > #!/bin/sh
> > > while true; do echo "`date +%s`, `cat
> > > /sys/block/bcache0/bcache/dirty_data`, `cat
> > > /sys/block/bcache0/bcache/cache/cache_available_percent`, `cat
> > > /sys/block/bcache0/bcache/writeback_rate`" >> $1; sleep 5; done;
> > > ---
> > >
> > > Unfortunately, I can't easily make them using the same timestamp, but
> > > I guess I can try to convert the UNIX timestamp to the relative time
> > > like the first one.
> > > But If we ignore the value of the X-axis,  we can still roughly
> > > compare them by using the length of the X-axis since they have the
> > > same time length,
> > > and we can see that the Master's write start hitting the backing
> > > device when the cache_available_percent dropped to around 30.
> >
> > Copied, thanks for the explanation. The chart for single thread with io
> > depth 1 is convinced IMHO :-)
> >
> > One more question, the benchmark is about a single I/O thread with io
> > depth 1, which is not typical condition for real workload. Do you have
> > plan to test the latency and IOPS for multiple threads with larger I/O
> > depth ?
> >
> >
> > Thanks.
> >
> >
> > Coly Li
> >
> >
> > >
> > > On Fri, Jan 8, 2021 at 12:06 PM Coly Li <colyli@suse.de> wrote:
> > >>
> > >> On 1/7/21 10:55 PM, Dongdong Tao wrote:
> > >>> Hi Coly,
> > >>>
> > >>>
> > >>> Thanks for the reminder, I understand that the rate is only a hint of
> > >>> the throughput, it’s a value to calculate the sleep time between each
> > >>> round of keys writeback, the higher the rate, the shorter the sleep
> > >>> time, most of the time this means the more dirty keys it can writeback
> > >>> in a certain amount of time before the hard disk running out of speed.
> > >>>
> > >>>
> > >>> Here is the testing data that run on a 400GB NVME + 1TB NVME HDD
> > >>>
> > >>
> > >> Hi Dongdong,
> > >>
> > >> Nice charts :-)
> > >>
> > >>> Steps:
> > >>>
> > >>>  1.
> > >>>
> > >>>     make-bcache -B <HDD> -C <NVME> --writeback
> > >>>
> > >>>  2.
> > >>>
> > >>>     sudo fio --name=random-writers --filename=/dev/bcache0
> > >>>     --ioengine=libaio --iodepth=1 --rw=randrw --blocksize=64k,8k
> > >>>     --direct=1 --numjobs=1  --write_lat_log=mix --log_avg_msec=10
> > >>>> The fio benchmark commands ran for about 20 hours.
> > >>>
> > >>
> > >> The time lengths of first 3 charts are 7.000e+7, rested are 1.60930e+9.
> > >> I guess the time length of the I/O latency chart is 1/100 of the rested.
> > >>
> > >> Can you also post the latency charts for 1.60930e+9 seconds? Then I can
> > >> compare the latency with dirty data and available cache charts.
> > >>
> > >>
> > >> Thanks.
> > >>
> > >>
> > >> Coly Li
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>>
> > >>> Let’s have a look at the write latency first:
> > >>>
> > >>> Master:
> > >>>
> > >>>
> > >>>
> > >>> Master+the patch:
> > >>>
> > >>> Combine them together:
> > >>>
> > >>> Again, the latency (y-axis) is based on nano-second, x-axis is the
> > >>> timestamp based on milli-second,  as we can see the master latency is
> > >>> obviously much higher than the one with my patch when the master bcache
> > >>> hit the cutoff writeback sync, the master isn’t going to get out of this
> > >>> cutoff writeback sync situation, This graph showed it already stuck at
> > >>> the cutoff writeback sync for about 4 hours before I finish the testing,
> > >>> it may still needs to stuck for days before it can get out this
> > >>> situation itself.
> > >>>
> > >>>
> > >>> Note that there are 1 million points for each , red represents master,
> > >>> green represents mater+my patch.  Most of them are overlapped with each
> > >>> other, so it may look like this graph has more red points then green
> > >>> after it hitting the cutoff, but simply it’s because the latency has
> > >>> scaled to a bigger range which represents the HDD latency.
> > >>>
> > >>>
> > >>>
> > >>> Let’s also have a look at the bcache’s cache available percent and dirty
> > >>> data percent.
> > >>>
> > >>> Master:
> > >>>
> > >>> Master+this patch:
> > >>>
> > >>> As you can see, this patch can avoid it hitting the cutoff writeback sync.
> > >>>
> > >>>
> > >>> As to say the improvement for this patch against the first one, let’s
> > >>> take a look at the writeback rate changing during the run.
> > >>>
> > >>> patch V1:
> > >>>
> > >>>
> > >>>
> > >>> Patch V2:
> > >>>
> > >>>
> > >>> The Y-axis is the value of rate, the V1 is very aggressive as it jumps
> > >>> instantly from a minimum 8 to around 10 million. And the patch V2 can
> > >>> control the rate under 5000 during the run, and after the first round of
> > >>> writeback, it can stay even under 2500, so this proves we don’t need to
> > >>> be as aggressive as V1 to get out of the high fragment situation which
> > >>> eventually causes all writes hitting the backing device. This looks very
> > >>> reasonable for me now.
> > >>>
> > >>> Note that the fio command that I used is consuming the bucket quite
> > >>> aggressively, so it had to hit the third stage which has the highest
> > >>> aggressiveness, but I believe this is not true in a real production env,
> > >>> real production env won’t consume buckets that aggressively, so I expect
> > >>> stage 3 may not very often be needed to hit.
> > >>>
> > >>>
> > >>> As discussed, I'll run multiple block size testing on at least 1TB NVME
> > >>> device later.
> > >>> But it might take some time.
> > >>>
> > >>>
> > >>> Regards,
> > >>> Dongdong
> > >>>
> > >>> On Tue, Jan 5, 2021 at 12:33 PM Coly Li <colyli@suse.de
> > >>> <mailto:colyli@suse.de>> wrote:
> > >>>
> > >>>     On 1/5/21 11:44 AM, Dongdong Tao wrote:
> > >>>     > Hey Coly,
> > >>>     >
> > >>>     > This is the second version of the patch, please allow me to explain a
> > >>>     > bit for this patch:
> > >>>     >
> > >>>     > We accelerate the rate in 3 stages with different aggressiveness, the
> > >>>     > first stage starts when dirty buckets percent reach above
> > >>>     > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW(50), the second is
> > >>>     > BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID(57) and the third is
> > >>>     > BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH(64). By default the first stage
> > >>>     > tries to writeback the amount of dirty data in one bucket (on average)
> > >>>     > in (1 / (dirty_buckets_percent - 50)) second, the second stage
> > >>>     tries to
> > >>>     > writeback the amount of dirty data in one bucket in (1 /
> > >>>     > (dirty_buckets_percent - 57)) * 200 millisecond. The third stage tries
> > >>>     > to writeback the amount of dirty data in one bucket in (1 /
> > >>>     > (dirty_buckets_percent - 64)) * 20 millisecond.
> > >>>     >
> > >>>     > As we can see, there are two writeback aggressiveness increasing
> > >>>     > strategies, one strategy is with the increasing of the stage, the
> > >>>     first
> > >>>     > stage is the easy-going phase whose initial rate is trying to
> > >>>     write back
> > >>>     > dirty data of one bucket in 1 second, the second stage is a bit more
> > >>>     > aggressive, the initial rate tries to writeback the dirty data of one
> > >>>     > bucket in 200 ms, the last stage is even more, whose initial rate
> > >>>     tries
> > >>>     > to writeback the dirty data of one bucket in 20 ms. This makes sense,
> > >>>     > one reason is that if the preceding stage couldn’t get the
> > >>>     fragmentation
> > >>>     > to a fine stage, then the next stage should increase the
> > >>>     aggressiveness
> > >>>     > properly, also it is because the later stage is closer to the
> > >>>     > bch_cutoff_writeback_sync. Another aggressiveness increasing
> > >>>     strategy is
> > >>>     > with the increasing of dirty bucket percent within each stage, the
> > >>>     first
> > >>>     > strategy controls the initial writeback rate of each stage, while this
> > >>>     > one increases the rate based on the initial rate, which is
> > >>>     initial_rate
> > >>>     > * (dirty bucket percent - BCH_WRITEBACK_FRAGMENT_THRESHOLD_X).
> > >>>     >
> > >>>     > The initial rate can be controlled by 3 parameters
> > >>>     > writeback_rate_fp_term_low, writeback_rate_fp_term_mid,
> > >>>     > writeback_rate_fp_term_high, they are default 1, 5, 50, users can
> > >>>     adjust
> > >>>     > them based on their needs.
> > >>>     >
> > >>>     > The reason that I choose 50, 57, 64 as the threshold value is because
> > >>>     > the GC must be triggered at least once during each stage due to the
> > >>>     > “sectors_to_gc” being set to 1/16 (6.25 %) of the total cache
> > >>>     size. So,
> > >>>     > the hope is that the first and second stage can get us back to good
> > >>>     > shape in most situations by smoothly writing back the dirty data
> > >>>     without
> > >>>     > giving too much stress to the backing devices, but it might still
> > >>>     enter
> > >>>     > the third stage if the bucket consumption is very aggressive.
> > >>>     >
> > >>>     > This patch use (dirty / dirty_buckets) * fp_term to calculate the
> > >>>     rate,
> > >>>     > this formula means that we want to writeback (dirty /
> > >>>     dirty_buckets) in
> > >>>     > 1/fp_term second, fp_term is calculated by above aggressiveness
> > >>>     > controller, “dirty” is the current dirty sectors, “dirty_buckets”
> > >>>     is the
> > >>>     > current dirty buckets, so (dirty / dirty_buckets) means the average
> > >>>     > dirty sectors in one bucket, the value is between 0 to 1024 for the
> > >>>     > default setting,  so this formula basically gives a hint that to
> > >>>     reclaim
> > >>>     > one bucket in 1/fp_term second. By using this semantic, we can have a
> > >>>     > lower writeback rate when the amount of dirty data is decreasing and
> > >>>     > overcome the fact that dirty buckets number is always increasing
> > >>>     unless
> > >>>     > GC happens.
> > >>>     >
> > >>>     > *Compare to the first patch:
> > >>>     > *The first patch is trying to write back all the data in 40 seconds,
> > >>>     > this will result in a very high writeback rate when the amount of
> > >>>     dirty
> > >>>     > data is big, this is mostly true for the large cache devices. The
> > >>>     basic
> > >>>     > problem is that the semantic of this patch is not ideal, because we
> > >>>     > don’t really need to writeback all dirty data in order to solve this
> > >>>     > issue, and the instant large increase of the rate is something I
> > >>>     feel we
> > >>>     > should better avoid (I like things to be smoothly changed unless no
> > >>>     > choice: )).
> > >>>     >
> > >>>     > Before I get to this new patch(which I believe should be optimal
> > >>>     for me
> > >>>     > atm), there have been many tuning/testing iterations, eg. I’ve
> > >>>     tried to
> > >>>     > tune the algorithm to writeback ⅓ of the dirty data in a certain
> > >>>     amount
> > >>>     > of seconds, writeback 1/fragment of the dirty data in a certain amount
> > >>>     > of seconds, writeback all the dirty data only in those error_buckets
> > >>>     > (error buckets = dirty buckets - 50% of the total buckets) in a
> > >>>     certain
> > >>>     > amount of time. However, those all turn out not to be ideal, only the
> > >>>     > semantic of the patch makes much sense for me and allows me to control
> > >>>     > the rate in a more precise way.
> > >>>     >
> > >>>     > *Testing data:
> > >>>     > *I'll provide the visualized testing data in the next couple of days
> > >>>     > with 1TB NVME devices cache but with HDD as backing device since it's
> > >>>     > what we mostly used in production env.
> > >>>     > I have the data for 400GB NVME, let me prepare it and take it for
> > >>>     you to
> > >>>     > review.
> > >>>     [snipped]
> > >>>
> > >>>     Hi Dongdong,
> > >>>
> > >>>     Thanks for the update and continuous effort on this idea.
> > >>>
> > >>>     Please keep in mind the writeback rate is just a advice rate for the
> > >>>     writeback throughput, in real workload changing the writeback rate
> > >>>     number does not change writeback throughput obviously.
> > >>>
> > >>>     Currently I feel this is an interesting and promising idea for your
> > >>>     patch, but I am not able to say whether it may take effect in real
> > >>>     workload, so we do need convinced performance data on real workload and
> > >>>     configuration.
> > >>>
> > >>>     Of course I may also help on the benchmark, but my to-do list is long
> > >>>     enough and it may take a very long delay time.
> > >>>
> > >>>     Thanks.
> > >>>
> > >>>     Coly Li
> > >>>
> > >>
> >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
  2021-01-08  8:39           ` Coly Li
@ 2021-01-08  8:47             ` Dongdong Tao
  2021-01-14  4:55               ` Dongdong Tao
       [not found]               ` <CAJS8hVJDaREvpvG4iO+Xs-KQXQKFi7=k29TrG=NXqjyiPpUCZA@mail.gmail.com>
  0 siblings, 2 replies; 24+ messages in thread
From: Dongdong Tao @ 2021-01-08  8:47 UTC (permalink / raw)
  To: Coly Li
  Cc: Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
	open list, Gavin Guo, Gerald Yang, Trent Lloyd,
	Dominique Poulain, Dongsheng Yang

Yeap, I will scale the testing for multiple threads with larger IO
depth, thanks for the suggestion!

On Fri, Jan 8, 2021 at 4:40 PM Coly Li <colyli@suse.de> wrote:
>
> On 1/8/21 4:30 PM, Dongdong Tao wrote:
> > Hi Coly,
> >
> > They are captured with the same time length, the meaning of the
> > timestamp and the time unit on the x-axis are different.
> > (Sorry, I should have clarified this right after the chart)
> >
> > For the latency chart:
> > The timestamp is the relative time since the beginning of the
> > benchmark, so the start timestamp is 0 and the unit is based on
> > millisecond
> >
> > For the dirty data and cache available percent chart:
> > The timestamp is the UNIX timestamp, the time unit is based on second,
> > I capture the stats every 5 seconds with the below script:
> > ---
> > #!/bin/sh
> > while true; do echo "`date +%s`, `cat
> > /sys/block/bcache0/bcache/dirty_data`, `cat
> > /sys/block/bcache0/bcache/cache/cache_available_percent`, `cat
> > /sys/block/bcache0/bcache/writeback_rate`" >> $1; sleep 5; done;
> > ---
> >
> > Unfortunately, I can't easily make them using the same timestamp, but
> > I guess I can try to convert the UNIX timestamp to the relative time
> > like the first one.
> > But If we ignore the value of the X-axis,  we can still roughly
> > compare them by using the length of the X-axis since they have the
> > same time length,
> > and we can see that the Master's write start hitting the backing
> > device when the cache_available_percent dropped to around 30.
>
> Copied, thanks for the explanation. The chart for single thread with io
> depth 1 is convinced IMHO :-)
>
> One more question, the benchmark is about a single I/O thread with io
> depth 1, which is not typical condition for real workload. Do you have
> plan to test the latency and IOPS for multiple threads with larger I/O
> depth ?
>
>
> Thanks.
>
>
> Coly Li
>
>
> >
> > On Fri, Jan 8, 2021 at 12:06 PM Coly Li <colyli@suse.de> wrote:
> >>
> >> On 1/7/21 10:55 PM, Dongdong Tao wrote:
> >>> Hi Coly,
> >>>
> >>>
> >>> Thanks for the reminder, I understand that the rate is only a hint of
> >>> the throughput, it’s a value to calculate the sleep time between each
> >>> round of keys writeback, the higher the rate, the shorter the sleep
> >>> time, most of the time this means the more dirty keys it can writeback
> >>> in a certain amount of time before the hard disk running out of speed.
> >>>
> >>>
> >>> Here is the testing data that run on a 400GB NVME + 1TB NVME HDD
> >>>
> >>
> >> Hi Dongdong,
> >>
> >> Nice charts :-)
> >>
> >>> Steps:
> >>>
> >>>  1.
> >>>
> >>>     make-bcache -B <HDD> -C <NVME> --writeback
> >>>
> >>>  2.
> >>>
> >>>     sudo fio --name=random-writers --filename=/dev/bcache0
> >>>     --ioengine=libaio --iodepth=1 --rw=randrw --blocksize=64k,8k
> >>>     --direct=1 --numjobs=1  --write_lat_log=mix --log_avg_msec=10
> >>>> The fio benchmark commands ran for about 20 hours.
> >>>
> >>
> >> The time lengths of first 3 charts are 7.000e+7, rested are 1.60930e+9.
> >> I guess the time length of the I/O latency chart is 1/100 of the rested.
> >>
> >> Can you also post the latency charts for 1.60930e+9 seconds? Then I can
> >> compare the latency with dirty data and available cache charts.
> >>
> >>
> >> Thanks.
> >>
> >>
> >> Coly Li
> >>
> >>
> >>
> >>
> >>
> >>>
> >>> Let’s have a look at the write latency first:
> >>>
> >>> Master:
> >>>
> >>>
> >>>
> >>> Master+the patch:
> >>>
> >>> Combine them together:
> >>>
> >>> Again, the latency (y-axis) is based on nano-second, x-axis is the
> >>> timestamp based on milli-second,  as we can see the master latency is
> >>> obviously much higher than the one with my patch when the master bcache
> >>> hit the cutoff writeback sync, the master isn’t going to get out of this
> >>> cutoff writeback sync situation, This graph showed it already stuck at
> >>> the cutoff writeback sync for about 4 hours before I finish the testing,
> >>> it may still needs to stuck for days before it can get out this
> >>> situation itself.
> >>>
> >>>
> >>> Note that there are 1 million points for each , red represents master,
> >>> green represents mater+my patch.  Most of them are overlapped with each
> >>> other, so it may look like this graph has more red points then green
> >>> after it hitting the cutoff, but simply it’s because the latency has
> >>> scaled to a bigger range which represents the HDD latency.
> >>>
> >>>
> >>>
> >>> Let’s also have a look at the bcache’s cache available percent and dirty
> >>> data percent.
> >>>
> >>> Master:
> >>>
> >>> Master+this patch:
> >>>
> >>> As you can see, this patch can avoid it hitting the cutoff writeback sync.
> >>>
> >>>
> >>> As to say the improvement for this patch against the first one, let’s
> >>> take a look at the writeback rate changing during the run.
> >>>
> >>> patch V1:
> >>>
> >>>
> >>>
> >>> Patch V2:
> >>>
> >>>
> >>> The Y-axis is the value of rate, the V1 is very aggressive as it jumps
> >>> instantly from a minimum 8 to around 10 million. And the patch V2 can
> >>> control the rate under 5000 during the run, and after the first round of
> >>> writeback, it can stay even under 2500, so this proves we don’t need to
> >>> be as aggressive as V1 to get out of the high fragment situation which
> >>> eventually causes all writes hitting the backing device. This looks very
> >>> reasonable for me now.
> >>>
> >>> Note that the fio command that I used is consuming the bucket quite
> >>> aggressively, so it had to hit the third stage which has the highest
> >>> aggressiveness, but I believe this is not true in a real production env,
> >>> real production env won’t consume buckets that aggressively, so I expect
> >>> stage 3 may not very often be needed to hit.
> >>>
> >>>
> >>> As discussed, I'll run multiple block size testing on at least 1TB NVME
> >>> device later.
> >>> But it might take some time.
> >>>
> >>>
> >>> Regards,
> >>> Dongdong
> >>>
> >>> On Tue, Jan 5, 2021 at 12:33 PM Coly Li <colyli@suse.de
> >>> <mailto:colyli@suse.de>> wrote:
> >>>
> >>>     On 1/5/21 11:44 AM, Dongdong Tao wrote:
> >>>     > Hey Coly,
> >>>     >
> >>>     > This is the second version of the patch, please allow me to explain a
> >>>     > bit for this patch:
> >>>     >
> >>>     > We accelerate the rate in 3 stages with different aggressiveness, the
> >>>     > first stage starts when dirty buckets percent reach above
> >>>     > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW(50), the second is
> >>>     > BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID(57) and the third is
> >>>     > BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH(64). By default the first stage
> >>>     > tries to writeback the amount of dirty data in one bucket (on average)
> >>>     > in (1 / (dirty_buckets_percent - 50)) second, the second stage
> >>>     tries to
> >>>     > writeback the amount of dirty data in one bucket in (1 /
> >>>     > (dirty_buckets_percent - 57)) * 200 millisecond. The third stage tries
> >>>     > to writeback the amount of dirty data in one bucket in (1 /
> >>>     > (dirty_buckets_percent - 64)) * 20 millisecond.
> >>>     >
> >>>     > As we can see, there are two writeback aggressiveness increasing
> >>>     > strategies, one strategy is with the increasing of the stage, the
> >>>     first
> >>>     > stage is the easy-going phase whose initial rate is trying to
> >>>     write back
> >>>     > dirty data of one bucket in 1 second, the second stage is a bit more
> >>>     > aggressive, the initial rate tries to writeback the dirty data of one
> >>>     > bucket in 200 ms, the last stage is even more, whose initial rate
> >>>     tries
> >>>     > to writeback the dirty data of one bucket in 20 ms. This makes sense,
> >>>     > one reason is that if the preceding stage couldn’t get the
> >>>     fragmentation
> >>>     > to a fine stage, then the next stage should increase the
> >>>     aggressiveness
> >>>     > properly, also it is because the later stage is closer to the
> >>>     > bch_cutoff_writeback_sync. Another aggressiveness increasing
> >>>     strategy is
> >>>     > with the increasing of dirty bucket percent within each stage, the
> >>>     first
> >>>     > strategy controls the initial writeback rate of each stage, while this
> >>>     > one increases the rate based on the initial rate, which is
> >>>     initial_rate
> >>>     > * (dirty bucket percent - BCH_WRITEBACK_FRAGMENT_THRESHOLD_X).
> >>>     >
> >>>     > The initial rate can be controlled by 3 parameters
> >>>     > writeback_rate_fp_term_low, writeback_rate_fp_term_mid,
> >>>     > writeback_rate_fp_term_high, they are default 1, 5, 50, users can
> >>>     adjust
> >>>     > them based on their needs.
> >>>     >
> >>>     > The reason that I choose 50, 57, 64 as the threshold value is because
> >>>     > the GC must be triggered at least once during each stage due to the
> >>>     > “sectors_to_gc” being set to 1/16 (6.25 %) of the total cache
> >>>     size. So,
> >>>     > the hope is that the first and second stage can get us back to good
> >>>     > shape in most situations by smoothly writing back the dirty data
> >>>     without
> >>>     > giving too much stress to the backing devices, but it might still
> >>>     enter
> >>>     > the third stage if the bucket consumption is very aggressive.
> >>>     >
> >>>     > This patch use (dirty / dirty_buckets) * fp_term to calculate the
> >>>     rate,
> >>>     > this formula means that we want to writeback (dirty /
> >>>     dirty_buckets) in
> >>>     > 1/fp_term second, fp_term is calculated by above aggressiveness
> >>>     > controller, “dirty” is the current dirty sectors, “dirty_buckets”
> >>>     is the
> >>>     > current dirty buckets, so (dirty / dirty_buckets) means the average
> >>>     > dirty sectors in one bucket, the value is between 0 to 1024 for the
> >>>     > default setting,  so this formula basically gives a hint that to
> >>>     reclaim
> >>>     > one bucket in 1/fp_term second. By using this semantic, we can have a
> >>>     > lower writeback rate when the amount of dirty data is decreasing and
> >>>     > overcome the fact that dirty buckets number is always increasing
> >>>     unless
> >>>     > GC happens.
> >>>     >
> >>>     > *Compare to the first patch:
> >>>     > *The first patch is trying to write back all the data in 40 seconds,
> >>>     > this will result in a very high writeback rate when the amount of
> >>>     dirty
> >>>     > data is big, this is mostly true for the large cache devices. The
> >>>     basic
> >>>     > problem is that the semantic of this patch is not ideal, because we
> >>>     > don’t really need to writeback all dirty data in order to solve this
> >>>     > issue, and the instant large increase of the rate is something I
> >>>     feel we
> >>>     > should better avoid (I like things to be smoothly changed unless no
> >>>     > choice: )).
> >>>     >
> >>>     > Before I get to this new patch(which I believe should be optimal
> >>>     for me
> >>>     > atm), there have been many tuning/testing iterations, eg. I’ve
> >>>     tried to
> >>>     > tune the algorithm to writeback ⅓ of the dirty data in a certain
> >>>     amount
> >>>     > of seconds, writeback 1/fragment of the dirty data in a certain amount
> >>>     > of seconds, writeback all the dirty data only in those error_buckets
> >>>     > (error buckets = dirty buckets - 50% of the total buckets) in a
> >>>     certain
> >>>     > amount of time. However, those all turn out not to be ideal, only the
> >>>     > semantic of the patch makes much sense for me and allows me to control
> >>>     > the rate in a more precise way.
> >>>     >
> >>>     > *Testing data:
> >>>     > *I'll provide the visualized testing data in the next couple of days
> >>>     > with 1TB NVME devices cache but with HDD as backing device since it's
> >>>     > what we mostly used in production env.
> >>>     > I have the data for 400GB NVME, let me prepare it and take it for
> >>>     you to
> >>>     > review.
> >>>     [snipped]
> >>>
> >>>     Hi Dongdong,
> >>>
> >>>     Thanks for the update and continuous effort on this idea.
> >>>
> >>>     Please keep in mind the writeback rate is just a advice rate for the
> >>>     writeback throughput, in real workload changing the writeback rate
> >>>     number does not change writeback throughput obviously.
> >>>
> >>>     Currently I feel this is an interesting and promising idea for your
> >>>     patch, but I am not able to say whether it may take effect in real
> >>>     workload, so we do need convinced performance data on real workload and
> >>>     configuration.
> >>>
> >>>     Of course I may also help on the benchmark, but my to-do list is long
> >>>     enough and it may take a very long delay time.
> >>>
> >>>     Thanks.
> >>>
> >>>     Coly Li
> >>>
> >>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
  2021-01-08  8:30         ` Dongdong Tao
@ 2021-01-08  8:39           ` Coly Li
  2021-01-08  8:47             ` Dongdong Tao
  0 siblings, 1 reply; 24+ messages in thread
From: Coly Li @ 2021-01-08  8:39 UTC (permalink / raw)
  To: Dongdong Tao
  Cc: Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
	open list, Gavin Guo, Gerald Yang, Trent Lloyd,
	Dominique Poulain, Dongsheng Yang

On 1/8/21 4:30 PM, Dongdong Tao wrote:
> Hi Coly,
> 
> They are captured with the same time length, the meaning of the
> timestamp and the time unit on the x-axis are different.
> (Sorry, I should have clarified this right after the chart)
> 
> For the latency chart:
> The timestamp is the relative time since the beginning of the
> benchmark, so the start timestamp is 0 and the unit is based on
> millisecond
> 
> For the dirty data and cache available percent chart:
> The timestamp is the UNIX timestamp, the time unit is based on second,
> I capture the stats every 5 seconds with the below script:
> ---
> #!/bin/sh
> while true; do echo "`date +%s`, `cat
> /sys/block/bcache0/bcache/dirty_data`, `cat
> /sys/block/bcache0/bcache/cache/cache_available_percent`, `cat
> /sys/block/bcache0/bcache/writeback_rate`" >> $1; sleep 5; done;
> ---
> 
> Unfortunately, I can't easily make them using the same timestamp, but
> I guess I can try to convert the UNIX timestamp to the relative time
> like the first one.
> But If we ignore the value of the X-axis,  we can still roughly
> compare them by using the length of the X-axis since they have the
> same time length,
> and we can see that the Master's write start hitting the backing
> device when the cache_available_percent dropped to around 30.

Copied, thanks for the explanation. The chart for single thread with io
depth 1 is convinced IMHO :-)

One more question, the benchmark is about a single I/O thread with io
depth 1, which is not typical condition for real workload. Do you have
plan to test the latency and IOPS for multiple threads with larger I/O
depth ?


Thanks.


Coly Li


> 
> On Fri, Jan 8, 2021 at 12:06 PM Coly Li <colyli@suse.de> wrote:
>>
>> On 1/7/21 10:55 PM, Dongdong Tao wrote:
>>> Hi Coly,
>>>
>>>
>>> Thanks for the reminder, I understand that the rate is only a hint of
>>> the throughput, it’s a value to calculate the sleep time between each
>>> round of keys writeback, the higher the rate, the shorter the sleep
>>> time, most of the time this means the more dirty keys it can writeback
>>> in a certain amount of time before the hard disk running out of speed.
>>>
>>>
>>> Here is the testing data that run on a 400GB NVME + 1TB NVME HDD
>>>
>>
>> Hi Dongdong,
>>
>> Nice charts :-)
>>
>>> Steps:
>>>
>>>  1.
>>>
>>>     make-bcache -B <HDD> -C <NVME> --writeback
>>>
>>>  2.
>>>
>>>     sudo fio --name=random-writers --filename=/dev/bcache0
>>>     --ioengine=libaio --iodepth=1 --rw=randrw --blocksize=64k,8k
>>>     --direct=1 --numjobs=1  --write_lat_log=mix --log_avg_msec=10
>>>> The fio benchmark commands ran for about 20 hours.
>>>
>>
>> The time lengths of first 3 charts are 7.000e+7, rested are 1.60930e+9.
>> I guess the time length of the I/O latency chart is 1/100 of the rested.
>>
>> Can you also post the latency charts for 1.60930e+9 seconds? Then I can
>> compare the latency with dirty data and available cache charts.
>>
>>
>> Thanks.
>>
>>
>> Coly Li
>>
>>
>>
>>
>>
>>>
>>> Let’s have a look at the write latency first:
>>>
>>> Master:
>>>
>>>
>>>
>>> Master+the patch:
>>>
>>> Combine them together:
>>>
>>> Again, the latency (y-axis) is based on nano-second, x-axis is the
>>> timestamp based on milli-second,  as we can see the master latency is
>>> obviously much higher than the one with my patch when the master bcache
>>> hit the cutoff writeback sync, the master isn’t going to get out of this
>>> cutoff writeback sync situation, This graph showed it already stuck at
>>> the cutoff writeback sync for about 4 hours before I finish the testing,
>>> it may still needs to stuck for days before it can get out this
>>> situation itself.
>>>
>>>
>>> Note that there are 1 million points for each , red represents master,
>>> green represents mater+my patch.  Most of them are overlapped with each
>>> other, so it may look like this graph has more red points then green
>>> after it hitting the cutoff, but simply it’s because the latency has
>>> scaled to a bigger range which represents the HDD latency.
>>>
>>>
>>>
>>> Let’s also have a look at the bcache’s cache available percent and dirty
>>> data percent.
>>>
>>> Master:
>>>
>>> Master+this patch:
>>>
>>> As you can see, this patch can avoid it hitting the cutoff writeback sync.
>>>
>>>
>>> As to say the improvement for this patch against the first one, let’s
>>> take a look at the writeback rate changing during the run.
>>>
>>> patch V1:
>>>
>>>
>>>
>>> Patch V2:
>>>
>>>
>>> The Y-axis is the value of rate, the V1 is very aggressive as it jumps
>>> instantly from a minimum 8 to around 10 million. And the patch V2 can
>>> control the rate under 5000 during the run, and after the first round of
>>> writeback, it can stay even under 2500, so this proves we don’t need to
>>> be as aggressive as V1 to get out of the high fragment situation which
>>> eventually causes all writes hitting the backing device. This looks very
>>> reasonable for me now.
>>>
>>> Note that the fio command that I used is consuming the bucket quite
>>> aggressively, so it had to hit the third stage which has the highest
>>> aggressiveness, but I believe this is not true in a real production env,
>>> real production env won’t consume buckets that aggressively, so I expect
>>> stage 3 may not very often be needed to hit.
>>>
>>>
>>> As discussed, I'll run multiple block size testing on at least 1TB NVME
>>> device later.
>>> But it might take some time.
>>>
>>>
>>> Regards,
>>> Dongdong
>>>
>>> On Tue, Jan 5, 2021 at 12:33 PM Coly Li <colyli@suse.de
>>> <mailto:colyli@suse.de>> wrote:
>>>
>>>     On 1/5/21 11:44 AM, Dongdong Tao wrote:
>>>     > Hey Coly,
>>>     >
>>>     > This is the second version of the patch, please allow me to explain a
>>>     > bit for this patch:
>>>     >
>>>     > We accelerate the rate in 3 stages with different aggressiveness, the
>>>     > first stage starts when dirty buckets percent reach above
>>>     > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW(50), the second is
>>>     > BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID(57) and the third is
>>>     > BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH(64). By default the first stage
>>>     > tries to writeback the amount of dirty data in one bucket (on average)
>>>     > in (1 / (dirty_buckets_percent - 50)) second, the second stage
>>>     tries to
>>>     > writeback the amount of dirty data in one bucket in (1 /
>>>     > (dirty_buckets_percent - 57)) * 200 millisecond. The third stage tries
>>>     > to writeback the amount of dirty data in one bucket in (1 /
>>>     > (dirty_buckets_percent - 64)) * 20 millisecond.
>>>     >
>>>     > As we can see, there are two writeback aggressiveness increasing
>>>     > strategies, one strategy is with the increasing of the stage, the
>>>     first
>>>     > stage is the easy-going phase whose initial rate is trying to
>>>     write back
>>>     > dirty data of one bucket in 1 second, the second stage is a bit more
>>>     > aggressive, the initial rate tries to writeback the dirty data of one
>>>     > bucket in 200 ms, the last stage is even more, whose initial rate
>>>     tries
>>>     > to writeback the dirty data of one bucket in 20 ms. This makes sense,
>>>     > one reason is that if the preceding stage couldn’t get the
>>>     fragmentation
>>>     > to a fine stage, then the next stage should increase the
>>>     aggressiveness
>>>     > properly, also it is because the later stage is closer to the
>>>     > bch_cutoff_writeback_sync. Another aggressiveness increasing
>>>     strategy is
>>>     > with the increasing of dirty bucket percent within each stage, the
>>>     first
>>>     > strategy controls the initial writeback rate of each stage, while this
>>>     > one increases the rate based on the initial rate, which is
>>>     initial_rate
>>>     > * (dirty bucket percent - BCH_WRITEBACK_FRAGMENT_THRESHOLD_X).
>>>     >
>>>     > The initial rate can be controlled by 3 parameters
>>>     > writeback_rate_fp_term_low, writeback_rate_fp_term_mid,
>>>     > writeback_rate_fp_term_high, they are default 1, 5, 50, users can
>>>     adjust
>>>     > them based on their needs.
>>>     >
>>>     > The reason that I choose 50, 57, 64 as the threshold value is because
>>>     > the GC must be triggered at least once during each stage due to the
>>>     > “sectors_to_gc” being set to 1/16 (6.25 %) of the total cache
>>>     size. So,
>>>     > the hope is that the first and second stage can get us back to good
>>>     > shape in most situations by smoothly writing back the dirty data
>>>     without
>>>     > giving too much stress to the backing devices, but it might still
>>>     enter
>>>     > the third stage if the bucket consumption is very aggressive.
>>>     >
>>>     > This patch use (dirty / dirty_buckets) * fp_term to calculate the
>>>     rate,
>>>     > this formula means that we want to writeback (dirty /
>>>     dirty_buckets) in
>>>     > 1/fp_term second, fp_term is calculated by above aggressiveness
>>>     > controller, “dirty” is the current dirty sectors, “dirty_buckets”
>>>     is the
>>>     > current dirty buckets, so (dirty / dirty_buckets) means the average
>>>     > dirty sectors in one bucket, the value is between 0 to 1024 for the
>>>     > default setting,  so this formula basically gives a hint that to
>>>     reclaim
>>>     > one bucket in 1/fp_term second. By using this semantic, we can have a
>>>     > lower writeback rate when the amount of dirty data is decreasing and
>>>     > overcome the fact that dirty buckets number is always increasing
>>>     unless
>>>     > GC happens.
>>>     >
>>>     > *Compare to the first patch:
>>>     > *The first patch is trying to write back all the data in 40 seconds,
>>>     > this will result in a very high writeback rate when the amount of
>>>     dirty
>>>     > data is big, this is mostly true for the large cache devices. The
>>>     basic
>>>     > problem is that the semantic of this patch is not ideal, because we
>>>     > don’t really need to writeback all dirty data in order to solve this
>>>     > issue, and the instant large increase of the rate is something I
>>>     feel we
>>>     > should better avoid (I like things to be smoothly changed unless no
>>>     > choice: )).
>>>     >
>>>     > Before I get to this new patch(which I believe should be optimal
>>>     for me
>>>     > atm), there have been many tuning/testing iterations, eg. I’ve
>>>     tried to
>>>     > tune the algorithm to writeback ⅓ of the dirty data in a certain
>>>     amount
>>>     > of seconds, writeback 1/fragment of the dirty data in a certain amount
>>>     > of seconds, writeback all the dirty data only in those error_buckets
>>>     > (error buckets = dirty buckets - 50% of the total buckets) in a
>>>     certain
>>>     > amount of time. However, those all turn out not to be ideal, only the
>>>     > semantic of the patch makes much sense for me and allows me to control
>>>     > the rate in a more precise way.
>>>     >
>>>     > *Testing data:
>>>     > *I'll provide the visualized testing data in the next couple of days
>>>     > with 1TB NVME devices cache but with HDD as backing device since it's
>>>     > what we mostly used in production env.
>>>     > I have the data for 400GB NVME, let me prepare it and take it for
>>>     you to
>>>     > review.
>>>     [snipped]
>>>
>>>     Hi Dongdong,
>>>
>>>     Thanks for the update and continuous effort on this idea.
>>>
>>>     Please keep in mind the writeback rate is just a advice rate for the
>>>     writeback throughput, in real workload changing the writeback rate
>>>     number does not change writeback throughput obviously.
>>>
>>>     Currently I feel this is an interesting and promising idea for your
>>>     patch, but I am not able to say whether it may take effect in real
>>>     workload, so we do need convinced performance data on real workload and
>>>     configuration.
>>>
>>>     Of course I may also help on the benchmark, but my to-do list is long
>>>     enough and it may take a very long delay time.
>>>
>>>     Thanks.
>>>
>>>     Coly Li
>>>
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
  2021-01-08  4:05       ` Coly Li
@ 2021-01-08  8:30         ` Dongdong Tao
  2021-01-08  8:39           ` Coly Li
  0 siblings, 1 reply; 24+ messages in thread
From: Dongdong Tao @ 2021-01-08  8:30 UTC (permalink / raw)
  To: Coly Li
  Cc: Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
	open list, Gavin Guo, Gerald Yang, Trent Lloyd,
	Dominique Poulain, Dongsheng Yang

Hi Coly,

They are captured with the same time length, the meaning of the
timestamp and the time unit on the x-axis are different.
(Sorry, I should have clarified this right after the chart)

For the latency chart:
The timestamp is the relative time since the beginning of the
benchmark, so the start timestamp is 0 and the unit is based on
millisecond

For the dirty data and cache available percent chart:
The timestamp is the UNIX timestamp, the time unit is based on second,
I capture the stats every 5 seconds with the below script:
---
#!/bin/sh
while true; do echo "`date +%s`, `cat
/sys/block/bcache0/bcache/dirty_data`, `cat
/sys/block/bcache0/bcache/cache/cache_available_percent`, `cat
/sys/block/bcache0/bcache/writeback_rate`" >> $1; sleep 5; done;
---

Unfortunately, I can't easily make them using the same timestamp, but
I guess I can try to convert the UNIX timestamp to the relative time
like the first one.
But If we ignore the value of the X-axis,  we can still roughly
compare them by using the length of the X-axis since they have the
same time length,
and we can see that the Master's write start hitting the backing
device when the cache_available_percent dropped to around 30.

Regards,
Dongdong


On Fri, Jan 8, 2021 at 12:06 PM Coly Li <colyli@suse.de> wrote:
>
> On 1/7/21 10:55 PM, Dongdong Tao wrote:
> > Hi Coly,
> >
> >
> > Thanks for the reminder, I understand that the rate is only a hint of
> > the throughput, it’s a value to calculate the sleep time between each
> > round of keys writeback, the higher the rate, the shorter the sleep
> > time, most of the time this means the more dirty keys it can writeback
> > in a certain amount of time before the hard disk running out of speed.
> >
> >
> > Here is the testing data that run on a 400GB NVME + 1TB NVME HDD
> >
>
> Hi Dongdong,
>
> Nice charts :-)
>
> > Steps:
> >
> >  1.
> >
> >     make-bcache -B <HDD> -C <NVME> --writeback
> >
> >  2.
> >
> >     sudo fio --name=random-writers --filename=/dev/bcache0
> >     --ioengine=libaio --iodepth=1 --rw=randrw --blocksize=64k,8k
> >     --direct=1 --numjobs=1  --write_lat_log=mix --log_avg_msec=10
> > > The fio benchmark commands ran for about 20 hours.
> >
>
> The time lengths of first 3 charts are 7.000e+7, rested are 1.60930e+9.
> I guess the time length of the I/O latency chart is 1/100 of the rested.
>
> Can you also post the latency charts for 1.60930e+9 seconds? Then I can
> compare the latency with dirty data and available cache charts.
>
>
> Thanks.
>
>
> Coly Li
>
>
>
>
>
> >
> > Let’s have a look at the write latency first:
> >
> > Master:
> >
> >
> >
> > Master+the patch:
> >
> > Combine them together:
> >
> > Again, the latency (y-axis) is based on nano-second, x-axis is the
> > timestamp based on milli-second,  as we can see the master latency is
> > obviously much higher than the one with my patch when the master bcache
> > hit the cutoff writeback sync, the master isn’t going to get out of this
> > cutoff writeback sync situation, This graph showed it already stuck at
> > the cutoff writeback sync for about 4 hours before I finish the testing,
> > it may still needs to stuck for days before it can get out this
> > situation itself.
> >
> >
> > Note that there are 1 million points for each , red represents master,
> > green represents mater+my patch.  Most of them are overlapped with each
> > other, so it may look like this graph has more red points then green
> > after it hitting the cutoff, but simply it’s because the latency has
> > scaled to a bigger range which represents the HDD latency.
> >
> >
> >
> > Let’s also have a look at the bcache’s cache available percent and dirty
> > data percent.
> >
> > Master:
> >
> > Master+this patch:
> >
> > As you can see, this patch can avoid it hitting the cutoff writeback sync.
> >
> >
> > As to say the improvement for this patch against the first one, let’s
> > take a look at the writeback rate changing during the run.
> >
> > patch V1:
> >
> >
> >
> > Patch V2:
> >
> >
> > The Y-axis is the value of rate, the V1 is very aggressive as it jumps
> > instantly from a minimum 8 to around 10 million. And the patch V2 can
> > control the rate under 5000 during the run, and after the first round of
> > writeback, it can stay even under 2500, so this proves we don’t need to
> > be as aggressive as V1 to get out of the high fragment situation which
> > eventually causes all writes hitting the backing device. This looks very
> > reasonable for me now.
> >
> > Note that the fio command that I used is consuming the bucket quite
> > aggressively, so it had to hit the third stage which has the highest
> > aggressiveness, but I believe this is not true in a real production env,
> > real production env won’t consume buckets that aggressively, so I expect
> > stage 3 may not very often be needed to hit.
> >
> >
> > As discussed, I'll run multiple block size testing on at least 1TB NVME
> > device later.
> > But it might take some time.
> >
> >
> > Regards,
> > Dongdong
> >
> > On Tue, Jan 5, 2021 at 12:33 PM Coly Li <colyli@suse.de
> > <mailto:colyli@suse.de>> wrote:
> >
> >     On 1/5/21 11:44 AM, Dongdong Tao wrote:
> >     > Hey Coly,
> >     >
> >     > This is the second version of the patch, please allow me to explain a
> >     > bit for this patch:
> >     >
> >     > We accelerate the rate in 3 stages with different aggressiveness, the
> >     > first stage starts when dirty buckets percent reach above
> >     > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW(50), the second is
> >     > BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID(57) and the third is
> >     > BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH(64). By default the first stage
> >     > tries to writeback the amount of dirty data in one bucket (on average)
> >     > in (1 / (dirty_buckets_percent - 50)) second, the second stage
> >     tries to
> >     > writeback the amount of dirty data in one bucket in (1 /
> >     > (dirty_buckets_percent - 57)) * 200 millisecond. The third stage tries
> >     > to writeback the amount of dirty data in one bucket in (1 /
> >     > (dirty_buckets_percent - 64)) * 20 millisecond.
> >     >
> >     > As we can see, there are two writeback aggressiveness increasing
> >     > strategies, one strategy is with the increasing of the stage, the
> >     first
> >     > stage is the easy-going phase whose initial rate is trying to
> >     write back
> >     > dirty data of one bucket in 1 second, the second stage is a bit more
> >     > aggressive, the initial rate tries to writeback the dirty data of one
> >     > bucket in 200 ms, the last stage is even more, whose initial rate
> >     tries
> >     > to writeback the dirty data of one bucket in 20 ms. This makes sense,
> >     > one reason is that if the preceding stage couldn’t get the
> >     fragmentation
> >     > to a fine stage, then the next stage should increase the
> >     aggressiveness
> >     > properly, also it is because the later stage is closer to the
> >     > bch_cutoff_writeback_sync. Another aggressiveness increasing
> >     strategy is
> >     > with the increasing of dirty bucket percent within each stage, the
> >     first
> >     > strategy controls the initial writeback rate of each stage, while this
> >     > one increases the rate based on the initial rate, which is
> >     initial_rate
> >     > * (dirty bucket percent - BCH_WRITEBACK_FRAGMENT_THRESHOLD_X).
> >     >
> >     > The initial rate can be controlled by 3 parameters
> >     > writeback_rate_fp_term_low, writeback_rate_fp_term_mid,
> >     > writeback_rate_fp_term_high, they are default 1, 5, 50, users can
> >     adjust
> >     > them based on their needs.
> >     >
> >     > The reason that I choose 50, 57, 64 as the threshold value is because
> >     > the GC must be triggered at least once during each stage due to the
> >     > “sectors_to_gc” being set to 1/16 (6.25 %) of the total cache
> >     size. So,
> >     > the hope is that the first and second stage can get us back to good
> >     > shape in most situations by smoothly writing back the dirty data
> >     without
> >     > giving too much stress to the backing devices, but it might still
> >     enter
> >     > the third stage if the bucket consumption is very aggressive.
> >     >
> >     > This patch use (dirty / dirty_buckets) * fp_term to calculate the
> >     rate,
> >     > this formula means that we want to writeback (dirty /
> >     dirty_buckets) in
> >     > 1/fp_term second, fp_term is calculated by above aggressiveness
> >     > controller, “dirty” is the current dirty sectors, “dirty_buckets”
> >     is the
> >     > current dirty buckets, so (dirty / dirty_buckets) means the average
> >     > dirty sectors in one bucket, the value is between 0 to 1024 for the
> >     > default setting,  so this formula basically gives a hint that to
> >     reclaim
> >     > one bucket in 1/fp_term second. By using this semantic, we can have a
> >     > lower writeback rate when the amount of dirty data is decreasing and
> >     > overcome the fact that dirty buckets number is always increasing
> >     unless
> >     > GC happens.
> >     >
> >     > *Compare to the first patch:
> >     > *The first patch is trying to write back all the data in 40 seconds,
> >     > this will result in a very high writeback rate when the amount of
> >     dirty
> >     > data is big, this is mostly true for the large cache devices. The
> >     basic
> >     > problem is that the semantic of this patch is not ideal, because we
> >     > don’t really need to writeback all dirty data in order to solve this
> >     > issue, and the instant large increase of the rate is something I
> >     feel we
> >     > should better avoid (I like things to be smoothly changed unless no
> >     > choice: )).
> >     >
> >     > Before I get to this new patch(which I believe should be optimal
> >     for me
> >     > atm), there have been many tuning/testing iterations, eg. I’ve
> >     tried to
> >     > tune the algorithm to writeback ⅓ of the dirty data in a certain
> >     amount
> >     > of seconds, writeback 1/fragment of the dirty data in a certain amount
> >     > of seconds, writeback all the dirty data only in those error_buckets
> >     > (error buckets = dirty buckets - 50% of the total buckets) in a
> >     certain
> >     > amount of time. However, those all turn out not to be ideal, only the
> >     > semantic of the patch makes much sense for me and allows me to control
> >     > the rate in a more precise way.
> >     >
> >     > *Testing data:
> >     > *I'll provide the visualized testing data in the next couple of days
> >     > with 1TB NVME devices cache but with HDD as backing device since it's
> >     > what we mostly used in production env.
> >     > I have the data for 400GB NVME, let me prepare it and take it for
> >     you to
> >     > review.
> >     [snipped]
> >
> >     Hi Dongdong,
> >
> >     Thanks for the update and continuous effort on this idea.
> >
> >     Please keep in mind the writeback rate is just a advice rate for the
> >     writeback throughput, in real workload changing the writeback rate
> >     number does not change writeback throughput obviously.
> >
> >     Currently I feel this is an interesting and promising idea for your
> >     patch, but I am not able to say whether it may take effect in real
> >     workload, so we do need convinced performance data on real workload and
> >     configuration.
> >
> >     Of course I may also help on the benchmark, but my to-do list is long
> >     enough and it may take a very long delay time.
> >
> >     Thanks.
> >
> >     Coly Li
> >
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
       [not found]     ` <CAJS8hVL2B=RZr8H4jFbz=bX9k_E9ur7kTeue6BJwzm4pwv1+qQ@mail.gmail.com>
@ 2021-01-08  4:05       ` Coly Li
  2021-01-08  8:30         ` Dongdong Tao
  0 siblings, 1 reply; 24+ messages in thread
From: Coly Li @ 2021-01-08  4:05 UTC (permalink / raw)
  To: Dongdong Tao
  Cc: Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
	open list, Gavin Guo, Gerald Yang, Trent Lloyd,
	Dominique Poulain, Dongsheng Yang

On 1/7/21 10:55 PM, Dongdong Tao wrote:
> Hi Coly,
> 
> 
> Thanks for the reminder, I understand that the rate is only a hint of
> the throughput, it’s a value to calculate the sleep time between each
> round of keys writeback, the higher the rate, the shorter the sleep
> time, most of the time this means the more dirty keys it can writeback
> in a certain amount of time before the hard disk running out of speed.
> 
> 
> Here is the testing data that run on a 400GB NVME + 1TB NVME HDD
> 

Hi Dongdong,

Nice charts :-)

> Steps:
> 
>  1.
> 
>     make-bcache -B <HDD> -C <NVME> --writeback
> 
>  2.
> 
>     sudo fio --name=random-writers --filename=/dev/bcache0
>     --ioengine=libaio --iodepth=1 --rw=randrw --blocksize=64k,8k
>     --direct=1 --numjobs=1  --write_lat_log=mix --log_avg_msec=10
> > The fio benchmark commands ran for about 20 hours.
> 

The time lengths of first 3 charts are 7.000e+7, rested are 1.60930e+9.
I guess the time length of the I/O latency chart is 1/100 of the rested.

Can you also post the latency charts for 1.60930e+9 seconds? Then I can
compare the latency with dirty data and available cache charts.


Thanks.


Coly Li





> 
> Let’s have a look at the write latency first:
> 
> Master:
> 
> 
> 
> Master+the patch:
> 
> Combine them together:
> 
> Again, the latency (y-axis) is based on nano-second, x-axis is the
> timestamp based on milli-second,  as we can see the master latency is
> obviously much higher than the one with my patch when the master bcache
> hit the cutoff writeback sync, the master isn’t going to get out of this
> cutoff writeback sync situation, This graph showed it already stuck at
> the cutoff writeback sync for about 4 hours before I finish the testing,
> it may still needs to stuck for days before it can get out this
> situation itself. 
> 
> 
> Note that there are 1 million points for each , red represents master,
> green represents mater+my patch.  Most of them are overlapped with each
> other, so it may look like this graph has more red points then green
> after it hitting the cutoff, but simply it’s because the latency has
> scaled to a bigger range which represents the HDD latency.
> 
> 
> 
> Let’s also have a look at the bcache’s cache available percent and dirty
> data percent.
> 
> Master:
> 
> Master+this patch:
> 
> As you can see, this patch can avoid it hitting the cutoff writeback sync.
> 
> 
> As to say the improvement for this patch against the first one, let’s
> take a look at the writeback rate changing during the run. 
> 
> patch V1:
> 
> 
> 
> Patch V2:
> 
> 
> The Y-axis is the value of rate, the V1 is very aggressive as it jumps
> instantly from a minimum 8 to around 10 million. And the patch V2 can
> control the rate under 5000 during the run, and after the first round of
> writeback, it can stay even under 2500, so this proves we don’t need to
> be as aggressive as V1 to get out of the high fragment situation which
> eventually causes all writes hitting the backing device. This looks very
> reasonable for me now. 
> 
> Note that the fio command that I used is consuming the bucket quite
> aggressively, so it had to hit the third stage which has the highest
> aggressiveness, but I believe this is not true in a real production env,
> real production env won’t consume buckets that aggressively, so I expect
> stage 3 may not very often be needed to hit. 
> 
> 
> As discussed, I'll run multiple block size testing on at least 1TB NVME
> device later. 
> But it might take some time.
> 
> 
> Regards,
> Dongdong
> 
> On Tue, Jan 5, 2021 at 12:33 PM Coly Li <colyli@suse.de
> <mailto:colyli@suse.de>> wrote:
> 
>     On 1/5/21 11:44 AM, Dongdong Tao wrote:
>     > Hey Coly,
>     >
>     > This is the second version of the patch, please allow me to explain a
>     > bit for this patch:
>     >
>     > We accelerate the rate in 3 stages with different aggressiveness, the
>     > first stage starts when dirty buckets percent reach above
>     > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW(50), the second is
>     > BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID(57) and the third is
>     > BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH(64). By default the first stage
>     > tries to writeback the amount of dirty data in one bucket (on average)
>     > in (1 / (dirty_buckets_percent - 50)) second, the second stage
>     tries to
>     > writeback the amount of dirty data in one bucket in (1 /
>     > (dirty_buckets_percent - 57)) * 200 millisecond. The third stage tries
>     > to writeback the amount of dirty data in one bucket in (1 /
>     > (dirty_buckets_percent - 64)) * 20 millisecond.
>     >
>     > As we can see, there are two writeback aggressiveness increasing
>     > strategies, one strategy is with the increasing of the stage, the
>     first
>     > stage is the easy-going phase whose initial rate is trying to
>     write back
>     > dirty data of one bucket in 1 second, the second stage is a bit more
>     > aggressive, the initial rate tries to writeback the dirty data of one
>     > bucket in 200 ms, the last stage is even more, whose initial rate
>     tries
>     > to writeback the dirty data of one bucket in 20 ms. This makes sense,
>     > one reason is that if the preceding stage couldn’t get the
>     fragmentation
>     > to a fine stage, then the next stage should increase the
>     aggressiveness
>     > properly, also it is because the later stage is closer to the
>     > bch_cutoff_writeback_sync. Another aggressiveness increasing
>     strategy is
>     > with the increasing of dirty bucket percent within each stage, the
>     first
>     > strategy controls the initial writeback rate of each stage, while this
>     > one increases the rate based on the initial rate, which is
>     initial_rate
>     > * (dirty bucket percent - BCH_WRITEBACK_FRAGMENT_THRESHOLD_X).  
>     >
>     > The initial rate can be controlled by 3 parameters
>     > writeback_rate_fp_term_low, writeback_rate_fp_term_mid,
>     > writeback_rate_fp_term_high, they are default 1, 5, 50, users can
>     adjust
>     > them based on their needs.
>     >
>     > The reason that I choose 50, 57, 64 as the threshold value is because
>     > the GC must be triggered at least once during each stage due to the
>     > “sectors_to_gc” being set to 1/16 (6.25 %) of the total cache
>     size. So,
>     > the hope is that the first and second stage can get us back to good
>     > shape in most situations by smoothly writing back the dirty data
>     without
>     > giving too much stress to the backing devices, but it might still
>     enter
>     > the third stage if the bucket consumption is very aggressive.
>     >
>     > This patch use (dirty / dirty_buckets) * fp_term to calculate the
>     rate,
>     > this formula means that we want to writeback (dirty /
>     dirty_buckets) in
>     > 1/fp_term second, fp_term is calculated by above aggressiveness
>     > controller, “dirty” is the current dirty sectors, “dirty_buckets”
>     is the
>     > current dirty buckets, so (dirty / dirty_buckets) means the average
>     > dirty sectors in one bucket, the value is between 0 to 1024 for the
>     > default setting,  so this formula basically gives a hint that to
>     reclaim
>     > one bucket in 1/fp_term second. By using this semantic, we can have a
>     > lower writeback rate when the amount of dirty data is decreasing and
>     > overcome the fact that dirty buckets number is always increasing
>     unless
>     > GC happens.
>     >
>     > *Compare to the first patch:
>     > *The first patch is trying to write back all the data in 40 seconds,
>     > this will result in a very high writeback rate when the amount of
>     dirty
>     > data is big, this is mostly true for the large cache devices. The
>     basic
>     > problem is that the semantic of this patch is not ideal, because we
>     > don’t really need to writeback all dirty data in order to solve this
>     > issue, and the instant large increase of the rate is something I
>     feel we
>     > should better avoid (I like things to be smoothly changed unless no
>     > choice: )).
>     >
>     > Before I get to this new patch(which I believe should be optimal
>     for me
>     > atm), there have been many tuning/testing iterations, eg. I’ve
>     tried to
>     > tune the algorithm to writeback ⅓ of the dirty data in a certain
>     amount
>     > of seconds, writeback 1/fragment of the dirty data in a certain amount
>     > of seconds, writeback all the dirty data only in those error_buckets
>     > (error buckets = dirty buckets - 50% of the total buckets) in a
>     certain
>     > amount of time. However, those all turn out not to be ideal, only the
>     > semantic of the patch makes much sense for me and allows me to control
>     > the rate in a more precise way.
>     >
>     > *Testing data:
>     > *I'll provide the visualized testing data in the next couple of days
>     > with 1TB NVME devices cache but with HDD as backing device since it's
>     > what we mostly used in production env. 
>     > I have the data for 400GB NVME, let me prepare it and take it for
>     you to
>     > review.
>     [snipped]
> 
>     Hi Dongdong,
> 
>     Thanks for the update and continuous effort on this idea.
> 
>     Please keep in mind the writeback rate is just a advice rate for the
>     writeback throughput, in real workload changing the writeback rate
>     number does not change writeback throughput obviously.
> 
>     Currently I feel this is an interesting and promising idea for your
>     patch, but I am not able to say whether it may take effect in real
>     workload, so we do need convinced performance data on real workload and
>     configuration.
> 
>     Of course I may also help on the benchmark, but my to-do list is long
>     enough and it may take a very long delay time.
> 
>     Thanks.
> 
>     Coly Li
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
       [not found] ` <CAJS8hVK-ZCxJt=E3hwR0hmqPYL1T07_WC_nerb-dZodO+DqtDA@mail.gmail.com>
@ 2021-01-05  4:33   ` Coly Li
       [not found]     ` <CAJS8hVL2B=RZr8H4jFbz=bX9k_E9ur7kTeue6BJwzm4pwv1+qQ@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Coly Li @ 2021-01-05  4:33 UTC (permalink / raw)
  To: Dongdong Tao
  Cc: Kent Overstreet, open list:BCACHE (BLOCK LAYER CACHE),
	open list, Gavin Guo, Gerald Yang, Trent Lloyd,
	Dominique Poulain, Dongsheng Yang

On 1/5/21 11:44 AM, Dongdong Tao wrote:
> Hey Coly,
> 
> This is the second version of the patch, please allow me to explain a
> bit for this patch:
> 
> We accelerate the rate in 3 stages with different aggressiveness, the
> first stage starts when dirty buckets percent reach above
> BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW(50), the second is
> BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID(57) and the third is
> BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH(64). By default the first stage
> tries to writeback the amount of dirty data in one bucket (on average)
> in (1 / (dirty_buckets_percent - 50)) second, the second stage tries to
> writeback the amount of dirty data in one bucket in (1 /
> (dirty_buckets_percent - 57)) * 200 millisecond. The third stage tries
> to writeback the amount of dirty data in one bucket in (1 /
> (dirty_buckets_percent - 64)) * 20 millisecond.
> 
> As we can see, there are two writeback aggressiveness increasing
> strategies, one strategy is with the increasing of the stage, the first
> stage is the easy-going phase whose initial rate is trying to write back
> dirty data of one bucket in 1 second, the second stage is a bit more
> aggressive, the initial rate tries to writeback the dirty data of one
> bucket in 200 ms, the last stage is even more, whose initial rate tries
> to writeback the dirty data of one bucket in 20 ms. This makes sense,
> one reason is that if the preceding stage couldn’t get the fragmentation
> to a fine stage, then the next stage should increase the aggressiveness
> properly, also it is because the later stage is closer to the
> bch_cutoff_writeback_sync. Another aggressiveness increasing strategy is
> with the increasing of dirty bucket percent within each stage, the first
> strategy controls the initial writeback rate of each stage, while this
> one increases the rate based on the initial rate, which is initial_rate
> * (dirty bucket percent - BCH_WRITEBACK_FRAGMENT_THRESHOLD_X).  
> 
> The initial rate can be controlled by 3 parameters
> writeback_rate_fp_term_low, writeback_rate_fp_term_mid,
> writeback_rate_fp_term_high, they are default 1, 5, 50, users can adjust
> them based on their needs.
> 
> The reason that I choose 50, 57, 64 as the threshold value is because
> the GC must be triggered at least once during each stage due to the
> “sectors_to_gc” being set to 1/16 (6.25 %) of the total cache size. So,
> the hope is that the first and second stage can get us back to good
> shape in most situations by smoothly writing back the dirty data without
> giving too much stress to the backing devices, but it might still enter
> the third stage if the bucket consumption is very aggressive.
> 
> This patch use (dirty / dirty_buckets) * fp_term to calculate the rate,
> this formula means that we want to writeback (dirty / dirty_buckets) in
> 1/fp_term second, fp_term is calculated by above aggressiveness
> controller, “dirty” is the current dirty sectors, “dirty_buckets” is the
> current dirty buckets, so (dirty / dirty_buckets) means the average
> dirty sectors in one bucket, the value is between 0 to 1024 for the
> default setting,  so this formula basically gives a hint that to reclaim
> one bucket in 1/fp_term second. By using this semantic, we can have a
> lower writeback rate when the amount of dirty data is decreasing and
> overcome the fact that dirty buckets number is always increasing unless
> GC happens.
> 
> *Compare to the first patch:
> *The first patch is trying to write back all the data in 40 seconds,
> this will result in a very high writeback rate when the amount of dirty
> data is big, this is mostly true for the large cache devices. The basic
> problem is that the semantic of this patch is not ideal, because we
> don’t really need to writeback all dirty data in order to solve this
> issue, and the instant large increase of the rate is something I feel we
> should better avoid (I like things to be smoothly changed unless no
> choice: )).
> 
> Before I get to this new patch(which I believe should be optimal for me
> atm), there have been many tuning/testing iterations, eg. I’ve tried to
> tune the algorithm to writeback ⅓ of the dirty data in a certain amount
> of seconds, writeback 1/fragment of the dirty data in a certain amount
> of seconds, writeback all the dirty data only in those error_buckets
> (error buckets = dirty buckets - 50% of the total buckets) in a certain
> amount of time. However, those all turn out not to be ideal, only the
> semantic of the patch makes much sense for me and allows me to control
> the rate in a more precise way.
> 
> *Testing data:
> *I'll provide the visualized testing data in the next couple of days
> with 1TB NVME devices cache but with HDD as backing device since it's
> what we mostly used in production env. 
> I have the data for 400GB NVME, let me prepare it and take it for you to
> review.
[snipped]

Hi Dongdong,

Thanks for the update and continuous effort on this idea.

Please keep in mind the writeback rate is just a advice rate for the
writeback throughput, in real workload changing the writeback rate
number does not change writeback throughput obviously.

Currently I feel this is an interesting and promising idea for your
patch, but I am not able to say whether it may take effect in real
workload, so we do need convinced performance data on real workload and
configuration.

Of course I may also help on the benchmark, but my to-do list is long
enough and it may take a very long delay time.

Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] bcache: consider the fragmentation when update the writeback rate
@ 2021-01-05  3:06 Dongdong Tao
       [not found] ` <CAJS8hVK-ZCxJt=E3hwR0hmqPYL1T07_WC_nerb-dZodO+DqtDA@mail.gmail.com>
  2021-01-17  7:11 ` kernel test robot
  0 siblings, 2 replies; 24+ messages in thread
From: Dongdong Tao @ 2021-01-05  3:06 UTC (permalink / raw)
  To: colyli; +Cc: dongdong tao, Kent Overstreet, linux-bcache, linux-kernel

From: dongdong tao <dongdong.tao@canonical.com>

Current way to calculate the writeback rate only considered the
dirty sectors, this usually works fine when the fragmentation
is not high, but it will give us unreasonable small rate when
we are under a situation that very few dirty sectors consumed
a lot dirty buckets. In some case, the dirty bucekts can reached
to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
reached the writeback_percent, the writeback rate will still
be the minimum value (4k), thus it will cause all the writes to be
stucked in a non-writeback mode because of the slow writeback.

We accelerate the rate in 3 stages with different aggressiveness,
the first stage starts when dirty buckets percent reach above
BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default
the first stage tries to writeback the amount of dirty data
in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second,
the second stage tries to writeback the amount of dirty data in one bucket
in (1 / (dirty_buckets_percent - 57)) * 200 millisecond, the third
stage tries to writeback the amount of dirty data in one bucket in
(1 / (dirty_buckets_percent - 64)) * 20 millisecond.

Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
---
 drivers/md/bcache/bcache.h    |  3 +++
 drivers/md/bcache/sysfs.c     | 18 ++++++++++++++++
 drivers/md/bcache/writeback.c | 39 +++++++++++++++++++++++++++++++++++
 drivers/md/bcache/writeback.h |  4 ++++
 4 files changed, 64 insertions(+)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 1d57f48307e6..f8e892208bae 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -385,6 +385,9 @@ struct cached_dev {
 	unsigned int		writeback_rate_update_seconds;
 	unsigned int		writeback_rate_i_term_inverse;
 	unsigned int		writeback_rate_p_term_inverse;
+	unsigned int		writeback_rate_fp_term_low;
+	unsigned int		writeback_rate_fp_term_mid;
+	unsigned int		writeback_rate_fp_term_high;
 	unsigned int		writeback_rate_minimum;
 
 	enum stop_on_failure	stop_when_cache_set_failed;
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 554e3afc9b68..130df9406171 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -121,6 +121,9 @@ rw_attribute(writeback_rate);
 rw_attribute(writeback_rate_update_seconds);
 rw_attribute(writeback_rate_i_term_inverse);
 rw_attribute(writeback_rate_p_term_inverse);
+rw_attribute(writeback_rate_fp_term_low);
+rw_attribute(writeback_rate_fp_term_mid);
+rw_attribute(writeback_rate_fp_term_high);
 rw_attribute(writeback_rate_minimum);
 read_attribute(writeback_rate_debug);
 
@@ -205,6 +208,9 @@ SHOW(__bch_cached_dev)
 	var_print(writeback_rate_update_seconds);
 	var_print(writeback_rate_i_term_inverse);
 	var_print(writeback_rate_p_term_inverse);
+	var_print(writeback_rate_fp_term_low);
+	var_print(writeback_rate_fp_term_mid);
+	var_print(writeback_rate_fp_term_high);
 	var_print(writeback_rate_minimum);
 
 	if (attr == &sysfs_writeback_rate_debug) {
@@ -331,6 +337,15 @@ STORE(__cached_dev)
 	sysfs_strtoul_clamp(writeback_rate_p_term_inverse,
 			    dc->writeback_rate_p_term_inverse,
 			    1, UINT_MAX);
+	sysfs_strtoul_clamp(writeback_rate_fp_term_low,
+			    dc->writeback_rate_fp_term_low,
+			    1, UINT_MAX);
+	sysfs_strtoul_clamp(writeback_rate_fp_term_mid,
+			    dc->writeback_rate_fp_term_mid,
+			    1, UINT_MAX);
+	sysfs_strtoul_clamp(writeback_rate_fp_term_high,
+			    dc->writeback_rate_fp_term_high,
+			    1, UINT_MAX);
 	sysfs_strtoul_clamp(writeback_rate_minimum,
 			    dc->writeback_rate_minimum,
 			    1, UINT_MAX);
@@ -502,6 +517,9 @@ static struct attribute *bch_cached_dev_files[] = {
 	&sysfs_writeback_rate_update_seconds,
 	&sysfs_writeback_rate_i_term_inverse,
 	&sysfs_writeback_rate_p_term_inverse,
+	&sysfs_writeback_rate_fp_term_low,
+	&sysfs_writeback_rate_fp_term_mid,
+	&sysfs_writeback_rate_fp_term_high,
 	&sysfs_writeback_rate_minimum,
 	&sysfs_writeback_rate_debug,
 	&sysfs_io_errors,
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index a129e4d2707c..a21485448e8e 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -88,6 +88,42 @@ static void __update_writeback_rate(struct cached_dev *dc)
 	int64_t integral_scaled;
 	uint32_t new_rate;
 
+	/*
+	 * We need to consider the number of dirty buckets as well
+	 * when calculating the proportional_scaled, Otherwise we might
+	 * have an unreasonable small writeback rate at a highly fragmented situation
+	 * when very few dirty sectors consumed a lot dirty buckets, the
+	 * worst case is when dirty_data reached writeback_percent and
+	 * dirty buckets reached to cutoff_writeback_sync, but the rate
+	 * still will be at the minimum value, which will cause the write
+	 * stuck at a non-writeback mode.
+	 */
+	struct cache_set *c = dc->disk.c;
+
+	int64_t dirty_buckets = c->nbuckets - c->avail_nbuckets;
+
+	if (c->gc_stats.in_use > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW && dirty > 0) {
+		int64_t fragment = (dirty_buckets *  c->cache->sb.bucket_size) / dirty;
+		int64_t fp_term;
+		int64_t fps;
+
+		if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID) {
+			fp_term = dc->writeback_rate_fp_term_low *
+			(c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW);
+		} else if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH) {
+			fp_term = dc->writeback_rate_fp_term_mid *
+			(c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID);
+		} else {
+			fp_term = dc->writeback_rate_fp_term_high *
+			(c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH);
+		}
+		fps = (dirty / dirty_buckets) * fp_term;
+		if (fragment > 3 && fps > proportional_scaled) {
+			//Only overrite the p when fragment > 3
+			proportional_scaled = fps;
+		}
+	}
+
 	if ((error < 0 && dc->writeback_rate_integral > 0) ||
 	    (error > 0 && time_before64(local_clock(),
 			 dc->writeback_rate.next + NSEC_PER_MSEC))) {
@@ -984,6 +1020,9 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
 
 	dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT;
 	dc->writeback_rate_p_term_inverse = 40;
+	dc->writeback_rate_fp_term_low = 1;
+	dc->writeback_rate_fp_term_mid = 5;
+	dc->writeback_rate_fp_term_high = 50;
 	dc->writeback_rate_i_term_inverse = 10000;
 
 	WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags));
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 3f1230e22de0..02b2f9df73f6 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -16,6 +16,10 @@
 
 #define BCH_AUTO_GC_DIRTY_THRESHOLD	50
 
+#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW 50
+#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID 57
+#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH 64
+
 #define BCH_DIRTY_INIT_THRD_MAX	64
 /*
  * 14 (16384ths) is chosen here as something that each backing device
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-01-19 23:40 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-03 12:42 [PATCH] bcache: consider the fragmentation when update the writeback rate Dongdong Tao
2020-11-04  5:06 ` kernel test robot
2020-11-05 16:32 ` Coly Li
2020-11-10  4:19   ` Dongdong Tao
2020-11-11  8:33     ` Coly Li
2020-12-09  2:27 ` Dongsheng Yang
2020-12-09  4:48   ` Dongdong Tao
2020-12-09  7:49     ` Dongsheng Yang
     [not found]       ` <CAJS8hVLMUS1mdrwC8ovzvMO+HWf4xtXRCNJEghtbtW0g93Kh_g@mail.gmail.com>
2020-12-14 17:07         ` Coly Li
     [not found]           ` <CAJS8hVKMjec1cpe_zoeZAJrfY0Pq9bJ51eO6E+g8pgN9jV3Nmw@mail.gmail.com>
2020-12-21  8:08             ` Coly Li
2021-01-05  3:06 Dongdong Tao
     [not found] ` <CAJS8hVK-ZCxJt=E3hwR0hmqPYL1T07_WC_nerb-dZodO+DqtDA@mail.gmail.com>
2021-01-05  4:33   ` Coly Li
     [not found]     ` <CAJS8hVL2B=RZr8H4jFbz=bX9k_E9ur7kTeue6BJwzm4pwv1+qQ@mail.gmail.com>
2021-01-08  4:05       ` Coly Li
2021-01-08  8:30         ` Dongdong Tao
2021-01-08  8:39           ` Coly Li
2021-01-08  8:47             ` Dongdong Tao
2021-01-14  4:55               ` Dongdong Tao
     [not found]               ` <CAJS8hVJDaREvpvG4iO+Xs-KQXQKFi7=k29TrG=NXqjyiPpUCZA@mail.gmail.com>
2021-01-14 10:05                 ` Coly Li
2021-01-14 12:22                   ` Dongdong Tao
2021-01-14 13:31                     ` Coly Li
2021-01-14 15:35                       ` Dongdong Tao
2021-01-17  7:11 ` kernel test robot
2021-01-19 12:56 Dongdong Tao
2021-01-19 14:06 ` Coly Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).