[PATCH v3] bcache: fix writeback target calc on large devices

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3] bcache: fix writeback target calc on large devices
@ 2018-01-05 21:17 Michael Lyle
  2018-01-05 21:20 ` Michael Lyle
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Lyle @ 2018-01-05 21:17 UTC (permalink / raw)
  To: linux-bcache, linux-block; +Cc: Michael Lyle

Bcache needs to scale the dirty data in the cache over the multiple
backing disks in order to calculate writeback rates for each.
The previous code did this by multiplying the target number of dirty
sectors by the backing device size, and expected it to fit into a
uint64_t; this blows up on relatively small backing devices.

The new approach figures out the bdev's share in 16384ths of the overall
cached data.  This is chosen to cope well when bdevs drastically vary in
size and to ensure that bcache can cross the petabyte boundary for each
backing device.

This has been improved based on Tang Junhui's feedback to ensure that
every device gets a share of dirty data, no matter how small it is
compared to the total backing pool.

Reported-by: Jack Douglas <jack@douglastechnology.co.uk>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
---
 drivers/md/bcache/writeback.c | 35 +++++++++++++++++++++++++++++++----
 1 file changed, 31 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 61f24c04cebd..c7e35180091e 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -18,17 +18,43 @@
 #include <trace/events/bcache.h>
 
 /* Rate limiting */
-
-static void __update_writeback_rate(struct cached_dev *dc)
+static uint64_t __calc_target_rate(struct cached_dev *dc)
 {
 	struct cache_set *c = dc->disk.c;
+
+	/*
+	 * This is the size of the cache, minus the amount used for
+	 * flash-only devices
+	 */
 	uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size -
 				bcache_flash_devs_sectors_dirty(c);
+
+	/*
+	 * Unfortunately there is no control of global dirty data.  If the
+	 * user states that they want 10% dirty data in the cache, and has,
+	 * e.g., 5 backing volumes of equal size, we try and ensure each
+	 * backing volume uses about 2% of the cache.
+	 *
+	 * 16384 is chosen here as something that each backing device should
+	 * be a reasonable fraction of the share, and not to blow up until
+	 * individual backing devices are a petabyte.
+	 */
+	uint32_t bdev_share_per16k =
+		div64_u64(16384 * bdev_sectors(dc->bdev),
+				c->cached_dev_sectors);
+
 	uint64_t cache_dirty_target =
 		div_u64(cache_sectors * dc->writeback_percent, 100);
-	int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
-				   c->cached_dev_sectors);
 
+	/* Ensure each backing dev gets at least 1/16384th dirty share */
+	if (bdev_share_per16k < 1)
+		bdev_share_per16k = 1;
+
+	return div_u64(cache_dirty_target * bdev_share_per16k, 16384);
+}
+
+static void __update_writeback_rate(struct cached_dev *dc)
+{
 	/*
 	 * PI controller:
 	 * Figures out the amount that should be written per second.
@@ -49,6 +75,7 @@ static void __update_writeback_rate(struct cached_dev *dc)
 	 * This acts as a slow, long-term average that is not subject to
 	 * variations in usage like the p term.
 	 */
+	int64_t target = __calc_target_rate(dc);
 	int64_t dirty = bcache_dev_sectors_dirty(&dc->disk);
 	int64_t error = dirty - target;
 	int64_t proportional_scaled =
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] bcache: fix writeback target calc on large devices
  2018-01-05 21:17 [PATCH v3] bcache: fix writeback target calc on large devices Michael Lyle
@ 2018-01-05 21:20 ` Michael Lyle
  0 siblings, 0 replies; 6+ messages in thread
From: Michael Lyle @ 2018-01-05 21:20 UTC (permalink / raw)
  To: linux-bcache, linux-block

On 01/05/2018 01:17 PM, Michael Lyle wrote:
> Bcache needs to scale the dirty data in the cache over the multiple
> backing disks in order to calculate writeback rates for each.
> The previous code did this by multiplying the target number of dirty
> sectors by the backing device size, and expected it to fit into a
> uint64_t; this blows up on relatively small backing devices.
> 
> The new approach figures out the bdev's share in 16384ths of the overall
> cached data.  This is chosen to cope well when bdevs drastically vary in
> size and to ensure that bcache can cross the petabyte boundary for each
> backing device.
> 
> This has been improved based on Tang Junhui's feedback to ensure that
> every device gets a share of dirty data, no matter how small it is
> compared to the total backing pool.
> 
> Reported-by: Jack Douglas <jack@douglastechnology.co.uk>
> Signed-off-by: Michael Lyle <mlyle@lyle.org>

Commentary:

I don't love this, at all.  It really should be the device's share of
the dirty data, not the device's share of the backing size, that sets
its share of the rate (so that if you have a 100GB cache vol, with a
10GB dirty target, and 5 backing devices of which only 2 are active..
those 2 can use all of the 10GB dirty).  But we lack an appropriate
accountancy mechanism right now so it has to be done this way.

This has seen light test so far-- a lot of single-backing tests, a
couple of 2 and 3 backing device tests.

Mike

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] bcache: fix writeback target calc on large devices
  2018-01-08  2:46 tang.junhui
@ 2018-01-08 17:58 ` Michael Lyle
  0 siblings, 0 replies; 6+ messages in thread
From: Michael Lyle @ 2018-01-08 17:58 UTC (permalink / raw)
  To: Junhui Tang; +Cc: linux-bcache, linux-block

Tang Junhui--

Thanks for your feedback, help, and flexibility.  I will try to make
this better overall in the long term.

On Sun, Jan 7, 2018 at 6:46 PM,  <tang.junhui@zte.com.cn> wrote:

> OK, please replace 16384 with macro, and replease 16384 * bdev_sectors(dc->bdev)
> with bit shift operation, eg: bdev_sectors(dc->bdev)<<MACRO.
>
> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>

I have pulled this into bcache-for-next and will be testing it all
some today, hopefully to send out to linux-block/Jens late
today/tomorrow.

Mike

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] bcache: fix writeback target calc on large devices
@ 2018-01-08  2:46 tang.junhui
  2018-01-08 17:58 ` Michael Lyle
  0 siblings, 1 reply; 6+ messages in thread
From: tang.junhui @ 2018-01-08  2:46 UTC (permalink / raw)
  To: mlyle; +Cc: linux-bcache, linux-block, tang.junhui

From: Tang Junhui <tang.junhui@zte.com.cn>

>On Fri, Jan 5, 2018 at 11:29 PM,  <tang.junhui@zte.com.cn> wrote:
>> From: Tang Junhui <tang.junhui@zte.com.cn>
>>
>> Hello Mike,
>>
>> I thought twice, and feel this patch is a little complex and still not very accurate for
>> small backend devices. I think we can resolve it like this:
>>
>>      uint64_t cache_dirty_target =
>>          div_u64(cache_sectors * dc->writeback_percent, 100);
>>
>> -    int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
>> -                   c->cached_dev_sectors);
>> +    int64_t target = div64_u64(cache_dirty_target,
>> +                    div64_u64(c->cached_dev_sectors, bdev_sectors(dc->bdev)));
>>
>> c->cached_dev_sectors is always bigger than bdev_sectors(dc->bdev), so
>> div64_u64(c->cached_dev_sectors, bdev_sectors(dc->bdev) would be never smaller than 1,
>> and this expression has no multiplication operation, so we do not need to worry
>> about overflows.
>
>I thought about this, but as I think you've noticed below it's very
>inaccurate for typical cases.  e.g. if you have 1TB and 750GB backing
>volumes, you get (1750/1000) vs (1750/750) and the 750GB is allowed
>half as much dirty space, and you end up storing 150% of the desired
>info.  Other rounding approaches aren't good, either.
>
>> we also can multiply a value(2^10) to avoid loss of accuracy in division like bellow:
>> (it would support all device smaller than 2^54 BYTE).
>> +    int64_t target = div64_u64(cache_dirty_target<<10,
>> +                    div64_u64(c->cached_dev_sectors<<10, bdev_sectors(dc->bdev)));
>>
>> How do you think about?
>
>This is a bit better.  It trades a per-backing-volume limit of 1PB in
>my code for a total of 9PB cached limit.
>
>Really all of these approaches are "wrong" because they don't allow
>the total amount of dirty to be used if you have quiet volumes and
>should be replaced with something better.  I am just trying to get
>something temporary in to 4.16 because this is a real bug that is
>preventing real users from using bcache.  I'm pretty sure the new code
>works in all the cases the old code failed.
>

OK, please replace 16384 with macro, and replease 16384 * bdev_sectors(dc->bdev)
with bit shift operation, eg: bdev_sectors(dc->bdev)<<MACRO.

Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>

>Simultaneously having a 1GB and 16TB cached volume is a bit of a silly
>case, and what the code does in this silly case isn't completely
>stupid, so...
>
>We're running out of time to get a fix into the package for 4.16, and
>I've tested the current version a fair bit.
>
>Mike
>--
>To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thanks,
Tang Junhui

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] bcache: fix writeback target calc on large devices
  2018-01-06  7:29 tang.junhui
@ 2018-01-06 17:47 ` Michael Lyle
  0 siblings, 0 replies; 6+ messages in thread
From: Michael Lyle @ 2018-01-06 17:47 UTC (permalink / raw)
  To: Junhui Tang; +Cc: linux-bcache, linux-block

Tang Junhui--

On Fri, Jan 5, 2018 at 11:29 PM,  <tang.junhui@zte.com.cn> wrote:
> From: Tang Junhui <tang.junhui@zte.com.cn>
>
> Hello Mike,
>
> I thought twice, and feel this patch is a little complex and still not very accurate for
> small backend devices. I think we can resolve it like this:
>
>      uint64_t cache_dirty_target =
>          div_u64(cache_sectors * dc->writeback_percent, 100);
>
> -    int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
> -                   c->cached_dev_sectors);
> +    int64_t target = div64_u64(cache_dirty_target,
> +                    div64_u64(c->cached_dev_sectors, bdev_sectors(dc->bdev)));
>
> c->cached_dev_sectors is always bigger than bdev_sectors(dc->bdev), so
> div64_u64(c->cached_dev_sectors, bdev_sectors(dc->bdev) would be never smaller than 1,
> and this expression has no multiplication operation, so we do not need to worry
> about overflows.

I thought about this, but as I think you've noticed below it's very
inaccurate for typical cases.  e.g. if you have 1TB and 750GB backing
volumes, you get (1750/1000) vs (1750/750) and the 750GB is allowed
half as much dirty space, and you end up storing 150% of the desired
info.  Other rounding approaches aren't good, either.

> we also can multiply a value(2^10) to avoid loss of accuracy in division like bellow:
> (it would support all device smaller than 2^54 BYTE).
> +    int64_t target = div64_u64(cache_dirty_target<<10,
> +                    div64_u64(c->cached_dev_sectors<<10, bdev_sectors(dc->bdev)));
>
> How do you think about?

This is a bit better.  It trades a per-backing-volume limit of 1PB in
my code for a total of 9PB cached limit.

Really all of these approaches are "wrong" because they don't allow
the total amount of dirty to be used if you have quiet volumes and
should be replaced with something better.  I am just trying to get
something temporary in to 4.16 because this is a real bug that is
preventing real users from using bcache.  I'm pretty sure the new code
works in all the cases the old code failed.

Simultaneously having a 1GB and 16TB cached volume is a bit of a silly
case, and what the code does in this silly case isn't completely
stupid, so...

We're running out of time to get a fix into the package for 4.16, and
I've tested the current version a fair bit.

Mike

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] bcache: fix writeback target calc on large devices
@ 2018-01-06  7:29 tang.junhui
  2018-01-06 17:47 ` Michael Lyle
  0 siblings, 1 reply; 6+ messages in thread
From: tang.junhui @ 2018-01-06  7:29 UTC (permalink / raw)
  To: mlyle; +Cc: linux-bcache, linux-block, tang.junhui

From: Tang Junhui <tang.junhui@zte.com.cn>

Hello Mike,

I thought twice, and feel this patch is a little complex and still not very accurate for
small backend devices. I think we can resolve it like this:

     uint64_t cache_dirty_target =
         div_u64(cache_sectors * dc->writeback_percent, 100);

-    int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
-                   c->cached_dev_sectors);
+    int64_t target = div64_u64(cache_dirty_target,
+                    div64_u64(c->cached_dev_sectors, bdev_sectors(dc->bdev)));

c->cached_dev_sectors is always bigger than bdev_sectors(dc->bdev), so 
div64_u64(c->cached_dev_sectors, bdev_sectors(dc->bdev) would be never smaller than 1,
and this expression has no multiplication operation, so we do not need to worry
about overflows.

we also can multiply a value(2^10) to avoid loss of accuracy in division like bellow:
(it would support all device smaller than 2^54 BYTE).
+    int64_t target = div64_u64(cache_dirty_target<<10,                            
+                    div64_u64(c->cached_dev_sectors<<10, bdev_sectors(dc->bdev))); 

How do you think about?

> Bcache needs to scale the dirty data in the cache over the multiple
> backing disks in order to calculate writeback rates for each.
> The previous code did this by multiplying the target number of dirty
> sectors by the backing device size, and expected it to fit into a
> uint64_t; this blows up on relatively small backing devices.
> 
> The new approach figures out the bdev's share in 16384ths of the overall
> cached data.  This is chosen to cope well when bdevs drastically vary in
> size and to ensure that bcache can cross the petabyte boundary for each
> backing device.
> 
> This has been improved based on Tang Junhui's feedback to ensure that
> every device gets a share of dirty data, no matter how small it is
> compared to the total backing pool.
> 
> Reported-by: Jack Douglas <jack@douglastechnology.co.uk>
> Signed-off-by: Michael Lyle <mlyle@lyle.org>
> ---
>  drivers/md/bcache/writeback.c | 35 +++++++++++++++++++++++++++++++----
>  1 file changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index 61f24c04cebd..c7e35180091e 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -18,17 +18,43 @@
>  #include <trace/events/bcache.h>
>  
>  /* Rate limiting */
> -
> -static void __update_writeback_rate(struct cached_dev *dc)
> +static uint64_t __calc_target_rate(struct cached_dev *dc)
>  {
>      struct cache_set *c = dc->disk.c;
> +
> +    /*
> +     * This is the size of the cache, minus the amount used for
> +     * flash-only devices
> +     */
>      uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size -
>                  bcache_flash_devs_sectors_dirty(c);
> +
> +    /*
> +     * Unfortunately there is no control of global dirty data.  If the
> +     * user states that they want 10% dirty data in the cache, and has,
> +     * e.g., 5 backing volumes of equal size, we try and ensure each
> +     * backing volume uses about 2% of the cache.
> +     *
> +     * 16384 is chosen here as something that each backing device should
> +     * be a reasonable fraction of the share, and not to blow up until
> +     * individual backing devices are a petabyte.
> +     */
> +    uint32_t bdev_share_per16k =
> +        div64_u64(16384 * bdev_sectors(dc->bdev),
> +                c->cached_dev_sectors);
> +
>      uint64_t cache_dirty_target =
>          div_u64(cache_sectors * dc->writeback_percent, 100);
> -    int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
> -                   c->cached_dev_sectors);
>  
> +    /* Ensure each backing dev gets at least 1/16384th dirty share */
> +    if (bdev_share_per16k < 1)
> +        bdev_share_per16k = 1;
> +
> +    return div_u64(cache_dirty_target * bdev_share_per16k, 16384);
> +}
> +
> +static void __update_writeback_rate(struct cached_dev *dc)
> +{
>      /*
>       * PI controller:
>       * Figures out the amount that should be written per second.
> @@ -49,6 +75,7 @@ static void __update_writeback_rate(struct cached_dev *dc)
>       * This acts as a slow, long-term average that is not subject to
>       * variations in usage like the p term.
>       */
> +    int64_t target = __calc_target_rate(dc);
>      int64_t dirty = bcache_dev_sectors_dirty(&dc->disk);
>      int64_t error = dirty - target;
>      int64_t proportional_scaled =
> -- 
> 2.14.1

Thanks,
Tang Junhui

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-01-08 17:58 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-05 21:17 [PATCH v3] bcache: fix writeback target calc on large devices Michael Lyle
2018-01-05 21:20 ` Michael Lyle
2018-01-06  7:29 tang.junhui
2018-01-06 17:47 ` Michael Lyle
2018-01-08  2:46 tang.junhui
2018-01-08 17:58 ` Michael Lyle

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.