All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] bcache: smooth writeback rate control
@ 2017-09-20  7:15 Michael Lyle
  2017-09-20 10:13 ` Coly Li
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Lyle @ 2017-09-20  7:15 UTC (permalink / raw)
  To: linux-bcache; +Cc: kent.overstreet, Michael Lyle

This works in conjunction with the new PI controller.  Currently, in
real-world workloads, the rate controller attempts to write back 1
sector per second.  In practice, these minimum-rate writebacks are
between 4k and 60k in test scenarios, since bcache aggregates and
attempts to do contiguous writes and because filesystems on top of
bcachefs typically write 4k or more.

Previously, bcache used to guarantee to write at least once per second.
This means that the actual writeback rate would exceed the configured
amount by a factor of 8-120 or more.

This patch adjusts to be willing to sleep up to 2.5 seconds, and to
target writing 4k/second.  On the smallest writes, it will sleep 1
second like before, but many times it will sleep longer and load the
backing device less.  This keeps the loading on the cache and backing
device related to writeback more consistent when writing back at low
rates.

Signed-off-by: Michael Lyle <mlyle@lyle.org>
---
 drivers/md/bcache/util.c      | 10 ++++++++--
 drivers/md/bcache/writeback.c |  2 +-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index 176d3c2ef5f5..4dbe37e82877 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -232,8 +232,14 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done)
 
 	d->next += div_u64(done * NSEC_PER_SEC, d->rate);
 
-	if (time_before64(now + NSEC_PER_SEC, d->next))
-		d->next = now + NSEC_PER_SEC;
+	/* Bound the time.  Don't let us fall further than 2 seconds behind
+	 * (this prevents unnecessary backlog that would make it impossible
+	 * to catch up).  If we're ahead of the desired writeback rate,
+	 * don't let us sleep more than 2.5 seconds (so we can notice/respond
+	 * if the control system tells us to speed up!).
+	 */
+	if (time_before64(now + NSEC_PER_SEC * 5 / 2, d->next))
+		d->next = now + NSEC_PER_SEC * 5 / 2;
 
 	if (time_after64(now - NSEC_PER_SEC * 2, d->next))
 		d->next = now - NSEC_PER_SEC * 2;
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index e2cfa83dba19..60535515acd3 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -511,7 +511,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
 	dc->writeback_percent		= 10;
 	dc->writeback_delay		= 30;
 	dc->writeback_rate.rate		= 1024;
-	dc->writeback_rate_minimum	= 5;
+	dc->writeback_rate_minimum	= 8;
 
 	dc->writeback_rate_update_seconds = 5;
 	dc->writeback_rate_p_term_inverse = 40;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] bcache: smooth writeback rate control
  2017-09-20  7:15 [PATCH] bcache: smooth writeback rate control Michael Lyle
@ 2017-09-20 10:13 ` Coly Li
  2017-09-20 14:50   ` Kent Overstreet
       [not found]   ` <CAJ+L6qcZVE0hkiN0xe00fFt4ndmha+uOi9x4UpO-7DRD=vh_mQ@mail.gmail.com>
  0 siblings, 2 replies; 9+ messages in thread
From: Coly Li @ 2017-09-20 10:13 UTC (permalink / raw)
  To: Michael Lyle; +Cc: linux-bcache, kent.overstreet

On 2017/9/20 上午9:15, Michael Lyle wrote:
> This works in conjunction with the new PI controller.  Currently, in
> real-world workloads, the rate controller attempts to write back 1
> sector per second.  In practice, these minimum-rate writebacks are
> between 4k and 60k in test scenarios, since bcache aggregates and
> attempts to do contiguous writes and because filesystems on top of
> bcachefs typically write 4k or more.
> 
> Previously, bcache used to guarantee to write at least once per second.
> This means that the actual writeback rate would exceed the configured
> amount by a factor of 8-120 or more.
> 

There is no guarantee, bcache just tries to writeback 1 key per second
when it reaches the minimum writeback rate. And it try to calculate a
delay time to make the average writeback throughput is around minimum
writeback rate in a longer time period.

So my question is, do you observe 8-120 times more real writeback
throughput in period of every 10 seconds ?


> This patch adjusts to be willing to sleep up to 2.5 seconds, and to
> target writing 4k/second.  On the smallest writes, it will sleep 1
> second like before, but many times it will sleep longer and load the
> backing device less.  This keeps the loading on the cache and backing
> device related to writeback more consistent when writing back at low
> rates.
> 

Again, I'd like to see exact data here, because this patch is about
performance tuning.

I just see the minimum rate increases from 1 to 8 sectors, and minimum
delay increase from 1 to 2.5 seconds. I don't see an exact problem this
patch wants to solve. Is it to write back dirty data faster, or save
more power from cached device ?


Your PI controller perfectly controls writeback rate to minimum value
when dirty data reaches target. At this moment, normally there still is
quite a lot dirty data on cache, increases idle time writeback rate from
1 to 8 sectors does not help too much, it still needs hours to clean the
dirty data.

And normally a hard disk will spin down in minutes, typically from 10
minutes to 30 minutes. So increase the delay time from 1 to 2.5 seconds
won't help hard disk to save power.

I post a proposal patch to linux-bcache@vger.kernel.org, (see
https://www.spinics.net/lists/linux-bcache/msg04837.html), which sets
the writeback rate to maximum value if there is no front end I/O request
for a while (normally it might means more idle time in future). Then the
dirty data on cache can be cleaned up ASAP. And if there is frond end
request coming, set the writeback rate to minimum value, and let the PD
(now it is PI) controller to adjust its value. By this method, if there
is no front end I/O, dirty data can be cleaned much faster and hard disk
also has much more chance to spin down.

Thanks.

-- 
Coly Li

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] bcache: smooth writeback rate control
  2017-09-20 10:13 ` Coly Li
@ 2017-09-20 14:50   ` Kent Overstreet
  2017-09-20 20:24     ` Coly Li
       [not found]   ` <CAJ+L6qcZVE0hkiN0xe00fFt4ndmha+uOi9x4UpO-7DRD=vh_mQ@mail.gmail.com>
  1 sibling, 1 reply; 9+ messages in thread
From: Kent Overstreet @ 2017-09-20 14:50 UTC (permalink / raw)
  To: Coly Li; +Cc: Michael Lyle, linux-bcache

On Wed, Sep 20, 2017 at 12:13:50PM +0200, Coly Li wrote:
> On 2017/9/20 上午9:15, Michael Lyle wrote:
> > This works in conjunction with the new PI controller.  Currently, in
> > real-world workloads, the rate controller attempts to write back 1
> > sector per second.  In practice, these minimum-rate writebacks are
> > between 4k and 60k in test scenarios, since bcache aggregates and
> > attempts to do contiguous writes and because filesystems on top of
> > bcachefs typically write 4k or more.
> > 
> > Previously, bcache used to guarantee to write at least once per second.
> > This means that the actual writeback rate would exceed the configured
> > amount by a factor of 8-120 or more.
> > 
> 
> There is no guarantee, bcache just tries to writeback 1 key per second
> when it reaches the minimum writeback rate. And it try to calculate a
> delay time to make the average writeback throughput is around minimum
> writeback rate in a longer time period.
> 
> So my question is, do you observe 8-120 times more real writeback
> throughput in period of every 10 seconds ?
> 
> 
> > This patch adjusts to be willing to sleep up to 2.5 seconds, and to
> > target writing 4k/second.  On the smallest writes, it will sleep 1
> > second like before, but many times it will sleep longer and load the
> > backing device less.  This keeps the loading on the cache and backing
> > device related to writeback more consistent when writing back at low
> > rates.
> > 
> 
> Again, I'd like to see exact data here, because this patch is about
> performance tuning.

Well, it's not just performance tuning. My PD controller code was always, to be
honest, crap; I'd been hoping someone who actually knew what they were doing on
this subject would rewrite that code for years, and Michael actually does know
what he's doing with PID controllers.

That said, your point that this is a behavioural change as well is a valid one.
It might be worth trying to tune the new controller to mimic the behaviour of
the old code as much as possible, and _then_ separately possibly change the
default tuning. However, due to the crappyness of the old code and the
difficulty in understanding what it's actually going to do in any given
situation that might not be practical or worthwhile.

Anyways, I wouldn't look at this patch so much as performance tuning (and I
wouldn't aim for changing behavior any more than is necessary in this patch); I
would view it as a path towards getting something more understandable that
people can actually tune and understand wtf it's going to do.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] bcache: smooth writeback rate control
       [not found]   ` <CAJ+L6qcZVE0hkiN0xe00fFt4ndmha+uOi9x4UpO-7DRD=vh_mQ@mail.gmail.com>
@ 2017-09-20 15:28     ` Michael Lyle
  2017-09-20 21:07       ` Coly Li
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Lyle @ 2017-09-20 15:28 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-bcache, Kent Overstreet

On Wed, Sep 20, 2017 at 3:13 AM Coly Li <i@coly.li> wrote:
>
> So my question is, do you observe 8-120 times more real writeback
> throughput in period of every 10 seconds ?

 In real world workloads I end up with no keys of length less than
4096.  Do you?

I would like the minimum writeback rate to be a realistic value that
may actually be attained.  Right now, in practice, bcache writes
**much faster**.  Previously, you were concerned that setting the
target for number of writes would result in increased backing disk
workload--- the new values are based on measurement that in all cases
this code writes back slower and loads the disk less than the old
code, so it should mitigate your concern.

> Again, I'd like to see exact data here, because this patch is about
> performance tuning.

OK.  Run iostat 1 on any workload.  See if you ever see a write rate
of less than 4k/sec to the backing disk.  I've run a variety of
workloads and never have.  I can send you very long iostat 1 logs if
it would help ;)

When the system is writing much more than the control system is asking
for, the control system is effectively disengaged.  This patch
increases the range of control authority by allowing the minimal
interval to be 2.5x slower.

> I just see the minimum rate increases from 1 to 8 sectors, and minimum
> delay increase from 1 to 2.5 seconds. I don't see an exact problem this
> patch wants to solve. Is it to write back dirty data faster, or save
> more power from cached device ?

So one thing that's really bad that's happening currently is that when
the disk is idle, is that the disk is repeatedly undergoing
load/unload cycles (very long seeks).  The disks I have-- Seagate
ST6000VN0041 Ironwolf NAS 6TB-- seek off the active portion of the
platter when idle for 750ms.  So the disks are making a very loud
clunk at one second intervals during idle writeback, which is not good
for the drives.  When writing back faster, they are quieter.  This
will at least do it 2.5x less often.

In a subsequent patch, there's additional things I'd like to do-- like
be willing to do no write after a wakeup if we are very far ahead.
That is, to allow the effective value to be much larger than 2.5
seconds.  This would potentially allow spindown on laptops, etc.  But
this change is still worthwhile on its own.

Another thing that would be helpful would be to issue more than 1
write at a time, so that queue depth doesn't always equal 1.  Queue
depth=4 has about 40-50% more throughput--- that is, it completes the
4 IOPs in about 2.5x the time of one-- when writes are clustered to
the same region of the disk/short-stroked.  However this has a
potential cost in latency, so it needs to be carefully traded off.

> [snipped some]
>
> I post a proposal patch to linux-bcache@vger.kernel.org, (see
> https://www.spinics.net/lists/linux-bcache/msg04837.html), which sets
> the writeback rate to maximum value if there is no front end I/O request
> for a while (normally it might means more idle time in future). Then the
> dirty data on cache can be cleaned up ASAP. And if there is frond end
> request coming, set the writeback rate to minimum value, and let the PD
> (now it is PI) controller to adjust its value. By this method, if there
> is no front end I/O, dirty data can be cleaned much faster and hard disk
> also has much more chance to spin down.

While this could clear out dirty data sooner, I'm not sure this is
always a good idea.  For intermittent, write-heavy workloads it's
good, but it substantially increases the chance that occasional random
reads have to write a long time-- which is usually the opposite from
what an IO scheduler tries to achieve.

>
> Thanks.
>
> --
> Coly Li

Mike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] bcache: smooth writeback rate control
  2017-09-20 14:50   ` Kent Overstreet
@ 2017-09-20 20:24     ` Coly Li
  2017-09-20 20:48       ` Michael Lyle
  0 siblings, 1 reply; 9+ messages in thread
From: Coly Li @ 2017-09-20 20:24 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Michael Lyle, linux-bcache

On 2017/9/20 下午4:50, Kent Overstreet wrote:
> On Wed, Sep 20, 2017 at 12:13:50PM +0200, Coly Li wrote:
>> On 2017/9/20 上午9:15, Michael Lyle wrote:
>>> This works in conjunction with the new PI controller.  Currently, in
>>> real-world workloads, the rate controller attempts to write back 1
>>> sector per second.  In practice, these minimum-rate writebacks are
>>> between 4k and 60k in test scenarios, since bcache aggregates and
>>> attempts to do contiguous writes and because filesystems on top of
>>> bcachefs typically write 4k or more.
>>>
>>> Previously, bcache used to guarantee to write at least once per second.
>>> This means that the actual writeback rate would exceed the configured
>>> amount by a factor of 8-120 or more.
>>>
>>
>> There is no guarantee, bcache just tries to writeback 1 key per second
>> when it reaches the minimum writeback rate. And it try to calculate a
>> delay time to make the average writeback throughput is around minimum
>> writeback rate in a longer time period.
>>
>> So my question is, do you observe 8-120 times more real writeback
>> throughput in period of every 10 seconds ?
>>
>>
>>> This patch adjusts to be willing to sleep up to 2.5 seconds, and to
>>> target writing 4k/second.  On the smallest writes, it will sleep 1
>>> second like before, but many times it will sleep longer and load the
>>> backing device less.  This keeps the loading on the cache and backing
>>> device related to writeback more consistent when writing back at low
>>> rates.
>>>
>>
>> Again, I'd like to see exact data here, because this patch is about
>> performance tuning.
> 
> Well, it's not just performance tuning. My PD controller code was always, to be
> honest, crap; I'd been hoping someone who actually knew what they were doing on
> this subject would rewrite that code for years, and Michael actually does know
> what he's doing with PID controllers.
> 

Hi Kent,

The whole P.I.D controller stuff is about performance tuning, isn't it ?

Before we make a change relative to performance, we should know the
exact benchmark result for real workload, not emulator. This is the
reason why I did the latency distribution test myself. And good to know
there is no latency regression and writeback rate decreases to minimum
rate smoothly when dirty data gets close to dirty target.

It is not difficult to write a P.I.D controller code, it is difficult to
implement it in a better way. From my test I feel Mike's P.I controller
is good, but we need one more people to understand it. This is why I ask
many questions, and some of the questions are quite basic. If we have
one more people understand why this P.I controller is better, there will
be more people come to understand it.

> That said, your point that this is a behavioural change as well is a valid one.
> It might be worth trying to tune the new controller to mimic the behaviour of
> the old code as much as possible, and _then_ separately possibly change the
> default tuning. However, due to the crappyness of the old code and the
> difficulty in understanding what it's actually going to do in any given
> situation that might not be practical or worthwhile.
> 

My point is, we should have exact real data to approve this change makes
things better. Better means a problem gets solved. For this single
patch, I see description on how the change is made, I don't see what
exact problem to be solved.


> Anyways, I wouldn't look at this patch so much as performance tuning (and I
> wouldn't aim for changing behavior any more than is necessary in this patch); I
> would view it as a path towards getting something more understandable that
> people can actually tune and understand wtf it's going to do.
> 

The P.I controller patch makes sense. But for this single patch, change
minimum writeback rate from 1 to 8, and next delay timeout from 1 to 2.5
seconds, I don't see this modification is path towards getting things
more understandable, before we see real data to approve it.

This is why I said: again, I'd like to see exact data here. Let me
discuss with Micheal for more details.

Anyway, my first rule is, if you add Reviewed-by or Acked-by to a patch,
I just shut up immediately :-)

Thanks.

-- 
Coly Li

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] bcache: smooth writeback rate control
  2017-09-20 20:24     ` Coly Li
@ 2017-09-20 20:48       ` Michael Lyle
  0 siblings, 0 replies; 9+ messages in thread
From: Michael Lyle @ 2017-09-20 20:48 UTC (permalink / raw)
  To: Coly Li; +Cc: Kent Overstreet, linux-bcache

On Wed, Sep 20, 2017 at 1:24 PM, Coly Li <i@coly.li> wrote:
> My point is, we should have exact real data to approve this change makes
> things better. Better means a problem gets solved. For this single
> patch, I see description on how the change is made, I don't see what
> exact problem to be solved.

OK.  With this change, my workloads always do fewer IOPs during quiet
times.  It often waits 2.5 seconds, instead of waiting 1 second.  The
utilization is clearly more constant, because when it writes more than
4k, it leaves the disk alone for longer.  Before, you were concerned
that setting the minimum writeback rate above 1 sector would
significantly impact performance, so this variant was chosen to
**always do less I/O**.  It should be fairly evident that it does this
:/  When it writes back a 4k key (the smallest key I get), it behaves
exactly the same and sleeps one second.  When it writes back a longer
key, it waits longer.

My other motivation is-- the writeback behavior is problematic-- right
now it writes at exactly the wrong speed for the IronWolf disks-- it
waits barely long enough for things to go into a power-saving / safe
shutdown mode (unloading the head actuator to the side).  The current
behavior maximizes the number of load/unload cycles that the disk
undergoes.  I don't have proof that this leads to earlier drive
failure, but in general "clunk clunk clunk" is not good.  At least
with this code it will do it half as much, pending further changes to
mitigate this more.

> The P.I controller patch makes sense. But for this single patch, change
> minimum writeback rate from 1 to 8, and next delay timeout from 1 to 2.5
> seconds, I don't see this modification is path towards getting things
> more understandable, before we see real data to approve it.
>
> This is why I said: again, I'd like to see exact data here. Let me
> discuss with Micheal for more details.

What do you want to see, iostat 1 of background writeback with current
code showing it is never doing less than 4k to the backing disks?
Because I have several thousand seconds of that.  If you have any
installation of current bcache you should be able to do just run it
and see.

Mike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] bcache: smooth writeback rate control
  2017-09-20 15:28     ` Michael Lyle
@ 2017-09-20 21:07       ` Coly Li
  2017-09-20 21:50         ` Michael Lyle
  0 siblings, 1 reply; 9+ messages in thread
From: Coly Li @ 2017-09-20 21:07 UTC (permalink / raw)
  To: Michael Lyle; +Cc: linux-bcache, Kent Overstreet

On 2017/9/20 下午5:28, Michael Lyle wrote:
> On Wed, Sep 20, 2017 at 3:13 AM Coly Li <i@coly.li> wrote:
>>
>> So my question is, do you observe 8-120 times more real writeback
>> throughput in period of every 10 seconds ?
> 
>  In real world workloads I end up with no keys of length less than
> 4096.  Do you?
> 
> I would like the minimum writeback rate to be a realistic value that
> may actually be attained.  Right now, in practice, bcache writes
> **much faster**.  Previously, you were concerned that setting the
> target for number of writes would result in increased backing disk
> workload--- the new values are based on measurement that in all cases
> this code writes back slower and loads the disk less than the old
> code, so it should mitigate your concern.
>

I get your point here. Current code sets minimum writeback rate to 1
sector, and indeed it writes 4K. But parameter "done" sent to
bch_next_delay() is 1 sector, indeed the correct value should be 8 sectors.


>> Again, I'd like to see exact data here, because this patch is about
>> performance tuning.
> 
> OK.  Run iostat 1 on any workload.  See if you ever see a write rate
> of less than 4k/sec to the backing disk.  I've run a variety of
> workloads and never have.  I can send you very long iostat 1 logs if
> it would help ;)

The code is buggy here, if sent 8 sectors into bch_next_delay() then the
throughput will be as expected but in this case minimum writeback rate
should be 8, not 1.

I come to understand you, yes you fix this bug :-)

> 
> When the system is writing much more than the control system is asking
> for, the control system is effectively disengaged.  This patch
> increases the range of control authority by allowing the minimal
> interval to be 2.5x slower.
>

Sure I agree with you. I have. a question about why choose x2.5, it
seems you mention it in following text.


>> I just see the minimum rate increases from 1 to 8 sectors, and minimum
>> delay increase from 1 to 2.5 seconds. I don't see an exact problem this
>> patch wants to solve. Is it to write back dirty data faster, or save
>> more power from cached device ?
> 
> So one thing that's really bad that's happening currently is that when
> the disk is idle, is that the disk is repeatedly undergoing
> load/unload cycles (very long seeks).  The disks I have-- Seagate
> ST6000VN0041 Ironwolf NAS 6TB-- seek off the active portion of the
> platter when idle for 750ms.  So the disks are making a very loud
> clunk at one second intervals during idle writeback, which is not good
> for the drives.  When writing back faster, they are quieter.  This
> will at least do it 2.5x less often.
> 

I see. This is a situation I missed in my environment. On my desktop
machine they are all SSD. My testing server is too noisy and overwhelm
groan of my hard disks. Thanks for your information :-)

I have one more question for this line "When writing back faster, they
are quieter.  This will at least do it 2.5x less often."

Current bcache code indeed writes back 4K per-second at minimum
writeback rate if I understand correctly. Your patch makes the minimum
writeback rate to 4K every 2.5 seconds, it makes writing back slower not
faster. Do I miss something here ?


> In a subsequent patch, there's additional things I'd like to do-- like
> be willing to do no write after a wakeup if we are very far ahead.
> That is, to allow the effective value to be much larger than 2.5
> seconds.  This would potentially allow spindown on laptops, etc.  But
> this change is still worthwhile on its own.
> 

OK, looking forward to the subsequent patch :-)


> Another thing that would be helpful would be to issue more than 1
> write at a time, so that queue depth doesn't always equal 1.  Queue
> depth=4 has about 40-50% more throughput--- that is, it completes the
> 4 IOPs in about 2.5x the time of one-- when writes are clustered to
> the same region of the disk/short-stroked.  However this has a
> potential cost in latency, so it needs to be carefully traded off.
> 

It makes sense. But "dc->writeback_rate_minimum	= 8" is one 4KB I/O,
then your patch is to make minimum I/O slower x2.5 times, does not
change I/O size. So this patch just makes writeback I/O less frequently,
not increase number of I/Os for each writeback ?


>> [snipped some]
>>
>> I post a proposal patch to linux-bcache@vger.kernel.org, (see
>> https://www.spinics.net/lists/linux-bcache/msg04837.html), which sets
>> the writeback rate to maximum value if there is no front end I/O request
>> for a while (normally it might means more idle time in future). Then the
>> dirty data on cache can be cleaned up ASAP. And if there is frond end
>> request coming, set the writeback rate to minimum value, and let the PD
>> (now it is PI) controller to adjust its value. By this method, if there
>> is no front end I/O, dirty data can be cleaned much faster and hard disk
>> also has much more chance to spin down.
> 
> While this could clear out dirty data sooner, I'm not sure this is
> always a good idea.  For intermittent, write-heavy workloads it's
> good, but it substantially increases the chance that occasional random
> reads have to write a long time-- which is usually the opposite from
> what an IO scheduler tries to achieve.

The idle timeout is 30 seconds, writeback rate increases after no
read/write for 30 minutes. And once an I/O comes, writeback rate is set
to minimum rate immediately and take effect for next writeback I/O
issue. This is why I suggest to set dc->writeback_rate_minimum to a very
small value. Value 8 makes sense, less value still issues same I/O size
as 4K. It makes sense to set dc->writeback_rate_minimum = 8.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] bcache: smooth writeback rate control
  2017-09-20 21:07       ` Coly Li
@ 2017-09-20 21:50         ` Michael Lyle
  2017-09-21 19:20           ` Coly Li
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Lyle @ 2017-09-20 21:50 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-bcache, Kent Overstreet

On Wed, Sep 20, 2017 at 2:07 PM, Coly Li <colyli@suse.de> wrote:
> I get your point here. Current code sets minimum writeback rate to 1
> sector, and indeed it writes 4K. But parameter "done" sent to
> bch_next_delay() is 1 sector, indeed the correct value should be 8 sectors.

Not quite.  The value sent to bch_next_delay is 8 sectors.  So then
the code wants to sleep 8 seconds, but it is only willing to sleep at
most one second.  In other words, when rate is low, the controller
becomes ineffective and does one write of any size no matter what.
This keeps the controller effective at low target rates-- if it does a
minimally-sized write, it will sleep one second like before-- if it
writes more, it will sleep longer.

> [snip]
> I see. This is a situation I missed in my environment. On my desktop
> machine they are all SSD. My testing server is too noisy and overwhelm
> groan of my hard disks. Thanks for your information :-)
>
> I have one more question for this line "When writing back faster, they
> are quieter.  This will at least do it 2.5x less often."
>
> Current bcache code indeed writes back 4K per-second at minimum
> writeback rate if I understand correctly. Your patch makes the minimum
> writeback rate to 4K every 2.5 seconds, it makes writing back slower not
> faster. Do I miss something here ?

It is still doing the bad thing, but only 40% of the bad thing and
hopefully will wear out the disk slower.  I want to make further
changes to do it even less (and let disks spin down).

The value of 2.5 seconds is chosen because it is half the speed that
we recalculate the target rate, by default.  If the rate increases, we
won't be sleeping too long before we start writing faster.  (i.e. our
typical reaction time is now 5 + 2.5/2 = 6.25 seconds, instead of 5 +
1/2 = 5.5 seconds)

>> In a subsequent patch, there's additional things I'd like to do-- like
>> be willing to do no write after a wakeup if we are very far ahead.
>> That is, to allow the effective value to be much larger than 2.5
>> seconds.  This would potentially allow spindown on laptops, etc.  But
>> this change is still worthwhile on its own.
>>
>
> OK, looking forward to the subsequent patch :-)

This one is a first step just because it's very simple and
self-contained.  The next step on this path is a bit more difficult,
because it requires rewriting bch_next_delay to store the reciprocal
of what it currently does and bounding both the amount it can be
ahead, and the sleep interval.

>> Another thing that would be helpful would be to issue more than 1
>> write at a time, so that queue depth doesn't always equal 1.  Queue
>> depth=4 has about 40-50% more throughput--- that is, it completes the
>> 4 IOPs in about 2.5x the time of one-- when writes are clustered to
>> the same region of the disk/short-stroked.  However this has a
>> potential cost in latency, so it needs to be carefully traded off.
>>
>
> It makes sense. But "dc->writeback_rate_minimum = 8" is one 4KB I/O,
> then your patch is to make minimum I/O slower x2.5 times, does not
> change I/O size. So this patch just makes writeback I/O less frequently,
> not increase number of I/Os for each writeback ?

If writeback is doing minimum-sized I/Os with a minimum target rate,
they will happen at the same rate as before.  If what it finds from
the dirty keybuf is a bigger-than-minimum sized I/O, it will sleep
longer.  e.g. if it finds a contiguous 8k block, it will now sleep 2
seconds (and will achieve the target 4k/second as a result).  if it
finds a contiguous >=10k block, it will sleep the maximum 2.5 seconds.
This keeps the controller doing something meaningful even when stuff
being written is larger than the target amount written in one second.

Note that it has an effect even when the rate is above the minimum.
If we're trying to write 1MB/second, and we find a 1.5MB continguous
block to write, we'll wait 1.5 seconds.

Mike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] bcache: smooth writeback rate control
  2017-09-20 21:50         ` Michael Lyle
@ 2017-09-21 19:20           ` Coly Li
  0 siblings, 0 replies; 9+ messages in thread
From: Coly Li @ 2017-09-21 19:20 UTC (permalink / raw)
  To: Michael Lyle; +Cc: Coly Li, linux-bcache, Kent Overstreet

On 2017/9/20 下午11:50, Michael Lyle wrote:
> On Wed, Sep 20, 2017 at 2:07 PM, Coly Li <colyli@suse.de> wrote:
>> I get your point here. Current code sets minimum writeback rate to 1
>> sector, and indeed it writes 4K. But parameter "done" sent to
>> bch_next_delay() is 1 sector, indeed the correct value should be 8 sectors.
> 
> Not quite.  The value sent to bch_next_delay is 8 sectors.  So then
> the code wants to sleep 8 seconds, but it is only willing to sleep at
> most one second.  In other words, when rate is low, the controller
> becomes ineffective and does one write of any size no matter what.
> This keeps the controller effective at low target rates-- if it does a
> minimally-sized write, it will sleep one second like before-- if it
> writes more, it will sleep longer.
> 

Hi Mike,

Got it. Now I understand why you set dc->writeback_rate_minimum	to 8,
it's reasonable and thoughtful. I agree with you. Thanks for your patent
explaining.

>> [snip]
>> I see. This is a situation I missed in my environment. On my desktop
>> machine they are all SSD. My testing server is too noisy and overwhelm
>> groan of my hard disks. Thanks for your information :-)
>>
>> I have one more question for this line "When writing back faster, they
>> are quieter.  This will at least do it 2.5x less often."
>>
>> Current bcache code indeed writes back 4K per-second at minimum
>> writeback rate if I understand correctly. Your patch makes the minimum
>> writeback rate to 4K every 2.5 seconds, it makes writing back slower not
>> faster. Do I miss something here ?
> 
> It is still doing the bad thing, but only 40% of the bad thing and
> hopefully will wear out the disk slower.  I want to make further
> changes to do it even less (and let disks spin down).
> 
> The value of 2.5 seconds is chosen because it is half the speed that
> we recalculate the target rate, by default.  If the rate increases, we
> won't be sleeping too long before we start writing faster.  (i.e. our
> typical reaction time is now 5 + 2.5/2 = 6.25 seconds, instead of 5 +
> 1/2 = 5.5 seconds)
> 

OK I see. So the interval for idle time writeback will be longer, but
not too long. Interesting, you want it longer, I want it shorter :-)
Don't worry I will solve the conflict from my side.


>>> In a subsequent patch, there's additional things I'd like to do-- like
>>> be willing to do no write after a wakeup if we are very far ahead.
>>> That is, to allow the effective value to be much larger than 2.5
>>> seconds.  This would potentially allow spindown on laptops, etc.  But
>>> this change is still worthwhile on its own.
>>>
>>
>> OK, looking forward to the subsequent patch :-)
> 
> This one is a first step just because it's very simple and
> self-contained.  The next step on this path is a bit more difficult,
> because it requires rewriting bch_next_delay to store the reciprocal
> of what it currently does and bounding both the amount it can be
> ahead, and the sleep interval.
> 

Then please make this patch as simple/understandable as possible. Bcache
is not deserved for a very complicated delay calculation policy. If you
remember the discussions years ago for memory readahead and writeback,
you may find always simple things win.

>>> Another thing that would be helpful would be to issue more than 1
>>> write at a time, so that queue depth doesn't always equal 1.  Queue
>>> depth=4 has about 40-50% more throughput--- that is, it completes the
>>> 4 IOPs in about 2.5x the time of one-- when writes are clustered to
>>> the same region of the disk/short-stroked.  However this has a
>>> potential cost in latency, so it needs to be carefully traded off.
>>>
>>
>> It makes sense. But "dc->writeback_rate_minimum = 8" is one 4KB I/O,
>> then your patch is to make minimum I/O slower x2.5 times, does not
>> change I/O size. So this patch just makes writeback I/O less frequently,
>> not increase number of I/Os for each writeback ?
> 
> If writeback is doing minimum-sized I/Os with a minimum target rate,
> they will happen at the same rate as before.  If what it finds from
> the dirty keybuf is a bigger-than-minimum sized I/O, it will sleep
> longer.  e.g. if it finds a contiguous 8k block, it will now sleep 2
> seconds (and will achieve the target 4k/second as a result).  if it
> finds a contiguous >=10k block, it will sleep the maximum 2.5 seconds.
> This keeps the controller doing something meaningful even when stuff
> being written is larger than the target amount written in one second.
> 
> Note that it has an effect even when the rate is above the minimum.
> If we're trying to write 1MB/second, and we find a 1.5MB continguous
> block to write, we'll wait 1.5 seconds.

It is clear to me. It's a good discussion, make things more clear and
understandable for me and other people.

Could you please add some important information from our discussion to
your patch commit log ? I do appreciate for this, then more people will
get to know how and why you design a P.I controller in this way, this is
valuable.

Now I know the patch is good, and have no comment for it. Thank you for
your informative response.

-- 
Coly Li

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-09-21 19:20 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-20  7:15 [PATCH] bcache: smooth writeback rate control Michael Lyle
2017-09-20 10:13 ` Coly Li
2017-09-20 14:50   ` Kent Overstreet
2017-09-20 20:24     ` Coly Li
2017-09-20 20:48       ` Michael Lyle
     [not found]   ` <CAJ+L6qcZVE0hkiN0xe00fFt4ndmha+uOi9x4UpO-7DRD=vh_mQ@mail.gmail.com>
2017-09-20 15:28     ` Michael Lyle
2017-09-20 21:07       ` Coly Li
2017-09-20 21:50         ` Michael Lyle
2017-09-21 19:20           ` Coly Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.